* cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
@ 2023-06-20 8:12 David Rheinsberg
2023-06-20 9:53 ` Tzung-Bi Shih
2023-06-21 3:25 ` Dustin Howett
0 siblings, 2 replies; 7+ messages in thread
From: David Rheinsberg @ 2023-06-20 8:12 UTC (permalink / raw)
To: chrome-platform; +Cc: Benson Leung, Guenter Roeck
Hi
(bcc: Dustin Howett and Tzung-Bi Shih, who signed off the original framework patches)
Using the cros-ec over lpc device on the Framework-13, I occasionally get:
cros_ec_lpcs cros_ec_lpcs.0: packet too long (4 bytes, expected 0)
Afterwards, the entire EC seems to be inactive and none of its controllers work, anymore (temperature sensors are stale, keyboard defunct, etc.). A reboot fixes the issues.
I cannot trigger this issue reliably, yet it seems to happen exclusively under heavy load. Do you have any recommendations how to debug this further?
I failed tracing where the error happens and why any further functionality of the EC is disabled thereafter. Does the driver end communication on an error? Or is this likely a firmware issue and just indicative of the firmware failing?
If you have any recommendations how to enabled the cros-tracing/debug features, I'd gladly run a custom kernel for a while to see where the failure originates.
Thanks
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-20 8:12 cros_ec_lpcs on framework: packet too long (4 bytes, expected 0) David Rheinsberg
@ 2023-06-20 9:53 ` Tzung-Bi Shih
2023-06-20 10:33 ` David Rheinsberg
2023-06-21 3:25 ` Dustin Howett
1 sibling, 1 reply; 7+ messages in thread
From: Tzung-Bi Shih @ 2023-06-20 9:53 UTC (permalink / raw)
To: David Rheinsberg; +Cc: chrome-platform, Benson Leung, Guenter Roeck
On Tue, Jun 20, 2023 at 10:12:34AM +0200, David Rheinsberg wrote:
> Using the cros-ec over lpc device on the Framework-13, I occasionally get:
>
> cros_ec_lpcs cros_ec_lpcs.0: packet too long (4 bytes, expected 0)
>
> Afterwards, the entire EC seems to be inactive and none of its controllers work, anymore (temperature sensors are stale, keyboard defunct, etc.). A reboot fixes the issues.
To be clear, does an AP reboot fix the issue? Or does it need an EC reboot?
Will it trigger some watchdog mechanisms and thus a reboot after waiting for
a specific duration (e.g. 30 seconds)?
> I cannot trigger this issue reliably, yet it seems to happen exclusively under heavy load. Do you have any recommendations how to debug this further?
>
> I failed tracing where the error happens and why any further functionality of the EC is disabled thereafter. Does the driver end communication on an error? Or is this likely a firmware issue and just indicative of the firmware failing?
>
> If you have any recommendations how to enabled the cros-tracing/debug features, I'd gladly run a custom kernel for a while to see where the failure originates.
Could you get the AP and EC console log? AFAIK, the "packet too long" is only
indicating an error of the EC command. There should be some other error
messages directly related to the system becomes inactive.
Try to get the consoles would be the most directly helpful. Except the logs,
ramoops, and stacktraces, for example, if the console is still available when
the issue happens, you could use SysRq to get all stack backtraces.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-20 9:53 ` Tzung-Bi Shih
@ 2023-06-20 10:33 ` David Rheinsberg
2023-06-21 4:57 ` Tzung-Bi Shih
0 siblings, 1 reply; 7+ messages in thread
From: David Rheinsberg @ 2023-06-20 10:33 UTC (permalink / raw)
To: Tzung-Bi Shih; +Cc: chrome-platform, Benson Leung, Guenter Roeck
Hi
On Tue, Jun 20, 2023, at 11:53 AM, Tzung-Bi Shih wrote:
> On Tue, Jun 20, 2023 at 10:12:34AM +0200, David Rheinsberg wrote:
>> Using the cros-ec over lpc device on the Framework-13, I occasionally get:
>>
>> cros_ec_lpcs cros_ec_lpcs.0: packet too long (4 bytes, expected 0)
>>
>> Afterwards, the entire EC seems to be inactive and none of its controllers work, anymore (temperature sensors are stale, keyboard defunct, etc.). A reboot fixes the issues.
>
> To be clear, does an AP reboot fix the issue? Or does it need an EC reboot?
> Will it trigger some watchdog mechanisms and thus a reboot after waiting for
> a specific duration (e.g. 30 seconds)?
A linux system reboot fixed the issue. I did *NOT* perform a cold-boot or explicitly rebooted the EC. If the EC is not automatically rebooted with the AP, then no EC-reboot was necessary to fix the issue.
No watchdog mechanism was triggered. Logs show I waited for 30min.
>> I cannot trigger this issue reliably, yet it seems to happen exclusively under heavy load. Do you have any recommendations how to debug this further?
>>
>> I failed tracing where the error happens and why any further functionality of the EC is disabled thereafter. Does the driver end communication on an error? Or is this likely a firmware issue and just indicative of the firmware failing?
>>
>> If you have any recommendations how to enabled the cros-tracing/debug features, I'd gladly run a custom kernel for a while to see where the failure originates.
>
> Could you get the AP and EC console log? AFAIK, the "packet too long" is only
> indicating an error of the EC command. There should be some other error
> messages directly related to the system becomes inactive.
There is nothing in the kernel logs besides the mentioned message +-30min of the occurrence. I did not enable debug-level messages, though.
> Try to get the consoles would be the most directly helpful. Except the logs,
> ramoops, and stacktraces, for example, if the console is still available when
> the issue happens, you could use SysRq to get all stack backtraces.
The system is (almost) fully operational when the issue happens. It is just that some hardware features do not work, anymore. I noticed the temperature sensors reporting stale data and the `fn`-key having no effect anymore.
The kernel log was quiet around the time the issue started, except for the mentioned message.
Is there a particular kernel thread you would like to see the stacktrace for? I can dump the full system stack trace and see whether anything looks particularly weird. But given that normal system operation was not interrupted, I am unsure that I will find anything.
I am unsure which device driver triggered the ec-lpc transaction, but it did not show any further error message. I am not even sure whether it is pkg-xfer or cmd-xfer, since both use the same error message. I can try placing an ftrace marker in the error-path and hopefully get a kernel stacktrace of the error?
Can I use the chardev to test whether the ec-driver is still responsive?
If you insist, I can provide the full kernel-log. I will also try to fetch a sysrq stacktrace dump next time I trigger it.
Thanks a lot for helping!
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-20 8:12 cros_ec_lpcs on framework: packet too long (4 bytes, expected 0) David Rheinsberg
2023-06-20 9:53 ` Tzung-Bi Shih
@ 2023-06-21 3:25 ` Dustin Howett
2023-06-26 13:04 ` David Rheinsberg
1 sibling, 1 reply; 7+ messages in thread
From: Dustin Howett @ 2023-06-21 3:25 UTC (permalink / raw)
To: David Rheinsberg; +Cc: chrome-platform, Benson Leung, Guenter Roeck
Hi David,
I've been investigating this exact phenomenon with the team over at Framework
Computer!
I can't comment on the EC ceasing to function (as I've only observed gentler
failure states[^1] on the 11th and 12th generation Intel Core 13" laptop), but
that is an interesting outcome. What system and EC firmware versions are you
using on that device?
On those devices, it occurs because exchanges from cros_ec_lpcs collide on the
bus with exchanges initiated by ACPI. There is a trivial locking mechanism used
by the ACPI control methods for EC communication, but it cannot be shared with
the host OS.
Excerpting a post of mine[^2] on the Framework Community Forum:
> The power and battery state of the machine are managed by ACPI, and the ACPI
> methods for querying those things call the EC directly. When it does so, it
> uses a mutex that can’t be shared with the OS. There’s also a couple
> ACPI-driven exchanges that occur during wake from sleep. Now, because
> cros_ec_lpcs (Linux) and CrosEC (Windows) use the LPC bus directly, an inflight
> request from ACPI can collide with an inflight request from one of these
> drivers.
Incidentally, their lead firmware engineer filed a similar bug[^3] on
my Windows EC
driver bearing the same interface as cros_ec_lpcs.
We don't currently have any leads on a permanent solution, but some members of
the community forum are testing out placing the cros_ec_debugfs driver on the
list of disallowed modules. Doing so reduces the chatter on the host command
interface, since it no longer needs to query the console periodically.
Cheers,
Dustin
[^1]: On those devices, the media keys stop working. Everything else continues
to work fine, including the host command interface and therefore the console.
[^2]: https://community.frame.work/t/tracking-fn-key-stops-working-on-popos-after-a-while/21208/49
[^3]: https://github.com/DHowett/FrameworkWindowsUtils/issues/3
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-20 10:33 ` David Rheinsberg
@ 2023-06-21 4:57 ` Tzung-Bi Shih
2023-06-26 13:09 ` David Rheinsberg
0 siblings, 1 reply; 7+ messages in thread
From: Tzung-Bi Shih @ 2023-06-21 4:57 UTC (permalink / raw)
To: David Rheinsberg; +Cc: chrome-platform, Benson Leung, Guenter Roeck
On Tue, Jun 20, 2023 at 12:33:35PM +0200, David Rheinsberg wrote:
> On Tue, Jun 20, 2023, at 11:53 AM, Tzung-Bi Shih wrote:
> > On Tue, Jun 20, 2023 at 10:12:34AM +0200, David Rheinsberg wrote:
> >> Using the cros-ec over lpc device on the Framework-13, I occasionally get:
> >>
> >> cros_ec_lpcs cros_ec_lpcs.0: packet too long (4 bytes, expected 0)
> >>
> >> Afterwards, the entire EC seems to be inactive and none of its controllers work, anymore (temperature sensors are stale, keyboard defunct, etc.). A reboot fixes the issues.
> >
> > To be clear, does an AP reboot fix the issue? Or does it need an EC reboot?
> > Will it trigger some watchdog mechanisms and thus a reboot after waiting for
> > a specific duration (e.g. 30 seconds)?
>
> A linux system reboot fixed the issue. I did *NOT* perform a cold-boot or explicitly rebooted the EC. If the EC is not automatically rebooted with the AP, then no EC-reboot was necessary to fix the issue.
>
> No watchdog mechanism was triggered. Logs show I waited for 30min.
My intention was to simply bisect between AP vs. EC. From the description
you provided, it sounds like someone was waiting for some events forever.
However, the system was still running so that it didn't trigger any watchdogs
or lockup detectors.
> >> I cannot trigger this issue reliably, yet it seems to happen exclusively under heavy load. Do you have any recommendations how to debug this further?
> >>
> >> I failed tracing where the error happens and why any further functionality of the EC is disabled thereafter. Does the driver end communication on an error? Or is this likely a firmware issue and just indicative of the firmware failing?
> >>
> >> If you have any recommendations how to enabled the cros-tracing/debug features, I'd gladly run a custom kernel for a while to see where the failure originates.
> >
> > Could you get the AP and EC console log? AFAIK, the "packet too long" is only
> > indicating an error of the EC command. There should be some other error
> > messages directly related to the system becomes inactive.
>
> There is nothing in the kernel logs besides the mentioned message +-30min of the occurrence. I did not enable debug-level messages, though.
The issue you encountered should be at least an error. I don't think it needs
to enable debug-level messages to see other hints.
Could you get EC console log? Maybe there are some more hint messages.
> > Try to get the consoles would be the most directly helpful. Except the logs,
> > ramoops, and stacktraces, for example, if the console is still available when
> > the issue happens, you could use SysRq to get all stack backtraces.
>
> The system is (almost) fully operational when the issue happens. It is just that some hardware features do not work, anymore. I noticed the temperature sensors reporting stale data and the `fn`-key having no effect anymore.
>
> The kernel log was quiet around the time the issue started, except for the mentioned message.
>
> Is there a particular kernel thread you would like to see the stacktrace for? I can dump the full system stack trace and see whether anything looks particularly weird. But given that normal system operation was not interrupted, I am unsure that I will find anything.
No. I thought kernel was stucked from your previous message but it wasn't per
your further comments.
> I am unsure which device driver triggered the ec-lpc transaction, but it did not show any further error message. I am not even sure whether it is pkg-xfer or cmd-xfer, since both use the same error message. I can try placing an ftrace marker in the error-path and hopefully get a kernel stacktrace of the error?
`pkg_xfer` for protocol version 3; `cmd_xfer` for legacy. In most cases, it
should fall into protocol version 3. I'm not sure if finding the device
driver would be helpful. The driver didn't print error messages as the driver
doesn't think it was an error.
> Can I use the chardev to test whether the ec-driver is still responsive?
You can send hello message to EC via the chardev. See [1] for reference.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/cros-ec-tests.git/tree/cros/tests/cros_ec_mcu.py?h=main#n48
> If you insist, I can provide the full kernel-log. I will also try to fetch a sysrq stacktrace dump next time I trigger it.
No. As I replied above, the kernel doesn't look like stucked. The sysrq
is unneeded in the case.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-21 3:25 ` Dustin Howett
@ 2023-06-26 13:04 ` David Rheinsberg
0 siblings, 0 replies; 7+ messages in thread
From: David Rheinsberg @ 2023-06-26 13:04 UTC (permalink / raw)
To: Dustin L. Howett; +Cc: chrome-platform, Benson Leung, Guenter Roeck
Hi
On Wed, Jun 21, 2023, at 5:25 AM, Dustin Howett wrote:
[...]
> We don't currently have any leads on a permanent solution, but some members of
> the community forum are testing out placing the cros_ec_debugfs driver on the
> list of disallowed modules. Doing so reduces the chatter on the host command
> interface, since it no longer needs to query the console periodically.
So you are saying the ACPI entries may communicate on the same bus as the cros-ec driver, and thus you get interleaved messages? That is quite unfortunate. I wonder whether retrying on checksum-mismatch would help?
I assume the kernel cannot lock around ACPI execution, since it doesn't reliably know when the lock is required? (or is it even triggered asynchronously? My acpi knowledge is abysmal..)
Thanks a lot for the heads-up! I haven't been able to trigger it again. If it starts getting annoying, I might try the checksum-retry idea. It might need some random delay, though, to avoid racing the acpi code which I guess does something similar?
Thanks
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cros_ec_lpcs on framework: packet too long (4 bytes, expected 0)
2023-06-21 4:57 ` Tzung-Bi Shih
@ 2023-06-26 13:09 ` David Rheinsberg
0 siblings, 0 replies; 7+ messages in thread
From: David Rheinsberg @ 2023-06-26 13:09 UTC (permalink / raw)
To: Tzung-Bi Shih; +Cc: chrome-platform, Benson Leung, Guenter Roeck
Hi
On Wed, Jun 21, 2023, at 6:57 AM, Tzung-Bi Shih wrote:
>> There is nothing in the kernel logs besides the mentioned message +-30min of the occurrence. I did not enable debug-level messages, though.
>
> The issue you encountered should be at least an error. I don't think it needs
> to enable debug-level messages to see other hints.
>
> Could you get EC console log? Maybe there are some more hint messages.
Not sure what the "EC console log" is. Do you mean kmsgs from the cros-ec driver?
>> I am unsure which device driver triggered the ec-lpc transaction, but it did not show any further error message. I am not even sure whether it is pkg-xfer or cmd-xfer, since both use the same error message. I can try placing an ftrace marker in the error-path and hopefully get a kernel stacktrace of the error?
>
> `pkg_xfer` for protocol version 3; `cmd_xfer` for legacy. In most cases, it
> should fall into protocol version 3. I'm not sure if finding the device
> driver would be helpful. The driver didn't print error messages as the driver
> doesn't think it was an error.
I wouldn't be too surprised if the error paths of the calling driver does not print anything on error. But it might not help much either way.
>> Can I use the chardev to test whether the ec-driver is still responsive?
>
> You can send hello message to EC via the chardev. See [1] for reference.
Thanks, this is quite nice for debugging!
Given Dustin's comment, this seems like a known issue with the acpi tables accessing the same bus without a shared lock. This sounds like a plausible cause of the issue I am seeing. Lets see whether that leads us somewhere.
Thanks a lot for the hints!
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-06-26 13:10 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-20 8:12 cros_ec_lpcs on framework: packet too long (4 bytes, expected 0) David Rheinsberg
2023-06-20 9:53 ` Tzung-Bi Shih
2023-06-20 10:33 ` David Rheinsberg
2023-06-21 4:57 ` Tzung-Bi Shih
2023-06-26 13:09 ` David Rheinsberg
2023-06-21 3:25 ` Dustin Howett
2023-06-26 13:04 ` David Rheinsberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.