* Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]
[not found] ` <3a9b2925-57fb-4139-8cf5-a761209c03cc@hixontech.com>
@ 2024-10-02 12:29 ` Linux regression tracking (Thorsten Leemhuis)
[not found] ` <CAN3TeO2zA6oLanWFXtJ_6Z1u7wWTwAZyrcP6-g81BfkE6jNXRQ@mail.gmail.com>
2024-10-03 17:49 ` Chris Hixon
0 siblings, 2 replies; 3+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-10-02 12:29 UTC (permalink / raw)
To: linux-kernel-bugs, Linux regressions mailing list,
Basavaraj Natikar
Cc: Jiri Kosina, linux-input, Benjamin Tissoires,
akshata.mukundshetty, LKML, Skyler, Richard, linux-btrfs,
Limonciello, Mario
[CCing Richard, who apparently faces the same problem according to a
recent comment in the bugzilla ticket mentioned earlier:
https://bugzilla.kernel.org/show_bug.cgi?id=219331#c8
CCing Mario, who might be interested in this and is a good contact when
it comes to issues with AMD stuff like this.
CCing the Btrfs list as JFYI, as all three reporters afaics see Btrfs
misbehavior or corruptions due to this.
Considered to bring Linus in, but decided to wait a bit before doing so.]
On 01.10.24 23:40, Chris Hixon wrote:
> On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
>> Basavaraj Natikar, I noticed a report about a regression in
>> bugzilla.kernel.org that appears to be caused by a change of yours:
>>
>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
>> [v6.9-rc1]
>>
>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>> I decided to write this mail. To quote from
>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>
>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>> They always appear within a few minutes of the system being on, if
>>> not immediately upon booting. My system is a Dell Inspiron 7405.
> [...]
>>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>>> [ 23.580724] rfkill: input handler enabled
>>> [ 25.652067] rfkill: input handler disabled
>
>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95
>
> No sensors detected - do we all have that in common?
Skyler, Richard?
>>> [...]
>> See the ticket for more details and the bisection result. Skyler, the
>> reporter (CCed), later also added:
>>
>>> Occasionally I will not get the usual bad page map error, but
>>> instead some BTRFS errors followed by the file system going read-only.
>>
>> Note, we had and earlier regression caused by this change reported by
>> Chris Hixon that maybe was not solved completely:
>> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
>
> This looks like the same issue I reported.
And sounds a lot like what Richard sees, who also sees disk corruption
with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>> Chris Hixon: do you still encounter errors, or was your issue
>> resolved/vanished somehow?
>
> I still encounter errors with every kernel/patch I've tested. I've blacklisted
> the amd_sfh module as a workaround, but when the module is inserted, a crash
> similar to those reported will happen soon after the (45 second?)
> detection/initialization timeout. It seems to affect whatever part of the
> kernel next becomes active. I've had disk corruption as well, when BTRFS is
> affected by the memory corruption,
Skyler, did you see btrfs disk corruption as well, just like Chris and
Richard did?
> so I've ended up testing on a USB stick I
> can reformat if necessary. I haven't tested new patches/kernels in a while
> though. I'll get back to you after I've tried the latest mainline. Also note
> that I've tried Fedora Rawhide's debug kernel,
From what I see it seems all three of you are using Fedora. Wonder if
that is a coincidence.
> which has a ton of debugging
> options including KASAN, but nothing seems to point the finger at something
> originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh
> chip) is corrupting memory somehow after some misstep in
> initialization/de-initialization? Also if you look at my report, you'll see I
> have no devices/sensors detected by amd_sfh - I wonder if other reporters all
> have this in common? (noted in dmesg output above from another user)
Given that Basavaraj Natikar never really addressed Chris earlier report
from months ago and the severeness of the problem I'd wonder if we
should revert the culprit to resolve this quickly, unless some proper
fix comes into sight soon. Sadly from a quick look that would require
multiple reverts afaics. :-/
Ciao, Thorsten
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]
[not found] ` <CAN3TeO2zA6oLanWFXtJ_6Z1u7wWTwAZyrcP6-g81BfkE6jNXRQ@mail.gmail.com>
@ 2024-10-02 14:02 ` Basavaraj Natikar
0 siblings, 0 replies; 3+ messages in thread
From: Basavaraj Natikar @ 2024-10-02 14:02 UTC (permalink / raw)
To: Richard Shaw, Linux regressions mailing list
Cc: linux-kernel-bugs, Basavaraj Natikar, Jiri Kosina, linux-input,
Benjamin Tissoires, akshata.mukundshetty, LKML, Skyler,
linux-btrfs, Limonciello, Mario
On 10/2/2024 6:19 PM, Richard Shaw wrote:
> On Wed, Oct 2, 2024 at 7:30 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>
> >> Basavaraj Natikar, I noticed a report about a regression in
> >> bugzilla.kernel.org <http://bugzilla.kernel.org> that appears
> to be caused by a change of yours:
> >>
> >> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is
> available")
> >> [v6.9-rc1]
> >>
> >> As many (most?) kernel developers don't keep an eye on the bug
> tracker,
> >> I decided to write this mail. To quote from
> >> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
> >>
> >>> I am getting bad page map errors on kernel version 6.9 or newer.
> >>> They always appear within a few minutes of the system being on, if
> >>> not immediately upon booting. My system is a Dell Inspiron 7405.
> > [...]
> >>> [ 23.234632] systemd-journald[611]: File
> /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal
> corrupted or uncleanly shut down, renaming and replacing.
> >>> [ 23.580724] rfkill: input handler enabled
> >>> [ 25.652067] rfkill: input handler disabled
> >
> >>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover,
> sensors not enabled is 0
> >>> [ 34.222379] pcie_mp2_amd 0000:03:00.7:
> amd_sfh_hid_client_init failed err -95
> >
> > No sensors detected - do we all have that in common?
>
As in all system there is a issue there is no sensor supported.
>
> My last log was with 6.11.0-debug[1] and found this:
>
> [ 40.178603] kernel: pcie_mp2_amd 0000:04:00.7: Failed to discover,
> sensors not enabled is 0
> [ 40.178904] kernel: pcie_mp2_amd 0000:04:00.7:
> amd_sfh_hid_client_init failed err -95
> [ 43.913688] kernel: Oops: general protection fault, probably for
> non-canonical address 0x3ffe71b40000848: 0000 [#1] PREEMPT SMP KASAN NOPTI
Since I am unable to reproduce this issue, I added a debug patch to the bug ID.
Could you please try it?
Thanks,
--
Basavaraj
>
> Interestingly the first OOPS was right after the amd_sfh tried to load
> (if I'm interpreting the above correctly).
>
> >> See the ticket for more details and the bisection result.
> Skyler, the
> >> reporter (CCed), later also added:
> >>
> >>> Occasionally I will not get the usual bad page map error, but
> >>> instead some BTRFS errors followed by the file system going
> read-only.
> >>
> >> Note, we had and earlier regression caused by this change
> reported by
> >> Chris Hixon that maybe was not solved completely:
> >>
> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
> >
> > This looks like the same issue I reported.
>
> And sounds a lot like what Richard sees, who also sees disk corruption
> with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>
> <snip>
>
> > I still encounter errors with every kernel/patch I've tested.
> I've blacklisted
> > the amd_sfh module as a workaround, but when the module is
> inserted, a crash
> > similar to those reported will happen soon after the (45 second?)
> > detection/initialization timeout. It seems to affect whatever
> part of the
> > kernel next becomes active. I've had disk corruption as well,
> when BTRFS is
> > affected by the memory corruption,
>
> Skyler, did you see btrfs disk corruption as well, just like Chris and
> Richard did?
>
>
> Yes, most of the time the btrfs write checker catches the problem but
> not always. I've had to reinstall F40 3 times while debugging this
> issue for uncorrectable errors. When I run the debug kernel I think it
> brings the system to a halt so fast it doesn't have time to write the
> corruption to disk.
>
> From what I see it seems all three of you are using Fedora. Wonder if
> that is a coincidence.
>
>
> Possibly. Can't say there isn't some patch we're using that's helping
> cause or expose the issue but Fedora tends to run the newest packages
> (including the Linux kernel) so can sometimes be the early warning
> system for other distros.
>
> Thanks,
> RIchard
>
> [1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=2049688
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only]
2024-10-02 12:29 ` [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only] Linux regression tracking (Thorsten Leemhuis)
[not found] ` <CAN3TeO2zA6oLanWFXtJ_6Z1u7wWTwAZyrcP6-g81BfkE6jNXRQ@mail.gmail.com>
@ 2024-10-03 17:49 ` Chris Hixon
1 sibling, 0 replies; 3+ messages in thread
From: Chris Hixon @ 2024-10-03 17:49 UTC (permalink / raw)
To: Linux regressions mailing list, Basavaraj Natikar
Cc: Jiri Kosina, linux-input, Benjamin Tissoires,
akshata.mukundshetty, LKML, Skyler, Richard, linux-btrfs,
Limonciello, Mario
On 10/2/2024, 6:29:59 AM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
> [CCing Richard, who apparently faces the same problem according to a
> recent comment in the bugzilla ticket mentioned earlier:
> https://bugzilla.kernel.org/show_bug.cgi?id=219331#c8
>
> CCing Mario, who might be interested in this and is a good contact when
> it comes to issues with AMD stuff like this.
>
> CCing the Btrfs list as JFYI, as all three reporters afaics see Btrfs
> misbehavior or corruptions due to this.
>
> Considered to bring Linus in, but decided to wait a bit before doing so.]
This patch from Basavaraj Natikar seems to solve the issue for me:
https://lore.kernel.org/linux-input/20241003160454.3017229-1-Basavaraj.Natikar@amd.com/
Tested-by: Chris Hixon <linux-kernel-bugs@hixontech.com>
My original report:
https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
Reported-by: Chris Hixon <linux-kernel-bugs@hixontech.com>
Thanks!
>
> On 01.10.24 23:40, Chris Hixon wrote:
>> On 10/1/2024, 12:56:49 PM, "Linux regression tracking (Thorsten Leemhuis)" wrote:
>
>>> Basavaraj Natikar, I noticed a report about a regression in
>>> bugzilla.kernel.org that appears to be caused by a change of yours:
>>>
>>> 2105e8e00da467 ("HID: amd_sfh: Improve boot time when SFH is available")
>>> [v6.9-rc1]
>>>
>>> As many (most?) kernel developers don't keep an eye on the bug tracker,
>>> I decided to write this mail. To quote from
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219331 :
>>>
>>>> I am getting bad page map errors on kernel version 6.9 or newer.
>>>> They always appear within a few minutes of the system being on, if
>>>> not immediately upon booting. My system is a Dell Inspiron 7405.
>> [...]
>>>> [ 23.234632] systemd-journald[611]: File /var/log/journal/a4e3170bc5be4f52a2080fb7b9f93cf0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>>>> [ 23.580724] rfkill: input handler enabled
>>>> [ 25.652067] rfkill: input handler disabled
>>
>>>> [ 34.222362] pcie_mp2_amd 0000:03:00.7: Failed to discover, sensors not enabled is 0
>>>> [ 34.222379] pcie_mp2_amd 0000:03:00.7: amd_sfh_hid_client_init failed err -95
>>
>> No sensors detected - do we all have that in common?
>
> Skyler, Richard?
>
>>>> [...]
>>> See the ticket for more details and the bisection result. Skyler, the
>>> reporter (CCed), later also added:
>>>
>>>> Occasionally I will not get the usual bad page map error, but
>>>> instead some BTRFS errors followed by the file system going read-only.
>>>
>>> Note, we had and earlier regression caused by this change reported by
>>> Chris Hixon that maybe was not solved completely:
>>> https://lore.kernel.org/all/3b129b1f-8636-456a-80b4-0f6cce0eef63@hixontech.com/
>>
>> This looks like the same issue I reported.
>
> And sounds a lot like what Richard sees, who also sees disk corruption
> with Btrfs (see https://bugzilla.redhat.com/show_bug.cgi?id=2314331 ).
>
>>> Chris Hixon: do you still encounter errors, or was your issue
>>> resolved/vanished somehow?
>>
>> I still encounter errors with every kernel/patch I've tested. I've blacklisted
>> the amd_sfh module as a workaround, but when the module is inserted, a crash
>> similar to those reported will happen soon after the (45 second?)
>> detection/initialization timeout. It seems to affect whatever part of the
>> kernel next becomes active. I've had disk corruption as well, when BTRFS is
>> affected by the memory corruption,
>
> Skyler, did you see btrfs disk corruption as well, just like Chris and
> Richard did?
>
>> so I've ended up testing on a USB stick I
>> can reformat if necessary. I haven't tested new patches/kernels in a while
>> though. I'll get back to you after I've tried the latest mainline. Also note
>> that I've tried Fedora Rawhide's debug kernel,
>
> From what I see it seems all three of you are using Fedora. Wonder if
> that is a coincidence.
Note: I don't think it's a Fedora issue. I've had the problem on multiple
distros, with any kernel >= 6.9 - anything with the "bad" commit.
>> which has a ton of debugging
>> options including KASAN, but nothing seems to point the finger at something
>> originating in amd_sfh code. Is it possible the hardware itself (the mp2/sfh
>> chip) is corrupting memory somehow after some misstep in
>> initialization/de-initialization? Also if you look at my report, you'll see I
>> have no devices/sensors detected by amd_sfh - I wonder if other reporters all
>> have this in common? (noted in dmesg output above from another user)
>
> Given that Basavaraj Natikar never really addressed Chris earlier report
> from months ago and the severeness of the problem I'd wonder if we
> should revert the culprit to resolve this quickly, unless some proper
> fix comes into sight soon. Sadly from a quick look that would require
> multiple reverts afaics. :-/
>
> Ciao, Thorsten
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-10-03 17:49 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <90f6ee64-df5e-43b2-ad04-fa3a35efc1d5@leemhuis.info>
[not found] ` <3a9b2925-57fb-4139-8cf5-a761209c03cc@hixontech.com>
2024-10-02 12:29 ` [regression] AMD SFH Driver Causes Memory Errors / Page Faults / btrfs on-disk corruption [Was: .../ btrfs going read-only] Linux regression tracking (Thorsten Leemhuis)
[not found] ` <CAN3TeO2zA6oLanWFXtJ_6Z1u7wWTwAZyrcP6-g81BfkE6jNXRQ@mail.gmail.com>
2024-10-02 14:02 ` Basavaraj Natikar
2024-10-03 17:49 ` Chris Hixon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).