From: Sricharan Ramabadhran <sricharan@codeaurora.org>
To: Konrad Dybcio <konrad.dybcio@somainline.org>,
Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Manivannan Sadhasivam <mani@kernel.org>,
pragalla@codeaurora.org, ~postmarketos/upstreaming@lists.sr.ht,
martin.botka@somainline.org,
angelogioacchino.delregno@somainline.org,
marijn.suijten@somainline.org, jamipkettunen@somainline.org,
Richard Weinberger <richard@nod.at>,
Vignesh Raghavendra <vigneshr@ti.com>,
linux-mtd@lists.infradead.org, linux-arm-msm@vger.kernel.org,
linux-kernel@vger.kernel.org, mdalam@codeaurora.org,
bbhatt@codeaurora.org, hemantk@codeaurora.org
Subject: Re: [PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID
Date: Wed, 2 Feb 2022 12:54:42 +0530 [thread overview]
Message-ID: <c63d5410-7f08-80fe-28ac-f4867038ff30@codeaurora.org> (raw)
In-Reply-To: <d79bf21d-5a90-0074-cef6-896f66e80d28@somainline.org>
Hi Konrad/Miquel,
On 2/1/2022 9:21 PM, Konrad Dybcio wrote:
>
> On 01/02/2022 14:52, Miquel Raynal wrote:
>> Hi Konrad,
>>
>> konrad.dybcio@somainline.org wrote on Mon, 31 Jan 2022 20:54:12 +0100:
>>
>>> On 31/01/2022 15:13, Sricharan Ramabadhran wrote:
>>>> Hi Konrad,
>>>>
>>>> On 1/31/2022 3:39 PM, Konrad Dybcio wrote:
>>>>> On 28/01/2022 18:50, Sricharan Ramabadhran wrote:
>>>>>> Hi Konrad,
>>>>>>
>>>>>> On 1/28/2022 9:55 AM, Sricharan Ramabadhran wrote:
>>>>>>> Hi Miquel,
>>>>>>>
>>>>>>> On 1/26/2022 4:12 PM, Miquel Raynal wrote:
>>>>>>>> Hi Mani,
>>>>>>>>
>>>>>>>> mani@kernel.org wrote on Wed, 26 Jan 2022 16:03:16 +0530:
>>>>>>>>> On Wed, Jan 26, 2022 at 11:16:13AM +0100, Miquel Raynal wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> miquel.raynal@bootlin.com wrote on Fri, 14 Jan 2022 08:27:18
>>>>>>>>>> +0100:
>>>>>>>>>>> Hi Konrad,
>>>>>>>>>>>
>>>>>>>>>>> konrad.dybcio@somainline.org wrote on Thu, 13 Jan 2022
>>>>>>>>>>> 19:44:26 >>>>>>>> +0100:
>>>>>>>>>>>> While I have absolutely 0 idea why and how, running
>>>>>>>>>>>> >>>>>>>>> clear_bam_transaction
>>>>>>>>>>>> when READID is issued makes the DMA totally clog up and
>>>>>>>>>>>> refuse >>>>>>>>> to function
>>>>>>>>>>>> at all on mdm9607. In fact, it is so bad that all the data
>>>>>>>>>>>> >>>>>>>>> gets garbled
>>>>>>>>>>>> and after a short while in the nand probe flow, the CPU
>>>>>>>>>>>> >>>>>>>>> decides that
>>>>>>>>>>>> sepuku is the only option.
>>>>>>>>>>>>
>>>>>>>>>>>> Removing _READID from the if condition makes it work like a
>>>>>>>>>>>> >>>>>>>>> charm, I can
>>>>>>>>>>>> read data and mount partitions without a problem.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@somainline.org>
>>>>>>>>>>>> ---
>>>>>>>>>>>> This is totally just an observation which took me an
>>>>>>>>>>>> inhumane >>>>>>>>> amount of
>>>>>>>>>>>> debug prints to find.. perhaps there's a better reason
>>>>>>>>>>>> behind >>>>>>>>> this, but
>>>>>>>>>>>> I can't seem to find any answers.. Therefore, this is a BIG
>>>>>>>>>>>> RFC!
>>>>>>>>>>> I'm adding two people from codeaurora who worked a lot on
>>>>>>>>>>> this >>>>>>>> driver.
>>>>>>>>>>> Hopefully they will have an idea :)
>>>>>>>>>> Sadre, I've spent a significant amount of time reviewing your
>>>>>>>>>> >>>>>>> patches,
>>>>>>>>>> now it's your turn to not take a month to answer to your peers
>>>>>>>>>> proposals.
>>>>>>>>>>
>>>>>>>>>> Please help reviewing this patch.
>>>>>>>>> Sorry. I was hoping that Qcom folks would chime in as I don't
>>>>>>>>> >>>>>> have any idea
>>>>>>>>> about the mdm9607 platform. It could be that the mail server
>>>>>>>>> >>>>>> migration from
>>>>>>>>> codeaurora to quicinc put a barrier here.
>>>>>>>>>
>>>>>>>>> Let me ping them internally.
>>>>>>>> Oh, ok, I didn't know. Thanks!
>>>>>>> Sorry Miquel, somehow we did not get this email in our inbox.
>>>>>>> Thanks to Mani for pinging us, we will test this up today
>>>>>>> and >>>> get back.
>>>>>> While we could not reproduce this issue on our ipq boards
>>>>>> (do >>> not have a mdm9607 right now) and
>>>>>> issue does not look any obvious.
>>>>>> can you please give the debug logs that you did for the
>>>>>> above >>> stage by stage ?
>>>>> I won't have access to the board for about two weeks, sorry.
>>>>>
>>>>> When I get to it, I'll surely try to send you the logs, though there
>>>>>
>>>>> wasn't much more than just something jumping to who-knows-where
>>>>>
>>>>> after clear_bam_transaction was called, resulting in values >>
>>>>> associated with
>>>>>
>>>>> the NAND being all zeroed out in pr_err/_debug/etc.
>>>>>
>>>> Ok sure. So was the READID command itself failing (or) the >
>>>> subsequent one ?
>>>> We can check which parameter reset by the clear_bam_transaction
>>>> is > causing the
>>>> failure. Meanwhile, looping in Pradeep who has access to the >
>>>> board, so in a better
>>>> position to debug.
>>> I'm sorry I have so few details on hand, and no kernel tree (no
>>> access to that machine either, for now).
>>>
>>>
>>> I will try to describe to the best of my abilities what I recall.
>>>
>>>
>>> My methodology of making sure things don't go haywire was to print
>>> the oob size
>>>
>>> of our NAND basically every two lines of code (yes, i was very
>>> desperate at one point),
>>>
>>> as that was zeroed out when *the bug* happened,
>> This does look like a pointer error at some point and some kernel data
>> has been corrupted very badly by the driver.
>>
>>> leading to a kernel bug/panic/stall
>>>
>>> (can't recall what exactly it was, but it said something along the
>>> lines of "no support for
>>>
>>> oob size 0" and then it didn't fail graceully, leading to some bad
>>> jumps and ultimately
>>>
>>> a dead platform..)
>>>
>>>
>>> after hours of digging, I found out that everything goes fine until
>>> clear_bam_transaction is called,
>> Do you remember if this function was called for the first time when
>> this happened?
>
> I think so, if I recall correctly there are no more callers in this
> path, as readid is the first nand command executed in flash probe flow.
>
>
>
>>
>>> after that gets executed every nand op starts reading all zeroes
>>> (for example in JEDEC ID check)
>>>
>>> so I added the changes from this patch, and things magically started
>>> working... My suspicion is
>>>
>>> that the underlying FIFO isn't fully drained (is it a FIFO on 9607?
>>> bah, i work on too many socs at once)
>> I don't see it in the list of supported devices, what's the exact
>> compatible used?
>
> qcom,ipq4019-nand
>
>
>
>>
>>> and this function only makes Linux think it is, without actually
>>> draining it, and the leftover
>>>
>>> commands get executed with some parts of them getting overwritten,
>>> resulting in the
>>>
>>> famous garbage in - garbage out situation, but that's only a
>>> guesstimate..
>> I would bet for a non allocated bam-ish pointer that is reset to zero
>> in the clear_bam_transaction() helper.
>>
>> Can you get your hands on the board again?
>
> Sure, but as I mentioned previously, only in about 2 weeks, I can't
> really do any dev before then.. :(
>
>
>
>> It would be nice to check if the allocation always occurs before use,
>> and if yes on how much bytes.
>>
>> If the pointer is not dangling, then perhaps something else smashes
>> that pointer.
>
>
> Konrad
>
>>
>>> Do note this somehow worked fine on 5.11 and then broke on 5.12/13.
>>> I went as far as replacing most
>>>
>>> of the kernel with the updated/downgraded parts via git checkout (i
>>> tried many combinations),
>>>
>>> to no avail.. I even tried different compilers and optimization
>>> levels, thinking it could have been
>>>
>>> a codegen issue, but no luck either.
>>>
>>>
>>> I.. do understand this email is a total mess to read, as much as it
>>> was to write, but
>>>
>>> without access to my code and the machine itself I can't give you
>>> solid details, and
>>>
>>> the fact this situation is far from ordinary doesn't help either..
>>>
>>>
>>> The latest (ancient, not quite pretty, but probably working if my
>>> memory is correct) version of my patches
>>>
>>> for the mdm9607 is available at [1], I will push the new revision
>>> after I get access to the workstation.
>>>
+ few more who have access to the board.
Going by the description, for kernel corruption, we can try out a
KASAN build.
Since you have mentioned it worked till 5.11, you bisected the
driver till 5.11 head and it worked ?
Regards,
Sricharan
next prev parent reply other threads:[~2022-02-02 7:28 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-13 18:44 [PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID Konrad Dybcio
2022-01-13 18:45 ` Konrad Dybcio
2022-01-14 7:27 ` Miquel Raynal
2022-01-26 10:16 ` Miquel Raynal
2022-01-26 10:33 ` Manivannan Sadhasivam
2022-01-26 10:42 ` Miquel Raynal
2022-01-26 11:36 ` Manivannan Sadhasivam
2022-01-28 4:25 ` Sricharan Ramabadhran
2022-01-28 17:50 ` Sricharan Ramabadhran
2022-01-31 9:52 ` Miquel Raynal
2022-01-31 10:09 ` Konrad Dybcio
2022-01-31 14:13 ` Sricharan Ramabadhran
2022-01-31 19:54 ` Konrad Dybcio
2022-02-01 13:52 ` Miquel Raynal
2022-02-01 15:51 ` Konrad Dybcio
2022-02-02 7:24 ` Sricharan Ramabadhran [this message]
2022-02-04 17:17 ` Sricharan Ramabadhran
2022-02-08 16:45 ` Konrad Dybcio
2022-02-24 7:33 ` Sricharan Ramabadhran
2022-03-11 21:22 ` Konrad Dybcio
2022-04-08 13:29 ` Manivannan Sadhasivam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c63d5410-7f08-80fe-28ac-f4867038ff30@codeaurora.org \
--to=sricharan@codeaurora.org \
--cc=angelogioacchino.delregno@somainline.org \
--cc=bbhatt@codeaurora.org \
--cc=hemantk@codeaurora.org \
--cc=jamipkettunen@somainline.org \
--cc=konrad.dybcio@somainline.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mtd@lists.infradead.org \
--cc=mani@kernel.org \
--cc=marijn.suijten@somainline.org \
--cc=martin.botka@somainline.org \
--cc=mdalam@codeaurora.org \
--cc=miquel.raynal@bootlin.com \
--cc=pragalla@codeaurora.org \
--cc=richard@nod.at \
--cc=vigneshr@ti.com \
--cc=~postmarketos/upstreaming@lists.sr.ht \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox