From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan.Glauber@cavium.com (Jan Glauber) Date: Mon, 15 Oct 2018 15:09:22 +0000 Subject: DMA remote memcpy requests In-Reply-To: References: <20181012090937.GA12289@arm.com> Message-ID: <20181015150912.GA8789@hc> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Mon, Oct 15, 2018 at 02:34:35PM +0000, Adam Cottrel wrote: > Dear Robin/Jan/Will, > > Any thoughts on what I can do to further diagnose the root cause? Hi Adam, from your description this sound like it: - only happens under memory pressure - only happens when you combine atheros DMA with something else (or does the MMC stress test trigger any faults on its own?) With that I would look through all the allocations in the atheros driver and especially look for any missing error handling. But that's just my 2 cents, maybe Robin or Will can give better advise here... Regards, Jan > > Best, > Adam > > > -----Original Message----- > > From: Robin Murphy > > Sent: 12 October 2018 11:47 > > To: Adam Cottrel ; Will Deacon > > > > Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org; > > jglauber at cavium.com; jnair at caviumnetworks.com; sgoutham at cavium.com > > Subject: Re: DMA remote memcpy requests > > > > On 12/10/18 10:48, Adam Cottrel wrote: > > > Hi Will, > > > > > > Thank you for getting back to me. > > > > > >> [+Robin and Cavium folks -- it's usually best to cc people as well as > > >> mailing the list] > > > I will remember this for future. Thanks for the advice. > > > > > >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor. > > >>> During heavy loading, I am seeing that target initiated DMA requests > > >>> are being silently dropped under extreme IO memory pressure and it > > >>> is proving very difficult to isolate the root cause. > > >> > > >> Is this ThunderX 1 or 2 or something else? Can you reproduce the > > >> issue with mainline? > > > I am using:- > > > model = "Cavium ThunderX CN81XX board"; > > > compatible = "cavium,thunder-81xx"; > > > > > > Yes - the issue can be reproduced on the mainline, but here is a link > > > to the code branch that I am using:- > > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a > > > th/ath10k > > > > > >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers > > >>> (32-bit) which are then copied to a shared ring buffer. The target > > > > That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64 > > platform. If the device is expecting 32-bit addresses but somehow doesn't > > have its DMA mask set appropriately, then if you have more than 3GB or so > > of RAM there's the potential for addresses to get truncated such that the > > DMA *does* happen, but to the wrong place. > > > > However, with SMMU translation enabled (i.e. not just passthrough), then > > I'd expect that same situation to cause more or less all DMA to fail, so if > > you've successfully tested that setup it must be something much more > > subtle :/ > > > > >>> then initiates the memcpy operation (for target-to-host reads), but > > >>> I do not have any means of debugging the target directly, and so I > > >>> am looking for software hooks on the host that might help debug this > > >>> complex > > >> problem. > > >> > > >> How does the firmware use the DMA API, or are you referring to a > > >> driver? If the latter, could you point us to the code, please? Is it > > >> using the streaming API, or is this a coherent allocation? > > > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for > > the most part, it follows the rules. In local tests, I have added memory > > barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC > > CIVAC) to try and eliminate cache-coherency type problems. > > > > > > The receive fault can be observed in the Rx handler which can be found > > > on line 528 of ce.c:- > > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a > > > th/ath10k/ce.c > > > > > > The memory is allocated by the Rx post buffer function which is on > > > line 760 of pci.c:- > > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a > > > th/ath10k/pci.c > > > > > > To better observe the fault, I made the following change:- > > > + On allocation, I use memset to clear the skb->data (pci.c::770) > > > + On receive, I check that the data is not zero (ce.c::555) > > > + If the data is not yet written, I exit the Rx IRQ handler and try again. > > > > > > In tests, the code works as expected under normal operation, however > > once I start to simulate a heavy memory pressure situation then the Rx > > handler starts to fail. This failure (if allowed to continue) will eventually tear > > down the entire module and crash the target firmware because presumably > > they are seeing similar dropouts on the transmit path. > > > > > > When the fault is happening, if I poll the target registers (e.g. write > > counters over MMIO) I can see that they are still sending us new messages. > > In other words, they have silently failed to send the data, or rather we have > > silently failed to accept the memory copy. I am not able to access the target > > firmware directly, but I have been reliably informed that the DMA memcpy > > operation is initiated by the target. > > > > > > My memory pressure test uses a large dd copy to create a lot of dirty > > memory pages. This always creates the fault, however without any memory > > pressure the code runs beautifully... > > > > Are you able to characterise whether it's actually the memory pressure itself > > that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just > > that there's suddenly a lot more work going on in general? Those aren't > > exactly the most powerful CPU cores, and with only > > 2 or 4 of them it doesn't seem impossible that the system could simply get > > loaded to the point where it can't keep up and starts dropping things on the > > floor. > > > > Robin. > > > > >>> Please can someone explain the low-level operation of DMA once it > > >>> becomes a target initiated memcpy function? > > >> > > >> I think we need a better handle on the issue first. > > > > > > I fully agree - please tell me what you want to know :-D > > > > > >>> p.s. I have tested with and without the IOMMU, and I have eliminated > > >>> issues such as cache coherency being the root cause. > > >> > > >> Right, not sure how the SMMU would help here. > > > > > > Understood, and thanks for taking the time to reply, and I look forward to > > hearing your thoughts as I would like to fix this issue once and for all. > > > > > > Best, > > > Adam > > >