From: Jan.Glauber@cavium.com (Jan Glauber)
To: linux-arm-kernel@lists.infradead.org
Subject: DMA remote memcpy requests
Date: Mon, 15 Oct 2018 15:09:22 +0000 [thread overview]
Message-ID: <20181015150912.GA8789@hc> (raw)
In-Reply-To: <DM6PR04MB405918E803AF4FCC34B12083F2FD0@DM6PR04MB4059.namprd04.prod.outlook.com>
On Mon, Oct 15, 2018 at 02:34:35PM +0000, Adam Cottrel wrote:
> Dear Robin/Jan/Will,
>
> Any thoughts on what I can do to further diagnose the root cause?
Hi Adam,
from your description this sound like it:
- only happens under memory pressure
- only happens when you combine atheros DMA with something else (or does
the MMC stress test trigger any faults on its own?)
With that I would look through all the allocations in the atheros
driver and especially look for any missing error handling. But that's
just my 2 cents, maybe Robin or Will can give better advise here...
Regards,
Jan
>
> Best,
> Adam
>
> > -----Original Message-----
> > From: Robin Murphy <robin.murphy@arm.com>
> > Sent: 12 October 2018 11:47
> > To: Adam Cottrel <adam.cottrel@veea.com>; Will Deacon
> > <will.deacon@arm.com>
> > Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org;
> > jglauber at cavium.com; jnair at caviumnetworks.com; sgoutham at cavium.com
> > Subject: Re: DMA remote memcpy requests
> >
> > On 12/10/18 10:48, Adam Cottrel wrote:
> > > Hi Will,
> > >
> > > Thank you for getting back to me.
> > >
> > >> [+Robin and Cavium folks -- it's usually best to cc people as well as
> > >> mailing the list]
> > > I will remember this for future. Thanks for the advice.
> > >
> > >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> > >>> During heavy loading, I am seeing that target initiated DMA requests
> > >>> are being silently dropped under extreme IO memory pressure and it
> > >>> is proving very difficult to isolate the root cause.
> > >>
> > >> Is this ThunderX 1 or 2 or something else? Can you reproduce the
> > >> issue with mainline?
> > > I am using:-
> > > model = "Cavium ThunderX CN81XX board";
> > > compatible = "cavium,thunder-81xx";
> > >
> > > Yes - the issue can be reproduced on the mainline, but here is a link
> > > to the code branch that I am using:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k
> > >
> > >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> > >>> (32-bit) which are then copied to a shared ring buffer. The target
> >
> > That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
> > platform. If the device is expecting 32-bit addresses but somehow doesn't
> > have its DMA mask set appropriately, then if you have more than 3GB or so
> > of RAM there's the potential for addresses to get truncated such that the
> > DMA *does* happen, but to the wrong place.
> >
> > However, with SMMU translation enabled (i.e. not just passthrough), then
> > I'd expect that same situation to cause more or less all DMA to fail, so if
> > you've successfully tested that setup it must be something much more
> > subtle :/
> >
> > >>> then initiates the memcpy operation (for target-to-host reads), but
> > >>> I do not have any means of debugging the target directly, and so I
> > >>> am looking for software hooks on the host that might help debug this
> > >>> complex
> > >> problem.
> > >>
> > >> How does the firmware use the DMA API, or are you referring to a
> > >> driver? If the latter, could you point us to the code, please? Is it
> > >> using the streaming API, or is this a coherent allocation?
> > > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
> > the most part, it follows the rules. In local tests, I have added memory
> > barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
> > CIVAC) to try and eliminate cache-coherency type problems.
> > >
> > > The receive fault can be observed in the Rx handler which can be found
> > > on line 528 of ce.c:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k/ce.c
> > >
> > > The memory is allocated by the Rx post buffer function which is on
> > > line 760 of pci.c:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k/pci.c
> > >
> > > To better observe the fault, I made the following change:-
> > > + On allocation, I use memset to clear the skb->data (pci.c::770)
> > > + On receive, I check that the data is not zero (ce.c::555)
> > > + If the data is not yet written, I exit the Rx IRQ handler and try again.
> > >
> > > In tests, the code works as expected under normal operation, however
> > once I start to simulate a heavy memory pressure situation then the Rx
> > handler starts to fail. This failure (if allowed to continue) will eventually tear
> > down the entire module and crash the target firmware because presumably
> > they are seeing similar dropouts on the transmit path.
> > >
> > > When the fault is happening, if I poll the target registers (e.g. write
> > counters over MMIO) I can see that they are still sending us new messages.
> > In other words, they have silently failed to send the data, or rather we have
> > silently failed to accept the memory copy. I am not able to access the target
> > firmware directly, but I have been reliably informed that the DMA memcpy
> > operation is initiated by the target.
> > >
> > > My memory pressure test uses a large dd copy to create a lot of dirty
> > memory pages. This always creates the fault, however without any memory
> > pressure the code runs beautifully...
> >
> > Are you able to characterise whether it's actually the memory pressure itself
> > that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
> > that there's suddenly a lot more work going on in general? Those aren't
> > exactly the most powerful CPU cores, and with only
> > 2 or 4 of them it doesn't seem impossible that the system could simply get
> > loaded to the point where it can't keep up and starts dropping things on the
> > floor.
> >
> > Robin.
> >
> > >>> Please can someone explain the low-level operation of DMA once it
> > >>> becomes a target initiated memcpy function?
> > >>
> > >> I think we need a better handle on the issue first.
> > >
> > > I fully agree - please tell me what you want to know :-D
> > >
> > >>> p.s. I have tested with and without the IOMMU, and I have eliminated
> > >>> issues such as cache coherency being the root cause.
> > >>
> > >> Right, not sure how the SMMU would help here.
> > >
> > > Understood, and thanks for taking the time to reply, and I look forward to
> > hearing your thoughts as I would like to fix this issue once and for all.
> > >
> > > Best,
> > > Adam
> > >
next prev parent reply other threads:[~2018-10-15 15:09 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-11 7:28 DMA remote memcpy requests Adam Cottrel
2018-10-12 9:09 ` Will Deacon
2018-10-12 9:48 ` Adam Cottrel
2018-10-12 10:46 ` Robin Murphy
2018-10-12 11:06 ` Adam Cottrel
2018-10-15 14:34 ` Adam Cottrel
2018-10-15 15:09 ` Jan Glauber [this message]
2018-10-15 15:24 ` Adam Cottrel
2018-10-15 15:39 ` Jan Glauber
2018-10-15 15:51 ` Adam Cottrel
2018-10-18 15:36 ` Adam Cottrel
2018-10-22 14:28 ` Jan Glauber
2018-10-22 14:39 ` Adam Cottrel
2018-10-22 15:33 ` Jan Glauber
[not found] ` <DM6PR07MB4923F3328079199090D6D2CA9EFE0@DM6PR07MB4923.namprd07.prod.outlook.com>
2018-10-16 16:52 ` Adam Cottrel
2018-10-16 17:08 ` Robin Murphy
2018-10-12 11:03 ` Jan Glauber
2018-10-12 11:07 ` Adam Cottrel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181015150912.GA8789@hc \
--to=jan.glauber@cavium.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.