* Bad DMA from Marvell 9230
@ 2014-03-27 6:57 Benjamin Herrenschmidt
2014-03-27 15:19 ` Tejun Heo
2014-05-30 7:06 ` Jérôme Carretero
0 siblings, 2 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2014-03-27 6:57 UTC (permalink / raw)
To: Tejun Heo; +Cc: Bartlomiej Zolnierkiewicz, linux-ide, LKML
Hi Folks !
Do that ring any bell ?
I've been trying a 9230 on a power box here (a 9235 on the same machine
works fine) and it blows up with an IOMMU violation early during init.
>From what I can tell the scenario is:
- So we still haven't issued any command per-se, all our DMA command
buffers etc... are all 0's at the point of the error.
- The core libata calls the AHCI driver's ahci_hardreset() for each
port in a separate thread. They all call sata_link_hardreset().
- This in turns calls sata_link_resume() which write to the SCR_CONTROL
register as follow:
scontrol = (scontrol & 0x0f0) | 0x300;
if ((rc = sata_scr_write(link, SCR_CONTROL, scontrol)))
{
printk(" -> sata_link_resume FAIL 2\n");
return rc;
}
/*
* Some PHYs react badly if SStatus is pounded
* immediately after resuming. Delay 200ms before
* debouncing.
*/
ata_msleep(link->ap, 200);
I get the interrupt from the IOMMU about 2ms after the write to
SCR_CONTROL.
Now, pending misinterpretation of some bits on my side, it looks like
the bad DMA is a DMA *read* from address 0 (which we never map,
typically to catch driver bugs).
I went through a few theories with this one but so far none held. I
don't think it's a D2H FIS issue since the DMA pointers for that appear
to be setup properly, the memory mapped, etc...
I though the chip might incorrectly/inadvertently try to (pre)fetch a
command. At that point all 32 command slots are all 0's, so if it
ignored the size it might try to fetch from command address 0.
So I added a loop to fill all 32 slots with a valid command address
in ahci_hardreset:
+ for (i = 0; i < 32; i++)
+ ahci_fill_cmd_slot(pp, i, 0);
rc = sata_link_hardreset(link, timing, deadline, &online,
ahci_check_ready);
But that had basically no effect.
I've contacted Marvell, but I was wondering if anybody here had already
experienced something similar or has an idea of what else the chip
might be doing wrong so we can try to find a workaround ?
Cheers,
Ben.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-03-27 6:57 Bad DMA from Marvell 9230 Benjamin Herrenschmidt
@ 2014-03-27 15:19 ` Tejun Heo
2014-04-05 2:35 ` Robert Hancock
2014-05-30 7:06 ` Jérôme Carretero
1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2014-03-27 15:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Bartlomiej Zolnierkiewicz, linux-ide, LKML
On Thu, Mar 27, 2014 at 05:57:37PM +1100, Benjamin Herrenschmidt wrote:
> I've contacted Marvell, but I was wondering if anybody here had already
> experienced something similar or has an idea of what else the chip
> might be doing wrong so we can try to find a workaround ?
No idea. First time to hear such problem. :(
--
tejun
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-03-27 15:19 ` Tejun Heo
@ 2014-04-05 2:35 ` Robert Hancock
0 siblings, 0 replies; 11+ messages in thread
From: Robert Hancock @ 2014-04-05 2:35 UTC (permalink / raw)
To: Tejun Heo, Benjamin Herrenschmidt
Cc: Bartlomiej Zolnierkiewicz, linux-ide, LKML
On 27/03/14 09:19 AM, Tejun Heo wrote:
> On Thu, Mar 27, 2014 at 05:57:37PM +1100, Benjamin Herrenschmidt wrote:
>> I've contacted Marvell, but I was wondering if anybody here had already
>> experienced something similar or has an idea of what else the chip
>> might be doing wrong so we can try to find a workaround ?
>
> No idea. First time to hear such problem. :(
>
There are other Marvell controllers that do DMA requests from the wrong
PCI function ID and cause IOMMU issues, so it seems like testing on such
systems (or just validating the DMA transactions done by the controller
by some other means) isn't something that Marvell likes to do.
Presumably reading from address 0 is normally fine without an IOMMU, so
this problem wouldn't be noticed otherwise.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-03-27 6:57 Bad DMA from Marvell 9230 Benjamin Herrenschmidt
2014-03-27 15:19 ` Tejun Heo
@ 2014-05-30 7:06 ` Jérôme Carretero
2014-05-30 10:37 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 11+ messages in thread
From: Jérôme Carretero @ 2014-05-30 7:06 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Tejun Heo, Bartlomiej Zolnierkiewicz, linux-ide, LKML
On Thu, 27 Mar 2014 17:57:37 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> I've been trying a 9230 on a power box here (a 9235 on the same
> machine works fine) and it blows up with an IOMMU violation early
> during init.
Hi,
That's https://bugzilla.kernel.org/show_bug.cgi?id=42679
if you haven't already found it.
--
Jérôme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 7:06 ` Jérôme Carretero
@ 2014-05-30 10:37 ` Benjamin Herrenschmidt
2014-05-30 13:58 ` Jérôme Carretero
0 siblings, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-30 10:37 UTC (permalink / raw)
To: Jérôme Carretero
Cc: Tejun Heo, Bartlomiej Zolnierkiewicz, linux-ide, LKML
On Fri, 2014-05-30 at 03:06 -0400, Jérôme Carretero wrote:
> On Thu, 27 Mar 2014 17:57:37 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> > I've been trying a 9230 on a power box here (a 9235 on the same
> > machine works fine) and it blows up with an IOMMU violation early
> > during init.
>
> Hi,
>
> That's https://bugzilla.kernel.org/show_bug.cgi?id=42679
> if you haven't already found it.
Somewhat... It's not the phantom function, the error I capture in my
IOMMU shows that it's trying to read from address 0 which is unmapped
but with the right initiator.
This device is a pile of crap. We've talked to Marvell support channel,
sent driver traces etc... but they didn't admit anything.
We've switched to a 9235 instead which seems to work fine.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 10:37 ` Benjamin Herrenschmidt
@ 2014-05-30 13:58 ` Jérôme Carretero
2014-05-30 14:13 ` Roger Heflin
2014-05-30 20:59 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 11+ messages in thread
From: Jérôme Carretero @ 2014-05-30 13:58 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Tejun Heo, Bartlomiej Zolnierkiewicz, linux-ide, LKML,
Alex Williamson
On Fri, 30 May 2014 20:37:58 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> We've switched to a 9235 instead which seems to work fine.
Weird (I hadn't seen that you reported the 9235 working...), I have
IOMMU problems with a 9235...
What system are you running it on (when you say "power box", is it a
beefy x86 computer or literally a PowerPC)?
For me, AMD 990FX chipset, latest master linux.
My board works fine* on another non-IOMMU system.
--
Jérôme
* with issues with port multipliers
Link to Benjamin's first message: https://lkml.org/lkml/2014/3/27/43
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 13:58 ` Jérôme Carretero
@ 2014-05-30 14:13 ` Roger Heflin
2014-05-30 15:14 ` Jérôme Carretero
2014-05-30 21:06 ` Benjamin Herrenschmidt
2014-05-30 20:59 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 11+ messages in thread
From: Roger Heflin @ 2014-05-30 14:13 UTC (permalink / raw)
To: Jérôme Carretero
Cc: Benjamin Herrenschmidt, Tejun Heo, Bartlomiej Zolnierkiewicz,
linux-ide, LKML, Alex Williamson
I had a 9230...on older kernels it worked "ok" so long as you did not
do any smart commands, I removed it and went to something that works.
Marvell appears to be hit and miss with some cards/chips working
right and some not...
Do enough smartcmds and the entire board (all 4 ports) locked up and
required a reboot, I quit doing smartcmds and stability went way up,
but it was still not 100% stable.
Supplier support "claimed" it to be a Linux AHCI bug as the "claim"
that their board correctly supports AHCI, even though all other AHCI
boards work right in this exact same use case in the exact same
machine.
On Fri, May 30, 2014 at 8:58 AM, Jérôme Carretero <cJ-ko@zougloub.eu> wrote:
> On Fri, 30 May 2014 20:37:58 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
>> We've switched to a 9235 instead which seems to work fine.
>
> Weird (I hadn't seen that you reported the 9235 working...), I have
> IOMMU problems with a 9235...
>
> What system are you running it on (when you say "power box", is it a
> beefy x86 computer or literally a PowerPC)?
> For me, AMD 990FX chipset, latest master linux.
> My board works fine* on another non-IOMMU system.
>
> --
> Jérôme
>
> * with issues with port multipliers
>
> Link to Benjamin's first message: https://lkml.org/lkml/2014/3/27/43
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 14:13 ` Roger Heflin
@ 2014-05-30 15:14 ` Jérôme Carretero
2014-05-30 21:06 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 11+ messages in thread
From: Jérôme Carretero @ 2014-05-30 15:14 UTC (permalink / raw)
To: Roger Heflin
Cc: Benjamin Herrenschmidt, Tejun Heo, Bartlomiej Zolnierkiewicz,
linux-ide, LKML, Alex Williamson
On Fri, 30 May 2014 09:13:43 -0500
Roger Heflin <rogerheflin@gmail.com> wrote:
> I had a 9230...
> [...]
> Supplier support "claimed" it to be a Linux AHCI bug as the "claim"
> that their board correctly supports AHCI, even though all other AHCI
> boards work right in this exact same use case in the exact same
> machine.
Does somebody know about another supplier that provides equivalent
SATA adapters that behave well, are robust and support FIS switching,
and don't come with proprietary drivers/utilities but rather *support*
their linux driver?
I'd bite the bullet and get a better, more expensive device, but it
doesn't seem to come with appropriate software support either.
There are some RAID adapters that don't expose the disks if we're
not creating RAID volumes... with an ugly CLI, and where we don't know
what's written where on the disk in case we are to create one volume
per disk, and do software RAID later.
Not tempted to use that.
Or waste PCIe slots and use more el-cheapo ASMedia 1061 PCIe-1x
devices... do these work well?
--
Jérôme
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 13:58 ` Jérôme Carretero
2014-05-30 14:13 ` Roger Heflin
@ 2014-05-30 20:59 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-30 20:59 UTC (permalink / raw)
To: Jérôme Carretero
Cc: Tejun Heo, Bartlomiej Zolnierkiewicz, linux-ide, LKML,
Alex Williamson
On Fri, 2014-05-30 at 09:58 -0400, Jérôme Carretero wrote:
> Weird (I hadn't seen that you reported the 9235 working...), I have
> IOMMU problems with a 9235...
>
> What system are you running it on (when you say "power box", is it a
> beefy x86 computer or literally a PowerPC)?
> For me, AMD 990FX chipset, latest master linux.
> My board works fine* on another non-IOMMU system.
A PowerPC POWER8 prototype machine :-)
Cheers,
Ben.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 14:13 ` Roger Heflin
2014-05-30 15:14 ` Jérôme Carretero
@ 2014-05-30 21:06 ` Benjamin Herrenschmidt
2014-05-30 23:08 ` Roger Heflin
1 sibling, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-30 21:06 UTC (permalink / raw)
To: Roger Heflin
Cc: Jérôme Carretero, Tejun Heo, Bartlomiej Zolnierkiewicz,
linux-ide, LKML, Alex Williamson
On Fri, 2014-05-30 at 09:13 -0500, Roger Heflin wrote:
> Do enough smartcmds and the entire board (all 4 ports) locked up and
> required a reboot, I quit doing smartcmds and stability went way up,
> but it was still not 100% stable.
Any chance you can give me an example of "enough smartcmds" ? IE a
script or something that reliably breaks it for you ? I'd like to try on
my 9235.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Bad DMA from Marvell 9230
2014-05-30 21:06 ` Benjamin Herrenschmidt
@ 2014-05-30 23:08 ` Roger Heflin
0 siblings, 0 replies; 11+ messages in thread
From: Roger Heflin @ 2014-05-30 23:08 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Jérôme Carretero, Tejun Heo, Bartlomiej Zolnierkiewicz,
linux-ide, LKML, Alex Williamson
pretty much any smartcommands...I was running something that got all
of the smart stats 1x per hour per disk...and this made it crash about
1x per week, if you were pushing the disks hard it appear to make it
even more likely to crash under the smart cmds, removing the commands
took things up to 2-3 months between crashes.
I suspect if you just put a simple smartcmd --all /dev/sdX and ran it
a few times a minute if it had the issue it would almost certainly
crash in less than a day, I did not figure out the smart cmds were
crashing it, someone else's post indicate that they had determined
that and I figured out what I had doing smartcmds and removed them and
things got much betterr.
For finding good vendors, I know others on the md-raid list have given
up on cheap and found decent but more expensive controllers.
I would expect LSI and Adaptec to care enough about their names to
make a decent quality product. There appears to be 4pt (1-8087
pt-jbod/nonraid) adaptec that may be some variant of marvell that is
about $130US on newegg, given it is adaptec they may have made the
marvell actually work. There are a number of 8pt non-raid cards up
around $250-$300 that would probably work great if you wanted to pay
that much, these cards have 2x8087 ports and need a 8087->4sata cable
cable. Given how nice it is to have a machine that just mostly works
without messing around with it I would probably pay the extra for
stability.
Last time I looked at the 2pt/pciex1 cards I found significant
indications of instability enough to expect that I would have to put
several hours (or more) of testing/crashing/RMA pain in to figure out
which worked. I went so far as crossing out any of the motherboards
with non-AMD/non-intel sata ports as I have been burned before on
large MB vendors doing a bad job of integrating others (possibly bad)
sata ports in, it is a sad state, but it also has been this way for a
long time.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-05-30 23:09 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-27 6:57 Bad DMA from Marvell 9230 Benjamin Herrenschmidt
2014-03-27 15:19 ` Tejun Heo
2014-04-05 2:35 ` Robert Hancock
2014-05-30 7:06 ` Jérôme Carretero
2014-05-30 10:37 ` Benjamin Herrenschmidt
2014-05-30 13:58 ` Jérôme Carretero
2014-05-30 14:13 ` Roger Heflin
2014-05-30 15:14 ` Jérôme Carretero
2014-05-30 21:06 ` Benjamin Herrenschmidt
2014-05-30 23:08 ` Roger Heflin
2014-05-30 20:59 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).