SATA hang on 8315E triggered by heavy flash write?

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* SATA hang on 8315E triggered by heavy flash write?
@ 2013-05-15  8:12 Anthony Foiani
  2013-05-21 21:44 ` Scott Wood
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Foiani @ 2013-05-15  8:12 UTC (permalink / raw)
  To: linuxppc-dev

Greetings.

We're using a board derived from the MPC8315E.  Fairly regularly, the
SATA connection will freeze up while we are writing to flash memory:

  [  839.806884] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen
  [  839.814201] ata2.00: failed command: WRITE DMA
  [  839.818814] ata2.00: cmd ca/00:08:58:95:21/00:00:00:00:00/e1 tag 0 dma 4096 out
  [  839.818838]          res 50/00:00:98:00:18/00:00:00:00:00/e1 Emask 0x10 (ATA bus error)
  [  839.834222] ata2.00: status: { DRDY }
  [  839.838046] ata2: hard resetting link
  [  839.867942] ata2: setting speed (in hard reset)
  [  849.959859] ata2: No Signature Update
  [  850.131872] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [  850.138219] ata2.00: link online but device misclassified
  [  855.143882] ata2.00: qc timeout (cmd 0xec)
  [  855.148144] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  [  855.154375] ata2.00: revalidation failed (errno=-5)
  [  855.159376] ata2: hard resetting link
  [  855.659847] ata2: Hardreset failed, not off-lined 0
  [  855.671839] ata2: setting speed (in hard reset)
  [  865.259851] ata2: No Signature Update
  ...

The previous times I saw this, it would eventually recover, and our
device would keep on working correctly.  

However, now that we're doing many more operations per second, it
seems that it doesn't have time to recover, and we get a permanent error:

  [  925.883824] ata2: No Signature Update
  [  926.055893] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [  926.062236] ata2.00: link online but device misclassified
  [  926.067846] ata2: EH complete
  [  926.071337] sd 1:0:0:0: [sda] Unhandled error code
  [  926.076283] sd 1:0:0:0: [sda]  Result: hostbyte=0x04 driverbyte=0x00
  [  926.082758] sd 1:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 01 21 95 58 00 00 08 00
  [  926.090150] end_request: I/O error, dev sda, sector 18978136
  [  926.096132] sd 1:0:0:0: [sda] Unhandled error code
  [  926.101037] sd 1:0:0:0: [sda]  Result: hostbyte=0x04 driverbyte=0x00
  [  926.107504] sd 1:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 01 22 2a 00 00 00 10 00
  [  926.114894] end_request: I/O error, dev sda, sector 19016192
  [  926.120811] sd 1:0:0:0: [sda] Unhandled error code
  [  926.125719] sd 1:0:0:0: [sda]  Result: hostbyte=0x04 driverbyte=0x00
  [  926.132225] sd 1:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 00 00 00 08 00 00 08 00
  [  926.139545] end_request: I/O error, dev sda, sector 8
  [  926.144690] Buffer I/O error on device sda1, logical block 0
  [  926.150437] lost page write due to I/O error on sda1
  [  926.155674] sd 1:0:0:0: [sda] Unhandled error code
  [  926.160614] sd 1:0:0:0: [sda]  Result: hostbyte=0x04 driverbyte=0x00
  [  926.167073] sd 1:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 01 18 00 18 00 00 08 00
  [  926.174451] end_request: I/O error, dev sda, sector 18350104

And eventually I get nothing but:

  [ 1177.852326] EXT2-fs (sda1): error: read_inode_bitmap: Cannot read inode bitmap - block_group = 70, inode_bitmap = 2293761
  [ 1177.871983] EXT2-fs (sda1): previous I/O error to superblock detected
  [ 1177.872016] 
  [ 1177.880192] EXT2-fs (sda1): error: read_inode_bitmap: Cannot read inode bitmap - block_group = 70, inode_bitmap = 2293761
  [ 1177.893731] EXT2-fs (sda1): previous I/O error to superblock detected
  [ 1177.893764] 

At this point, /dev/sda is pretty much unusable, and I have to do at
least a reboot to recover.  (I don't recall if I had to do a power
cycle at this point, though.)

I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
Errata, Rev. 3, 09/2011):

  eLBC-A001:

  Simultaneous FCM and GPCM or UPM operation may erroneously trigger
  bus monitor timeout

  Description: Devices: MPC8315E, MPC8314E
  When the FCM is in the middle of a long transaction, such as NAND
  erase or write, another transaction on the GPCM or UPM triggers the
  bus monitor to start immediately for the GPCM or UPM, even though
  the GPCM or UPM is still waiting for the FCM to finish and has not
  yet started its transaction. If the bus monitor timeout value is not
  programmed for a sufficiently large value, the local bus monitor may
  time out. This timeout corrupts the current NAND Flash operation and
  terminate the GPCM or UPM operation.

  Impact: Local bus monitor may time out unexpectedly and corrupt the
  NAND transaction.

  Workaround: Set the local bus monitor timeout value to the maximum
  by setting LBCR[BMT] = 0 and LBCR[BMTPS] = 0xF.

  Fix plan: No plans to fix

But it seems that erratum is already fixed:

  http://patchwork.ozlabs.org/patch/96339/
  (git patch d08e44570e)

Am I reading that correctly?  (I'm already writing only one flash
sector at a time, but it might be that even a single 0x10000-byte
sector takes long enough to trigger the issue.)  I also verified that
I have the relevant property in my device tree:

  localbus@e0005000 {
    ...
    compatible = "fsl,mpc8315-elbc", "fsl,elbc", "simple-bus";

So, my questions are:

1. Is anyone else seeing something like this?

2. Is there an obvious way for our code to detect that we're in the
   middle of error recovery, so we can not write to the disk until
   recovery is complete?

3. Is there any chance that the 1.5Gbps limiting code might have
   exacerbated the problems?

4. Should I open a support request with Freescale, or if someone from
   Freescale is already reading this, could you look to see if anyone
   else has reported it?

Kernel is 3.4.36, cpuinfo says:

  / # cat /proc/cpuinfo
  processor       : 0
  cpu             : e300c3
  clock           : 266.666664MHz
  revision        : 2.0 (pvr 8085 0020)
  bogomips        : 66.66
  timebase        : 33333333
  platform        : MPC831x RDB
  Memory          : 256 MB

device on that SATA link is an InnoDisk SSD:

  / # dmesg | grep 'ata2'
  [    7.729684] ata2: SATA max UDMA/133 irq 45
  [    7.973996] ata2: setting speed (in hard reset)
  [    7.984849] ata2: Signature Update detected @ 0 msecs
  [    8.390553] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [    8.471244] ata2.00: ATA-8: InnoDisk Corp. - mSATA D150Q, 110520B, max UDMA/133
  [    8.478690] ata2.00: 31277232 sectors, multi 16: LBA48 
  [    8.484562] ata2.00: configured for UDMA/133

As always, any further hints would be very welcome.

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-15  8:12 SATA hang on 8315E triggered by heavy flash write? Anthony Foiani
@ 2013-05-21 21:44 ` Scott Wood
  2013-05-22  4:16   ` Anthony Foiani
  0 siblings, 1 reply; 12+ messages in thread
From: Scott Wood @ 2013-05-21 21:44 UTC (permalink / raw)
  To: Anthony Foiani; +Cc: linuxppc-dev, Shaohui.Xie

On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
> At this point, /dev/sda is pretty much unusable, and I have to do at
> least a reboot to recover.  (I don't recall if I had to do a power
> cycle at this point, though.)
>=20
> I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
> Errata, Rev. 3, 09/2011):
>=20
>   eLBC-A001:
>=20
>   Simultaneous FCM and GPCM or UPM operation may erroneously trigger
>   bus monitor timeout
>=20
>   Description: Devices: MPC8315E, MPC8314E
>   When the FCM is in the middle of a long transaction, such as NAND
>   erase or write, another transaction on the GPCM or UPM triggers the
>   bus monitor to start immediately for the GPCM or UPM, even though
>   the GPCM or UPM is still waiting for the FCM to finish and has not
>   yet started its transaction. If the bus monitor timeout value is not
>   programmed for a sufficiently large value, the local bus monitor may
>   time out. This timeout corrupts the current NAND Flash operation and
>   terminate the GPCM or UPM operation.
>=20
>   Impact: Local bus monitor may time out unexpectedly and corrupt the
>   NAND transaction.
>=20
>   Workaround: Set the local bus monitor timeout value to the maximum
>   by setting LBCR[BMT] =3D 0 and LBCR[BMTPS] =3D 0xF.
>=20
>   Fix plan: No plans to fix
>=20
> But it seems that erratum is already fixed:
>=20
>   http://patchwork.ozlabs.org/patch/96339/
>   (git patch d08e44570e)
>=20
> Am I reading that correctly?

Yes, that erratum has been worked around.

> (I'm already writing only one flash
> sector at a time, but it might be that even a single 0x10000-byte
> sector takes long enough to trigger the issue.)

I don't think this erratum is relevant.  Unlike NAND, NOR flash does =20
not involve holding the localbus for extended periods of time.  I also =20
don't see how it would interact with SATA, which is separate from the =20
localbus.  Are you seeing any errors on the localbus, or just on SATA?

> I also verified that
> I have the relevant property in my device tree:
>=20
>   localbus@e0005000 {
>     ...
>     compatible =3D "fsl,mpc8315-elbc", "fsl,elbc", "simple-bus";
>=20
> So, my questions are:
>=20
> 1. Is anyone else seeing something like this?
>=20
> 2. Is there an obvious way for our code to detect that we're in the
>    middle of error recovery, so we can not write to the disk until
>    recovery is complete?
>=20
> 3. Is there any chance that the 1.5Gbps limiting code might have
>    exacerbated the problems?
>=20
> 4. Should I open a support request with Freescale, or if someone from
>    Freescale is already reading this, could you look to see if anyone
>    else has reported it?

Hopefully Shaohui (our SATA person) can answer these.  If you don't get =20
an answer, go ahead and open an official support request.

-Scott=

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-21 21:44 ` Scott Wood
@ 2013-05-22  4:16   ` Anthony Foiani
  2013-05-22  6:15     ` Xie Shaohui-B21989
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Foiani @ 2013-05-22  4:16 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Shaohui.Xie


Scott --

Scott Wood <scottwood@freescale.com> writes:

> On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
>> At this point, /dev/sda is pretty much unusable, and I have to do
>> at least a reboot to recover.  (I don't recall if I had to do a
>> power cycle at this point, though.)

For whatever it's worth, a hard boot (full power cycle) is indeed
necessary at this point.

>> I suspect that it is related to errata eLBC-A001 (from MPC8315E
>> Chip Errata, Rev. 3, 09/2011):
>> ...
>> But it seems that erratum is already fixed:
>> 
>>   http://patchwork.ozlabs.org/patch/96339/
>>   (git patch d08e44570e)
>> 
>> Am I reading that correctly?
>
> Yes, that erratum has been worked around.

Ok, thanks for the confirmation.

>> (I'm already writing only one flash sector at a time, but it might
>> be that even a single 0x10000-byte sector takes long enough to
>> trigger the issue.)
>
> I don't think this erratum is relevant.  Unlike NAND, NOR flash does
> not involve holding the localbus for extended periods of time.

I wasn't sure about the mechanism of the erratum, and it seemed
awfully close, so I thought I'd go fishing.  Guess I missed.  :(

It is NOR writes, btw; I do both in my application, but the initial
error always seems to occur during a NOR write.  (In this device,
kernel + devtree go into NOR flash, ramdisk goes into NAND flash, and
data goes to SSD... stop laughing.)

Here's the most recent hang.  First, to compare the application log
timestamps with the kernel log timestamps:

  # mix of kernel and application log, note that kernel is about +12s.
  +0.537506 main.0 [0]: rc: fork took 9.376ms
  [   12.892323] PHY: mdio@e0024520:01 - Link is Up - 100/Full
  +1.603034 main.0 [0]: schs: ctor: done

The console output is:

  # console log
  [318334.294126] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
  [318334.301515] ata2.00: PHY RDY changed
  [318334.305301] ata2.00: failed command: WRITE DMA
  [318334.309991] ata2.00: cmd ca/00:08:b0:00:18/00:00:00:00:00/e1 tag 0 dma 4096 out
  [318334.310015]          res 50/00:00:08:61:25/00:00:00:00:00/e1 Emask 0x10 (ATA bus error)
  [318334.325689] ata2.00: status: { DRDY }
  [318334.329717] ata2: hard resetting link
  [318334.836038] ata2: Hardreset failed, not off-lined 0
  [318334.848407] ata2: setting speed (in hard reset)
  [318344.456050] ata2: No Signature Update
  [318344.631916] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [318344.638354] ata2.00: link online but device misclassified
  [318349.643897] ata2.00: qc timeout (cmd 0xec)
  [318349.648268] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  [318349.654562] ata2.00: revalidation failed (errno=-5)
  [318349.659667] ata2: hard resetting link
  [318350.163864] ata2: Hardreset failed, not off-lined 0
  [318350.175869] ata2: setting speed (in hard reset)
  [318359.771956] ata2: No Signature Update
  [318359.947901] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [318359.954342] ata2.00: link online but device misclassified
  [318369.959921] ata2.00: qc timeout (cmd 0xec)
  [318369.964279] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  [318369.970567] ata2.00: revalidation failed (errno=-5)
  [318369.975658] ata2: hard resetting link
  [318370.479933] ata2: Hardreset failed, not off-lined 0
  [318370.491880] ata2: setting speed (in hard reset)
  [318380.083892] ata2: No Signature Update

And my application log:

  # application log
  +318320.957019 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x180000 from buf[0x80000]; attempt 1/3
  +318322.498346 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x190000 from buf[0x90000]; attempt 1/3
  +318323.849995 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1a0000 from buf[0xa0000]; attempt 1/3
  +318325.262559 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1b0000 from buf[0xb0000]; attempt 1/3
  +318326.703213 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1c0000 from buf[0xc0000]; attempt 1/3

> I also don't see how it would interact with SATA, which is separate
> from the localbus.  

No idea.  Is there some other shared resource that might be taxed by
this type of load?

I do get a few other errors, usually just once or twice per boot:

  [ 4231.619368] NOHZ: local_softirq_pending 100
  [ 4232.249935] NOHZ: local_softirq_pending 100
  [ 4232.312241] NOHZ: local_softirq_pending 100
  [ 4232.424523] NOHZ: local_softirq_pending 100
  [ 4233.139146] NOHZ: local_softirq_pending 100
  [ 4233.328540] NOHZ: local_softirq_pending 100
  [ 4233.655909] NOHZ: local_softirq_pending 100
  [ 4234.106578] NOHZ: local_softirq_pending 100
  [ 4234.853966] NOHZ: local_softirq_pending 100
  [ 4235.375208] NOHZ: local_softirq_pending 100
  [11072.027818] hrtimer: interrupt took 126210 ns

They seem harmless, though, and (as the timestamps indicate) the
machine happily ran for 3-4 days after those issues.

> Are you seeing any errors on the localbus, or just on SATA?

I'm not seeing any errors in the console log -- but I'm not using the
LBC for anything other than flash writes, SFAIK.  (Unless I2C is
handled through the LBC, in which case, I have frequent (~50-100/s)
small transactions all the time -- but the hangs always coincide with
flash writes, and not with the I2C traffic that is going on all the
time...)

> Hopefully Shaohui (our SATA person) can answer these.  If you don't
> get an answer, go ahead and open an official support request.

I have a (lousy) workaround in hand: don't touch the disk during flash
updates.  (The flash writes are software updates, which will hopefully
be fairly rare once I'm done developing this thing.  Until then,
though, I'm updating it multiple times a day, and have hit this quite
a few times by now.)

So there's no great hurry.  If Shaohui can find something in the next
week or so, that'd be fantastic; otherwise, I'll open a request.

Thanks again!

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: SATA hang on 8315E triggered by heavy flash write?
  2013-05-22  4:16   ` Anthony Foiani
@ 2013-05-22  6:15     ` Xie Shaohui-B21989
  2013-05-23  5:52       ` Anthony Foiani
  0 siblings, 1 reply; 12+ messages in thread
From: Xie Shaohui-B21989 @ 2013-05-22  6:15 UTC (permalink / raw)
  To: Anthony Foiani; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Hi, Anthony Foiani,

Please confirm what is the key operation to reproduce the error.
1. only update NOR for a long enough time, for ex. tens of seconds, see if =
error happens;
2. only r/w SSD without NOR operation, see if error happens;
3. r/w SSD first and keep it run, then start to read NOR, if no error for a=
 long time, then start to write NOR, see how long the error will happen.

Best Regards,=20
Shaohui Xie


> -----Original Message-----
> From: Anthony Foiani [mailto:tkil@scrye.com]
> Sent: Wednesday, May 22, 2013 12:17 PM
> To: Wood Scott-B07421
> Cc: linuxppc-dev@lists.ozlabs.org; Xie Shaohui-B21989
> Subject: Re: SATA hang on 8315E triggered by heavy flash write?
>=20
>=20
> Scott --
>=20
> Scott Wood <scottwood@freescale.com> writes:
>=20
> > On 05/15/2013 03:12:21 AM, Anthony Foiani wrote:
> >> At this point, /dev/sda is pretty much unusable, and I have to do at
> >> least a reboot to recover.  (I don't recall if I had to do a power
> >> cycle at this point, though.)
>=20
> For whatever it's worth, a hard boot (full power cycle) is indeed
> necessary at this point.
>=20
> >> I suspect that it is related to errata eLBC-A001 (from MPC8315E Chip
> >> Errata, Rev. 3, 09/2011):
> >> ...
> >> But it seems that erratum is already fixed:
> >>
> >>   http://patchwork.ozlabs.org/patch/96339/
> >>   (git patch d08e44570e)
> >>
> >> Am I reading that correctly?
> >
> > Yes, that erratum has been worked around.
>=20
> Ok, thanks for the confirmation.
>=20
> >> (I'm already writing only one flash sector at a time, but it might be
> >> that even a single 0x10000-byte sector takes long enough to trigger
> >> the issue.)
> >
> > I don't think this erratum is relevant.  Unlike NAND, NOR flash does
> > not involve holding the localbus for extended periods of time.
>=20
> I wasn't sure about the mechanism of the erratum, and it seemed awfully
> close, so I thought I'd go fishing.  Guess I missed.  :(
>=20
> It is NOR writes, btw; I do both in my application, but the initial error
> always seems to occur during a NOR write.  (In this device, kernel +
> devtree go into NOR flash, ramdisk goes into NAND flash, and data goes to
> SSD... stop laughing.)
>=20
> Here's the most recent hang.  First, to compare the application log
> timestamps with the kernel log timestamps:
>=20
>   # mix of kernel and application log, note that kernel is about +12s.
>   +0.537506 main.0 [0]: rc: fork took 9.376ms
>   [   12.892323] PHY: mdio@e0024520:01 - Link is Up - 100/Full
>   +1.603034 main.0 [0]: schs: ctor: done
>=20
> The console output is:
>=20
>   # console log
>   [318334.294126] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action
> 0xe frozen
>   [318334.301515] ata2.00: PHY RDY changed
>   [318334.305301] ata2.00: failed command: WRITE DMA
>   [318334.309991] ata2.00: cmd ca/00:08:b0:00:18/00:00:00:00:00/e1 tag 0
> dma 4096 out
>   [318334.310015]          res 50/00:00:08:61:25/00:00:00:00:00/e1 Emask
> 0x10 (ATA bus error)
>   [318334.325689] ata2.00: status: { DRDY }
>   [318334.329717] ata2: hard resetting link
>   [318334.836038] ata2: Hardreset failed, not off-lined 0
>   [318334.848407] ata2: setting speed (in hard reset)
>   [318344.456050] ata2: No Signature Update
>   [318344.631916] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [318344.638354] ata2.00: link online but device misclassified
>   [318349.643897] ata2.00: qc timeout (cmd 0xec)
>   [318349.648268] ata2.00: failed to IDENTIFY (I/O error, err_mask=3D0x4)
>   [318349.654562] ata2.00: revalidation failed (errno=3D-5)
>   [318349.659667] ata2: hard resetting link
>   [318350.163864] ata2: Hardreset failed, not off-lined 0
>   [318350.175869] ata2: setting speed (in hard reset)
>   [318359.771956] ata2: No Signature Update
>   [318359.947901] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [318359.954342] ata2.00: link online but device misclassified
>   [318369.959921] ata2.00: qc timeout (cmd 0xec)
>   [318369.964279] ata2.00: failed to IDENTIFY (I/O error, err_mask=3D0x4)
>   [318369.970567] ata2.00: revalidation failed (errno=3D-5)
>   [318369.975658] ata2: hard resetting link
>   [318370.479933] ata2: Hardreset failed, not off-lined 0
>   [318370.491880] ata2: setting speed (in hard reset)
>   [318380.083892] ata2: No Signature Update
>=20
> And my application log:
>=20
>   # application log
>   +318320.957019 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x180000 from buf[0x80000]; attempt 1/3
>   +318322.498346 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x190000 from buf[0x90000]; attempt 1/3
>   +318323.849995 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1a0000 from buf[0xa0000]; attempt 1/3
>   +318325.262559 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1b0000 from buf[0xb0000]; attempt 1/3
>   +318326.703213 sw-upd.0 [29]: fm: nor0: write: writing 0x10000
> @0x1c0000 from buf[0xc0000]; attempt 1/3
>=20
> > I also don't see how it would interact with SATA, which is separate
> > from the localbus.
>=20
> No idea.  Is there some other shared resource that might be taxed by this
> type of load?
>=20
> I do get a few other errors, usually just once or twice per boot:
>=20
>   [ 4231.619368] NOHZ: local_softirq_pending 100
>   [ 4232.249935] NOHZ: local_softirq_pending 100
>   [ 4232.312241] NOHZ: local_softirq_pending 100
>   [ 4232.424523] NOHZ: local_softirq_pending 100
>   [ 4233.139146] NOHZ: local_softirq_pending 100
>   [ 4233.328540] NOHZ: local_softirq_pending 100
>   [ 4233.655909] NOHZ: local_softirq_pending 100
>   [ 4234.106578] NOHZ: local_softirq_pending 100
>   [ 4234.853966] NOHZ: local_softirq_pending 100
>   [ 4235.375208] NOHZ: local_softirq_pending 100
>   [11072.027818] hrtimer: interrupt took 126210 ns
>=20
> They seem harmless, though, and (as the timestamps indicate) the machine
> happily ran for 3-4 days after those issues.
>=20
> > Are you seeing any errors on the localbus, or just on SATA?
>=20
> I'm not seeing any errors in the console log -- but I'm not using the LBC
> for anything other than flash writes, SFAIK.  (Unless I2C is handled
> through the LBC, in which case, I have frequent (~50-100/s) small
> transactions all the time -- but the hangs always coincide with flash
> writes, and not with the I2C traffic that is going on all the
> time...)
>=20
> > Hopefully Shaohui (our SATA person) can answer these.  If you don't
> > get an answer, go ahead and open an official support request.
>=20
> I have a (lousy) workaround in hand: don't touch the disk during flash
> updates.  (The flash writes are software updates, which will hopefully be
> fairly rare once I'm done developing this thing.  Until then, though, I'm
> updating it multiple times a day, and have hit this quite a few times by
> now.)
>=20
> So there's no great hurry.  If Shaohui can find something in the next
> week or so, that'd be fantastic; otherwise, I'll open a request.
>=20
> Thanks again!
>=20
> Best regards,
> Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-22  6:15     ` Xie Shaohui-B21989
@ 2013-05-23  5:52       ` Anthony Foiani
  2013-05-23  6:04         ` Xie Shaohui-B21989
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Foiani @ 2013-05-23  5:52 UTC (permalink / raw)
  To: Xie Shaohui-B21989; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org


Shaohui --

Thanks for the quick reply!  Please find my investigation and results
below.

Xie Shaohui-B21989 <B21989@freescale.com> writes:

> 1. only update NOR for a long enough time, for ex. tens of seconds,
>    see if error happens;

It seems that I can do this without any errors:

  / # flash_erase /dev/mtd1 0 0
  Erasing 64 Kibyte @ 7f0000 -- 100 % complete 
  / # dd if=/dev/zero of=/dev/mtd1 
  dd: writing '/dev/mtd1': No space left on device
  16385+0 records in
  16384+0 records out
  8388608 bytes (8.0MB) copied, 62.399439 seconds, 131.3KB/s

> 2. only r/w SSD without NOR operation, see if error happens;

Again, no problem:

  /ssd # ls -al biggie.bin
  -rw-r--r--    1 root     root     2330607084 May 22 19:34 biggie.bin
  /ssd # ls -alh biggie.bin
  -rw-r--r--    1 root     root        2.2G May 22 19:34 biggie.bin
  /ssd # time cp biggie.bin biggie2.bin
  real    3m 27.55s
  user    0m 2.60s
  sys     2m 16.13s

> 3. r/w SSD first and keep it run, then start to read NOR, if no
>    error for a long time, then start to write NOR, see how long the
>    error will happen.

Doing a NOR read during heavy SATA r/w seems to succeed, with no
errors on the console:

  [window 1]
  /ssd # time cp biggie.bin biggie2.bin

  [window 2]
  / # dd if=/dev/mtd1 of=/dev/null
  16384+0 records in
  16384+0 records out
  8388608 bytes (8.0MB) copied, 6.380613 seconds, 1.3MB/s

Doing a NOR write fails almost instantly (within a second):

  [window 1]
  /ssd # time cp biggie.bin biggie2.bin

  [window 2]
  / # dd if=/dev/zero of=/dev/mtd1 

  [console]
  [ 5160.269106] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen
  [ 5160.276387] ata2.00: failed command: READ DMA
  [ 5160.280905] ata2.00: cmd c8/00:00:60:f3:01/00:00:00:00:00/e0 tag 0 dma 131072 in
  [ 5160.280928]          res 50/00:00:f0:c0:48/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
  [ 5160.296386] ata2.00: status: { DRDY }
  [ 5160.300195] ata2: hard resetting link
  [ 5160.347858] ata2: setting speed (in hard reset)
  [ 5170.439981] ata2: No Signature Update
  [ 5170.611901] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [ 5170.618204] ata2.00: link online but device misclassified
  [ 5175.623918] ata2.00: qc timeout (cmd 0xec)
  [ 5175.628147] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  [ 5175.634347] ata2.00: revalidation failed (errno=-5)
  [ 5175.639373] ata2: hard resetting link
  [ 5176.143847] ata2: Hardreset failed, not off-lined 0
  [ 5176.155867] ata2: setting speed (in hard reset)
  [ 5185.743871] ata2: No Signature Update
  [ 5185.915900] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [ 5185.922203] ata2.00: link online but device misclassified
  [ 5195.927910] ata2.00: qc timeout (cmd 0xec)
  [ 5195.932140] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  [ 5195.938342] ata2.00: revalidation failed (errno=-5)
  [ 5195.943430] ata2: hard resetting link
  [ 5196.443885] ata2: Hardreset failed, not off-lined 0
  ...

At this point, a hard reset / full power cycle is needed to recover.

The board is an MPC8315ERDB derivative, and I'm running a patched
3.4.36 kernel.

I've uploaded some (possibly) relevant files to:

  http://foiani.home.dyndns.org/~tony/linux/ppc-sata-issues-201305/

There is a diff from 3.4.36, a devtree, and a kernel config.

Please let me know if there is any more information that I can
contribute.

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: SATA hang on 8315E triggered by heavy flash write?
  2013-05-23  5:52       ` Anthony Foiani
@ 2013-05-23  6:04         ` Xie Shaohui-B21989
  2013-05-23 15:10           ` Anthony Foiani
  0 siblings, 1 reply; 12+ messages in thread
From: Xie Shaohui-B21989 @ 2013-05-23  6:04 UTC (permalink / raw)
  To: Anthony Foiani; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Hi, Anthony Foiani,

Thanks for the confirmation.=20

So it seems the NOR write break the signal Integrity of SATA.
I don't have schematic and board right now, could you please measure signal=
s related to NOR write to see if anything abnormal? Is the board use FPGA o=
r CPLD to control signal?

If stop NOR write, could the SATA recover and work?

Best Regards,=20
Shaohui Xie


> -----Original Message-----
> From: Anthony Foiani [mailto:tkil@scrye.com]
> Sent: Thursday, May 23, 2013 1:52 PM
> To: Xie Shaohui-B21989
> Cc: Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org
> Subject: Re: SATA hang on 8315E triggered by heavy flash write?
>=20
>=20
> Shaohui --
>=20
> Thanks for the quick reply!  Please find my investigation and results
> below.
>=20
> Xie Shaohui-B21989 <B21989@freescale.com> writes:
>=20
> > 1. only update NOR for a long enough time, for ex. tens of seconds,
> >    see if error happens;
>=20
> It seems that I can do this without any errors:
>=20
>   / # flash_erase /dev/mtd1 0 0
>   Erasing 64 Kibyte @ 7f0000 -- 100 % complete
>   / # dd if=3D/dev/zero of=3D/dev/mtd1
>   dd: writing '/dev/mtd1': No space left on device
>   16385+0 records in
>   16384+0 records out
>   8388608 bytes (8.0MB) copied, 62.399439 seconds, 131.3KB/s
>=20
> > 2. only r/w SSD without NOR operation, see if error happens;
>=20
> Again, no problem:
>=20
>   /ssd # ls -al biggie.bin
>   -rw-r--r--    1 root     root     2330607084 May 22 19:34 biggie.bin
>   /ssd # ls -alh biggie.bin
>   -rw-r--r--    1 root     root        2.2G May 22 19:34 biggie.bin
>   /ssd # time cp biggie.bin biggie2.bin
>   real    3m 27.55s
>   user    0m 2.60s
>   sys     2m 16.13s
>=20
> > 3. r/w SSD first and keep it run, then start to read NOR, if no
> >    error for a long time, then start to write NOR, see how long the
> >    error will happen.
>=20
> Doing a NOR read during heavy SATA r/w seems to succeed, with no errors
> on the console:
>=20
>   [window 1]
>   /ssd # time cp biggie.bin biggie2.bin
>=20
>   [window 2]
>   / # dd if=3D/dev/mtd1 of=3D/dev/null
>   16384+0 records in
>   16384+0 records out
>   8388608 bytes (8.0MB) copied, 6.380613 seconds, 1.3MB/s
>=20
> Doing a NOR write fails almost instantly (within a second):
>=20
>   [window 1]
>   /ssd # time cp biggie.bin biggie2.bin
>=20
>   [window 2]
>   / # dd if=3D/dev/zero of=3D/dev/mtd1
>=20
>   [console]
>   [ 5160.269106] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action
> 0x6 frozen
>   [ 5160.276387] ata2.00: failed command: READ DMA
>   [ 5160.280905] ata2.00: cmd c8/00:00:60:f3:01/00:00:00:00:00/e0 tag 0
> dma 131072 in
>   [ 5160.280928]          res 50/00:00:f0:c0:48/00:00:00:00:00/e0 Emask
> 0x10 (ATA bus error)
>   [ 5160.296386] ata2.00: status: { DRDY }
>   [ 5160.300195] ata2: hard resetting link
>   [ 5160.347858] ata2: setting speed (in hard reset)
>   [ 5170.439981] ata2: No Signature Update
>   [ 5170.611901] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [ 5170.618204] ata2.00: link online but device misclassified
>   [ 5175.623918] ata2.00: qc timeout (cmd 0xec)
>   [ 5175.628147] ata2.00: failed to IDENTIFY (I/O error, err_mask=3D0x4)
>   [ 5175.634347] ata2.00: revalidation failed (errno=3D-5)
>   [ 5175.639373] ata2: hard resetting link
>   [ 5176.143847] ata2: Hardreset failed, not off-lined 0
>   [ 5176.155867] ata2: setting speed (in hard reset)
>   [ 5185.743871] ata2: No Signature Update
>   [ 5185.915900] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   [ 5185.922203] ata2.00: link online but device misclassified
>   [ 5195.927910] ata2.00: qc timeout (cmd 0xec)
>   [ 5195.932140] ata2.00: failed to IDENTIFY (I/O error, err_mask=3D0x4)
>   [ 5195.938342] ata2.00: revalidation failed (errno=3D-5)
>   [ 5195.943430] ata2: hard resetting link
>   [ 5196.443885] ata2: Hardreset failed, not off-lined 0
>   ...
>=20
> At this point, a hard reset / full power cycle is needed to recover.
>=20
> The board is an MPC8315ERDB derivative, and I'm running a patched
> 3.4.36 kernel.
>=20
> I've uploaded some (possibly) relevant files to:
>=20
>   http://foiani.home.dyndns.org/~tony/linux/ppc-sata-issues-201305/
>=20
> There is a diff from 3.4.36, a devtree, and a kernel config.
>=20
> Please let me know if there is any more information that I can contribute=
.
>=20
> Best regards,
> Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-23  6:04         ` Xie Shaohui-B21989
@ 2013-05-23 15:10           ` Anthony Foiani
  2013-05-23 15:49             ` Anthony Foiani
  2013-05-27  7:50             ` Xie Shaohui-B21989
  0 siblings, 2 replies; 12+ messages in thread
From: Anthony Foiani @ 2013-05-23 15:10 UTC (permalink / raw)
  To: Xie Shaohui-B21989; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Shaohui --

Xie Shaohui-B21989 <B21989@freescale.com> writes:

> Thanks for the confirmation. 

You're very welcome.

> So it seems the NOR write break the signal Integrity of SATA.
> I don't have schematic and board right now, could you please measure
> signals related to NOR write to see if anything abnormal? Is the board
> use FPGA or CPLD to control signal?

I'll have to pass these questions on to my hardware vendor; I'm not
equipped to do this level of hardware debugging (neither hardware nor
knowledge!).

> If stop NOR write, could the SATA recover and work?

Earlier in my development, I was seeing this error and it would
recover:

  ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
  ata2: PHY RDY changed
  ata2: hard resetting link
  ata2: Signature Update detected @ 0 msecs
  ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  ata2.00: configured for UDMA/133
  ata2: EH complete

At the current time, however, it seems that it does not recover.

I don't know whether this is due to the speed limiting code, or if
it's because we are doing more disk accesses (when the actual product
is up and running).

I can re-do the tests with the speed limit disabled, but I won't be
able to get to that for a few hours yet.  You can read about the speed
limit issues in this thread:

  http://article.gmane.org/gmane.linux.ports.ppc.embedded/50652

And my final patch (yes, a year later):

  http://article.gmane.org/gmane.linux.ports.ppc.embedded/58969

Please don't laugh too hard when you read it.  :)

Thanks again for your help.  I'll try to get the results of testing
w/o speed limit to you within a day or two.

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-23 15:10           ` Anthony Foiani
@ 2013-05-23 15:49             ` Anthony Foiani
  2013-05-27  7:50             ` Xie Shaohui-B21989
  1 sibling, 0 replies; 12+ messages in thread
From: Anthony Foiani @ 2013-05-23 15:49 UTC (permalink / raw)
  To: Xie Shaohui-B21989; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Shaohui --

Apologies, a minor clarification is needed:

Anthony Foiani <tkil@scrye.com> writes:

> Shaohui --
>
> Xie Shaohui-B21989 <B21989@freescale.com> writes:
>
>> If stop NOR write, could the SATA recover and work?
>
> Earlier in my development, I was seeing this error and it would
> recover:
>
>   ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
>   ata2: PHY RDY changed
>   ata2: hard resetting link
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   ata2.00: configured for UDMA/133
>   ata2: EH complete

In this case, it would recover *even as the NOR write continued*.

Here's an example where it froze and recovered twice.  The application
starts about 12s after the kernel, so 945s for the kernel should be
933s for the application.

Also, note that this case already has the speed limit code included
(see message at 945.928702), so I don't think I need to do a separate
test.

  [console]
  [  945.902543] ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
  [  945.909584] ata2: PHY RDY changed
  [  945.913048] ata2: hard resetting link
  [  945.928702] ata2: setting speed (in hard reset)
  [  945.939864] ata2: Signature Update detected @ 0 msecs
  [  946.115888] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [  946.128548] ata2.00: configured for UDMA/133
  [  946.133021] ata2: EH complete
  [  952.537180] ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
  [  952.544208] ata2: PHY RDY changed
  [  952.547626] ata2: hard resetting link
  [  952.558319] ata2: setting speed (in hard reset)
  [  953.076730] ata2: Signature Update detected @ 508 msecs
  [  953.251866] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  [  953.259389] ata2.00: configured for UDMA/133
  [  953.263892] ata2: EH complete

  [application]
  +924.152278 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x100000 from buf[0x0]; attempt 1/3
  +925.599739 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x110000 from buf[0x10000]; attempt 1/3
  +927.018239 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x120000 from buf[0x20000]; attempt 1/3
  +928.414069 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x130000 from buf[0x30000]; attempt 1/3
  +929.872850 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x140000 from buf[0x40000]; attempt 1/3
  +931.341634 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x150000 from buf[0x50000]; attempt 1/3
  +932.724024 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x160000 from buf[0x60000]; attempt 1/3
  [first freeze is about here]
  +934.146203 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x170000 from buf[0x70000]; attempt 1/3
  +935.569069 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x180000 from buf[0x80000]; attempt 1/3
  +936.875275 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x190000 from buf[0x90000]; attempt 1/3
  +938.205302 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1a0000 from buf[0xa0000]; attempt 1/3
  +939.519662 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1b0000 from buf[0xb0000]; attempt 1/3
  +940.873656 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1c0000 from buf[0xc0000]; attempt 1/3
  [second freeze is about here]
  +942.230740 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1d0000 from buf[0xd0000]; attempt 1/3
  +943.641994 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1e0000 from buf[0xe0000]; attempt 1/3
  +944.938454 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x1f0000 from buf[0xf0000]; attempt 1/3
  +946.236491 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x200000 from buf[0x100000]; attempt 1/3
  +947.607673 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x210000 from buf[0x110000]; attempt 1/3
  +948.919213 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x220000 from buf[0x120000]; attempt 1/3
  +950.151386 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x230000 from buf[0x130000]; attempt 1/3
  +951.502522 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x240000 from buf[0x140000]; attempt 1/3
  +952.851177 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x250000 from buf[0x150000]; attempt 1/3
  +954.082897 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x260000 from buf[0x160000]; attempt 1/3
  +955.315338 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x270000 from buf[0x170000]; attempt 1/3
  +956.559639 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x280000 from buf[0x180000]; attempt 1/3
  +957.845503 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x290000 from buf[0x190000]; attempt 1/3
  +959.100007 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x2a0000 from buf[0x1a0000]; attempt 1/3
  +960.347982 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x2b0000 from buf[0x1b0000]; attempt 1/3
  +961.545344 sw-upd.0 [29]: fm: nor0: write: writing 0x10000 @0x2c0000 from buf[0x1c0000]; attempt 1/3

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: SATA hang on 8315E triggered by heavy flash write?
  2013-05-23 15:10           ` Anthony Foiani
  2013-05-23 15:49             ` Anthony Foiani
@ 2013-05-27  7:50             ` Xie Shaohui-B21989
  2013-05-28  0:29               ` Anthony Foiani
  1 sibling, 1 reply; 12+ messages in thread
From: Xie Shaohui-B21989 @ 2013-05-27  7:50 UTC (permalink / raw)
  To: Anthony Foiani; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

> > So it seems the NOR write break the signal Integrity of SATA.
> > I don't have schematic and board right now, could you please measure
> > signals related to NOR write to see if anything abnormal? Is the board
> > use FPGA or CPLD to control signal?
>=20
> I'll have to pass these questions on to my hardware vendor; I'm not
> equipped to do this level of hardware debugging (neither hardware nor
> knowledge!).
>=20
> > If stop NOR write, could the SATA recover and work?
>=20
> Earlier in my development, I was seeing this error and it would
> recover:
>=20
>   ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen
>   ata2: PHY RDY changed
>   ata2: hard resetting link
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
>   ata2.00: configured for UDMA/133
>   ata2: EH complete
>=20
> At the current time, however, it seems that it does not recover.
>=20
> I don't know whether this is due to the speed limiting code, or if it's
> because we are doing more disk accesses (when the actual product is up
> and running).
[S.H] it seems it's not due to speed limiting code, 1.5Gbps is still used t=
o recover link.

>=20
> I can re-do the tests with the speed limit disabled, but I won't be able
> to get to that for a few hours yet.  You can read about the speed limit
> issues in this thread:
>=20
>   http://article.gmane.org/gmane.linux.ports.ppc.embedded/50652
>=20
> And my final patch (yes, a year later):
>=20
>   http://article.gmane.org/gmane.linux.ports.ppc.embedded/58969
[S.H] for the speed limit issue, I checked 3.4.rc7 kernel, there seems a pl=
ace can be used to limit the speed for 8315:
	if (!of_device_is_compatible(ofdev->dev.of_node, "fsl,mpc8315-sata")) {
		temp =3D ioread32(csr_base + TRANSCFG);
		temp =3D temp & 0xffffffe0;
		iowrite32(temp | TRANSCFG_RX_WATER_MARK, csr_base + TRANSCFG);
	} else {
		/* the speed limitation code for 8315 may can be put here.
		* just move the original code which wrapped by "#ifdef CONFIG_MPC8315_DS"=
 here.
		* please let me know if you will give a try. */
	}


Best Regards,=20
Shaohui Xie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-27  7:50             ` Xie Shaohui-B21989
@ 2013-05-28  0:29               ` Anthony Foiani
  2013-05-30  7:32                 ` Xie Shaohui-B21989
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Foiani @ 2013-05-28  0:29 UTC (permalink / raw)
  To: Xie Shaohui-B21989; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Shaohio --

Once again, thanks for the reply.

Xie Shaohui-B21989 <B21989@freescale.com> writes:

> it seems [recovery or lack of recovery is] not due to speed limiting
> code, 1.5Gbps is still used to recover link.

Right, I noticed this in my later email on this topic.

> for the speed limit issue, I checked 3.4.rc7 kernel, there seems a
> place can be used to limit the speed for 8315:
>
> 	if (!of_device_is_compatible(ofdev->dev.of_node, "fsl,mpc8315-sata")) {
> 		temp = ioread32(csr_base + TRANSCFG);
> 		temp = temp & 0xffffffe0;
> 		iowrite32(temp | TRANSCFG_RX_WATER_MARK, csr_base + TRANSCFG);
> 	} else {
> 		/* the speed limitation code for 8315 may can be put here.
> 		* just move the original code which wrapped by "#ifdef CONFIG_MPC8315_DS" here.
> 		* please let me know if you will give a try. */
> 	}

It's not clear that all uses of the MPC8315 SATA controller have this
problem.  It obviously occured on the 8315DS, but apparently that
board never made it to production; it might or might not happen on the
8315ERDB; and it clearly happens on my vendor's board.

Given this lack of knowledge, Scott Wood was very hesitant to
implement any far-reaching changes, because we could not pin down
exactly what had to be tested.

My original patch:

  http://article.gmane.org/gmane.linux.ports.ppc.embedded/58710

did the speed limiting directly in the sata_fsl code; Jeff Garzik
mentioned that there was some existing infrastructure for this, and I
inferred that he would prefer that I use it.

So that's what my final patch does, and it's keyed off a single OF
value.

There's a small chance that I'll have the opportunity to move the
project to 3.9 (especially if that release gets declared
long-term-stable).  Even then, I unfortunately won't have the
bandwidth to pursue getting any of these approaches into mainline.
The best I can do is get them onto the lists so that others might be
able to benefit later.

Either way, thanks again.  I'll try to put together a package for my
vendor to test with; once I have demonstrated that there is a problem
with their hardware, they have been gracious about accepting that
result and pursuing it with Freescale if necessary.

Thanks again for your help!

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: SATA hang on 8315E triggered by heavy flash write?
  2013-05-28  0:29               ` Anthony Foiani
@ 2013-05-30  7:32                 ` Xie Shaohui-B21989
  2013-06-01  4:24                   ` Anthony Foiani
  0 siblings, 1 reply; 12+ messages in thread
From: Xie Shaohui-B21989 @ 2013-05-30  7:32 UTC (permalink / raw)
  To: Anthony Foiani; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Hi, Anthony Foiani,

I found a MPC8315ERDB rev1.0 board and did some tests.
First there is no limit speed issue on the board, so it seems it may only h=
appen on the MPC8315DS board.
Second, the SATA can work well with NOR write operation on the ERDB board.=
=20
So the two issues happened to you should be board issues.


Best Regards,=20
Shaohui Xie
> -----Original Message-----
> From: Anthony Foiani [mailto:tkil@scrye.com]
> Sent: Tuesday, May 28, 2013 8:30 AM
> To: Xie Shaohui-B21989
> Cc: Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org
> Subject: Re: SATA hang on 8315E triggered by heavy flash write?
>=20
> Shaohio --
>=20
> Once again, thanks for the reply.
>=20
> Xie Shaohui-B21989 <B21989@freescale.com> writes:
>=20
> > it seems [recovery or lack of recovery is] not due to speed limiting
> > code, 1.5Gbps is still used to recover link.
>=20
> Right, I noticed this in my later email on this topic.
>=20
> > for the speed limit issue, I checked 3.4.rc7 kernel, there seems a
> > place can be used to limit the speed for 8315:
> >
> > 	if (!of_device_is_compatible(ofdev->dev.of_node, "fsl,mpc8315-
> sata")) {
> > 		temp =3D ioread32(csr_base + TRANSCFG);
> > 		temp =3D temp & 0xffffffe0;
> > 		iowrite32(temp | TRANSCFG_RX_WATER_MARK, csr_base + TRANSCFG);
> > 	} else {
> > 		/* the speed limitation code for 8315 may can be put here.
> > 		* just move the original code which wrapped by "#ifdef
> CONFIG_MPC8315_DS" here.
> > 		* please let me know if you will give a try. */
> > 	}
>=20
> It's not clear that all uses of the MPC8315 SATA controller have this
> problem.  It obviously occured on the 8315DS, but apparently that board
> never made it to production; it might or might not happen on the 8315ERDB=
;
> and it clearly happens on my vendor's board.
>=20
> Given this lack of knowledge, Scott Wood was very hesitant to implement
> any far-reaching changes, because we could not pin down exactly what had
> to be tested.
>=20
> My original patch:
>=20
>   http://article.gmane.org/gmane.linux.ports.ppc.embedded/58710
>=20
> did the speed limiting directly in the sata_fsl code; Jeff Garzik
> mentioned that there was some existing infrastructure for this, and I
> inferred that he would prefer that I use it.
>=20
> So that's what my final patch does, and it's keyed off a single OF value.
>=20
> There's a small chance that I'll have the opportunity to move the project
> to 3.9 (especially if that release gets declared long-term-stable).  Even
> then, I unfortunately won't have the bandwidth to pursue getting any of
> these approaches into mainline.
> The best I can do is get them onto the lists so that others might be able
> to benefit later.
>=20
> Either way, thanks again.  I'll try to put together a package for my
> vendor to test with; once I have demonstrated that there is a problem
> with their hardware, they have been gracious about accepting that result
> and pursuing it with Freescale if necessary.
>=20
> Thanks again for your help!
>=20
> Best regards,
> Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: SATA hang on 8315E triggered by heavy flash write?
  2013-05-30  7:32                 ` Xie Shaohui-B21989
@ 2013-06-01  4:24                   ` Anthony Foiani
  0 siblings, 0 replies; 12+ messages in thread
From: Anthony Foiani @ 2013-06-01  4:24 UTC (permalink / raw)
  To: Xie Shaohui-B21989; +Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org

Shaohui, greetings --

Xie Shaohui-B21989 <B21989@freescale.com> writes:

> I found a MPC8315ERDB rev1.0 board and did some tests.

I bet that was fun.  :) Thanks for going the extra mile and finding
that hardware.  Were you able to unearth a 8315DS of any sort?

> First there is no limit speed issue on the board, so it seems it may
> only happen on the MPC8315DS board.

To be clear, the board we're using does boot and run just fine at
3Gbps most of the time; the CONFIG_MPC8315DS fix was one suggested by
our vendor, but even then, I suspect it was basically prophylactic.

Or, perhaps, it was conflated with the NOR / SATA issue -- see below.

> Second, the SATA can work well with NOR write operation on the ERDB
> board. So the two issues happened to you should be board issues.

Very possibly!

Our vendor has identified at least one possible error with the wiring
/ routing on this board, and have suggested a hardware modification.
Their fix makes sense, but any hardware modification introduces the
risk of breaking one of the few prototype boards.

Since we're very close to software delivery, and we have a workaround
in hand -- namely, don't write the disk during flash operations -- my
team has decided that we'll go with the software workaround until
initial delivery.

We should be able to do this modification and associated testing in
mid-July; at that point, I'll report back with our findings.

Thanks again for all your help; you and Scott have been extremely
helpful and have provided excellent support.  Apologies if it turns
out that it was all due to a wiring error.  :(

Best regards,
Anthony Foiani

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-01  4:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-15  8:12 SATA hang on 8315E triggered by heavy flash write? Anthony Foiani
2013-05-21 21:44 ` Scott Wood
2013-05-22  4:16   ` Anthony Foiani
2013-05-22  6:15     ` Xie Shaohui-B21989
2013-05-23  5:52       ` Anthony Foiani
2013-05-23  6:04         ` Xie Shaohui-B21989
2013-05-23 15:10           ` Anthony Foiani
2013-05-23 15:49             ` Anthony Foiani
2013-05-27  7:50             ` Xie Shaohui-B21989
2013-05-28  0:29               ` Anthony Foiani
2013-05-30  7:32                 ` Xie Shaohui-B21989
2013-06-01  4:24                   ` Anthony Foiani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).