* Possible regression in 8250_dw driver @ 2023-05-23 8:59 Richard Tresidder 2023-05-23 11:13 ` Ilpo Järvinen 2023-05-24 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis) 0 siblings, 2 replies; 12+ messages in thread From: Richard Tresidder @ 2023-05-23 8:59 UTC (permalink / raw) To: linux-serial Hi We seem to be getting corruption of received data from a ublox GPS unit To me it looks like a fifo overrun of some sort?? background: I'm attempting to use 6.3.3 as a new base for one of our systems. Previously it was using 5.1.7 as a base. The uart in question is one of the two in the Cyclone V SOC HPS. And to muddy the waters the linux console TTYS0 is the other Uart from the same HPS core Yet the console appears to be working ok. Note all other libs and apps are at the same revision and build, it is only the kernel that is different. Both versions of the kernel are also built using the same bitbake bsdk.. Seeing the following with 6.3.3: 00000000: 45 58 54 20 43 4F 52 45 20 33 2E 30 31 20 28 31 | EXT CORE 3.01 (1 00000010: 31 31 31 34 31 29 00 00 00 00 00 00 00 00 30 30 | 11141)........00 00000020: 30 38 30 30 30 30 00 00 52 4F 4D 20 42 41 53 45 | 080000..ROM BASE 00000030: 20 32 2E 30 31 20 28 37 35 33 33 31 53 00 00 00 | 2.01 (75331S... 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 53 42 41 53 | ............SBAS 00000050: 3B 49 4D 45 53 3B 51 5A 53 53 00 00 00 00 00 00 | ;IMES;QZSS...... 00000060: 00 00 00 00 00 00 00 00 00 00 01 3D 29 00 00 00 | ...........=)... 00000070: 00 00 00 00 00 00 46 57 56 45 52 3D 54 49 4D 20 | ......FWVER=TIM 00000080: 31 2E 31 30 00 00 00 00 00 00 00 00 00 00 00 00 | 1.10............ 00000090: 00 00 00 00 50 52 4F 54 56 45 52 3D 32 32 2E 30 | ....PROTVER=22.0 000000a0: 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0............... 000000b0: 00 00 4D 4F 44 3D 4C 45 41 2D 4D 38 54 2D 30 00 | ..MOD=LEA-M8T-0. 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 000000d0: 46 49 53 3D 30 78 45 46 34 30 31 35 20 28 31 30 | FIS=0xEF4015 (10 000000e0: 30 31 31 31 29 00 00 00 00 00 00 00 00 00 47 50 | 0111).........GP 000000f0: 53 3B 47 4C 4F 3B 47 41 4C 3B 42 44 00 00 00 00 | S;GLO;GAL;BD.... 00000100: 00 00 | .. But should be seeing this as shown on 5.1.7: Excuse the offset (due to this frame also showing the packet id's and lengths) But the body of the frame is what we should be seeing. 00000000: B5 62 0A 04 FA 00 45 58 54 20 43 4F 52 45 20 33 | µb..ú.EXT CORE 3 00000010: 2E 30 31 20 28 31 31 31 31 34 31 29 00 00 00 00 | .01 (111141).... 00000020: 00 00 00 00 30 30 30 38 30 30 30 30 00 00 52 4F | ....00080000..RO 00000030: 4D 20 42 41 53 45 20 32 2E 30 31 20 28 37 35 33 | M BASE 2.01 (753 00000040: 33 31 29 00 00 00 00 00 00 00 00 00 46 57 56 45 | 31).........FWVE 00000050: 52 3D 54 49 4D 20 31 2E 31 30 00 00 00 00 00 00 | R=TIM 1.10...... 00000060: 00 00 00 00 00 00 00 00 00 00 50 52 4F 54 56 45 | ..........PROTVE 00000070: 52 3D 32 32 2E 30 30 00 00 00 00 00 00 00 00 00 | R=22.00......... 00000080: 00 00 00 00 00 00 00 00 4D 4F 44 3D 4C 45 41 2D | ........MOD=LEA- 00000090: 4D 38 54 2D 30 00 00 00 00 00 00 00 00 00 00 00 | M8T-0........... 000000A0: 00 00 00 00 00 00 46 49 53 3D 30 78 45 46 34 30 | ......FIS=0xEF40 000000B0: 31 35 20 28 31 30 30 31 31 31 29 00 00 00 00 00 | 15 (100111)..... 000000C0: 00 00 00 00 47 50 53 3B 47 4C 4F 3B 47 41 4C 3B | ....GPS;GLO;GAL; 000000D0: 42 44 53 00 00 00 00 00 00 00 00 00 00 00 00 00 | BDS............. 000000E0: 00 00 53 42 41 53 3B 49 4D 45 53 3B 51 5A 53 53 | ..SBAS;IMES;QZSS 000000F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ 00000100: 01 3D | .=. As you can see it looks like the frame thats received on the 6.3.3 kernel is mangled? This same message is just being requested over and over again from the GPS unit. The offset where the tears occur looks to be pretty similar between each poll request. Usually the 1 at the end of the (75331 is where the first tear occurs. I'd appreciate some quidance in how to track this down as there appears to have been a reasonable amount of work done to this driver and the serial core between these two versions. Cheers Richard Tresidder ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-23 8:59 Possible regression in 8250_dw driver Richard Tresidder @ 2023-05-23 11:13 ` Ilpo Järvinen 2023-05-24 7:52 ` Richard Tresidder 2023-05-24 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis) 1 sibling, 1 reply; 12+ messages in thread From: Ilpo Järvinen @ 2023-05-23 11:13 UTC (permalink / raw) To: Richard Tresidder; +Cc: linux-serial [-- Attachment #1: Type: text/plain, Size: 4798 bytes --] On Tue, 23 May 2023, Richard Tresidder wrote: > Hi > We seem to be getting corruption of received data from a ublox GPS unit > To me it looks like a fifo overrun of some sort?? Overruns should be logged (in dmesg or /proc/tty/driver/serial). > background: > I'm attempting to use 6.3.3 as a new base for one of our systems. > Previously it was using 5.1.7 as a base. > The uart in question is one of the two in the Cyclone V SOC HPS. > And to muddy the waters the linux console TTYS0 is the other Uart from the > same HPS core > Yet the console appears to be working ok. Maybe some of the DMA related changes triggering some unexpected behavior. Console doesn't use DMA so that could explain the difference. > Note all other libs and apps are at the same revision and build, it is only > the kernel that is different. > Both versions of the kernel are also built using the same bitbake bsdk.. > > Seeing the following with 6.3.3: > > 00000000: 45 58 54 20 43 4F 52 45 20 33 2E 30 31 20 28 31 | EXT CORE 3.01 (1 > 00000010: 31 31 31 34 31 29 00 00 00 00 00 00 00 00 30 30 | 11141)........00 > 00000020: 30 38 30 30 30 30 00 00 52 4F 4D 20 42 41 53 45 | 080000..ROM BASE > 00000030: 20 32 2E 30 31 20 28 37 35 33 33 31 53 00 00 00 | 2.01 (75331S... > 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 53 42 41 53 | ............SBAS > 00000050: 3B 49 4D 45 53 3B 51 5A 53 53 00 00 00 00 00 00 | ;IMES;QZSS...... > 00000060: 00 00 00 00 00 00 00 00 00 00 01 3D 29 00 00 00 | ...........=)... > 00000070: 00 00 00 00 00 00 46 57 56 45 52 3D 54 49 4D 20 | ......FWVER=TIM > 00000080: 31 2E 31 30 00 00 00 00 00 00 00 00 00 00 00 00 | 1.10............ > 00000090: 00 00 00 00 50 52 4F 54 56 45 52 3D 32 32 2E 30 | ....PROTVER=22.0 > 000000a0: 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0............... > 000000b0: 00 00 4D 4F 44 3D 4C 45 41 2D 4D 38 54 2D 30 00 | ..MOD=LEA-M8T-0. > 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ > 000000d0: 46 49 53 3D 30 78 45 46 34 30 31 35 20 28 31 30 | FIS=0xEF4015 (10 > 000000e0: 30 31 31 31 29 00 00 00 00 00 00 00 00 00 47 50 | 0111).........GP > 000000f0: 53 3B 47 4C 4F 3B 47 41 4C 3B 42 44 00 00 00 00 | S;GLO;GAL;BD.... > 00000100: 00 00 | .. > > But should be seeing this as shown on 5.1.7: > Excuse the offset (due to this frame also showing the packet id's and lengths) > But the body of the frame is what we should be seeing. > > 00000000: B5 62 0A 04 FA 00 45 58 54 20 43 4F 52 45 20 33 | µb..ú.EXT CORE 3 > 00000010: 2E 30 31 20 28 31 31 31 31 34 31 29 00 00 00 00 | .01 (111141).... > 00000020: 00 00 00 00 30 30 30 38 30 30 30 30 00 00 52 4F | ....00080000..RO > 00000030: 4D 20 42 41 53 45 20 32 2E 30 31 20 28 37 35 33 | M BASE 2.01 (753 > 00000040: 33 31 29 00 00 00 00 00 00 00 00 00 46 57 56 45 | 31).........FWVE > 00000050: 52 3D 54 49 4D 20 31 2E 31 30 00 00 00 00 00 00 | R=TIM 1.10...... > 00000060: 00 00 00 00 00 00 00 00 00 00 50 52 4F 54 56 45 | ..........PROTVE > 00000070: 52 3D 32 32 2E 30 30 00 00 00 00 00 00 00 00 00 | R=22.00......... > 00000080: 00 00 00 00 00 00 00 00 4D 4F 44 3D 4C 45 41 2D | ........MOD=LEA- > 00000090: 4D 38 54 2D 30 00 00 00 00 00 00 00 00 00 00 00 | M8T-0........... > 000000A0: 00 00 00 00 00 00 46 49 53 3D 30 78 45 46 34 30 | ......FIS=0xEF40 > 000000B0: 31 35 20 28 31 30 30 31 31 31 29 00 00 00 00 00 | 15 (100111)..... > 000000C0: 00 00 00 00 47 50 53 3B 47 4C 4F 3B 47 41 4C 3B | ....GPS;GLO;GAL; > 000000D0: 42 44 53 00 00 00 00 00 00 00 00 00 00 00 00 00 | BDS............. > 000000E0: 00 00 53 42 41 53 3B 49 4D 45 53 3B 51 5A 53 53 | ..SBAS;IMES;QZSS > 000000F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ > 00000100: 01 3D | .=. > > As you can see it looks like the frame thats received on the 6.3.3 kernel is > mangled? > This same message is just being requested over and over again from the GPS > unit. > > The offset where the tears occur looks to be pretty similar between each poll > request. > Usually the 1 at the end of the (75331 is where the first tear occurs. > > I'd appreciate some quidance in how to track this down as there appears to > have been a reasonable amount of work done to this driver and the serial core > between these two versions. A few ideas: - try without dma_rx_complete() calling p->dma->rx_dma(p) - revert 90b8596ac46043e4a782d9111f5b285251b13756 - Try the revert in https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch (for e8ffbb71f783 and f8d6e9d3ca5c) But finding the culprit with git bisect would be the most helpful here. -- i. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-23 11:13 ` Ilpo Järvinen @ 2023-05-24 7:52 ` Richard Tresidder 2023-05-24 10:06 ` Ilpo Järvinen 0 siblings, 1 reply; 12+ messages in thread From: Richard Tresidder @ 2023-05-24 7:52 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-serial On 23/05/2023 7:13 pm, Ilpo Järvinen wrote: > On Tue, 23 May 2023, Richard Tresidder wrote: > >> Hi >> We seem to be getting corruption of received data from a ublox GPS unit >> To me it looks like a fifo overrun of some sort?? > Overruns should be logged (in dmesg or /proc/tty/driver/serial). Hmm wasn't seeing anything in either. > >> background: >> I'm attempting to use 6.3.3 as a new base for one of our systems. >> Previously it was using 5.1.7 as a base. >> The uart in question is one of the two in the Cyclone V SOC HPS. >> And to muddy the waters the linux console TTYS0 is the other Uart from the >> same HPS core >> Yet the console appears to be working ok. > Maybe some of the DMA related changes triggering some unexpected behavior. > > Console doesn't use DMA so that could explain the difference. > >> Note all other libs and apps are at the same revision and build, it is only >> the kernel that is different. >> Both versions of the kernel are also built using the same bitbake bsdk.. >> >> Seeing the following with 6.3.3: >> >> 00000000: 45 58 54 20 43 4F 52 45 20 33 2E 30 31 20 28 31 | EXT CORE 3.01 (1 >> 00000010: 31 31 31 34 31 29 00 00 00 00 00 00 00 00 30 30 | 11141)........00 >> 00000020: 30 38 30 30 30 30 00 00 52 4F 4D 20 42 41 53 45 | 080000..ROM BASE >> 00000030: 20 32 2E 30 31 20 28 37 35 33 33 31 53 00 00 00 | 2.01 (75331S... >> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 53 42 41 53 | ............SBAS >> 00000050: 3B 49 4D 45 53 3B 51 5A 53 53 00 00 00 00 00 00 | ;IMES;QZSS...... >> 00000060: 00 00 00 00 00 00 00 00 00 00 01 3D 29 00 00 00 | ...........=)... >> 00000070: 00 00 00 00 00 00 46 57 56 45 52 3D 54 49 4D 20 | ......FWVER=TIM >> 00000080: 31 2E 31 30 00 00 00 00 00 00 00 00 00 00 00 00 | 1.10............ >> 00000090: 00 00 00 00 50 52 4F 54 56 45 52 3D 32 32 2E 30 | ....PROTVER=22.0 >> 000000a0: 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0............... >> 000000b0: 00 00 4D 4F 44 3D 4C 45 41 2D 4D 38 54 2D 30 00 | ..MOD=LEA-M8T-0. >> 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ >> 000000d0: 46 49 53 3D 30 78 45 46 34 30 31 35 20 28 31 30 | FIS=0xEF4015 (10 >> 000000e0: 30 31 31 31 29 00 00 00 00 00 00 00 00 00 47 50 | 0111).........GP >> 000000f0: 53 3B 47 4C 4F 3B 47 41 4C 3B 42 44 00 00 00 00 | S;GLO;GAL;BD.... >> 00000100: 00 00 | .. >> >> But should be seeing this as shown on 5.1.7: >> Excuse the offset (due to this frame also showing the packet id's and lengths) >> But the body of the frame is what we should be seeing. >> >> 00000000: B5 62 0A 04 FA 00 45 58 54 20 43 4F 52 45 20 33 | µb..ú.EXT CORE 3 >> 00000010: 2E 30 31 20 28 31 31 31 31 34 31 29 00 00 00 00 | .01 (111141).... >> 00000020: 00 00 00 00 30 30 30 38 30 30 30 30 00 00 52 4F | ....00080000..RO >> 00000030: 4D 20 42 41 53 45 20 32 2E 30 31 20 28 37 35 33 | M BASE 2.01 (753 >> 00000040: 33 31 29 00 00 00 00 00 00 00 00 00 46 57 56 45 | 31).........FWVE >> 00000050: 52 3D 54 49 4D 20 31 2E 31 30 00 00 00 00 00 00 | R=TIM 1.10...... >> 00000060: 00 00 00 00 00 00 00 00 00 00 50 52 4F 54 56 45 | ..........PROTVE >> 00000070: 52 3D 32 32 2E 30 30 00 00 00 00 00 00 00 00 00 | R=22.00......... >> 00000080: 00 00 00 00 00 00 00 00 4D 4F 44 3D 4C 45 41 2D | ........MOD=LEA- >> 00000090: 4D 38 54 2D 30 00 00 00 00 00 00 00 00 00 00 00 | M8T-0........... >> 000000A0: 00 00 00 00 00 00 46 49 53 3D 30 78 45 46 34 30 | ......FIS=0xEF40 >> 000000B0: 31 35 20 28 31 30 30 31 31 31 29 00 00 00 00 00 | 15 (100111)..... >> 000000C0: 00 00 00 00 47 50 53 3B 47 4C 4F 3B 47 41 4C 3B | ....GPS;GLO;GAL; >> 000000D0: 42 44 53 00 00 00 00 00 00 00 00 00 00 00 00 00 | BDS............. >> 000000E0: 00 00 53 42 41 53 3B 49 4D 45 53 3B 51 5A 53 53 | ..SBAS;IMES;QZSS >> 000000F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ >> 00000100: 01 3D | .=. >> >> As you can see it looks like the frame thats received on the 6.3.3 kernel is >> mangled? >> This same message is just being requested over and over again from the GPS >> unit. >> >> The offset where the tears occur looks to be pretty similar between each poll >> request. >> Usually the 1 at the end of the (75331 is where the first tear occurs. >> >> I'd appreciate some quidance in how to track this down as there appears to >> have been a reasonable amount of work done to this driver and the serial core >> between these two versions. > A few ideas: > - try without dma_rx_complete() calling p->dma->rx_dma(p) > - revert 90b8596ac46043e4a782d9111f5b285251b13756 > - Try the revert in https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch > (for e8ffbb71f783 and f8d6e9d3ca5c) > > But finding the culprit with git bisect would be the most helpful here. > Bisect wasn't an easy option as I'd applied a pile of patches after the interesting commits for our system to run. I'm not a git master :) So I just started reverting the patches that had been applied to the 8250 folder. Worked backwards from head. After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce serial: 8250_dma: Fix DMA Rx rearm race Things started to work again. I reset everything and then just reverted that individual patch and things work. So that looks like the culprit.. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-24 7:52 ` Richard Tresidder @ 2023-05-24 10:06 ` Ilpo Järvinen 2023-05-25 3:44 ` Richard Tresidder 0 siblings, 1 reply; 12+ messages in thread From: Ilpo Järvinen @ 2023-05-24 10:06 UTC (permalink / raw) To: Richard Tresidder; +Cc: linux-serial [-- Attachment #1: Type: text/plain, Size: 4681 bytes --] On Wed, 24 May 2023, Richard Tresidder wrote: > On 23/05/2023 7:13 pm, Ilpo Järvinen wrote: > > On Tue, 23 May 2023, Richard Tresidder wrote: > > > > > background: > > > I'm attempting to use 6.3.3 as a new base for one of our systems. > > > Previously it was using 5.1.7 as a base. > > > The uart in question is one of the two in the Cyclone V SOC HPS. > > > And to muddy the waters the linux console TTYS0 is the other Uart from the > > > same HPS core > > > Yet the console appears to be working ok. > > Maybe some of the DMA related changes triggering some unexpected behavior. > > > > Console doesn't use DMA so that could explain the difference. > > > > > Note all other libs and apps are at the same revision and build, it is > > > only > > > the kernel that is different. > > > Both versions of the kernel are also built using the same bitbake bsdk.. > > > > > > > > > As you can see it looks like the frame thats received on the 6.3.3 kernel > > > is > > > mangled? > > > This same message is just being requested over and over again from the GPS > > > unit. > > > > > > The offset where the tears occur looks to be pretty similar between each > > > poll > > > request. > > > Usually the 1 at the end of the (75331 is where the first tear occurs. > > > > > > I'd appreciate some quidance in how to track this down as there appears to > > > have been a reasonable amount of work done to this driver and the serial > > > core > > > between these two versions. > > A few ideas: > > - try without dma_rx_complete() calling p->dma->rx_dma(p) > > - revert 90b8596ac46043e4a782d9111f5b285251b13756 > > - Try the revert in > > https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch > > (for e8ffbb71f783 and f8d6e9d3ca5c) > > > > But finding the culprit with git bisect would be the most helpful here. > > > Bisect wasn't an easy option as I'd applied a pile of patches after the > interesting commits for our system to run. > I'm not a git master :) > > So I just started reverting the patches that had been applied to the 8250 > folder. > Worked backwards from head. > > After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce > serial: 8250_dma: Fix DMA Rx rearm race > > Things started to work again. > > I reset everything and then just reverted that individual patch and things > work. > So that looks like the culprit.. Okay, that also worked great, thanks a lot. It gives something concrete to work with to find a fix. The commit itself looks very much correct, however, my guess is that when serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow does not become DMA_PAUSED which leads to not properly flushing DMA Rx transaction. Can you try the following debug patch (if my guess is correct, besides triggering the WARN_ON_ONCE, it also works around the issue): [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS inconsistency Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was issued. Not intended for upstream. --- drivers/tty/serial/8250/8250_dma.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c index 7fa66501792d..3dedacd9f104 100644 --- a/drivers/tty/serial/8250/8250_dma.c +++ b/drivers/tty/serial/8250/8250_dma.c @@ -38,7 +38,7 @@ static void __dma_tx_complete(void *param) spin_unlock_irqrestore(&p->port.lock, flags); } -static void __dma_rx_complete(struct uart_8250_port *p) +static void __dma_rx_complete(struct uart_8250_port *p, bool paused) { struct uart_8250_dma *dma = p->dma; struct tty_port *tty_port = &p->port.state->port; @@ -52,8 +52,12 @@ static void __dma_rx_complete(struct uart_8250_port *p) * anything in such case. */ dma_status = dmaengine_tx_status(dma->rxchan, dma->rx_cookie, &state); - if (dma_status == DMA_IN_PROGRESS) - return; + if (dma_status == DMA_IN_PROGRESS) { + if (!paused) + return; + + WARN_ON_ONCE(paused); + } count = dma->rx_size - state.residue; @@ -72,7 +76,7 @@ static void dma_rx_complete(void *param) spin_lock_irqsave(&p->port.lock, flags); if (dma->rx_running) - __dma_rx_complete(p); + __dma_rx_complete(p, false); /* * Cannot be combined with the previous check because __dma_rx_complete() @@ -172,7 +176,7 @@ void serial8250_rx_dma_flush(struct uart_8250_port *p) if (dma->rx_running) { dmaengine_pause(dma->rxchan); - __dma_rx_complete(p); + __dma_rx_complete(p, true); dmaengine_terminate_async(dma->rxchan); } } -- tg: (ac9a78681b92..) debug/dma-paused (depends on: tty-next) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-24 10:06 ` Ilpo Järvinen @ 2023-05-25 3:44 ` Richard Tresidder 2023-05-25 8:43 ` Ilpo Järvinen 0 siblings, 1 reply; 12+ messages in thread From: Richard Tresidder @ 2023-05-25 3:44 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-serial Richard Tresidder ** On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: > On Wed, 24 May 2023, Richard Tresidder wrote: >> On 23/05/2023 7:13 pm, Ilpo Järvinen wrote: >>> On Tue, 23 May 2023, Richard Tresidder wrote: >>> >>>> background: >>>> I'm attempting to use 6.3.3 as a new base for one of our systems. >>>> Previously it was using 5.1.7 as a base. >>>> The uart in question is one of the two in the Cyclone V SOC HPS. >>>> And to muddy the waters the linux console TTYS0 is the other Uart from the >>>> same HPS core >>>> Yet the console appears to be working ok. >>> Maybe some of the DMA related changes triggering some unexpected behavior. >>> >>> Console doesn't use DMA so that could explain the difference. >>> >>>> Note all other libs and apps are at the same revision and build, it is >>>> only >>>> the kernel that is different. >>>> Both versions of the kernel are also built using the same bitbake bsdk.. >>>> >>>> >>>> As you can see it looks like the frame thats received on the 6.3.3 kernel >>>> is >>>> mangled? >>>> This same message is just being requested over and over again from the GPS >>>> unit. >>>> >>>> The offset where the tears occur looks to be pretty similar between each >>>> poll >>>> request. >>>> Usually the 1 at the end of the (75331 is where the first tear occurs. >>>> >>>> I'd appreciate some quidance in how to track this down as there appears to >>>> have been a reasonable amount of work done to this driver and the serial >>>> core >>>> between these two versions. >>> A few ideas: >>> - try without dma_rx_complete() calling p->dma->rx_dma(p) >>> - revert 90b8596ac46043e4a782d9111f5b285251b13756 >>> - Try the revert in >>> https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch >>> (for e8ffbb71f783 and f8d6e9d3ca5c) >>> >>> But finding the culprit with git bisect would be the most helpful here. >>> >> Bisect wasn't an easy option as I'd applied a pile of patches after the >> interesting commits for our system to run. >> I'm not a git master :) >> >> So I just started reverting the patches that had been applied to the 8250 >> folder. >> Worked backwards from head. >> >> After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce >> serial: 8250_dma: Fix DMA Rx rearm race >> >> Things started to work again. >> >> I reset everything and then just reverted that individual patch and things >> work. >> So that looks like the culprit.. > Okay, that also worked great, thanks a lot. It gives something concrete to > work with to find a fix. > > The commit itself looks very much correct, however, my guess is that when > serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow > does not become DMA_PAUSED which leads to not properly flushing DMA Rx > transaction. Can you try the following debug patch (if my guess is > correct, besides triggering the WARN_ON_ONCE, it also works around the > issue): > > [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS inconsistency > > Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was > issued. > > Not intended for upstream. Thanks Ilpo I can confirm that the patch below works. Got the WARN_ON_ONCE dump so it's taking that path. I think the problem here is that the pl330 DMA controller thats in these SOC's doesn't "really" support pause according to the driver. It doesn't look like it can ever set "DMA_PAUSED" I'm not sure of the flow through of that though. There is some commenting in the pl330 driver about the pause routine. Cheers Richard Tresidder > --- > drivers/tty/serial/8250/8250_dma.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c > index 7fa66501792d..3dedacd9f104 100644 > --- a/drivers/tty/serial/8250/8250_dma.c > +++ b/drivers/tty/serial/8250/8250_dma.c > @@ -38,7 +38,7 @@ static void __dma_tx_complete(void *param) > spin_unlock_irqrestore(&p->port.lock, flags); > } > > -static void __dma_rx_complete(struct uart_8250_port *p) > +static void __dma_rx_complete(struct uart_8250_port *p, bool paused) > { > struct uart_8250_dma *dma = p->dma; > struct tty_port *tty_port = &p->port.state->port; > @@ -52,8 +52,12 @@ static void __dma_rx_complete(struct uart_8250_port *p) > * anything in such case. > */ > dma_status = dmaengine_tx_status(dma->rxchan, dma->rx_cookie, &state); > - if (dma_status == DMA_IN_PROGRESS) > - return; > + if (dma_status == DMA_IN_PROGRESS) { > + if (!paused) > + return; > + > + WARN_ON_ONCE(paused); > + } > > count = dma->rx_size - state.residue; > > @@ -72,7 +76,7 @@ static void dma_rx_complete(void *param) > > spin_lock_irqsave(&p->port.lock, flags); > if (dma->rx_running) > - __dma_rx_complete(p); > + __dma_rx_complete(p, false); > > /* > * Cannot be combined with the previous check because __dma_rx_complete() > @@ -172,7 +176,7 @@ void serial8250_rx_dma_flush(struct uart_8250_port *p) > > if (dma->rx_running) { > dmaengine_pause(dma->rxchan); > - __dma_rx_complete(p); > + __dma_rx_complete(p, true); > dmaengine_terminate_async(dma->rxchan); > } > } > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-25 3:44 ` Richard Tresidder @ 2023-05-25 8:43 ` Ilpo Järvinen 2023-05-25 9:27 ` Richard Tresidder 0 siblings, 1 reply; 12+ messages in thread From: Ilpo Järvinen @ 2023-05-25 8:43 UTC (permalink / raw) To: Richard Tresidder; +Cc: linux-serial [-- Attachment #1: Type: text/plain, Size: 7687 bytes --] On Thu, 25 May 2023, Richard Tresidder wrote: > > > > > Richard Tresidder > > > > > > > > > ** > > > > > > > On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: > > > On Wed, 24 May 2023, Richard Tresidder wrote: > > > On 23/05/2023 7:13 pm, Ilpo Järvinen wrote: > > > > On Tue, 23 May 2023, Richard Tresidder wrote: > > > > > > > > > background: > > > > > I'm attempting to use 6.3.3 as a new base for one of our systems. > > > > > Previously it was using 5.1.7 as a base. > > > > > The uart in question is one of the two in the Cyclone V SOC HPS. > > > > > And to muddy the waters the linux console TTYS0 is the other Uart from > > > > > the > > > > > same HPS core > > > > > Yet the console appears to be working ok. > > > > Maybe some of the DMA related changes triggering some unexpected > > > > behavior. > > > > > > > > Console doesn't use DMA so that could explain the difference. > > > > > > > > > Note all other libs and apps are at the same revision and build, it is > > > > > only > > > > > the kernel that is different. > > > > > Both versions of the kernel are also built using the same bitbake > > > > > bsdk.. > > > > > > > > > > > > > > > As you can see it looks like the frame thats received on the 6.3.3 > > > > > kernel > > > > > is > > > > > mangled? > > > > > This same message is just being requested over and over again from the > > > > > GPS > > > > > unit. > > > > > > > > > > The offset where the tears occur looks to be pretty similar between > > > > > each > > > > > poll > > > > > request. > > > > > Usually the 1 at the end of the (75331 is where the first tear occurs. > > > > > > > > > > I'd appreciate some quidance in how to track this down as there > > > > > appears to > > > > > have been a reasonable amount of work done to this driver and the > > > > > serial > > > > > core > > > > > between these two versions. > > > > A few ideas: > > > > - try without dma_rx_complete() calling p->dma->rx_dma(p) > > > > - revert 90b8596ac46043e4a782d9111f5b285251b13756 > > > > - Try the revert in > > > > https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch > > > > (for e8ffbb71f783 and f8d6e9d3ca5c) > > > > > > > > But finding the culprit with git bisect would be the most helpful here. > > > > > > > Bisect wasn't an easy option as I'd applied a pile of patches after the > > > interesting commits for our system to run. > > > I'm not a git master :) > > > > > > So I just started reverting the patches that had been applied to the 8250 > > > folder. > > > Worked backwards from head. > > > > > > After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce > > > serial: 8250_dma: Fix DMA Rx rearm race > > > > > > Things started to work again. > > > > > > I reset everything and then just reverted that individual patch and things > > > work. > > > So that looks like the culprit.. > > Okay, that also worked great, thanks a lot. It gives something concrete to > > work with to find a fix. > > > > The commit itself looks very much correct, however, my guess is that when > > serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow > > does not become DMA_PAUSED which leads to not properly flushing DMA Rx > > transaction. Can you try the following debug patch (if my guess is > > correct, besides triggering the WARN_ON_ONCE, it also works around the > > issue): > > > > [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS inconsistency > > > > Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was > > issued. > > > > Not intended for upstream. > > Thanks Ilpo > I can confirm that the patch below works. Got the WARN_ON_ONCE dump so it's > taking that path. > I think the problem here is that the pl330 DMA controller thats in these SOC's > doesn't "really" support pause according to the driver. For this flush usecase, it is enough to support pause + read residue + terminate which is supported according to the function comment for pl330_pause(). > It doesn't look like it can ever set "DMA_PAUSED" Why not? It pauses the transfer and is even able to allow reading the transferred byte count. What it is claimed to not be able to do is resume. Note that w/o resume serial8250_tx_dma() XON/XOFF processing will not work but that's unrelated to this issue (I'll probably need to do another patch to fix that on 8250 DMA side). > I'm not sure of the flow through of that though. > There is some commenting in the pl330 driver about the pause routine. Maybe the patch below will help with pl330 DMA status (based on somewhat limited understanding of all the moving parts): [PATCH] dmaengine: pl330: Set DMA_PAUSED when pausing pl330_pause() does not set anything to indicate paused condition which causes pl330_tx_status() to return DMA_IN_PROGRESS. This breaks 8250 DMA code after the fix in commit 57e9af7831dc ("serial: 8250_dma: Fix DMA Rx rearm race"). The function comment for pl330_pause() claims pause is supported but resume is not which is enough for 8250 DMA flush to work as long as DMA status is reported correctly when appropriate. Add PAUSED state for descriptor and mark BUSY descriptors with PAUSED in pl330_pause(). Return DMA_PAUSED from pl330_tx_status() when the descriptor is PAUSED. Fixes: 88987d2c7534 ("dmaengine: pl330: add DMA_PAUSE feature") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> --- drivers/dma/pl330.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/dma/pl330.c b/drivers/dma/pl330.c index 0d9257fbdfb0..daad25f2c498 100644 --- a/drivers/dma/pl330.c +++ b/drivers/dma/pl330.c @@ -403,6 +403,12 @@ enum desc_status { * of a channel can be BUSY at any time. */ BUSY, + /* + * Pause was called while descriptor was BUSY. Due to hardware + * limitations, only termination is possible for descriptors + * that have been paused. + */ + PAUSED, /* * Sitting on the channel work_list but xfer done * by PL330 core @@ -2041,7 +2047,7 @@ static inline void fill_queue(struct dma_pl330_chan *pch) list_for_each_entry(desc, &pch->work_list, node) { /* If already submitted */ - if (desc->status == BUSY) + if (desc->status == BUSY || desc->status == PAUSED) continue; ret = pl330_submit_req(pch->thread, desc); @@ -2326,6 +2332,7 @@ static int pl330_pause(struct dma_chan *chan) { struct dma_pl330_chan *pch = to_pchan(chan); struct pl330_dmac *pl330 = pch->dmac; + struct dma_pl330_desc *desc; unsigned long flags; pm_runtime_get_sync(pl330->ddma.dev); @@ -2335,6 +2342,10 @@ static int pl330_pause(struct dma_chan *chan) _stop(pch->thread); spin_unlock(&pl330->lock); + list_for_each_entry(desc, &pch->work_list, node) { + if (desc->status == BUSY) + desc->status = PAUSED; + } spin_unlock_irqrestore(&pch->lock, flags); pm_runtime_mark_last_busy(pl330->ddma.dev); pm_runtime_put_autosuspend(pl330->ddma.dev); @@ -2425,7 +2436,7 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t cookie, else if (running && desc == running) transferred = pl330_get_current_xferred_count(pch, desc); - else if (desc->status == BUSY) + else if (desc->status == BUSY || desc->status == PAUSED) /* * Busy but not running means either just enqueued, * or finished and not yet marked done @@ -2442,6 +2453,9 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t cookie, case DONE: ret = DMA_COMPLETE; break; + case PAUSED: + ret = DMA_PAUSED; + break; case PREP: case BUSY: ret = DMA_IN_PROGRESS; -- tg: (5af9d9508d7d..) fix/pl330-paused (depends on: debug/dma-paused) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-25 8:43 ` Ilpo Järvinen @ 2023-05-25 9:27 ` Richard Tresidder 2023-05-25 9:38 ` Ilpo Järvinen 0 siblings, 1 reply; 12+ messages in thread From: Richard Tresidder @ 2023-05-25 9:27 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-serial On 25/05/2023 4:43 pm, Ilpo Järvinen wrote: > On Thu, 25 May 2023, Richard Tresidder wrote: > >> Richard Tresidder >> >> ** >> >> On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: >> >>> On Wed, 24 May 2023, Richard Tresidder wrote: >>>> On 23/05/2023 7:13 pm, Ilpo Järvinen wrote: >>>>> On Tue, 23 May 2023, Richard Tresidder wrote: >>>>> >>>>>> background: >>>>>> I'm attempting to use 6.3.3 as a new base for one of our systems. >>>>>> Previously it was using 5.1.7 as a base. >>>>>> The uart in question is one of the two in the Cyclone V SOC HPS. >>>>>> And to muddy the waters the linux console TTYS0 is the other Uart from >>>>>> the >>>>>> same HPS core >>>>>> Yet the console appears to be working ok. >>>>> Maybe some of the DMA related changes triggering some unexpected >>>>> behavior. >>>>> >>>>> Console doesn't use DMA so that could explain the difference. >>>>> >>>>>> Note all other libs and apps are at the same revision and build, it is >>>>>> only >>>>>> the kernel that is different. >>>>>> Both versions of the kernel are also built using the same bitbake >>>>>> bsdk.. >>>>>> >>>>>> >>>>>> As you can see it looks like the frame thats received on the 6.3.3 >>>>>> kernel >>>>>> is >>>>>> mangled? >>>>>> This same message is just being requested over and over again from the >>>>>> GPS >>>>>> unit. >>>>>> >>>>>> The offset where the tears occur looks to be pretty similar between >>>>>> each >>>>>> poll >>>>>> request. >>>>>> Usually the 1 at the end of the (75331 is where the first tear occurs. >>>>>> >>>>>> I'd appreciate some quidance in how to track this down as there >>>>>> appears to >>>>>> have been a reasonable amount of work done to this driver and the >>>>>> serial >>>>>> core >>>>>> between these two versions. >>>>> A few ideas: >>>>> - try without dma_rx_complete() calling p->dma->rx_dma(p) >>>>> - revert 90b8596ac46043e4a782d9111f5b285251b13756 >>>>> - Try the revert in >>>>> https://lore.kernel.org/all/316ab583-d217-a332-d161-8225b0cee227@redhat.com/2-0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch >>>>> (for e8ffbb71f783 and f8d6e9d3ca5c) >>>>> >>>>> But finding the culprit with git bisect would be the most helpful here. >>>>> >>>> Bisect wasn't an easy option as I'd applied a pile of patches after the >>>> interesting commits for our system to run. >>>> I'm not a git master :) >>>> >>>> So I just started reverting the patches that had been applied to the 8250 >>>> folder. >>>> Worked backwards from head. >>>> >>>> After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce >>>> serial: 8250_dma: Fix DMA Rx rearm race >>>> >>>> Things started to work again. >>>> >>>> I reset everything and then just reverted that individual patch and things >>>> work. >>>> So that looks like the culprit.. >>> Okay, that also worked great, thanks a lot. It gives something concrete to >>> work with to find a fix. >>> >>> The commit itself looks very much correct, however, my guess is that when >>> serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow >>> does not become DMA_PAUSED which leads to not properly flushing DMA Rx >>> transaction. Can you try the following debug patch (if my guess is >>> correct, besides triggering the WARN_ON_ONCE, it also works around the >>> issue): >>> >>> [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS inconsistency >>> >>> Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was >>> issued. >>> >>> Not intended for upstream. >> Thanks Ilpo >> I can confirm that the patch below works. Got the WARN_ON_ONCE dump so it's >> taking that path. >> I think the problem here is that the pl330 DMA controller thats in these SOC's >> doesn't "really" support pause according to the driver. > For this flush usecase, it is enough to support pause + read residue + > terminate which is supported according to the function comment for > pl330_pause(). Yep agree given the 8250 serial code immediately terminates the dma after the pause during the flush anyway.. >> It doesn't look like it can ever set "DMA_PAUSED" > Why not? It pauses the transfer and is even able to allow reading the > transferred byte count. > > What it is claimed to not be able to do is resume. Note that w/o resume > serial8250_tx_dma() XON/XOFF processing will not work but that's > unrelated to this issue (I'll probably need to do another patch to fix > that on 8250 DMA side). Yep I just meant it doesn't look capable of reporting DMA_PAUSED Which your patch below probably fixes. and it kind of does a stop without a resume capability so must be terminated after this. Though I'm not sure of the implications of reporting paused without resume capability. Kind of an odd one.. I'll check your pl330 patch out tomorrow, and see what else might break. RT >> I'm not sure of the flow through of that though. >> There is some commenting in the pl330 driver about the pause routine. > Maybe the patch below will help with pl330 DMA status (based on somewhat > limited understanding of all the moving parts): > > > [PATCH] dmaengine: pl330: Set DMA_PAUSED when pausing > > pl330_pause() does not set anything to indicate paused condition which > causes pl330_tx_status() to return DMA_IN_PROGRESS. This breaks 8250 > DMA code after the fix in commit 57e9af7831dc ("serial: 8250_dma: Fix > DMA Rx rearm race"). The function comment for pl330_pause() claims > pause is supported but resume is not which is enough for 8250 DMA flush > to work as long as DMA status is reported correctly when appropriate. > > Add PAUSED state for descriptor and mark BUSY descriptors with PAUSED > in pl330_pause(). Return DMA_PAUSED from pl330_tx_status() when the > descriptor is PAUSED. > > Fixes: 88987d2c7534 ("dmaengine: pl330: add DMA_PAUSE feature") > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > --- > drivers/dma/pl330.c | 18 ++++++++++++++++-- > 1 file changed, 16 insertions(+), 2 deletions(-) > > diff --git a/drivers/dma/pl330.c b/drivers/dma/pl330.c > index 0d9257fbdfb0..daad25f2c498 100644 > --- a/drivers/dma/pl330.c > +++ b/drivers/dma/pl330.c > @@ -403,6 +403,12 @@ enum desc_status { > * of a channel can be BUSY at any time. > */ > BUSY, > + /* > + * Pause was called while descriptor was BUSY. Due to hardware > + * limitations, only termination is possible for descriptors > + * that have been paused. > + */ > + PAUSED, > /* > * Sitting on the channel work_list but xfer done > * by PL330 core > @@ -2041,7 +2047,7 @@ static inline void fill_queue(struct dma_pl330_chan *pch) > list_for_each_entry(desc, &pch->work_list, node) { > > /* If already submitted */ > - if (desc->status == BUSY) > + if (desc->status == BUSY || desc->status == PAUSED) > continue; > > ret = pl330_submit_req(pch->thread, desc); > @@ -2326,6 +2332,7 @@ static int pl330_pause(struct dma_chan *chan) > { > struct dma_pl330_chan *pch = to_pchan(chan); > struct pl330_dmac *pl330 = pch->dmac; > + struct dma_pl330_desc *desc; > unsigned long flags; > > pm_runtime_get_sync(pl330->ddma.dev); > @@ -2335,6 +2342,10 @@ static int pl330_pause(struct dma_chan *chan) > _stop(pch->thread); > spin_unlock(&pl330->lock); > > + list_for_each_entry(desc, &pch->work_list, node) { > + if (desc->status == BUSY) > + desc->status = PAUSED; > + } > spin_unlock_irqrestore(&pch->lock, flags); > pm_runtime_mark_last_busy(pl330->ddma.dev); > pm_runtime_put_autosuspend(pl330->ddma.dev); > @@ -2425,7 +2436,7 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t cookie, > else if (running && desc == running) > transferred = > pl330_get_current_xferred_count(pch, desc); > - else if (desc->status == BUSY) > + else if (desc->status == BUSY || desc->status == PAUSED) > /* > * Busy but not running means either just enqueued, > * or finished and not yet marked done > @@ -2442,6 +2453,9 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t cookie, > case DONE: > ret = DMA_COMPLETE; > break; > + case PAUSED: > + ret = DMA_PAUSED; > + break; > case PREP: > case BUSY: > ret = DMA_IN_PROGRESS; > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-25 9:27 ` Richard Tresidder @ 2023-05-25 9:38 ` Ilpo Järvinen 2023-05-26 7:07 ` Richard Tresidder 0 siblings, 1 reply; 12+ messages in thread From: Ilpo Järvinen @ 2023-05-25 9:38 UTC (permalink / raw) To: Richard Tresidder; +Cc: linux-serial [-- Attachment #1: Type: text/plain, Size: 6649 bytes --] On Thu, 25 May 2023, Richard Tresidder wrote: > On 25/05/2023 4:43 pm, Ilpo Järvinen wrote: > > On Thu, 25 May 2023, Richard Tresidder wrote: > > > On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: > > > > On Wed, 24 May 2023, Richard Tresidder wrote: > > > > > So I just started reverting the patches that had been applied to the > > > > > 8250 > > > > > folder. > > > > > Worked backwards from head. > > > > > > > > > > After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce > > > > > serial: 8250_dma: Fix DMA Rx rearm race > > > > > > > > > > Things started to work again. > > > > > > > > > > I reset everything and then just reverted that individual patch and > > > > > things > > > > > work. > > > > > So that looks like the culprit.. > > > > Okay, that also worked great, thanks a lot. It gives something concrete > > > > to > > > > work with to find a fix. > > > > > > > > The commit itself looks very much correct, however, my guess is that > > > > when > > > > serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow > > > > does not become DMA_PAUSED which leads to not properly flushing DMA Rx > > > > transaction. Can you try the following debug patch (if my guess is > > > > correct, besides triggering the WARN_ON_ONCE, it also works around the > > > > issue): > > > > > > > > [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS > > > > inconsistency > > > > > > > > Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was > > > > issued. > > > > > > > > Not intended for upstream. > > > Thanks Ilpo > > > I can confirm that the patch below works. Got the WARN_ON_ONCE dump so > > > it's > > > taking that path. > > > I think the problem here is that the pl330 DMA controller thats in these > > > SOC's > > > doesn't "really" support pause according to the driver. > > For this flush usecase, it is enough to support pause + read residue + > > terminate which is supported according to the function comment for > > pl330_pause(). > > Yep agree given the 8250 serial code immediately terminates the dma after the > pause during the flush anyway.. > > > > It doesn't look like it can ever set "DMA_PAUSED" > > Why not? It pauses the transfer and is even able to allow reading the > > transferred byte count. > > > > What it is claimed to not be able to do is resume. Note that w/o resume > > serial8250_tx_dma() XON/XOFF processing will not work but that's > > unrelated to this issue (I'll probably need to do another patch to fix > > that on 8250 DMA side). > > Yep I just meant it doesn't look capable of reporting DMA_PAUSED > Which your patch below probably fixes. > and it kind of does a stop without a resume capability so must be terminated > after this. > Though I'm not sure of the implications of reporting paused without resume > capability. > Kind of an odd one.. dma_slave_caps has a bool for both pause and resume support (I only found out about the second flag for resume today) so not unheard of it seems. -- i. > I'll check your pl330 patch out tomorrow, and see what else might break. > > > > I'm not sure of the flow through of that though. > > > There is some commenting in the pl330 driver about the pause routine. > > Maybe the patch below will help with pl330 DMA status (based on somewhat > > limited understanding of all the moving parts): > > > > > > [PATCH] dmaengine: pl330: Set DMA_PAUSED when pausing > > > > pl330_pause() does not set anything to indicate paused condition which > > causes pl330_tx_status() to return DMA_IN_PROGRESS. This breaks 8250 > > DMA code after the fix in commit 57e9af7831dc ("serial: 8250_dma: Fix > > DMA Rx rearm race"). The function comment for pl330_pause() claims > > pause is supported but resume is not which is enough for 8250 DMA flush > > to work as long as DMA status is reported correctly when appropriate. > > > > Add PAUSED state for descriptor and mark BUSY descriptors with PAUSED > > in pl330_pause(). Return DMA_PAUSED from pl330_tx_status() when the > > descriptor is PAUSED. > > > > Fixes: 88987d2c7534 ("dmaengine: pl330: add DMA_PAUSE feature") > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> > > > > --- > > drivers/dma/pl330.c | 18 ++++++++++++++++-- > > 1 file changed, 16 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/dma/pl330.c b/drivers/dma/pl330.c > > index 0d9257fbdfb0..daad25f2c498 100644 > > --- a/drivers/dma/pl330.c > > +++ b/drivers/dma/pl330.c > > @@ -403,6 +403,12 @@ enum desc_status { > > * of a channel can be BUSY at any time. > > */ > > BUSY, > > + /* > > + * Pause was called while descriptor was BUSY. Due to hardware > > + * limitations, only termination is possible for descriptors > > + * that have been paused. > > + */ > > + PAUSED, > > /* > > * Sitting on the channel work_list but xfer done > > * by PL330 core > > @@ -2041,7 +2047,7 @@ static inline void fill_queue(struct dma_pl330_chan > > *pch) > > list_for_each_entry(desc, &pch->work_list, node) { > > /* If already submitted */ > > - if (desc->status == BUSY) > > + if (desc->status == BUSY || desc->status == PAUSED) > > continue; > > ret = pl330_submit_req(pch->thread, desc); > > @@ -2326,6 +2332,7 @@ static int pl330_pause(struct dma_chan *chan) > > { > > struct dma_pl330_chan *pch = to_pchan(chan); > > struct pl330_dmac *pl330 = pch->dmac; > > + struct dma_pl330_desc *desc; > > unsigned long flags; > > pm_runtime_get_sync(pl330->ddma.dev); > > @@ -2335,6 +2342,10 @@ static int pl330_pause(struct dma_chan *chan) > > _stop(pch->thread); > > spin_unlock(&pl330->lock); > > + list_for_each_entry(desc, &pch->work_list, node) { > > + if (desc->status == BUSY) > > + desc->status = PAUSED; > > + } > > spin_unlock_irqrestore(&pch->lock, flags); > > pm_runtime_mark_last_busy(pl330->ddma.dev); > > pm_runtime_put_autosuspend(pl330->ddma.dev); > > @@ -2425,7 +2436,7 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t > > cookie, > > else if (running && desc == running) > > transferred = > > pl330_get_current_xferred_count(pch, desc); > > - else if (desc->status == BUSY) > > + else if (desc->status == BUSY || desc->status == PAUSED) > > /* > > * Busy but not running means either just enqueued, > > * or finished and not yet marked done > > @@ -2442,6 +2453,9 @@ pl330_tx_status(struct dma_chan *chan, dma_cookie_t > > cookie, > > case DONE: > > ret = DMA_COMPLETE; > > break; > > + case PAUSED: > > + ret = DMA_PAUSED; > > + break; > > case PREP: > > case BUSY: > > ret = DMA_IN_PROGRESS; > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-25 9:38 ` Ilpo Järvinen @ 2023-05-26 7:07 ` Richard Tresidder 2023-05-26 10:16 ` Ilpo Järvinen 0 siblings, 1 reply; 12+ messages in thread From: Richard Tresidder @ 2023-05-26 7:07 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-serial On 25/05/2023 5:38 pm, Ilpo Järvinen wrote: > On Thu, 25 May 2023, Richard Tresidder wrote: >> On 25/05/2023 4:43 pm, Ilpo Järvinen wrote: >>> On Thu, 25 May 2023, Richard Tresidder wrote: >>>> On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: >>>>> On Wed, 24 May 2023, Richard Tresidder wrote: >>>>>> So I just started reverting the patches that had been applied to the >>>>>> 8250 >>>>>> folder. >>>>>> Worked backwards from head. >>>>>> >>>>>> After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce >>>>>> serial: 8250_dma: Fix DMA Rx rearm race >>>>>> >>>>>> Things started to work again. >>>>>> >>>>>> I reset everything and then just reverted that individual patch and >>>>>> things >>>>>> work. >>>>>> So that looks like the culprit.. >>>>> Okay, that also worked great, thanks a lot. It gives something concrete >>>>> to >>>>> work with to find a fix. >>>>> >>>>> The commit itself looks very much correct, however, my guess is that >>>>> when >>>>> serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow >>>>> does not become DMA_PAUSED which leads to not properly flushing DMA Rx >>>>> transaction. Can you try the following debug patch (if my guess is >>>>> correct, besides triggering the WARN_ON_ONCE, it also works around the >>>>> issue): >>>>> >>>>> [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS >>>>> inconsistency >>>>> >>>>> Detect DMA state not returning DMA_PAUSED after dmaengine_pause() was >>>>> issued. >>>>> >>>>> Not intended for upstream. >>>> Thanks Ilpo >>>> I can confirm that the patch below works. Got the WARN_ON_ONCE dump so >>>> it's >>>> taking that path. >>>> I think the problem here is that the pl330 DMA controller thats in these >>>> SOC's >>>> doesn't "really" support pause according to the driver. >>> For this flush usecase, it is enough to support pause + read residue + >>> terminate which is supported according to the function comment for >>> pl330_pause(). >> Yep agree given the 8250 serial code immediately terminates the dma after the >> pause during the flush anyway.. >> >>>> It doesn't look like it can ever set "DMA_PAUSED" >>> Why not? It pauses the transfer and is even able to allow reading the >>> transferred byte count. >>> >>> What it is claimed to not be able to do is resume. Note that w/o resume >>> serial8250_tx_dma() XON/XOFF processing will not work but that's >>> unrelated to this issue (I'll probably need to do another patch to fix >>> that on 8250 DMA side). >> Yep I just meant it doesn't look capable of reporting DMA_PAUSED >> Which your patch below probably fixes. >> and it kind of does a stop without a resume capability so must be terminated >> after this. >> Though I'm not sure of the implications of reporting paused without resume >> capability. >> Kind of an odd one.. > dma_slave_caps has a bool for both pause and resume support (I only found > out about the second flag for resume today) so not unheard of it seems. I've just tested the pl330 patch and it looks good. No longer get the WARN_ON_ONCE triggered in the 8250_dma driver so it looks like Paused is correctly being set now and we're avoiding that path. Wonder how many others dma drivers are afflicted in the same manner? A quick grep seems to infer a lot more drivers set the pause function then have something in them setting the status to DMA_PAUSED so probably good to keep the 8250 fix in also for the time being. Cheers Richard Tresidder ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-26 7:07 ` Richard Tresidder @ 2023-05-26 10:16 ` Ilpo Järvinen 2023-05-26 10:44 ` Richard Tresidder 0 siblings, 1 reply; 12+ messages in thread From: Ilpo Järvinen @ 2023-05-26 10:16 UTC (permalink / raw) To: Richard Tresidder; +Cc: linux-serial [-- Attachment #1: Type: text/plain, Size: 4431 bytes --] On Fri, 26 May 2023, Richard Tresidder wrote: > On 25/05/2023 5:38 pm, Ilpo Järvinen wrote: > > > On Thu, 25 May 2023, Richard Tresidder wrote: > > > On 25/05/2023 4:43 pm, Ilpo Järvinen wrote: > > > > On Thu, 25 May 2023, Richard Tresidder wrote: > > > > > On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: > > > > > > On Wed, 24 May 2023, Richard Tresidder wrote: > > > > > > > So I just started reverting the patches that had been applied to > > > > > > > the > > > > > > > 8250 > > > > > > > folder. > > > > > > > Worked backwards from head. > > > > > > > > > > > > > > After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce > > > > > > > serial: 8250_dma: Fix DMA Rx rearm race > > > > > > > > > > > > > > Things started to work again. > > > > > > > > > > > > > > I reset everything and then just reverted that individual patch > > > > > > > and > > > > > > > things > > > > > > > work. > > > > > > > So that looks like the culprit.. > > > > > > Okay, that also worked great, thanks a lot. It gives something > > > > > > concrete > > > > > > to > > > > > > work with to find a fix. > > > > > > > > > > > > The commit itself looks very much correct, however, my guess is that > > > > > > when > > > > > > serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow > > > > > > does not become DMA_PAUSED which leads to not properly flushing DMA > > > > > > Rx > > > > > > transaction. Can you try the following debug patch (if my guess is > > > > > > correct, besides triggering the WARN_ON_ONCE, it also works around > > > > > > the > > > > > > issue): > > > > > > > > > > > > [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS > > > > > > inconsistency > > > > > > > > > > > > Detect DMA state not returning DMA_PAUSED after dmaengine_pause() > > > > > > was > > > > > > issued. > > > > > > > > > > > > Not intended for upstream. > > > > > Thanks Ilpo > > > > > I can confirm that the patch below works. Got the WARN_ON_ONCE > > > > > dump so > > > > > it's > > > > > taking that path. > > > > > I think the problem here is that the pl330 DMA controller thats in > > > > > these > > > > > SOC's > > > > > doesn't "really" support pause according to the driver. > > > > For this flush usecase, it is enough to support pause + read residue + > > > > terminate which is supported according to the function comment for > > > > pl330_pause(). > > > Yep agree given the 8250 serial code immediately terminates the dma after > > > the > > > pause during the flush anyway.. > > > > > > > > It doesn't look like it can ever set "DMA_PAUSED" > > > > Why not? It pauses the transfer and is even able to allow reading the > > > > transferred byte count. > > > > > > > > What it is claimed to not be able to do is resume. Note that w/o resume > > > > serial8250_tx_dma() XON/XOFF processing will not work but that's > > > > unrelated to this issue (I'll probably need to do another patch to fix > > > > that on 8250 DMA side). > > > Yep I just meant it doesn't look capable of reporting DMA_PAUSED > > > Which your patch below probably fixes. > > > and it kind of does a stop without a resume capability so must be > > > terminated > > > after this. > > > Though I'm not sure of the implications of reporting paused without resume > > > capability. > > > Kind of an odd one.. > > dma_slave_caps has a bool for both pause and resume support (I only found > > out about the second flag for resume today) so not unheard of it seems. > > I've just tested the pl330 patch and it looks good. > No longer get the WARN_ON_ONCE triggered in the 8250_dma driver so it looks > like > Paused is correctly being set now and we're avoiding that path. Thanks a lot for testing the fix and also for your invaluable help in narrowing down the problem. It would have been very hard to find the culprit otherwise. Can I add your Tested-by tag before sending the official version of the fix out? > Wonder how many others dma drivers are afflicted in the same manner? > A quick grep seems to infer a lot more drivers set the pause function then > have > something in them setting the status to DMA_PAUSED so probably good to keep > the > 8250 fix in also for the time being. There might be others, but then the patch which covered the issue has gone to stables already months ago and you're the first to report it AFAIK so perhaps not that wide-spread. I'll think how to deal with this on 8250_dma side. -- i. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-26 10:16 ` Ilpo Järvinen @ 2023-05-26 10:44 ` Richard Tresidder 0 siblings, 0 replies; 12+ messages in thread From: Richard Tresidder @ 2023-05-26 10:44 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: linux-serial Richard Tresidder On 26/05/2023 6:16 pm, Ilpo Järvinen wrote: > On Fri, 26 May 2023, Richard Tresidder wrote: > >> On 25/05/2023 5:38 pm, Ilpo Järvinen wrote: >> >>> On Thu, 25 May 2023, Richard Tresidder wrote: >>>> On 25/05/2023 4:43 pm, Ilpo Järvinen wrote: >>>>> On Thu, 25 May 2023, Richard Tresidder wrote: >>>>>> On 24/05/2023 6:06 pm, Ilpo Järvinen wrote: >>>>>>> On Wed, 24 May 2023, Richard Tresidder wrote: >>>>>>>> So I just started reverting the patches that had been applied to >>>>>>>> the >>>>>>>> 8250 >>>>>>>> folder. >>>>>>>> Worked backwards from head. >>>>>>>> >>>>>>>> After reverting 57e9af7831dcf211c5c689c2a6f209f4abdf0bce >>>>>>>> serial: 8250_dma: Fix DMA Rx rearm race >>>>>>>> >>>>>>>> Things started to work again. >>>>>>>> >>>>>>>> I reset everything and then just reverted that individual patch >>>>>>>> and >>>>>>>> things >>>>>>>> work. >>>>>>>> So that looks like the culprit.. >>>>>>> Okay, that also worked great, thanks a lot. It gives something >>>>>>> concrete >>>>>>> to >>>>>>> work with to find a fix. >>>>>>> >>>>>>> The commit itself looks very much correct, however, my guess is that >>>>>>> when >>>>>>> serial8250_rx_dma_flush() does dmaengine_pause() dma_status somehow >>>>>>> does not become DMA_PAUSED which leads to not properly flushing DMA >>>>>>> Rx >>>>>>> transaction. Can you try the following debug patch (if my guess is >>>>>>> correct, besides triggering the WARN_ON_ONCE, it also works around >>>>>>> the >>>>>>> issue): >>>>>>> >>>>>>> [PATCH] DEBUG: 8250_dma: Detect DMA_PAUSED vs DMA_IN_PROGRESS >>>>>>> inconsistency >>>>>>> >>>>>>> Detect DMA state not returning DMA_PAUSED after dmaengine_pause() >>>>>>> was >>>>>>> issued. >>>>>>> >>>>>>> Not intended for upstream. >>>>>> Thanks Ilpo >>>>>> I can confirm that the patch below works. Got the WARN_ON_ONCE >>>>>> dump so >>>>>> it's >>>>>> taking that path. >>>>>> I think the problem here is that the pl330 DMA controller thats in >>>>>> these >>>>>> SOC's >>>>>> doesn't "really" support pause according to the driver. >>>>> For this flush usecase, it is enough to support pause + read residue + >>>>> terminate which is supported according to the function comment for >>>>> pl330_pause(). >>>> Yep agree given the 8250 serial code immediately terminates the dma after >>>> the >>>> pause during the flush anyway.. >>>> >>>>>> It doesn't look like it can ever set "DMA_PAUSED" >>>>> Why not? It pauses the transfer and is even able to allow reading the >>>>> transferred byte count. >>>>> >>>>> What it is claimed to not be able to do is resume. Note that w/o resume >>>>> serial8250_tx_dma() XON/XOFF processing will not work but that's >>>>> unrelated to this issue (I'll probably need to do another patch to fix >>>>> that on 8250 DMA side). >>>> Yep I just meant it doesn't look capable of reporting DMA_PAUSED >>>> Which your patch below probably fixes. >>>> and it kind of does a stop without a resume capability so must be >>>> terminated >>>> after this. >>>> Though I'm not sure of the implications of reporting paused without resume >>>> capability. >>>> Kind of an odd one.. >>> dma_slave_caps has a bool for both pause and resume support (I only found >>> out about the second flag for resume today) so not unheard of it seems. >> I've just tested the pl330 patch and it looks good. >> No longer get the WARN_ON_ONCE triggered in the 8250_dma driver so it looks >> like >> Paused is correctly being set now and we're avoiding that path. > Thanks a lot for testing the fix and also for your invaluable help in > narrowing down the problem. It would have been very hard to find the > culprit otherwise. > > Can I add your Tested-by tag before sending the official version of the > fix out? No problems thanks for turning the patches around so quickly. Yep you can add a Tested-by Tag Have a great weekend! Cheers Richard Tresidder > >> Wonder how many others dma drivers are afflicted in the same manner? >> A quick grep seems to infer a lot more drivers set the pause function then >> have >> something in them setting the status to DMA_PAUSED so probably good to keep >> the >> 8250 fix in also for the time being. > There might be others, but then the patch which covered the issue has gone > to stables already months ago and you're the first to report it AFAIK so > perhaps not that wide-spread. > > I'll think how to deal with this on 8250_dma side. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Possible regression in 8250_dw driver 2023-05-23 8:59 Possible regression in 8250_dw driver Richard Tresidder 2023-05-23 11:13 ` Ilpo Järvinen @ 2023-05-24 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis) 1 sibling, 0 replies; 12+ messages in thread From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-05-24 11:20 UTC (permalink / raw) To: Richard Tresidder, linux-serial; +Cc: Linux kernel regressions list [CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 23.05.23 10:59, Richard Tresidder wrote: > We seem to be getting corruption of received data from a ublox GPS unit > To me it looks like a fifo overrun of some sort?? > > background: > I'm attempting to use 6.3.3 as a new base for one of our systems. > Previously it was using 5.1.7 as a base. > The uart in question is one of the two in the Cyclone V SOC HPS. > And to muddy the waters the linux console TTYS0 is the other Uart from > the same HPS core > Yet the console appears to be working ok. > Note all other libs and apps are at the same revision and build, it is > only the kernel that is different. > Both versions of the kernel are also built using the same bitbake bsdk.. > > Seeing the following with 6.3.3: > [...] Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced 57e9af7831dcf211c5c689c2a #regzbot title serial: 8250_dma: data corruption with ublox GPS unit #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-05-26 10:44 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-23 8:59 Possible regression in 8250_dw driver Richard Tresidder 2023-05-23 11:13 ` Ilpo Järvinen 2023-05-24 7:52 ` Richard Tresidder 2023-05-24 10:06 ` Ilpo Järvinen 2023-05-25 3:44 ` Richard Tresidder 2023-05-25 8:43 ` Ilpo Järvinen 2023-05-25 9:27 ` Richard Tresidder 2023-05-25 9:38 ` Ilpo Järvinen 2023-05-26 7:07 ` Richard Tresidder 2023-05-26 10:16 ` Ilpo Järvinen 2023-05-26 10:44 ` Richard Tresidder 2023-05-24 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox