Created attachment 293[details]
dma-error-backtrace
When trying to configure bonding mode 4 using members with iavf driver (for
intel 700 series NICs) we see these DMA errors:
"EAL: Cannot set up DMA remapping, error 12 (Cannot allocate memory)"
When this happens we also see TX errors on the devices, so I tried dumping DMA
vaddrs and enabling TX descriptor dumps for iavf and saw the following:
DMA errors occuring at:
iova=0x2351200000, len=2097152
iova=0x2351400000, len=2097152
iova=0x2351600000, len=2097152
iova=0x2351800000, len=2097152
iova=0x2351a00000, len=2097152
iova=0x2351c00000, len=2097152
TX descriptor dumps:
Queue 0 Tx_data_desc 0: QW0: 0x000000235137f8c0 QW1: 0x000001f000000040
Queue 0 Tx_data_desc 0: QW0: 0x000000235137f8c0 QW1: 0x000001f000000050
Queue 0 Tx_data_desc 1: QW0: 0x000000235137fb00 QW1: 0x000001f000000040
Queue 0 Tx_data_desc 1: QW0: 0x000000235137fb00 QW1: 0x000001f000000050
Queue 0 Tx_data_desc 0: QW0: 0x000000235197f8c0 QW1: 0x000001f000000040
Queue 0 Tx_data_desc 0: QW0: 0x000000235197f8c0 QW1: 0x000001f000000050
Queue 0 Tx_data_desc 2: QW0: 0x000000235137fd40 QW1: 0x000001f000000040
Queue 0 Tx_data_desc 2: QW0: 0x000000235137fd40 QW1: 0x000001f000000050
Queue 0 Tx_data_desc 1: QW0: 0x000000235197fb00 QW1: 0x000001f000000040
Queue 0 Tx_data_desc 1: QW0: 0x000000235197fb00 QW1: 0x000001f000000050
So DMA errors are probably the root cause for the TX errors. I tried figuring
out why DMA errors occur so I added an abort on DMA error to generate a
coredump. I've attached the backtrace of the interesting threads.
Looking at the backtrace, it looks like LSC callback is called at the same time
as we're starting the iavf member devices, and this seems to cause the DMA
errors. The reason I say that is because I tried synchronizing the threads and
the DMA errors disappeared. So far we have two workarounds for this problem:
1. Synchronize threads with locks
2. Pre-allocate more memory, hence no need to expand heap and do DMA
remapping.
Maybe someone can explain why these DMA errors occur when the threads are not
synched? What would be the proper fix for this?
Created attachment 293 [details] dma-error-backtrace When trying to configure bonding mode 4 using members with iavf driver (for intel 700 series NICs) we see these DMA errors: "EAL: Cannot set up DMA remapping, error 12 (Cannot allocate memory)" When this happens we also see TX errors on the devices, so I tried dumping DMA vaddrs and enabling TX descriptor dumps for iavf and saw the following: DMA errors occuring at: iova=0x2351200000, len=2097152 iova=0x2351400000, len=2097152 iova=0x2351600000, len=2097152 iova=0x2351800000, len=2097152 iova=0x2351a00000, len=2097152 iova=0x2351c00000, len=2097152 TX descriptor dumps: Queue 0 Tx_data_desc 0: QW0: 0x000000235137f8c0 QW1: 0x000001f000000040 Queue 0 Tx_data_desc 0: QW0: 0x000000235137f8c0 QW1: 0x000001f000000050 Queue 0 Tx_data_desc 1: QW0: 0x000000235137fb00 QW1: 0x000001f000000040 Queue 0 Tx_data_desc 1: QW0: 0x000000235137fb00 QW1: 0x000001f000000050 Queue 0 Tx_data_desc 0: QW0: 0x000000235197f8c0 QW1: 0x000001f000000040 Queue 0 Tx_data_desc 0: QW0: 0x000000235197f8c0 QW1: 0x000001f000000050 Queue 0 Tx_data_desc 2: QW0: 0x000000235137fd40 QW1: 0x000001f000000040 Queue 0 Tx_data_desc 2: QW0: 0x000000235137fd40 QW1: 0x000001f000000050 Queue 0 Tx_data_desc 1: QW0: 0x000000235197fb00 QW1: 0x000001f000000040 Queue 0 Tx_data_desc 1: QW0: 0x000000235197fb00 QW1: 0x000001f000000050 So DMA errors are probably the root cause for the TX errors. I tried figuring out why DMA errors occur so I added an abort on DMA error to generate a coredump. I've attached the backtrace of the interesting threads. Looking at the backtrace, it looks like LSC callback is called at the same time as we're starting the iavf member devices, and this seems to cause the DMA errors. The reason I say that is because I tried synchronizing the threads and the DMA errors disappeared. So far we have two workarounds for this problem: 1. Synchronize threads with locks 2. Pre-allocate more memory, hence no need to expand heap and do DMA remapping. Maybe someone can explain why these DMA errors occur when the threads are not synched? What would be the proper fix for this?