* [PATCH] e100 rx: or s and el bits @ 2007-05-01 11:24 Milton Miller 2007-05-01 15:01 ` David Acker 2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke 0 siblings, 2 replies; 51+ messages in thread From: Milton Miller @ 2007-05-01 11:24 UTC (permalink / raw) To: Scott Feldman, Jeff Garzik Cc: Auke Kok, e1000-devel, netdev, linux-kernel, Jesse Brandeburg, John Ronciak, Jeff Kirsher, Andrew Morton In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description talks about emulating another driver by setting addtional bits and the being unable to test when submitted. Seeing the & operator to set more bits made me suspicious, and indeed the bits are defined in positive logic: cb_s = 0x4000, cb_el = 0x8000, So anding those together would be 0. I'm guessing they should be or'd, but don't have hardware here to test, much like the committed patch. In fact, I'll let someone else do the compile test too. I'll update the comment. (It looks like the end of list and s bits would not be set in the template nor cleared when linking in recieve skbs, so as long as the kernel can keep up with the 100Mb card we wouldn't see the adapter go off the linked list, possibly explaining any successful use of this patch written against 2.6.14). Signed-off-by: Milton Miller <miltonm@bga.com> --- linux-2.6/drivers/net/e100.c.orig 2007-05-01 05:19:17.000000000 -0500 +++ linux-2.6/drivers/net/e100.c 2007-05-01 05:22:14.000000000 -0500 @@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s); nic->blank_rfd.rbd = 0xFFFFFFFF; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to - * this one, and clearing EL bit of previous. */ + * this one, and clearing EL and S bit of previous. */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); + prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller @ 2007-05-01 15:01 ` David Acker 2007-05-02 20:21 ` David Acker 2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke 1 sibling, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-01 15:01 UTC (permalink / raw) To: Milton Miller Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel, Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher, Auke Kok Milton Miller wrote: > In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description > talks about emulating another driver by setting addtional bits and > the being unable to test when submitted. Seeing the & operator to > set more bits made me suspicious, and indeed the bits are defined > in positive logic: > > cb_s = 0x4000, > cb_el = 0x8000, > > So anding those together would be 0. I'm guessing they should > be or'd, but don't have hardware here to test, much like the > committed patch. In fact, I'll let someone else do the compile > test too. I'll update the comment. > I wonder if this worked for me because the hardware also spun on the link field being NULL? Since the RU base is also set to 0, the calculated physical address would be 0 as well. I would imagine if the hardware tried to read/write to very low addresses across PCI, there would be issues. I will retest with a small receive pool to try to hit the problem. It seems to apply to a pretty recent git pull from linus's tree. I manually merged this into the 2.6.18.4 kernel we are using. With the original in kernel driver (just EL bit, no S bit), I had two tests that would always end in horrible memory corruption on a PXA255 based system. One is a 12 hour bidirectional TCP test using iperf with the ethernet port sending packets to a wireless card and vice versa. The other is a similar configuration running a 12 hour UDP test sending 20 megabits/second in each direction. Even through the original S-bit patch seems broken, we have had days of continuous traffic through it without issue where previously we could never go more than 6 hours. I will let folks know how it goes. In the UDP test, the ethernet side often gets ahead of the available buffers due to CPU and PCI usage by the wireless card and its driver. I will also run these tests with the new patch and with a smaller receive pool (default is 256) to make the pool run out more often. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-01 15:01 ` David Acker @ 2007-05-02 20:21 ` David Acker 2007-05-04 21:43 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-02 20:21 UTC (permalink / raw) To: Milton Miller Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel, Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher, Auke Kok David Acker wrote: > Milton Miller wrote: >> In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description >> talks about emulating another driver by setting addtional bits and >> the being unable to test when submitted. Seeing the & operator to >> set more bits made me suspicious, and indeed the bits are defined >> in positive logic: >> >> cb_s = 0x4000, >> cb_el = 0x8000, >> >> So anding those together would be 0. I'm guessing they should >> be or'd, but don't have hardware here to test, much like the >> committed patch. In fact, I'll let someone else do the compile >> test too. I'll update the comment. >> > > I wonder if this worked for me because the hardware also spun on the > link field being NULL? Since the RU base is also set to 0, the > calculated physical address would be 0 as well. I would imagine if the > hardware tried to read/write to very low addresses across PCI, there > would be issues. I will retest with a small receive pool to try to hit > the problem. > I will also run these tests with the new patch and with a smaller > receive pool (default is 256) to make the pool run out more often. So far my testing has shown both the original and the new version of the S-bit patch work in that no corruption seemed to occur over long term runs. The previous S-bit patch may have only worked due to something specific about how my PCI companion chip handles I/O to low memory addresses (from dereferencing a link address of 0). Perhaps the e100 handles the NULL link as well, but given that the manual does not seem to state what happens when the hardware encounters a buffer with a link of 0, I think Milton's fix is the proper way to do it. The old eepro driver did set both bits although it did it with a hardcoded constant. I will continue testing with slab debug on but that will take longer. Has anyone tried this on other platforms? -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-02 20:21 ` David Acker @ 2007-05-04 21:43 ` David Acker 2007-05-06 6:36 ` Milton Miller 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-04 21:43 UTC (permalink / raw) To: Milton Miller Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel, Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher, Auke Kok David Acker wrote: > So far my testing has shown both the original and the new version of the > S-bit patch work in that no corruption seemed to occur over long term > runs. I spoke too soon. Further testing has not gone well. If I use the default settings for CPU saver and drop the receive pool down to 16 buffers I can cause problems with various forms of the patch. With the original S-bit patch I can get: e100: eth0: e100_update_stats: exec cuc_dump_reset failed At this point sends and receives are failing and their counters are not changing. When a raw socket opened on the port tries to send it gets errno 11, Resource temporarily unavailable. When this error occurred, I added a reset call through the tx timeout schedule_work(&nic->tx_timeout_task); This made the port come back but obviously this is somewhat of a kludge and eventually, the entire system hangs or get an oops from memory corruption. Apparently my earlier tests got lucky with a large receive pool size. The only way the original patch was stable for me was by disabling CPU saver, and setting the NAPI weight (default of 16) as large as the default receive pool size (256) so that all buffers were claimed and reallocated in one NAPI call. None of this should be needed so there is definitely a problem with the original patch. The updated patch produced a different issue. We got an RNR interrupt indicating the receive unit got ahead of the software. The S-bit patch removed any handling of this issue as it assumed the hardware would spin on the sbit. Apparently if both the S-bit and the EL-bit are set on the same RFD, it follows the EL-bit handling. Printing the stat/ack and status bytes on the RNR interrupts I get: status 01001000 = 0x48 = RUS of 0010 = No Resources, CUS of 01 = Suspended stat/ack 01010000 = 0x50 = FR, RNR or 00010000 = 0x10 = RNR Notice that the RUS went into No Resources and not suspended. Thus clearing the S-bit does not wake it up; it needs a new start command. I could not find documentation that states that the S-bit need only be cleared to take the RU out of suspended state. Before the S-bit patch the driver tried to track this need but that version of the driver didn't work for me either. By the way, I am using, "Intel 8255x 10/100 Mbps Ethernet Controller Family, Open Source Software Developer Manual, January 2006" as my documentation. This got me looking at just how in the world this worked on the old eepro100 driver. It had another difference; it did not reap the last rx buffer in the chain. It set a postponed bit and then picked it up on the next interrupt after more buffers had been allocated. It then noticed that the RU was in a suspended or no resources state and did a softreset. I don't believe this avoid the last buffer trick really fixes the race. Imagine the following: 1. 4 buffers in receive pool, all freshly allocated 2. Hardware consumes 3 buffers 3. Software processes 3 buffers, begins to allocate new buffers 4. Hardware writes status bits into buffer 4 while software updates link and command word bits in buffer 4. They share a cache line and corrupt each other. This appears to be possible with any of the versions of this driver I have seen. The problem is one of packet ownership. Once the driver gives a list of buffers to hardware, hardware owns them all. The driver can not safely change these buffers. Sadly, this means that the idea of the driver "staying ahead" of the hardware such that the hardware never runs out of resources will not work here. Once the driver gives the hardware a packet with S or EL bits set, it must let the hardware encounter the packet and return it to software. I think the driver needs to protect the last entry in the ring by putting the S-bit on the entry before it. The first time the driver allocates a block of packets, it writes a new S-bit out on the next to last packet. As buffers complete it allocates more packets in the chain but does not set a new S-bit since the old one will stop hardware. It can not clear the old S-bit because the driver does not own the buffer, hardware does. After processing the s-bit packet the hardware will interrupt with a stat/ack of RNR and RUS of suspended. When software processes a packet with an old S-bit it allocates new buffers and sets the s-bit on the new next to last packet. The above case changes now: 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. S-bit is on buffer 3. 2. Hardware consumes 3 buffers, hits S-bit, RNR interrupts 3. Software processes 3 buffers, begins to allocate new buffers 4. Software sends resume once buffers are allocated, S-bit is on buffer 2. 5. Hardware gets resume. When it processed buffer 3, it saved the link to buffer 4 and thus resumes at buffer 4. Here is a different flow where the software stays ahead: 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. S-bit is on buffer 3. 2. Hardware consumes 2 buffers (1, 2). 3. Software processes buffers 1, 2, begins to allocate new buffers 4. Software buffers 1, 2 are allocated 5. Hardware consumes 1 buffer (#3) and hits S-bit, RNR interrupts. 6. Software consumes 1 buffer, (#3) and finds the old S-bit. It allocates a new buffer 3 and sets the S-bit on buffer 2. 7. Software sends resume, hardware continues at buffer 4. In this setup, software will send a resume command every RING_SIZE packets. RNR interrupts will also occur every RING_SIZE packets. When hardware is faster than software, it will process RING_SIZE packets, RNR interrupt and wait for software to process all of them. When software is faster then hardware, hardware will still process RING_SIZE packets before interrupting but software will only need to allocate 1 packet or so before sending the resume so hardware will wait much less time. This will probably slow things down since on a fast CPU, software will normally stay ahead of the hardware and the only PCI operations from the driver would be interrupt acks. With this change, we have PCI operations every 256 packets. I don't see how else to do this in a safe way on ARM (at least PXA255). I am testing this over the weekend with a 16-buffer receive pool. If all goes well, I will send a patch early next week. It will basically back out the S-bit patch and then make the changes noted above. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-04 21:43 ` David Acker @ 2007-05-06 6:36 ` Milton Miller 2007-05-07 15:27 ` David Acker 2007-05-14 18:26 ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker 0 siblings, 2 replies; 51+ messages in thread From: Milton Miller @ 2007-05-06 6:36 UTC (permalink / raw) To: David Acker Cc: Auke Kok, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher [dropping Andrew, Jeff, and LKML] On May 4, 2007, at 4:43 PM, David Acker wrote: > David Acker wrote: >> So far my testing has shown both the original and the new version of >> the S-bit patch work in that no corruption seemed to occur over long >> term runs. > > I spoke too soon. Further testing has not gone well. If I use the > default settings for CPU saver and drop the receive pool down to 16 > buffers I can cause problems with various forms of the patch. With > the original S-bit patch I can get: ... > > The updated patch produced a different issue. We got an RNR interrupt > indicating the receive unit got ahead of the software. The S-bit > patch removed any handling of this issue as it assumed the hardware > would spin > on the sbit. Apparently if both the S-bit and the EL-bit are set on > the same RFD, it follows the EL-bit handling. Printing the stat/ack > and status bytes on the RNR interrupts I get: > > status > 01001000 = 0x48 = RUS of 0010 = No Resources, CUS of 01 = Suspended > > stat/ack > 01010000 = 0x50 = FR, RNR > or > 00010000 = 0x10 = RNR > > Notice that the RUS went into No Resources and not suspended. Thus > clearing the S-bit does not wake it up; it needs a new start command. > I could not find documentation that states that the S-bit need only be > cleared to take the RU out of suspended state. Before the S-bit patch > the driver tried to track this need but that version of the driver > didn't work for me either. By the way, I am using, "Intel 8255x > 10/100 Mbps Ethernet Controller Family, Open Source Software Developer > Manual, January 2006" as my documentation. > > This got me looking at just how in the world this worked on the old > eepro100 driver. It had another difference; it did not reap the last > rx > buffer in the chain. It set a postponed bit and then picked it up on > the next interrupt after more buffers had been allocated. It then > noticed that the RU was in a suspended or no resources state and did a > softreset. > > I don't believe this avoid the last buffer trick really fixes the > race. Imagine the following: > 1. 4 buffers in receive pool, all freshly allocated > 2. Hardware consumes 3 buffers > 3. Software processes 3 buffers, begins to allocate new buffers > 4. Hardware writes status bits into buffer 4 while software updates > link and command word bits in buffer 4. They share a cache line and > corrupt each other. > > This appears to be possible with any of the versions of this driver I > have seen. The problem is one of packet ownership. Once the driver > gives a list of buffers to hardware, hardware owns them all. The > driver can not safely change these buffers. Sadly, this means that > the idea of the driver "staying ahead" of the hardware such that the > hardware never runs out of resources will not work here. Once the > driver gives the hardware a packet with S or EL bits set, it must let > the hardware encounter the packet and return it to software. > > I think the driver needs to protect the last entry in the ring by > putting the S-bit on the entry before it. The first time the driver > allocates a block of packets, it writes a new S-bit out on the next to > last packet. As buffers complete it allocates more packets in the > chain but does not set a new S-bit since the old one will stop > hardware. It can not clear the old S-bit because the driver does not > own the buffer, hardware does. After processing the s-bit packet the > hardware will interrupt with a stat/ack of RNR and RUS of suspended. > When software processes a packet with an old S-bit it allocates new > buffers and sets the s-bit on the new next to last packet. > > The above case changes now: > 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. > S-bit is on buffer 3. > 2. Hardware consumes 3 buffers, hits S-bit, RNR interrupts > 3. Software processes 3 buffers, begins to allocate new buffers > 4. Software sends resume once buffers are allocated, S-bit is on > buffer 2. > 5. Hardware gets resume. When it processed buffer 3, it saved the > link to buffer 4 and thus resumes at buffer 4. > > > Here is a different flow where the software stays ahead: > 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. > S-bit is on buffer 3. > 2. Hardware consumes 2 buffers (1, 2). > 3. Software processes buffers 1, 2, begins to allocate new buffers > 4. Software buffers 1, 2 are allocated > 5. Hardware consumes 1 buffer (#3) and hits S-bit, RNR interrupts. > 6. Software consumes 1 buffer, (#3) and finds the old S-bit. It > allocates a new buffer 3 and sets the S-bit on buffer 2. > 7. Software sends resume, hardware continues at buffer 4. > > In this setup, software will send a resume command every RING_SIZE > packets. RNR interrupts will also occur every RING_SIZE packets. > When hardware is faster than software, it will process RING_SIZE > packets, RNR interrupt and wait for software to process all of them. > When software is faster then hardware, hardware will still process > RING_SIZE packets before interrupting but software will only need to > allocate 1 packet or so before sending the resume so hardware will > wait much less time. > > This will probably slow things down since on a fast CPU, software will > normally stay ahead of the hardware and the only PCI operations from > the driver would be interrupt acks. With this change, we have PCI > operations every 256 packets. I don't see how else to do this in a > safe way on ARM (at least PXA255). > > I am testing this over the weekend with a 16-buffer receive pool. If > all goes well, I will send a patch early next week. It will basically > back out the S-bit patch and then make the changes noted above. > While this will help the problem with the cache-incoherent DMA systems not running, it guarantees the hardware will stop every <ring-size> packets and wait for the cpu to respond to an interrupt. It would seem that this will lead to packet drops. [download manual from site in source file] In fact 6.4.3.4 says 82557 will start dropping frames immediately. Looking at the descriptions around page 101: (1) The link pointer, S, and EL is read when hw starts recieving the frame. (2) Its pretty clear EL overrides S from the order of the descriptions in the text. (3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips frame fill and goes to the next packet. How about putting a zero length descriptor in consistent memory to suspend the rx unit before the last real frame? In other words fr0 -> fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN. We could then have 2 such frames, and when we refill modify FrN to the new chain, with the WaitHere1 as its next-to-last, do the syncs, then clear the S bit on WaitHere0. When the rx passes WaitHere0 we can reclaim it for the next use (might want a slightly larger pool, basically need RxRingSize / RxRingFillBatch such frames. milton ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-06 6:36 ` Milton Miller @ 2007-05-07 15:27 ` David Acker 2007-05-14 18:26 ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker 1 sibling, 0 replies; 51+ messages in thread From: David Acker @ 2007-05-07 15:27 UTC (permalink / raw) To: Milton Miller Cc: Auke Kok, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher Milton Miller wrote: > While this will help the problem with the cache-incoherent DMA systems > not running, it guarantees the hardware will stop every <ring-size> > packets and wait for the cpu to respond to an interrupt. It would seem > that this will lead to packet drops. > Well, in NAPI mode, we the CPU may poll its way to the last buffer without having to go through an interrupt cycle. You are right that buffers would probably get dropped between the time the hardware hit the S-bit and the CPU caught up. > [download manual from site in source file] > > In fact 6.4.3.4 says 82557 will start dropping frames immediately. > > Looking at the descriptions around page 101: > (1) The link pointer, S, and EL is read when hw starts recieving the frame. > (2) Its pretty clear EL overrides S from the order of the descriptions > in the text. My testing confirms this. > (3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips > frame fill and goes to the next packet. > > How about putting a zero length descriptor in consistent memory to > suspend the rx unit before the last real frame? In other words fr0 -> > fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN. We could then have 2 such > frames, and when we refill modify FrN to the new chain, with the > WaitHere1 as its next-to-last, do the syncs, then clear the S bit on > WaitHere0. When the rx passes WaitHere0 we can reclaim it for the next > use (might want a slightly larger pool, basically need RxRingSize / > RxRingFillBatch such frames. Hmm...I will take a look at this. My test worked over the weekend by the way. A patch will be coming. -Ack ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-06 6:36 ` Milton Miller 2007-05-07 15:27 ` David Acker @ 2007-05-14 18:26 ` David Acker 2007-05-18 1:54 ` Jeff Garzik 1 sibling, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-14 18:26 UTC (permalink / raw) To: Milton Miller Cc: Auke Kok, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Milton Miller wrote: > In fact 6.4.3.4 says 82557 will start dropping frames immediately. > > Looking at the descriptions around page 101: > (1) The link pointer, S, and EL is read when hw starts recieving the frame. > (2) Its pretty clear EL overrides S from the order of the descriptions > in the text. > (3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips > frame fill and goes to the next packet. > > How about putting a zero length descriptor in consistent memory to > suspend the rx unit before the last real frame? In other words fr0 -> > fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN. We could then have 2 such > frames, and when we refill modify FrN to the new chain, with the > WaitHere1 as its next-to-last, do the syncs, then clear the S bit on > WaitHere0. When the rx passes WaitHere0 we can reclaim it for the next > use (might want a slightly larger pool, basically need RxRingSize / > RxRingFillBatch such frames. > I managed to code up something similar to what Milton described here. Below is a patch against 2.6.21.1, before the S-bit patch. I ran it through two 12 hour UDP tests with 20 megabits per second in each direction for data going from an ethernet client to an 802.11g client. I had the receive pool sized at 16 buffers. --- On the ARM, their is a race condition between software allocating a new receive buffer and hardware writing into a buffer. The two race on touching the last Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal to 0. When hardware encounters this buffer it attempts to write data to it and then update Status Word bits and Actual Count in the RFD. At the same time software may try to clear the el-bit and set the link address to a new buffer. Since the entire RFD is once cache-line, the two write operations can collide. This can lead to the receive unit stalling or freed receive buffers getting written to. The fix is to set the el-bit on and the size to 0 on the next to last buffer in the chain. When the hardware encounters this buffer it stops and does not write to it at all. The hardware issues an RNR interrupt with the receive unit in the No Resources state. When software allocates buffers, it can update the tail of the list because it knows the hardware will stop at the buffer before it. Once it has a new next to last buffer prepared, it can clear the el-bit and set the size on the previous one. The race on this buffer is safe since the link already points to a valid next buffer. If the hardware sees the el-bit cleared without the size set, it will move on to the next buffer and complete that one in error. If it sees the size set but the el-bit still set, it will complete that buffer and then RNR interrupt and wait. Signed-off-by: David Acker <dacker@roinet.com> --- --- linux-2.6.21.1/drivers/net/e100.c.orig 2007-05-11 11:27:11.000000000 -0400 +++ linux-2.6.21.1/drivers/net/e100.c 2007-05-11 11:39:06.000000000 -0400 @@ -1781,13 +1781,12 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to - * this one, and clearing EL bit of previous. */ + * this one. We are safe to touch the previous RFD because + * it is protected the before last buffer's el bit being set */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); - wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } @@ -1861,6 +1860,8 @@ static void e100_rx_clean(struct nic *ni struct rx *rx; int restart_required = 0; struct rx *rx_to_start = NULL; + struct rx *old_before_last_rx, *new_before_last_rx; + struct rfd *old_before_last_rfd, *new_before_last_rfd; /* are we already rnr? then pay attention!!! this ensures that * the state machine progression never allows a start with a @@ -1885,12 +1886,43 @@ static void e100_rx_clean(struct nic *ni if(restart_required) rx_to_start = nic->rx_to_clean; + old_before_last_rx = nic->rx_to_use->prev->prev; + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data; + /* Alloc new skbs to refill list */ for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { if(unlikely(e100_rx_alloc_skb(nic, rx))) break; /* Better luck next time (see watchdog) */ } + new_before_last_rx = nic->rx_to_use->prev->prev; + if (new_before_last_rx != old_before_last_rx) { + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data; + new_before_last_rfd->size = 0; + new_before_last_rfd->command |= cpu_to_le16(cb_el); + pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + /* Now that we have a new stopping point, we can clear the old + * stopping point. + * Note: There appears to be a race here where the hardware + * can complete this buffer with the el-bit set but with the + * size also set. The hardware RNR interrupts, the RUS + * goes into the No Resources state. */ + old_before_last_rfd->command &= ~cpu_to_le16(cb_el); + wmb(); + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); + pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + } + if(restart_required) { // ack the rnr? writeb(stat_ack_rnr, &nic->csr->scb.stat_ack); @@ -1926,6 +1958,7 @@ static int e100_rx_alloc_list(struct nic { struct rx *rx; unsigned int i, count = nic->params.rfds.count; + struct rfd *before_last; nic->rx_to_use = nic->rx_to_clean = NULL; nic->ru_running = RU_UNINITIALIZED; @@ -1942,6 +1975,20 @@ static int e100_rx_alloc_list(struct nic } } + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + rx = nic->rxs->prev->prev; + before_last = (struct rfd *)rx->skb->data; + before_last->command |= cpu_to_le16(cb_el); + before_last->size = 0; + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + nic->rx_to_use = nic->rx_to_clean = nic->rxs; nic->ru_running = RU_SUSPENDED; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-14 18:26 ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker @ 2007-05-18 1:54 ` Jeff Garzik 2007-05-18 3:47 ` Kok, Auke 0 siblings, 1 reply; 51+ messages in thread From: Jeff Garzik @ 2007-05-18 1:54 UTC (permalink / raw) To: David Acker, Auke Kok Cc: Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Can you resend against the latest kernel (2.6.22-rc1)? And what does Intel think? Jeff ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 1:54 ` Jeff Garzik @ 2007-05-18 3:47 ` Kok, Auke 2007-05-18 14:07 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-05-18 3:47 UTC (permalink / raw) To: Jeff Garzik Cc: David Acker, Auke Kok, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Jeff Garzik wrote: > Can you resend against the latest kernel (2.6.22-rc1)? > > And what does Intel think? I'm expecting at least a reply from Milton as the patch was sent to him. I haven't yet tested it but will certainly do so. At first glance it looks OK, and I'll try to put it under my colleague's noses who know e100 best. A resend against 2.6.22-rc1 would be nice. Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 3:47 ` Kok, Auke @ 2007-05-18 14:07 ` David Acker 2007-05-18 14:20 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-18 14:07 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Kok, Auke wrote: > Jeff Garzik wrote: >> Can you resend against the latest kernel (2.6.22-rc1)? >> >> And what does Intel think? > > I'm expecting at least a reply from Milton as the patch was sent to him. > I haven't yet tested it but will certainly do so. At first glance it > looks OK, and I'll try to put it under my colleague's noses who know > e100 best. > > A resend against 2.6.22-rc1 would be nice. > Done. Below is a patch against 2.6.22-rc1. It combines removing the s-bit patch and applying the patch I previously sent. On the ARM, their is a race condition between software allocating a new receive buffer and hardware writing into a buffer. The two race on touching the last Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal to 0. When hardware encounters this buffer it attempts to write data to it and then update Status Word bits and Actual Count in the RFD. At the same time software may try to clear the el-bit and set the link address to a new buffer. Since the entire RFD is once cache-line, the two write operations can collide. This can lead to the receive unit stalling or freed receive buffers getting written to. The fix is to set the el-bit on and the size to 0 on the next to last buffer in the chain. When the hardware encounters this buffer it stops and does not write to it at all. The hardware issues an RNR interrupt with the receive unit in the No Resources state. When software allocates buffers, it can update the tail of the list because it knows the hardware will stop at the buffer before it. Once it has a new next to last buffer prepared, it can clear the el-bit and set the size on the previous one. The race on this buffer is safe since the link already points to a valid next buffer. If the hardware sees the el-bit cleared without the size set, it will move on to the next buffer and complete that one in error. If it sees the size set but the el-bit still set, it will complete that buffer and then RNR interrupt and wait. Signed-off-by: David Acker <dacker@roinet.com> --- --- linux-2.6.22-rc1/drivers/net/e100.c.orig 2007-05-18 10:01:52.000000000 -0400 +++ linux-2.6.22-rc1/drivers/net/e100.c 2007-05-18 09:59:33.000000000 -0400 @@ -285,6 +285,12 @@ enum scb_status { rus_mask = 0x3C, }; +enum ru_state { + RU_SUSPENDED = 0, + RU_RUNNING = 1, + RU_UNINITIALIZED = -1, +}; + enum scb_stat_ack { stat_ack_not_ours = 0x00, stat_ack_sw_gen = 0x04, @@ -526,6 +532,7 @@ struct nic { struct rx *rx_to_use; struct rx *rx_to_clean; struct rfd blank_rfd; + enum ru_state ru_running; spinlock_t cb_lock ____cacheline_aligned; spinlock_t cmd_lock; @@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = cpu_to_le16(cb_el); nic->blank_rfd.rbd = 0xFFFFFFFF; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni return 0; } -static inline void e100_start_receiver(struct nic *nic) +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) { - /* Start if RFA is non-NULL */ - if(nic->rx_to_clean->skb) - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); + if(!nic->rxs) return; + if(RU_SUSPENDED != nic->ru_running) return; + + /* handle init time starts */ + if(!rx) rx = nic->rxs; + + /* (Re)start RU if suspended or idle and RFA is non-NULL */ + if(rx->skb) { + e100_exec_cmd(nic, ruc_start, rx->dma_addr); + nic->ru_running = RU_RUNNING; + } } #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) @@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to - * this one, and clearing EL bit of previous. */ + * this one. We are safe to touch the previous RFD because + * it is protected the before last buffer's el bit being set */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); - wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } @@ -1813,6 +1827,10 @@ static int e100_rx_indicate(struct nic * pci_unmap_single(nic->pdev, rx->dma_addr, RFD_BUF_LEN, PCI_DMA_FROMDEVICE); + /* this allows for a fast restart without re-enabling interrupts */ + if(le16_to_cpu(rfd->command) & cb_el) + nic->ru_running = RU_SUSPENDED; + /* Pull off the RFD and put the actual data (minus eth hdr) */ skb_reserve(skb, sizeof(struct rfd)); skb_put(skb, actual_size); @@ -1843,18 +1861,78 @@ static void e100_rx_clean(struct nic *ni unsigned int work_to_do) { struct rx *rx; + int restart_required = 0; + struct rx *rx_to_start = NULL; + struct rx *old_before_last_rx, *new_before_last_rx; + struct rfd *old_before_last_rfd, *new_before_last_rfd; + + /* are we already rnr? then pay attention!!! this ensures that + * the state machine progression never allows a start with a + * partially cleaned list, avoiding a race between hardware + * and rx_to_clean when in NAPI mode */ + if(RU_SUSPENDED == nic->ru_running) + restart_required = 1; /* Indicate newly arrived packets */ for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) { - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) + int err = e100_rx_indicate(nic, rx, work_done, work_to_do); + if(-EAGAIN == err) { + /* hit quota so have more work to do, restart once + * cleanup is complete */ + restart_required = 0; + break; + } else if(-ENODATA == err) break; /* No more to clean */ } + /* save our starting point as the place we'll restart the receiver */ + if(restart_required) + rx_to_start = nic->rx_to_clean; + + old_before_last_rx = nic->rx_to_use->prev->prev; + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data; + /* Alloc new skbs to refill list */ for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { if(unlikely(e100_rx_alloc_skb(nic, rx))) break; /* Better luck next time (see watchdog) */ } + + new_before_last_rx = nic->rx_to_use->prev->prev; + if (new_before_last_rx != old_before_last_rx) { + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data; + new_before_last_rfd->size = 0; + new_before_last_rfd->command |= cpu_to_le16(cb_el); + pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + /* Now that we have a new stopping point, we can clear the old + * stopping point. + * Note: There appears to be a race here where the hardware + * can complete this buffer with the el-bit set but with the + * size also set. The hardware RNR interrupts, the RUS + * goes into the No Resources state. */ + old_before_last_rfd->command &= ~cpu_to_le16(cb_el); + wmb(); + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); + pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + } + + if(restart_required) { + // ack the rnr? + writeb(stat_ack_rnr, &nic->csr->scb.stat_ack); + e100_start_receiver(nic, rx_to_start); + if(work_done) + (*work_done)++; + } } static void e100_rx_clean_list(struct nic *nic) @@ -1862,6 +1940,8 @@ static void e100_rx_clean_list(struct ni struct rx *rx; unsigned int i, count = nic->params.rfds.count; + nic->ru_running = RU_UNINITIALIZED; + if(nic->rxs) { for(rx = nic->rxs, i = 0; i < count; rx++, i++) { if(rx->skb) { @@ -1881,8 +1961,10 @@ static int e100_rx_alloc_list(struct nic { struct rx *rx; unsigned int i, count = nic->params.rfds.count; + struct rfd *before_last; nic->rx_to_use = nic->rx_to_clean = NULL; + nic->ru_running = RU_UNINITIALIZED; if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) return -ENOMEM; @@ -1896,7 +1978,22 @@ static int e100_rx_alloc_list(struct nic } } + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + rx = nic->rxs->prev->prev; + before_last = (struct rfd *)rx->skb->data; + before_last->command |= cpu_to_le16(cb_el); + before_last->size = 0; + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + nic->rx_to_use = nic->rx_to_clean = nic->rxs; + nic->ru_running = RU_SUSPENDED; return 0; } @@ -1916,6 +2013,10 @@ static irqreturn_t e100_intr(int irq, vo /* Ack interrupt(s) */ iowrite8(stat_ack, &nic->csr->scb.stat_ack); + /* We hit Receive No Resource (RNR); restart RU after cleaning */ + if(stat_ack & stat_ack_rnr) + nic->ru_running = RU_SUSPENDED; + if(likely(netif_rx_schedule_prep(netdev))) { e100_disable_irq(nic); __netif_rx_schedule(netdev); @@ -2007,7 +2108,7 @@ static int e100_up(struct nic *nic) if((err = e100_hw_init(nic))) goto err_clean_cbs; e100_set_multicast_list(nic->netdev); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, nic->netdev->name, nic->netdev))) @@ -2088,7 +2189,7 @@ static int e100_loopback_test(struct nic mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, BMCR_LOOPBACK); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { err = -ENOMEM; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 14:07 ` David Acker @ 2007-05-18 14:20 ` David Acker 2007-05-18 15:29 ` Kok, Auke 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-18 14:20 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev David Acker wrote: > Kok, Auke wrote: >> Jeff Garzik wrote: >>> Can you resend against the latest kernel (2.6.22-rc1)? >>> >>> And what does Intel think? >> >> I'm expecting at least a reply from Milton as the patch was sent to >> him. I haven't yet tested it but will certainly do so. At first glance >> it looks OK, and I'll try to put it under my colleague's noses who >> know e100 best. >> >> A resend against 2.6.22-rc1 would be nice. >> > > Done. Below is a patch against 2.6.22-rc1. It combines removing the > s-bit patch and applying the patch I previously sent. Oops. I missed one state in that patch. Since the el-bit buffer will normally not complete due to a zero size, we need to check if the buffer with no data has the el-bit set. Without this, you have to wait for the interrupt. Sorry about that...this was in the code I tested on my embedded system but got lost in the regular kernel patch. On the ARM, their is a race condition between software allocating a new receive buffer and hardware writing into a buffer. The two race on touching the last Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal to 0. When hardware encounters this buffer it attempts to write data to it and then update Status Word bits and Actual Count in the RFD. At the same time software may try to clear the el-bit and set the link address to a new buffer. Since the entire RFD is once cache-line, the two write operations can collide. This can lead to the receive unit stalling or freed receive buffers getting written to. The fix is to set the el-bit on and the size to 0 on the next to last buffer in the chain. When the hardware encounters this buffer it stops and does not write to it at all. The hardware issues an RNR interrupt with the receive unit in the No Resources state. When software allocates buffers, it can update the tail of the list because it knows the hardware will stop at the buffer before it. Once it has a new next to last buffer prepared, it can clear the el-bit and set the size on the previous one. The race on this buffer is safe since the link already points to a valid next buffer. If the hardware sees the el-bit cleared without the size set, it will move on to the next buffer and complete that one in error. If it sees the size set but the el-bit still set, it will complete that buffer and then RNR interrupt and wait. Signed-off-by: David Acker <dacker@roinet.com> --- --- linux-2.6.22-rc1/drivers/net/e100.c.orig 2007-05-18 10:16:03.000000000 -0400 +++ linux-2.6.22-rc1/drivers/net/e100.c 2007-05-18 10:15:53.000000000 -0400 @@ -285,6 +285,12 @@ enum scb_status { rus_mask = 0x3C, }; +enum ru_state { + RU_SUSPENDED = 0, + RU_RUNNING = 1, + RU_UNINITIALIZED = -1, +}; + enum scb_stat_ack { stat_ack_not_ours = 0x00, stat_ack_sw_gen = 0x04, @@ -526,6 +532,7 @@ struct nic { struct rx *rx_to_use; struct rx *rx_to_clean; struct rfd blank_rfd; + enum ru_state ru_running; spinlock_t cb_lock ____cacheline_aligned; spinlock_t cmd_lock; @@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = cpu_to_le16(cb_el); nic->blank_rfd.rbd = 0xFFFFFFFF; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni return 0; } -static inline void e100_start_receiver(struct nic *nic) +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) { - /* Start if RFA is non-NULL */ - if(nic->rx_to_clean->skb) - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); + if(!nic->rxs) return; + if(RU_SUSPENDED != nic->ru_running) return; + + /* handle init time starts */ + if(!rx) rx = nic->rxs; + + /* (Re)start RU if suspended or idle and RFA is non-NULL */ + if(rx->skb) { + e100_exec_cmd(nic, ruc_start, rx->dma_addr); + nic->ru_running = RU_RUNNING; + } } #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) @@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to - * this one, and clearing EL bit of previous. */ + * this one. We are safe to touch the previous RFD because + * it is protected the before last buffer's el bit being set */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); - wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } @@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic * DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status); /* If data isn't ready, nothing to indicate */ - if(unlikely(!(rfd_status & cb_complete))) + if(unlikely(!(rfd_status & cb_complete))) { + /* this allows for a fast restart without re-enabling interrupts */ + if(le16_to_cpu(rfd->command) & cb_el) + nic->ru_running = RU_SUSPENDED; return -ENODATA; + } /* Get actual data size */ actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF; @@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic * pci_unmap_single(nic->pdev, rx->dma_addr, RFD_BUF_LEN, PCI_DMA_FROMDEVICE); + /* this allows for a fast restart without re-enabling interrupts */ + if(le16_to_cpu(rfd->command) & cb_el) + nic->ru_running = RU_SUSPENDED; + /* Pull off the RFD and put the actual data (minus eth hdr) */ skb_reserve(skb, sizeof(struct rfd)); skb_put(skb, actual_size); @@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni unsigned int work_to_do) { struct rx *rx; + int restart_required = 0; + struct rx *rx_to_start = NULL; + struct rx *old_before_last_rx, *new_before_last_rx; + struct rfd *old_before_last_rfd, *new_before_last_rfd; + + /* are we already rnr? then pay attention!!! this ensures that + * the state machine progression never allows a start with a + * partially cleaned list, avoiding a race between hardware + * and rx_to_clean when in NAPI mode */ + if(RU_SUSPENDED == nic->ru_running) + restart_required = 1; /* Indicate newly arrived packets */ for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) { - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) + int err = e100_rx_indicate(nic, rx, work_done, work_to_do); + if(-EAGAIN == err) { + /* hit quota so have more work to do, restart once + * cleanup is complete */ + restart_required = 0; + break; + } else if(-ENODATA == err) break; /* No more to clean */ } + /* save our starting point as the place we'll restart the receiver */ + if(restart_required) + rx_to_start = nic->rx_to_clean; + + old_before_last_rx = nic->rx_to_use->prev->prev; + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data; + /* Alloc new skbs to refill list */ for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { if(unlikely(e100_rx_alloc_skb(nic, rx))) break; /* Better luck next time (see watchdog) */ } + + new_before_last_rx = nic->rx_to_use->prev->prev; + if (new_before_last_rx != old_before_last_rx) { + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data; + new_before_last_rfd->size = 0; + new_before_last_rfd->command |= cpu_to_le16(cb_el); + pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + /* Now that we have a new stopping point, we can clear the old + * stopping point. + * Note: There appears to be a race here where the hardware + * can complete this buffer with the el-bit set but with the + * size also set. The hardware RNR interrupts, the RUS + * goes into the No Resources state. */ + old_before_last_rfd->command &= ~cpu_to_le16(cb_el); + wmb(); + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); + pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + } + + if(restart_required) { + // ack the rnr? + writeb(stat_ack_rnr, &nic->csr->scb.stat_ack); + e100_start_receiver(nic, rx_to_start); + if(work_done) + (*work_done)++; + } } static void e100_rx_clean_list(struct nic *nic) @@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni struct rx *rx; unsigned int i, count = nic->params.rfds.count; + nic->ru_running = RU_UNINITIALIZED; + if(nic->rxs) { for(rx = nic->rxs, i = 0; i < count; rx++, i++) { if(rx->skb) { @@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic { struct rx *rx; unsigned int i, count = nic->params.rfds.count; + struct rfd *before_last; nic->rx_to_use = nic->rx_to_clean = NULL; + nic->ru_running = RU_UNINITIALIZED; if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) return -ENOMEM; @@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic } } + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + rx = nic->rxs->prev->prev; + before_last = (struct rfd *)rx->skb->data; + before_last->command |= cpu_to_le16(cb_el); + before_last->size = 0; + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + nic->rx_to_use = nic->rx_to_clean = nic->rxs; + nic->ru_running = RU_SUSPENDED; return 0; } @@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo /* Ack interrupt(s) */ iowrite8(stat_ack, &nic->csr->scb.stat_ack); + /* We hit Receive No Resource (RNR); restart RU after cleaning */ + if(stat_ack & stat_ack_rnr) + nic->ru_running = RU_SUSPENDED; + if(likely(netif_rx_schedule_prep(netdev))) { e100_disable_irq(nic); __netif_rx_schedule(netdev); @@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic) if((err = e100_hw_init(nic))) goto err_clean_cbs; e100_set_multicast_list(nic->netdev); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, nic->netdev->name, nic->netdev))) @@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, BMCR_LOOPBACK); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { err = -ENOMEM; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 14:20 ` David Acker @ 2007-05-18 15:29 ` Kok, Auke 2007-05-18 15:47 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-05-18 15:29 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev David Acker wrote: > David Acker wrote: >> Kok, Auke wrote: >>> Jeff Garzik wrote: >>>> Can you resend against the latest kernel (2.6.22-rc1)? >>>> >>>> And what does Intel think? >>> I'm expecting at least a reply from Milton as the patch was sent to >>> him. I haven't yet tested it but will certainly do so. At first glance >>> it looks OK, and I'll try to put it under my colleague's noses who >>> know e100 best. >>> >>> A resend against 2.6.22-rc1 would be nice. >>> >> Done. Below is a patch against 2.6.22-rc1. It combines removing the >> s-bit patch and applying the patch I previously sent. > > Oops. I missed one state in that patch. Since the el-bit buffer will normally not > complete due to a zero size, we need to check if the buffer with no data has the el-bit > set. Without this, you have to wait for the interrupt. Sorry about that...this was in > the code I tested on my embedded system but got lost in the regular kernel patch. OK. Thanks. If you don't mind I'm going to have some testing on this patch done for a bit now (mostly x86 hardware of course) to see if there's no pitfalls in it. It'll be a few days because of the weekend before I get back on it. Auke > > On the ARM, their is a race condition between software allocating a new receive > buffer and hardware writing into a buffer. The two race on touching the last > Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal > to 0. When hardware encounters this buffer it attempts to write data to it > and then update Status Word bits and Actual Count in the RFD. At the same time > software may try to clear the el-bit and set the link address to a new buffer. > Since the entire RFD is once cache-line, the two write operations can > collide. This can lead to the receive unit stalling or freed receive buffers > getting written to. > > The fix is to set the el-bit on and the size to 0 on the next to last buffer > in the chain. When the hardware encounters this buffer it stops and does > not write to it at all. The hardware issues an RNR interrupt with the > receive unit in the No Resources state. When software allocates buffers, > it can update the tail of the list because it knows the hardware will stop > at the buffer before it. Once it has a new next to last buffer prepared, > it can clear the el-bit and set the size on the previous one. The race on > this buffer is safe since the link already points to a valid next buffer. > If the hardware sees the el-bit cleared without the size set, it will > move on to the next buffer and complete that one in error. If it sees > the size set but the el-bit still set, it will complete that buffer > and then RNR interrupt and wait. > > > Signed-off-by: David Acker <dacker@roinet.com> > > --- > > > --- linux-2.6.22-rc1/drivers/net/e100.c.orig 2007-05-18 10:16:03.000000000 -0400 > +++ linux-2.6.22-rc1/drivers/net/e100.c 2007-05-18 10:15:53.000000000 -0400 > @@ -285,6 +285,12 @@ enum scb_status { > rus_mask = 0x3C, > }; > > +enum ru_state { > + RU_SUSPENDED = 0, > + RU_RUNNING = 1, > + RU_UNINITIALIZED = -1, > +}; > + > enum scb_stat_ack { > stat_ack_not_ours = 0x00, > stat_ack_sw_gen = 0x04, > @@ -526,6 +532,7 @@ struct nic { > struct rx *rx_to_use; > struct rx *rx_to_clean; > struct rfd blank_rfd; > + enum ru_state ru_running; > > spinlock_t cb_lock ____cacheline_aligned; > spinlock_t cmd_lock; > @@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic > ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); > > /* Template for a freshly allocated RFD */ > - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); > + nic->blank_rfd.command = cpu_to_le16(cb_el); > nic->blank_rfd.rbd = 0xFFFFFFFF; > nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); > > @@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni > return 0; > } > > -static inline void e100_start_receiver(struct nic *nic) > +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) > { > - /* Start if RFA is non-NULL */ > - if(nic->rx_to_clean->skb) > - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); > + if(!nic->rxs) return; > + if(RU_SUSPENDED != nic->ru_running) return; > + > + /* handle init time starts */ > + if(!rx) rx = nic->rxs; > + > + /* (Re)start RU if suspended or idle and RFA is non-NULL */ > + if(rx->skb) { > + e100_exec_cmd(nic, ruc_start, rx->dma_addr); > + nic->ru_running = RU_RUNNING; > + } > } > > #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) > @@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic > } > > /* Link the RFD to end of RFA by linking previous RFD to > - * this one, and clearing EL bit of previous. */ > + * this one. We are safe to touch the previous RFD because > + * it is protected the before last buffer's el bit being set */ > if(rx->prev->skb) { > struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; > put_unaligned(cpu_to_le32(rx->dma_addr), > (u32 *)&prev_rfd->link); > - wmb(); > - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); > pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, > sizeof(struct rfd), PCI_DMA_TODEVICE); > } > @@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic * > DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status); > > /* If data isn't ready, nothing to indicate */ > - if(unlikely(!(rfd_status & cb_complete))) > + if(unlikely(!(rfd_status & cb_complete))) { > + /* this allows for a fast restart without re-enabling interrupts */ > + if(le16_to_cpu(rfd->command) & cb_el) > + nic->ru_running = RU_SUSPENDED; > return -ENODATA; > + } > > /* Get actual data size */ > actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF; > @@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic * > pci_unmap_single(nic->pdev, rx->dma_addr, > RFD_BUF_LEN, PCI_DMA_FROMDEVICE); > > + /* this allows for a fast restart without re-enabling interrupts */ > + if(le16_to_cpu(rfd->command) & cb_el) > + nic->ru_running = RU_SUSPENDED; > + > /* Pull off the RFD and put the actual data (minus eth hdr) */ > skb_reserve(skb, sizeof(struct rfd)); > skb_put(skb, actual_size); > @@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni > unsigned int work_to_do) > { > struct rx *rx; > + int restart_required = 0; > + struct rx *rx_to_start = NULL; > + struct rx *old_before_last_rx, *new_before_last_rx; > + struct rfd *old_before_last_rfd, *new_before_last_rfd; > + > + /* are we already rnr? then pay attention!!! this ensures that > + * the state machine progression never allows a start with a > + * partially cleaned list, avoiding a race between hardware > + * and rx_to_clean when in NAPI mode */ > + if(RU_SUSPENDED == nic->ru_running) > + restart_required = 1; > > /* Indicate newly arrived packets */ > for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) { > - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) > + int err = e100_rx_indicate(nic, rx, work_done, work_to_do); > + if(-EAGAIN == err) { > + /* hit quota so have more work to do, restart once > + * cleanup is complete */ > + restart_required = 0; > + break; > + } else if(-ENODATA == err) > break; /* No more to clean */ > } > > + /* save our starting point as the place we'll restart the receiver */ > + if(restart_required) > + rx_to_start = nic->rx_to_clean; > + > + old_before_last_rx = nic->rx_to_use->prev->prev; > + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data; > + > /* Alloc new skbs to refill list */ > for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { > if(unlikely(e100_rx_alloc_skb(nic, rx))) > break; /* Better luck next time (see watchdog) */ > } > + > + new_before_last_rx = nic->rx_to_use->prev->prev; > + if (new_before_last_rx != old_before_last_rx) { > + /* Set the el-bit on the buffer that is before the last buffer. > + * This lets us update the next pointer on the last buffer without > + * worrying about hardware touching it. > + * We set the size to 0 to prevent hardware from touching this buffer. > + * When the hardware hits the before last buffer with el-bit and size > + * of 0, it will RNR interrupt, the RUS will go into the No Resources > + * state. It will not complete nor write to this buffer. */ > + new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data; > + new_before_last_rfd->size = 0; > + new_before_last_rfd->command |= cpu_to_le16(cb_el); > + pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); > + > + /* Now that we have a new stopping point, we can clear the old > + * stopping point. > + * Note: There appears to be a race here where the hardware > + * can complete this buffer with the el-bit set but with the > + * size also set. The hardware RNR interrupts, the RUS > + * goes into the No Resources state. */ > + old_before_last_rfd->command &= ~cpu_to_le16(cb_el); > + wmb(); > + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); > + pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); > + } > + > + if(restart_required) { > + // ack the rnr? > + writeb(stat_ack_rnr, &nic->csr->scb.stat_ack); > + e100_start_receiver(nic, rx_to_start); > + if(work_done) > + (*work_done)++; > + } > } > > static void e100_rx_clean_list(struct nic *nic) > @@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni > struct rx *rx; > unsigned int i, count = nic->params.rfds.count; > > + nic->ru_running = RU_UNINITIALIZED; > + > if(nic->rxs) { > for(rx = nic->rxs, i = 0; i < count; rx++, i++) { > if(rx->skb) { > @@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic > { > struct rx *rx; > unsigned int i, count = nic->params.rfds.count; > + struct rfd *before_last; > > nic->rx_to_use = nic->rx_to_clean = NULL; > + nic->ru_running = RU_UNINITIALIZED; > > if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) > return -ENOMEM; > @@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic > } > } > > + /* Set the el-bit on the buffer that is before the last buffer. > + * This lets us update the next pointer on the last buffer without > + * worrying about hardware touching it. > + * We set the size to 0 to prevent hardware from touching this buffer. > + * When the hardware hits the before last buffer with el-bit and size > + * of 0, it will RNR interrupt, the RUS will go into the No Resources > + * state. It will not complete nor write to this buffer. */ > + rx = nic->rxs->prev->prev; > + before_last = (struct rfd *)rx->skb->data; > + before_last->command |= cpu_to_le16(cb_el); > + before_last->size = 0; > + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); > + > nic->rx_to_use = nic->rx_to_clean = nic->rxs; > + nic->ru_running = RU_SUSPENDED; > > return 0; > } > @@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo > /* Ack interrupt(s) */ > iowrite8(stat_ack, &nic->csr->scb.stat_ack); > > + /* We hit Receive No Resource (RNR); restart RU after cleaning */ > + if(stat_ack & stat_ack_rnr) > + nic->ru_running = RU_SUSPENDED; > + > if(likely(netif_rx_schedule_prep(netdev))) { > e100_disable_irq(nic); > __netif_rx_schedule(netdev); > @@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic) > if((err = e100_hw_init(nic))) > goto err_clean_cbs; > e100_set_multicast_list(nic->netdev); > - e100_start_receiver(nic); > + e100_start_receiver(nic, NULL); > mod_timer(&nic->watchdog, jiffies); > if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, > nic->netdev->name, nic->netdev))) > @@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic > mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, > BMCR_LOOPBACK); > > - e100_start_receiver(nic); > + e100_start_receiver(nic, NULL); > > if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { > err = -ENOMEM; > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 15:29 ` Kok, Auke @ 2007-05-18 15:47 ` David Acker 2007-05-18 15:59 ` Kok, Auke 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-18 15:47 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Kok, Auke wrote: > David Acker wrote: >> David Acker wrote: >>> Done. Below is a patch against 2.6.22-rc1. It combines removing the >>> s-bit patch and applying the patch I previously sent. >> >> Oops. I missed one state in that patch. Since the el-bit buffer will >> normally not complete due to a zero size, we need to check if the >> buffer with no data has the el-bit set. Without this, you have to >> wait for the interrupt. Sorry about that...this was in the code I >> tested on my embedded system but got lost in the regular kernel patch. > > OK. Thanks. > > If you don't mind I'm going to have some testing on this patch done for > a bit now (mostly x86 hardware of course) to see if there's no pitfalls > in it. It'll be a few days because of the weekend before I get back on it. > Cool. I will see if I can get some more tests running over the weekend on our PXA255 platform. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 15:47 ` David Acker @ 2007-05-18 15:59 ` Kok, Auke 2007-05-18 17:11 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-05-18 15:59 UTC (permalink / raw) To: David Acker Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev David Acker wrote: > Kok, Auke wrote: >> David Acker wrote: >>> David Acker wrote: >>>> Done. Below is a patch against 2.6.22-rc1. It combines removing the >>>> s-bit patch and applying the patch I previously sent. >>> Oops. I missed one state in that patch. Since the el-bit buffer will >>> normally not complete due to a zero size, we need to check if the >>> buffer with no data has the el-bit set. Without this, you have to >>> wait for the interrupt. Sorry about that...this was in the code I >>> tested on my embedded system but got lost in the regular kernel patch. >> OK. Thanks. >> >> If you don't mind I'm going to have some testing on this patch done for >> a bit now (mostly x86 hardware of course) to see if there's no pitfalls >> in it. It'll be a few days because of the weekend before I get back on it. >> > > Cool. I will see if I can get some more tests running over the weekend on our PXA255 > platform. First impression just came in: It seems RX performance is dropped to 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code seems to misbehave and fluctuate, dropping below 10mbit after a few netperf runs and staying there... ideas? Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 15:59 ` Kok, Auke @ 2007-05-18 17:11 ` David Acker 2007-05-18 17:47 ` Kok, Auke 2007-05-21 17:35 ` Milton Miller 0 siblings, 2 replies; 51+ messages in thread From: David Acker @ 2007-05-18 17:11 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Kok, Auke wrote: > First impression just came in: It seems RX performance is dropped to > 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code > seems to misbehave and fluctuate, dropping below 10mbit after a few > netperf runs and staying there... > > ideas? I found the problem. Another casualty of working with two different kernels at once...arg. The blank rfd needs to have its el-bit clear now. Here is the new and improved patch. On the ARM, their is a race condition between software allocating a new receive buffer and hardware writing into a buffer. The two race on touching the last Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal to 0. When hardware encounters this buffer it attempts to write data to it and then update Status Word bits and Actual Count in the RFD. At the same time software may try to clear the el-bit and set the link address to a new buffer. Since the entire RFD is once cache-line, the two write operations can collide. This can lead to the receive unit stalling or freed receive buffers getting written to. The fix is to set the el-bit on and the size to 0 on the next to last buffer in the chain. When the hardware encounters this buffer it stops and does not write to it at all. The hardware issues an RNR interrupt with the receive unit in the No Resources state. When software allocates buffers, it can update the tail of the list because it knows the hardware will stop at the buffer before it. Once it has a new next to last buffer prepared, it can clear the el-bit and set the size on the previous one. The race on this buffer is safe since the link already points to a valid next buffer. If the hardware sees the el-bit cleared without the size set, it will move on to the next buffer and complete that one in error. If it sees the size set but the el-bit still set, it will complete that buffer and then RNR interrupt and wait. Signed-off-by: David Acker <dacker@roinet.com> --- --- linux-2.6.22-rc1/drivers/net/e100.c.orig 2007-05-18 13:08:01.000000000 -0400 +++ linux-2.6.22-rc1/drivers/net/e100.c 2007-05-18 13:08:24.000000000 -0400 @@ -285,6 +285,12 @@ enum scb_status { rus_mask = 0x3C, }; +enum ru_state { + RU_SUSPENDED = 0, + RU_RUNNING = 1, + RU_UNINITIALIZED = -1, +}; + enum scb_stat_ack { stat_ack_not_ours = 0x00, stat_ack_sw_gen = 0x04, @@ -526,6 +532,7 @@ struct nic { struct rx *rx_to_use; struct rx *rx_to_clean; struct rfd blank_rfd; + enum ru_state ru_running; spinlock_t cb_lock ____cacheline_aligned; spinlock_t cmd_lock; @@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = 0; nic->blank_rfd.rbd = 0xFFFFFFFF; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni return 0; } -static inline void e100_start_receiver(struct nic *nic) +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) { - /* Start if RFA is non-NULL */ - if(nic->rx_to_clean->skb) - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); + if(!nic->rxs) return; + if(RU_SUSPENDED != nic->ru_running) return; + + /* handle init time starts */ + if(!rx) rx = nic->rxs; + + /* (Re)start RU if suspended or idle and RFA is non-NULL */ + if(rx->skb) { + e100_exec_cmd(nic, ruc_start, rx->dma_addr); + nic->ru_running = RU_RUNNING; + } } #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) @@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to - * this one, and clearing EL bit of previous. */ + * this one. We are safe to touch the previous RFD because + * it is protected the before last buffer's el bit being set */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); - wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } @@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic * DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status); /* If data isn't ready, nothing to indicate */ - if(unlikely(!(rfd_status & cb_complete))) + if(unlikely(!(rfd_status & cb_complete))) { + /* this allows for a fast restart without re-enabling interrupts */ + if(le16_to_cpu(rfd->command) & cb_el) + nic->ru_running = RU_SUSPENDED; return -ENODATA; + } /* Get actual data size */ actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF; @@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic * pci_unmap_single(nic->pdev, rx->dma_addr, RFD_BUF_LEN, PCI_DMA_FROMDEVICE); + /* this allows for a fast restart without re-enabling interrupts */ + if(le16_to_cpu(rfd->command) & cb_el) + nic->ru_running = RU_SUSPENDED; + /* Pull off the RFD and put the actual data (minus eth hdr) */ skb_reserve(skb, sizeof(struct rfd)); skb_put(skb, actual_size); @@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni unsigned int work_to_do) { struct rx *rx; + int restart_required = 0; + struct rx *rx_to_start = NULL; + struct rx *old_before_last_rx, *new_before_last_rx; + struct rfd *old_before_last_rfd, *new_before_last_rfd; + + /* are we already rnr? then pay attention!!! this ensures that + * the state machine progression never allows a start with a + * partially cleaned list, avoiding a race between hardware + * and rx_to_clean when in NAPI mode */ + if(RU_SUSPENDED == nic->ru_running) + restart_required = 1; /* Indicate newly arrived packets */ for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) { - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) + int err = e100_rx_indicate(nic, rx, work_done, work_to_do); + if(-EAGAIN == err) { + /* hit quota so have more work to do, restart once + * cleanup is complete */ + restart_required = 0; + break; + } else if(-ENODATA == err) break; /* No more to clean */ } + /* save our starting point as the place we'll restart the receiver */ + if(restart_required) + rx_to_start = nic->rx_to_clean; + + old_before_last_rx = nic->rx_to_use->prev->prev; + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data; + /* Alloc new skbs to refill list */ for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { if(unlikely(e100_rx_alloc_skb(nic, rx))) break; /* Better luck next time (see watchdog) */ } + + new_before_last_rx = nic->rx_to_use->prev->prev; + if (new_before_last_rx != old_before_last_rx) { + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data; + new_before_last_rfd->size = 0; + new_before_last_rfd->command |= cpu_to_le16(cb_el); + pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + /* Now that we have a new stopping point, we can clear the old + * stopping point. + * Note: There appears to be a race here where the hardware + * can complete this buffer with the el-bit set but with the + * size also set. The hardware RNR interrupts, the RUS + * goes into the No Resources state. */ + old_before_last_rfd->command &= ~cpu_to_le16(cb_el); + wmb(); + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); + pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + } + + if(restart_required) { + // ack the rnr? + writeb(stat_ack_rnr, &nic->csr->scb.stat_ack); + e100_start_receiver(nic, rx_to_start); + if(work_done) + (*work_done)++; + } } static void e100_rx_clean_list(struct nic *nic) @@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni struct rx *rx; unsigned int i, count = nic->params.rfds.count; + nic->ru_running = RU_UNINITIALIZED; + if(nic->rxs) { for(rx = nic->rxs, i = 0; i < count; rx++, i++) { if(rx->skb) { @@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic { struct rx *rx; unsigned int i, count = nic->params.rfds.count; + struct rfd *before_last; nic->rx_to_use = nic->rx_to_clean = NULL; + nic->ru_running = RU_UNINITIALIZED; if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) return -ENOMEM; @@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic } } + /* Set the el-bit on the buffer that is before the last buffer. + * This lets us update the next pointer on the last buffer without + * worrying about hardware touching it. + * We set the size to 0 to prevent hardware from touching this buffer. + * When the hardware hits the before last buffer with el-bit and size + * of 0, it will RNR interrupt, the RUS will go into the No Resources + * state. It will not complete nor write to this buffer. */ + rx = nic->rxs->prev->prev; + before_last = (struct rfd *)rx->skb->data; + before_last->command |= cpu_to_le16(cb_el); + before_last->size = 0; + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + nic->rx_to_use = nic->rx_to_clean = nic->rxs; + nic->ru_running = RU_SUSPENDED; return 0; } @@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo /* Ack interrupt(s) */ iowrite8(stat_ack, &nic->csr->scb.stat_ack); + /* We hit Receive No Resource (RNR); restart RU after cleaning */ + if(stat_ack & stat_ack_rnr) + nic->ru_running = RU_SUSPENDED; + if(likely(netif_rx_schedule_prep(netdev))) { e100_disable_irq(nic); __netif_rx_schedule(netdev); @@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic) if((err = e100_hw_init(nic))) goto err_clean_cbs; e100_set_multicast_list(nic->netdev); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, nic->netdev->name, nic->netdev))) @@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, BMCR_LOOPBACK); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { err = -ENOMEM; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 17:11 ` David Acker @ 2007-05-18 17:47 ` Kok, Auke 2007-05-21 17:35 ` Milton Miller 1 sibling, 0 replies; 51+ messages in thread From: Kok, Auke @ 2007-05-18 17:47 UTC (permalink / raw) To: David Acker, Milton Miller Cc: Jeff Garzik, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev David Acker wrote: > Kok, Auke wrote: >> First impression just came in: It seems RX performance is dropped to >> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code >> seems to misbehave and fluctuate, dropping below 10mbit after a few >> netperf runs and staying there... >> >> ideas? > > I found the problem. Another casualty of working with two different kernels at once...arg. > The blank rfd needs to have its el-bit clear now. Here is the new and improved patch. > - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); > + nic->blank_rfd.command = 0; OK, that fixed it and it's getting 94mbit/tcp rx now without issues. Milton, can you look at this patch? I'd like some more feedback. Meanwhile I will try to get this tested on a variety of e100 NICs, which will take a few days. Thanks, Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-18 17:11 ` David Acker 2007-05-18 17:47 ` Kok, Auke @ 2007-05-21 17:35 ` Milton Miller 2007-05-21 17:45 ` Kok, Auke 1 sibling, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-05-21 17:35 UTC (permalink / raw) To: David Acker Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev, Kok, Auke On May 18, 2007, at 12:11 PM, David Acker wrote: > Kok, Auke wrote: >> First impression just came in: It seems RX performance is dropped to >> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code >> seems to misbehave and fluctuate, dropping below 10mbit after a few >> netperf runs and staying there... >> ideas? > > I found the problem. Another casualty of working with two different > kernels at once...arg. > The blank rfd needs to have its el-bit clear now. Here is the new and > improved patch. > > On the ARM, their is a race condition between software allocating a > new receive > buffer and hardware writing into a buffer. The two race on touching > the last > Receive Frame Descriptor (RFD). It has its el-bit set and its next > link equal > to 0. When hardware encounters this buffer it attempts to write data > to it > and then update Status Word bits and Actual Count in the RFD. At the > same time > software may try to clear the el-bit and set the link address to a new > buffer. > Since the entire RFD is once cache-line, the two write operations can > collide. This can lead to the receive unit stalling or freed receive > buffers > getting written to. > > The fix is to set the el-bit on and the size to 0 on the next to last > buffer > in the chain. When the hardware encounters this buffer it stops and > does > not write to it at all. The hardware issues an RNR interrupt with the > receive unit in the No Resources state. When software allocates > buffers, > it can update the tail of the list because it knows the hardware will > stop > at the buffer before it. Once it has a new next to last buffer > prepared, > it can clear the el-bit and set the size on the previous one. The > race on > this buffer is safe since the link already points to a valid next > buffer. > If the hardware sees the el-bit cleared without the size set, it will > move on to the next buffer and complete that one in error. If it sees > the size set but the el-bit still set, it will complete that buffer > and then RNR interrupt and wait. > > > Signed-off-by: David Acker <dacker@roinet.com> > This patch doesn't apply. It appears white space damaged somewhere: (1) blank lines in diff are empty not <space> (2) unchanged lines starting with tab are <space><space><tab> After fixing the above I still get: patching file drivers/net/e100.c Hunk #1 FAILED at 285. Hunk #4 FAILED at 1749. Hunk #8 FAILED at 1865. Hunk #10 succeeded at 1965 with fuzz 1. Hunk #11 succeeded at 1982 with fuzz 1. 3 out of 14 hunks FAILED -- saving rejects to file drivers/net/e100.c.rej although I haven't figured out what is wrong. Proceeding with the review: Coding style: (1) if body on seperate line. (2) space after if before ( (3) The other enums in this driver are not ALL_CAPS (4) This driver doesn't do CONSTANT != value but value != enum (see nic->mac for examples) I don't understand why the old logic in start_receiver doesn't work. When would the rx passed not be rx_to_clean ? The driver sets EL (end-of-list) to stop the receiver. I was trying to determine if this is the right decision or we should use suspend bit. I think the EL is the right choice because it will count discarded frames while the card waits to be refilled, it is not obvious that those counters count in the suspended state. Also, we want to use the RU_START not RU_RESUME so that it reads the RFD again with the size (as opposed to just the S bit) However, we should not call this state RU_SUSPENDED because that is a different state in the hw. I also see that the driver sets the size from 0 back to frame size. While there is a wmb() between the removal of EL and the restore of the frame size, there is only one pci_sync for both of them. Since the size is in a separate word, depending on skb alignment it may be in a different cache line and we may end up with all orderings being visible to the device. This patch adds a lot of code that checks if the next RFD has EL set and assumes that if it sees the next frame to clean has EL set then the device will have seen it and entered the stopped state. In theory, the device might be held off on the PCI bus and not have yet read the descriptor and stopped. In addition, while it anticipates the RNR interrupt and restarts the receiver from the clean process, it doesn't clear the pending interrupt and the cpu will still take the interrupt, although it will read the status and not actually restart the RU. It would seem simpler to me to just check the device status on the way out after refilling and restoring the descriptor. If the device has stopped, rx_to_clean will still point to the no-longer-size-0 not EL buffer and will be the correct restart place. Also, I think this exposes us to a window with coherent hardware where the device will see EL cleared but size 0 and will skip completing the frame but continue on to the next RFD. The driver will then issue a start command while the RU is active, which is prohibited. This window also means that the device may have skipped this previously 0-length descriptor and gone on to the next one, so we will stick waiting for the device to write to this descriptor while it went on to the next one. The description of how we use the EL bit and size 0 should go in the big comment at the top describing the operation of the driver. Also, the description of size 0 should be separated from the description of EL being set, as those are independent. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-21 17:35 ` Milton Miller @ 2007-05-21 17:45 ` Kok, Auke 2007-05-22 16:51 ` Milton Miller 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-05-21 17:45 UTC (permalink / raw) To: Milton Miller Cc: David Acker, Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev Milton Miller wrote: > On May 18, 2007, at 12:11 PM, David Acker wrote: > >> Kok, Auke wrote: >>> First impression just came in: It seems RX performance is dropped to >>> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code >>> seems to misbehave and fluctuate, dropping below 10mbit after a few >>> netperf runs and staying there... >>> ideas? >> I found the problem. Another casualty of working with two different >> kernels at once...arg. >> The blank rfd needs to have its el-bit clear now. Here is the new and >> improved patch. >> >> On the ARM, their is a race condition between software allocating a >> new receive >> buffer and hardware writing into a buffer. The two race on touching >> the last >> Receive Frame Descriptor (RFD). It has its el-bit set and its next >> link equal >> to 0. When hardware encounters this buffer it attempts to write data >> to it >> and then update Status Word bits and Actual Count in the RFD. At the >> same time >> software may try to clear the el-bit and set the link address to a new >> buffer. >> Since the entire RFD is once cache-line, the two write operations can >> collide. This can lead to the receive unit stalling or freed receive >> buffers >> getting written to. >> >> The fix is to set the el-bit on and the size to 0 on the next to last >> buffer >> in the chain. When the hardware encounters this buffer it stops and >> does >> not write to it at all. The hardware issues an RNR interrupt with the >> receive unit in the No Resources state. When software allocates >> buffers, >> it can update the tail of the list because it knows the hardware will >> stop >> at the buffer before it. Once it has a new next to last buffer >> prepared, >> it can clear the el-bit and set the size on the previous one. The >> race on >> this buffer is safe since the link already points to a valid next >> buffer. >> If the hardware sees the el-bit cleared without the size set, it will >> move on to the next buffer and complete that one in error. If it sees >> the size set but the el-bit still set, it will complete that buffer >> and then RNR interrupt and wait. >> >> >> Signed-off-by: David Acker <dacker@roinet.com> >> > > > This patch doesn't apply. It appears white space damaged somewhere: > > (1) blank lines in diff are empty not <space> > (2) unchanged lines starting with tab are <space><space><tab> > > After fixing the above I still get: > > patching file drivers/net/e100.c > Hunk #1 FAILED at 285. > Hunk #4 FAILED at 1749. > Hunk #8 FAILED at 1865. > Hunk #10 succeeded at 1965 with fuzz 1. > Hunk #11 succeeded at 1982 with fuzz 1. > 3 out of 14 hunks FAILED -- saving rejects to file > drivers/net/e100.c.rej > > although I haven't figured out what is wrong. > > Proceeding with the review: > > Coding style: > (1) if body on seperate line. > (2) space after if before ( > (3) The other enums in this driver are not ALL_CAPS > (4) This driver doesn't do CONSTANT != value but value != enum > (see nic->mac for examples) I sent Milton my copy of this patch which has these style issues corrected and applies cleanly to a recent git tree. If anyone else specifically wants a copy let me know. Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-21 17:45 ` Kok, Auke @ 2007-05-22 16:51 ` Milton Miller 2007-05-22 22:07 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-05-22 16:51 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev, David Acker On May 21, 2007, at 12:45 PM, Kok, Auke wrote: > Milton Miller wrote: >> On May 18, 2007, at 12:11 PM, David Acker wrote: >>> Kok, Auke wrote: >>>> First impression just came in: It seems RX performance is dropped >>>> to 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new >>>> code seems to misbehave and fluctuate, dropping below 10mbit after >>>> a few netperf runs and staying there... >>>> ideas? >>> I found the problem. Another casualty of working with two different >>> kernels at once...arg. >>> The blank rfd needs to have its el-bit clear now. Here is the new >>> and improved patch. ... >> Proceeding with the review: >> Coding style: >> (1) if body on seperate line. >> (2) space after if before ( >> (3) The other enums in this driver are not ALL_CAPS >> (4) This driver doesn't do CONSTANT != value but value != enum >> (see nic->mac for examples) > > I sent Milton my copy of this patch which has these style issues > corrected and > applies cleanly to a recent git tree. If anyone else specifically > wants a copy > let me know. > > Auke It addressed 1 and 2, and applies, but did not address 3 and 4. But the bigger point is it didn't address the holes I identified. I think we need to change the logic to reclaim the size from 0 only if we are restarting, and make rx_indicate look ahead to rx->next if it encounters a !EL size 0 buffer. Without this we are doing a "prohibited" rx_start to a running machine. The device can still see this size 0 !EL state. Also we will get stuck when the device finds the window between the two writes. We can remove some register pressure by finding old_before_last_rfd when we are ready to use it, just comparing old_before_last_rx to new. Also, as I pointed out, the rx_to_start change to start_reciever is compicated and unnecessary, as rx_to_clean can always be used and it was the starting point before the changes. As far as the RU_SUSPENDED, I don't think we need it, instead we should poll the device. Here is my proposal: rx_indicate can stop when it hits the packet with EL. If it hits a packet with size 0, look ahead to rx->next to see if it is complete, if so complete this one otherwise leave it as next to clean. After the rx_indicate loop, try to allocate more skbs. If we are successful, then fixup the before-next as we do now. Then check the device status for RNR, and if its stopped then set rx_to_clean rfd size back to ETH_LEN and restart the reciever. This does have a small hole: if we only add one packet at a time we will end up with all size 0 descriptors in the lopp. We can detect that and not remove EL from the old before-next unless we are restarting. That would imply moving the status poll before we allocate the list. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-22 16:51 ` Milton Miller @ 2007-05-22 22:07 ` David Acker 2007-05-23 14:02 ` Milton Miller 0 siblings, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-22 22:07 UTC (permalink / raw) To: Milton Miller Cc: Kok, Auke, Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev Milton Miller wrote: >>> Proceeding with the review: >>> Coding style: >>> (1) if body on seperate line. >>> (2) space after if before ( >>> (3) The other enums in this driver are not ALL_CAPS >>> (4) This driver doesn't do CONSTANT != value but value != enum >>> (see nic->mac for examples) >> >> I sent Milton my copy of this patch which has these style issues >> corrected and >> applies cleanly to a recent git tree. If anyone else specifically >> wants a copy >> let me know. >> >> Auke > > It addressed 1 and 2, and applies, but did not address 3 and 4. > Sorry about the style bugs. I will be more careful about that next time. Many of the issues you bring have been in the e100 for some time. If you ignore the s-bit patch, I basically did the the following: moved the el-bit to before the last buffer so that the last buffer was protected during chaining. set the el-bit buffer (next-to-last) size to 0 so that it is not written to while we are writing to it. This seemed the only way to protect the buffers we are changing while having code to stay ahead of the hardware. (3) The driver had these all caps constants before my patch. They went away with the s-bit patch. I just put them back. I agree they stick out but I wanted to leave style changes out of a bug fix patch. I agree about the name of the constant, RU_SUSPENDED. It is not accurate and I had a patch to fix this when I experimented with using the S-bit. > > But the bigger point is it didn't address the holes I identified. > I also see that the driver sets the size from 0 back to frame > size. While there is a wmb() between the removal of EL and the > restore of the frame size, there is only one pci_sync for both > of them. Since the size is in a separate word, depending on skb > alignment it may be in a different cache line and we may end up > with all orderings being visible to the device. Hmmm...interesting. The start of the data of the skb may not be aligned on a cache line. Ok DMA experts...what happens when you sync across cache lines? It should dump both of them, right? I guess the orderings could vary...hence why I saw completions with size set but the el-bit still set. So perhaps the skb alloc needs to be aligned by cache line? This issue existed before my patches as well. > This patch adds a lot of code that checks if the next RFD has > EL set and assumes that if it sees the next frame to clean has > EL set then the device will have seen it and entered the stopped > state. In theory, the device might be held off on the PCI bus > and not have yet read the descriptor and stopped. In addition, > while it anticipates the RNR interrupt and restarts the receiver > from the clean process, it doesn't clear the pending interrupt > and the cpu will still take the interrupt, although it will read > the status and not actually restart the RU. So the device is just before the el-bit buffer and stops. We get a poll call with interrupts disabled and see the el-bit buffer and decide we need to restart. First we alloc new buffers and move the el-bit/size 0 down. The restart occurs on the next poll (interrupts are not turned on if we did any work). In theory, the hardware could have read the buffer that no longer has a size of 0 and tried to use it when we hit it with the start. I am not sure how the hardware handles this. Perhaps it is ignored and thus harmless. We could poll the hardware to check before actually sending the start...it adds an extra pci operation but would avoid a command that is illegal in the manual. If we ignore a buffer with the el-bit set but without the complete bit, we must wait for the hardware to RNR interrupt. I have found that when a buffer has both the el-bit set and the size set to 0, it will not set the complete bit on that buffer. This was part of the reason why my first patch had hardware just using the list of buffers given all the way up until the end. > I think we need to change the logic to reclaim the size from 0 > only if we are restarting, and make rx_indicate look ahead to > rx->next if it encounters a !EL size 0 buffer. Without this we > are doing a "prohibited" rx_start to a running machine. The hardware is only restarted when we enter the RU_SUSPENDED state. This only happens when we: 1) get an RNR interrupt 2) get an EL-bit buffer without a completion 3) get an el-bit buffer with a completion (hardware saw size set but not el-bit clear) State 2) seems to be the problem. Get rid of that and we wait for an interrupt to tell us it saw the buffer in most cases. Of course, then we always wait. If we do not reclaim the size from 0, but clear the el-bit, that buffer will always return an error. > The > device can still see this size 0 !EL state. Also we will get > stuck when the device finds the window between the two writes. > This window also means that the device may have skipped this > previously 0-length descriptor and gone on to the next one, > so we will stick waiting for the device to write to this > descriptor while it went on to the next one. If the hardware sees both size 0 and EL, it RNR interrupts, does not write to the buffer at all and stops. If the hardware sees size 0 and !EL, (which I never saw in testing but could happen) it would return the buffer with an error and continue. This is ok since we already have written a new size 0 and EL. Our polling code would just find a packet completed in error. If the hardware sees size set and EL, it RNR interrupts, and does write data and status to the buffer, and then stops. I have see this one occur and it isn't a problem. If hardware sees size set and !EL, it writes data and continues. > We can remove some register pressure by finding old_before_last_rfd > when we are ready to use it, just comparing old_before_last_rx > to new. Sure. > > Also, as I pointed out, the rx_to_start change to start_reciever > is compicated and unnecessary, as rx_to_clean can always be used > and it was the starting point before the changes. This was in the driver before my patch and the s-bit patch. I agree that it could go away as rx_to_clean should not be capable of changing. > > As far as the RU_SUSPENDED, I don't think we need it, instead > we should poll the device. RU_SUSPENDED is pretty much used as it was before all the patches. The only waste I found is that if we find the EL-bit while interrupts are disabled (say during a NAPI poll), we set RU_SUSPENDED and then have to wait for the next poll to figure out that we need to restart the receiver. > > Here is my proposal: > rx_indicate can stop when it hits the packet with EL. If it > hits a packet with size 0, look ahead to rx->next to see if > it is complete, if so complete this one otherwise leave it > as next to clean. After the rx_indicate loop, try to allocate > more skbs. If we are successful, then fixup the before-next > as we do now. Then check the device status for RNR, and if > its stopped then set rx_to_clean rfd size back to ETH_LEN > and restart the reciever. > > This does have a small hole: if we only add one packet at > a time we will end up with all size 0 descriptors in the > lopp. We can detect that and not remove EL from the old > before-next unless we are restarting. That would imply > moving the status poll before we allocate the list. Can you send a patch or pseudo-code? It would help to see some main parts: 1. blank_rfd setup 2. single buffer alloc 3. el-bit handling (when do we set it and when do we clear it) 4. size handling (when do we set it and when do we clear it) What if we just fixed the alignment of rfd to make sure it is cache aligned? Then we know our syncs will be a single cache flush. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-22 22:07 ` David Acker @ 2007-05-23 14:02 ` Milton Miller 2007-05-23 21:32 ` David Acker 2007-05-24 4:13 ` Milton Miller 0 siblings, 2 replies; 51+ messages in thread From: Milton Miller @ 2007-05-23 14:02 UTC (permalink / raw) To: David Acker Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev, Scott Feldman I tried to remove anything we were in agreement with. On May 22, 2007, at 5:07 PM, David Acker wrote: > Milton Miller wrote: > Many of the issues you bring have been in the e100 for some time. If > you ignore the s-bit patch, I basically did the the following: > moved the el-bit to before the last buffer so that the last buffer was > protected during chaining. > set the el-bit buffer (next-to-last) size to 0 so that it is not > written to while we are writing to it. > > This seemed the only way to protect the buffers we are changing while > having code to stay ahead of the hardware. I agree with this part of the approach. I just think we need a bit more work on the "what to do when we are ready for hardware to not stop" part. >> But the bigger point is it didn't address the holes I identified. >> >> I also see that the driver sets the size from 0 back to frame >> size. While there is a wmb() between the removal of EL and the >> restore of the frame size, there is only one pci_sync for both >> of them. Since the size is in a separate word, depending on skb >> alignment it may be in a different cache line and we may end up >> with all orderings being visible to the device. > > Hmmm...interesting. The start of the data of the skb may not be > aligned on a cache line. Ok DMA experts...what happens when you sync > across cache lines? It should dump both of them, right? I guess the > orderings could vary...hence why I saw completions with size set but > the el-bit still set. So perhaps the skb alloc needs to be aligned by > cache line? This issue existed before my patches as well. The sync is required to push both cache lines, but there is no ordering guarantee. This probably is why you saw size and el set. Aligning the RFD to a cache line conflicts with aligning the payload (IP header and data) to a word boundary, and depending on cache line size it may be impossible to do both. And it won't fix the hole for coherent dma machines. It wasn't an issue before because we never set two fields. >> This patch adds a lot of code that checks if the next RFD has >> EL set and assumes that if it sees the next frame to clean has >> EL set then the device will have seen it and entered the stopped >> state. In theory, the device might be held off on the PCI bus >> and not have yet read the descriptor and stopped. In addition, >> while it anticipates the RNR interrupt and restarts the receiver >> from the clean process, it doesn't clear the pending interrupt >> and the cpu will still take the interrupt, although it will read >> the status and not actually restart the RU. > > So the device is just before the el-bit buffer and stops. We get a > poll call with interrupts disabled and see the el-bit buffer and > decide we need to restart. First we alloc new buffers and move the > el-bit/size 0 down. The restart occurs on the next poll (interrupts > are not turned on if we did any work). In theory, the hardware could > have read the buffer that no longer has a size of 0 and tried to use > it when we hit it with the start. I am not sure how the hardware > handles this. Perhaps it is ignored and thus harmless. We could poll > the hardware to check before actually sending the start...it adds an > extra pci operation but would avoid a command that is illegal in the > manual. > > If we ignore a buffer with the el-bit set but without the complete > bit, we must wait for the hardware to RNR interrupt. I have found > that when a buffer has both the el-bit set and the size set to 0, it > will not set the complete bit on that buffer. This was part of the > reason why my first patch had hardware just using the list of buffers > given all the way up until the end. My current reading of the manual is that the C bit will not be set on an RFD that is size 0. It goes on to processes EL and S, and decides to stop and interrupt RNR or suspend, or just go to the next packet. >> I think we need to change the logic to reclaim the size from 0 >> only if we are restarting, and make rx_indicate look ahead to >> rx->next if it encounters a !EL size 0 buffer. Without this we >> are doing a "prohibited" rx_start to a running machine. > > The hardware is only restarted when we enter the RU_SUSPENDED state. > This only happens when we: > 1) get an RNR interrupt > 2) get an EL-bit buffer without a completion > 3) get an el-bit buffer with a completion (hardware saw size set but > not el-bit clear) > > State 2) seems to be the problem. Get rid of that and we wait for an > interrupt to tell us it saw the buffer in most cases. Of course, then > we always wait. If we poll the device, I think it will be rare that it is not stopped when we see the EL bit bit it has not yet read the next RFD. But it is a race and I think a device read is worth avoiding something prohibited in the manual. > If we do not reclaim the size from 0, but clear the el-bit, that > buffer will always return an error. I disagree. If the size is 0, it is a special case and no error will be generated, instead the RU will just process EL and S then proceed to the next buffer. If the size was smaller than the incoming packet but not 0 then it would be completed in error. It does make are "did the device go past" logic harder. If the buffer ever was 0, then we need to see if the next never-zero buffer has completed, and reclaim all the buffers in between. >> The device can still see this size 0 !EL state. Also we will >> get stuck when the device finds the window between the two >> writes. > >> This window also means that the device may have skipped this >> previously 0-length descriptor and gone on to the next one, >> so we will stick waiting for the device to write to this >> descriptor while it went on to the next one. > > If the hardware sees both size 0 and EL, it RNR interrupts, does not > write to the buffer at all and stops. > > If the hardware sees size 0 and !EL, (which I never saw in testing but > could happen) it would return the buffer with an error and continue. > This is ok since we already have written a new size 0 and EL. Our > polling code would just find a packet completed in error. [A bit repetitious since we are replying to two threads] Again, I think there is no completion, error or not, and our clean will get stuck while the hardware goes on. > If the hardware sees size set and EL, it RNR interrupts, and does > write data and status to the buffer, and then stops. I have see this > one occur and it isn't a problem. > > If hardware sees size set and !EL, it writes data and continues. > > >> As far as the RU_SUSPENDED, I don't think we need it, instead >> we should poll the device. > RU_SUSPENDED is pretty much used as it was before all the patches. The > only waste I found is that if we find the EL-bit while interrupts are > disabled (say during a NAPI poll), we set RU_SUSPENDED and then have > to wait for the next poll to figure out that we need to restart the > receiver. I have no problem with reading the device when we think it may have hit EL and restarting on this poll. I even prefer it. If we do restart, we should take the time to ack the RNR interrupt so we don't process it again. >> Here is my proposal: >> rx_indicate can stop when it hits the packet with EL. If it >> hits a packet with size 0, look ahead to rx->next to see if >> it is complete, if so complete this one otherwise leave it >> as next to clean. After the rx_indicate loop, try to allocate >> more skbs. If we are successful, then fixup the before-next >> as we do now. Then check the device status for RNR, and if >> its stopped then set rx_to_clean rfd size back to ETH_LEN >> and restart the reciever. >> This does have a small hole: if we only add one packet at >> a time we will end up with all size 0 descriptors in the >> lopp. We can detect that and not remove EL from the old >> before-next unless we are restarting. That would imply >> moving the status poll before we allocate the list. > > Can you send a patch or pseudo-code? It would help to see some main > parts: > 1. blank_rfd setup > 2. single buffer alloc > 3. el-bit handling (when do we set it and when do we clear it) > 4. size handling (when do we set it and when do we clear it) I was afraid you were going to ask for code :-) I'll try to find some time to write some. I'll need to look for an adapter to do any testing. They exist in some of the machines at work. > What if we just fixed the alignment of rfd to make sure it is cache > aligned? Then we know our syncs will be a single cache flush. That doesn't help in the coherent dma case; the device can still read in between, and we need to handle the !EL size 0 case. Actually, I thought about it some more after yesterdays note. We can set the size !0 after making sure the device sees EL (wmb and pci_sync_for_device), as long as we remember this buffer used to be size 0 when doing RX clean, and look ahead for another complete packet as long as the size was ever 0. My proposal was just to use "size is still 0" as that flag, wasting the space allocated for that skb. Another aproach is to always allocate a rfd-only packet at the beginning of the loop, and inserting th ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-23 14:02 ` Milton Miller @ 2007-05-23 21:32 ` David Acker 2007-05-24 5:26 ` Milton Miller 2007-05-24 4:13 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-23 21:32 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev, Scott Feldman Milton Miller wrote: > I agree with this part of the approach. I just think we need > a bit more work on the "what to do when we are ready for > hardware to not stop" part. Agreed. > > The sync is required to push both cache lines, but there is no > ordering guarantee. This probably is why you saw size and el > set. Aligning the RFD to a cache line conflicts with aligning > the payload (IP header and data) to a word boundary, and > depending on cache line size it may be impossible to do both. > > And it won't fix the hole for coherent dma machines. Yep. > It wasn't an issue before because we never set two fields. Before the driver was setting link and clearing the el-bit on the last buffer. This resulted in badness when hardware saw the el-bit cleared but link set to 0. > My current reading of the manual is that the C bit will not be > set on an RFD that is size 0. It goes on to processes EL and > S, and decides to stop and interrupt RNR or suspend, or just > go to the next packet. I double checked this with a quick experiment and it appears you are correct. What about if we always did the following: set the size: sync(); clear el-bit sync() Then if the hardware sees just the size set, the packet completes but with the el-bit and we know we need to restart since it completed. If it sees the size == 0, and the el bit set, it stops and RNR interrupts. When we find a buffer that is not completed but has the el-bit set, we read the status byte of the status control block to see if the RU is in the no resources state. If it isn't, it means that we found that buffer before the hardware did and thus need to wait for it. We will either find it on the next poll or enable interrupts and get told about it by hardware. What do you think? -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-23 21:32 ` David Acker @ 2007-05-24 5:26 ` Milton Miller 2007-05-24 11:21 ` Milton Miller 2007-05-24 12:44 ` David Acker 0 siblings, 2 replies; 51+ messages in thread From: Milton Miller @ 2007-05-24 5:26 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik On May 23, 2007, at 4:32 PM, David Acker wrote: > Milton Miller wrote: >> My current reading of the manual is that the C bit will not be >> set on an RFD that is size 0. It goes on to processes EL and >> S, and decides to stop and interrupt RNR or suspend, or just >> go to the next packet. > I double checked this with a quick experiment and it appears you are > correct. > > What about if we always did the following: > set the size: > sync(); > clear el-bit > sync() > > Then if the hardware sees just the size set, the packet completes but > with the el-bit and we know we need to restart since it completed. > If it sees the size == 0, and the el bit set, it stops and RNR > interrupts. I think this is exposed to a hole and a race: we don't know if the hardware read the RFD before we set the size or after, just that it was before the EL bit was cleared. If it read it before the size was set, then it will not set the C bit. If it reads it after the size is set, it will complete it. For coherent DMA we can always observe the C bit. But for the incoherent DMA case, our store to clear the EL bit may overwrite the dma from the device to the beginning of the packet, or the write to EOF, F, and size, and/or the write to C, OK, and status bits to tell us its done. In the worst case, we would overwrite the beginning of the data but catch the C bit and even the actual size, and therefore would receive corrupted data. We can only detect the hardware went RNR when it does so or decide we won the race when it receives and completes the next frame. > When we find a buffer that is not completed but has the el-bit set, we > read the status byte of the status control block to see if the RU is > in the no resources state. If it isn't, it means that we found that > buffer before the hardware did and thus need to wait for it. We will > either find it on the next poll or enable interrupts and get told > about it by hardware. > > What do you think? I think the second part is good ... Ok here's my just-before-dinner brainstorming, as relayed after dinner: We add two flags to struct rx: one says this packet is EL, and one says it is or was size 0. We create a function, find_mark_el(struct nic, is_running) that is called after initial alloc and/or after refilling skb list. In find_mark_el, we start with rx_to_use (lets rename this rx_to_fill), and go back two entries, call this rx_to_mark. If is_running and rx_to_mark->prev->el then just return, leave EL alone. Otherwise, set EL and size to 0, set the two flags in the rx struct, sync the RFD, then search for the current EL (we could save the pointer, but its always the odl rx_to_use (fill) or its ->prev). Clear RFD->EL, sync, clear rx->el. Set size non-zero, sync, Leave the was_0 flag set if is_running (clear it only if we know reciever is stopped). At this point, if the receiver was stopped we can restart it,. We should do so in the caller. (We always restart at rx_to_clean). Restart should ack the RNR status before issuing the ru_start command to avoid a spurious RNR interrupt. In the receive loop, if RFD->C is not set, rx->was_0 is set and el is not set, then we need to check rx->next->RFD->C bit (after pci_sync_for_cpu). If the next packet C bit is set then we consider this skb as used, and just complete it with no data to the stack. Because find_mark_el will only advance EL if the receiver was stopped or we had more than 1 packet added, we can guarantee that if packet N has was_0 set, then packet N+1 will not have it set, so we have bounded lookahead. This adds two flags to struct rx, but that is allocated as a single array and the array size is filled out as forward and back pointers. Rx clean only has to look at was_0 on the last packet when it decides C is not set, and only if both are set does it peek at the next packet. In other words, we shouldn't worry about the added flags making it a non-power-of-2 size. By setting size != 0 after we have modified all other fields, the device will only write to this packet if we are done writing. If we loose the race and only partially complete, it will just go on to the next packet and we find it. If we totally loose, then the receiver will go RNR and we can reclaim all the buffer space we have allocated. Comments? Questions? We need to enforce a minimum rx ring size (3? 4?). We rely on other mechanisms to guarantee the RFD in this skb will not cache line conflict with the data in antoher scb (slab allocs of the skb should give us this). milton ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-24 5:26 ` Milton Miller @ 2007-05-24 11:21 ` Milton Miller 2007-05-24 12:51 ` David Acker 2007-05-29 15:58 ` David Acker 2007-05-24 12:44 ` David Acker 1 sibling, 2 replies; 51+ messages in thread From: Milton Miller @ 2007-05-24 11:21 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak Some further thoughts ... On May 24, 2007, at 12:26 AM, Milton Miller wrote: > On May 23, 2007, at 4:32 PM, David Acker wrote: >> Milton Miller wrote: >>> My current reading of the manual is that the C bit will not be >>> set on an RFD that is size 0. It goes on to processes EL and >>> S, and decides to stop and interrupt RNR or suspend, or just >>> go to the next packet. >> I double checked this with a quick experiment and it appears you are >> correct. >> When we find a buffer that is not completed but has the el-bit set, >> we read the status byte of the status control block to see if the RU >> is in the no resources state. If it isn't, it means that we found >> that buffer before the hardware did and thus need to wait for it. >> We will either find it on the next poll or enable interrupts and get >> told about it by hardware. >> >> What do you think? > > I think the second part is good ... > > Ok here's my just-before-dinner brainstorming, as relayed after dinner: > > We add two flags to struct rx: one says this packet is EL, and one > says > it is or was size 0. We create a function, find_mark_el(struct nic, > is_running) that is called after initial alloc and/or after refilling > skb list. In find_mark_el, we start with rx_to_use (lets rename this > rx_to_fill), and go back two entries, call this rx_to_mark. If > is_running and rx_to_mark->prev->el then just return, leave EL alone. > Otherwise, set EL and size to 0, set the two flags in the rx struct, > sync the RFD, then search for the current EL (we could save the > pointer, > but its always the odl rx_to_use (fill) or its ->prev). Clear RFD->EL, > sync, clear rx->el. Set size non-zero, sync, Leave the was_0 flag > set > if is_running (clear it only if we know reciever is stopped). > > At this point, if the receiver was stopped we can restart it,. We > should do so in the caller. (We always restart at rx_to_clean). > Restart should ack the RNR status before issuing the ru_start command > to > avoid a spurious RNR interrupt. > > In the receive loop, if RFD->C is not set, rx->was_0 is set and el > is not set, then we need to check rx->next->RFD->C bit (after > pci_sync_for_cpu). If the next packet C bit is set then we consider > this skb as used, and just complete it with no data to the stack. > > Because find_mark_el will only advance EL if the receiver was stopped > or we had more than 1 packet added, we can guarantee that if packet > N has was_0 set, then packet N+1 will not have it set, so we have > bounded lookahead. > > This adds two flags to struct rx, but that is allocated as a single > array and the array size is filled out as forward and back pointers. > Rx clean only has to look at was_0 on the last packet when it > decides C is not set, and only if both are set does it peek at the > next packet. In other words, we shouldn't worry about the added > flags making it a non-power-of-2 size. > > By setting size != 0 after we have modified all other fields, the > device will only write to this packet if we are done writing. If > we loose the race and only partially complete, it will just go on > to the next packet and we find it. If we totally loose, then the > receiver will go RNR and we can reclaim all the buffer space we > have allocated. > > Comments? Questions? > > We need to enforce a minimum rx ring size (3? 4?). > > We rely on other mechanisms to guarantee the RFD in this skb > will not cache line conflict with the data in antoher scb > (slab allocs of the skb should give us this). Copying EL to a flag in rx is only to avoid additional reading of the rfd while it might be under dma. We do need the was_0 flag. Do we need to continue with the stop-before-last plan? As long as we set size to 0 with EL, we shoud be able to change the link, sync, set size 0, sync, and then set size. We still need the "advance at least 2 if not stopped" check when deciding to move the EL. This would break if the hardware in the dma path can read the multiple cache lines of the RFD out of order, so it may not be safe (if some pci host decided to prefetch data, and got the second line ahead of the first and didn't discard prefetch until pci bus disconnect). Actually, I am afraid I know hardware that might do that. [defer] Currently we handle failed allocs by doing a sw interrupt in the watchdog. Since we fail ifup if we can't alloc rxs, we can always start the reciever, even if we didn't alloc a new packet -- it will just read the RFD and go RNR again. We could make this sw interrupt conditional on rx_to_clean->el being set. However, looking further, it appears this 2s watchdog induced watchdog also covers makeing sure that we reattempt netif_schedule_prep_rx when it fails in e100_intr. Any change in this area should be in a seperate patch, and probably delayed at least one release. I also note that netpoll_controller calls e100_intr, which will call into the netif_rx_schedule only when a device interrupt is active. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-24 11:21 ` Milton Miller @ 2007-05-24 12:51 ` David Acker 2007-05-24 14:25 ` Milton Miller 2007-05-29 15:58 ` David Acker 1 sibling, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-24 12:51 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, e1000-devel, netdev, Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak Milton Miller wrote: >> Ok here's my just-before-dinner brainstorming, as relayed after dinner: >> >> We add two flags to struct rx: one says this packet is EL, and one says >> it is or was size 0. We create a function, find_mark_el(struct nic, >> is_running) that is called after initial alloc and/or after refilling >> skb list. In find_mark_el, we start with rx_to_use (lets rename this >> rx_to_fill), and go back two entries, call this rx_to_mark. If >> is_running and rx_to_mark->prev->el then just return, leave EL alone. >> Otherwise, set EL and size to 0, set the two flags in the rx struct, >> sync the RFD, then search for the current EL (we could save the pointer, >> but its always the odl rx_to_use (fill) or its ->prev). Clear RFD->EL, >> sync, clear rx->el. Set size non-zero, sync, Leave the was_0 flag set >> if is_running (clear it only if we know reciever is stopped). >> >> At this point, if the receiver was stopped we can restart it,. We >> should do so in the caller. (We always restart at rx_to_clean). >> Restart should ack the RNR status before issuing the ru_start command to >> avoid a spurious RNR interrupt. >> >> In the receive loop, if RFD->C is not set, rx->was_0 is set and el >> is not set, then we need to check rx->next->RFD->C bit (after >> pci_sync_for_cpu). If the next packet C bit is set then we consider >> this skb as used, and just complete it with no data to the stack. >> >> Because find_mark_el will only advance EL if the receiver was stopped >> or we had more than 1 packet added, we can guarantee that if packet >> N has was_0 set, then packet N+1 will not have it set, so we have >> bounded lookahead. >> >> This adds two flags to struct rx, but that is allocated as a single >> array and the array size is filled out as forward and back pointers. >> Rx clean only has to look at was_0 on the last packet when it >> decides C is not set, and only if both are set does it peek at the >> next packet. In other words, we shouldn't worry about the added >> flags making it a non-power-of-2 size. >> >> By setting size != 0 after we have modified all other fields, the >> device will only write to this packet if we are done writing. If >> we loose the race and only partially complete, it will just go on >> to the next packet and we find it. If we totally loose, then the >> receiver will go RNR and we can reclaim all the buffer space we >> have allocated. >> >> Comments? Questions? This sounds pretty reasonable. I will take a stab at coding this up today; I always think better looking at code. >> We need to enforce a minimum rx ring size (3? 4?). The code currently limits ethtool -G ethX rx calls to 16. >> We rely on other mechanisms to guarantee the RFD in this skb >> will not cache line conflict with the data in antoher scb >> (slab allocs of the skb should give us this). Yep. > Copying EL to a flag in rx is only to avoid additional > reading of the rfd while it might be under dma. We do > need the was_0 flag. > > Do we need to continue with the stop-before-last plan? As > long as we set size to 0 with EL, we shoud be able to change > the link, sync, set size 0, sync, and then set size. Perhaps not. I can take a try at coding it without it. It would certainly make the driver easier to follow. > We still need the "advance at least 2 if not stopped" check when > deciding to move the EL. This would break if the hardware > in the dma path can read the multiple cache lines of the > RFD out of order, so it may not be safe (if some pci host > decided to prefetch data, and got the second line ahead of > the first and didn't discard prefetch until pci bus > disconnect). Actually, I am afraid I know hardware that > might do that. Hmm, me too. > > [defer] > Currently we handle failed allocs by doing a sw interrupt > in the watchdog. Since we fail ifup if we can't alloc > rxs, we can always start the reciever, even if we didn't > alloc a new packet -- it will just read the RFD and go RNR > again. We could make this sw interrupt conditional on > rx_to_clean->el being set. However, looking further, it > appears this 2s watchdog induced watchdog also covers > makeing sure that we reattempt netif_schedule_prep_rx when > it fails in e100_intr. Any change in this area should be > in a seperate patch, and probably delayed at least one > release. I also note that netpoll_controller calls > e100_intr, which will call into the netif_rx_schedule > only when a device interrupt is active. Agreed. I will get back when I have done some experiments with these ideas. Thanks for the replies! -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-24 12:51 ` David Acker @ 2007-05-24 14:25 ` Milton Miller 0 siblings, 0 replies; 51+ messages in thread From: Milton Miller @ 2007-05-24 14:25 UTC (permalink / raw) To: David Acker Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev On May 24, 2007, at 7:51 AM, David Acker wrote: > Milton Miller wrote: >> Comments? Questions? > This sounds pretty reasonable. I will take a stab at coding this up > today; I always think better looking at code. Thanks. By the way, find_mark_el should probably get passed the old fill point. The EL it will be clearing will either be on that rx or the previous one (or there is none for the initial alloc_list). >> Do we need to continue with the stop-before-last plan? As >> long as we set size to 0 with EL, we shoud be able to change >> the link, sync, set size 0, sync, and then set size. > Perhaps not. I can take a try at coding it without it. It would > certainly make the driver easier to follow. > >> We still need the "advance at least 2 if not stopped" check when >> deciding to move the EL. This would break if the hardware >> in the dma path can read the multiple cache lines of the >> RFD out of order, so it may not be safe (if some pci host >> decided to prefetch data, and got the second line ahead of >> the first and didn't discard prefetch until pci bus >> disconnect). Actually, I am afraid I know hardware that >> might do that. > Hmm, me too. > I think we should be conservative here, and keep the stop before the last and modify the last that we currently have. I know the above prefetch hardware exists on a system that can plug e100 cards. While prefetching multiple lines is based on the pci fetch code (read vs read line vs read multiple), its just too close to failing. The reason this behavior won't break us with the two packet version is we always have written the unaligned (and hence not atomic) write of the link field before we even expose the RFD to the hardware. Since data is aligned to two bytes, the write to the EL bit and the write to length will be atomic. In addition, if the hardware reorders the loads and gets length with EL, it will just RNR after filling in the packet and we will handle it just fine. Perhaps we should comment that we rely on 2 byte alignment, in case some crazy architecture sets ETH_HDR_ALIGN to an odd value. (It's arch dependent to allow trade offs of partial cache line writes from IO vs processing unaligned TCP and IP headers by the cpu). I wouldn't do BUG_ON but wouldn't object to BUILD_BUG_ON if someone thinks it's needed. The difference is a additional pointer follow in the new find_mark_el() from the new fill point to the mark point (and the availability of one more RFD for a packet). Either way we need to check its ->prev for being the being current EL when hardware is not stopped. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-24 11:21 ` Milton Miller 2007-05-24 12:51 ` David Acker @ 2007-05-29 15:58 ` David Acker 2007-05-30 8:26 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: David Acker @ 2007-05-29 15:58 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, e1000-devel, netdev, Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak Ok, I finally got some time to code this out and study it and Ihave some questions. Milton Miller wrote: >> We add two flags to struct rx: one says this packet is EL, and one says >> it is or was size 0. We create a function, find_mark_el(struct nic, >> is_running) that is called after initial alloc and/or after refilling >> skb list. In find_mark_el, we start with rx_to_use (lets rename this >> rx_to_fill), and go back two entries, call this rx_to_mark. If >> is_running and rx_to_mark->prev->el then just return, leave EL alone. So if we start clean and then one packet completes, we can not clear the old marked entry? We must then add a per nic pointer to the current marked entry so that when the next packet completes, we can avoid scanning for it. Why do we need to check this again? >> Otherwise, set EL and size to 0, set the two flags in the rx struct, >> sync the RFD, then search for the current EL (we could save the pointer, >> but its always the odl rx_to_use (fill) or its ->prev). Clear RFD->EL, >> sync, clear rx->el. Set size non-zero, sync, Leave the was_0 flag set >> if is_running (clear it only if we know reciever is stopped). >> >> At this point, if the receiver was stopped we can restart it,. We >> should do so in the caller. (We always restart at rx_to_clean). >> Restart should ack the RNR status before issuing the ru_start command to >> avoid a spurious RNR interrupt. >> >> In the receive loop, if RFD->C is not set, rx->was_0 is set and el >> is not set, then we need to check rx->next->RFD->C bit (after >> pci_sync_for_cpu). If the next packet C bit is set then we consider >> this skb as used, and just complete it with no data to the stack. If the C-bit is not set, we can read the status to see if the RU is running (we cleared the EL bit before it read it but it has not found another packet yet) or not (it read the el-bit before we cleared it). This read lets us avoid going through an enable interrupts, wait for the RNR interrupt cycle. >> Because find_mark_el will only advance EL if the receiver was stopped >> or we had more than 1 packet added, we can guarantee that if packet >> N has was_0 set, then packet N+1 will not have it set, so we have >> bounded lookahead. Is this meant to be 1 packet added at any given call or 1 packet added since the last time we cleared? >> This adds two flags to struct rx, but that is allocated as a single >> array and the array size is filled out as forward and back pointers. >> Rx clean only has to look at was_0 on the last packet when it >> decides C is not set, and only if both are set does it peek at the >> next packet. In other words, we shouldn't worry about the added >> flags making it a non-power-of-2 size. >> >> By setting size != 0 after we have modified all other fields, the >> device will only write to this packet if we are done writing. If >> we loose the race and only partially complete, it will just go on >> to the next packet and we find it. If we totally loose, then the >> receiver will go RNR and we can reclaim all the buffer space we >> have allocated. >> >> Comments? Questions? >> Say we have an 8 buffer receive pool (0-7) at start. rx[6] is marked. hardware consumes rx[0] software sees one rx complete without an s-bit set. software notes that rx[6] is marked software allocates new buffer for rx[0] software runs find_mark_el with rx_to_use == rx[1], rx_to_mark == rx[7], is_running == true, marked_rx == rx[6] find_mark_el finds rx_to_mark->prev->el set and is_running true, and returns. hardware consumes rx[1] software sees one rx complete without an s-bit set. software notes that rx[6] is still marked software allocates new buffer for rx[1] software runs find_mark_el with rx_to_use == rx[2], rx_to_mark == rx[0], is_running == true, marked_rx == rx[6] find_mark_el does not find rx_to_mark->prev->el set so continues find_mark_el sets rx[0]->rfd->el rx[0]->rfd->size = 0, rx[0]->el, rx[0]->size0; find_mark_el clears rx[6]->rfd->el, syncs, sets rx[6]->rfd->size, syncs, clears rx[6]->rfd->el but leaves rx[6]->rfd->size0 set. Is this the correct flow? -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-29 15:58 ` David Acker @ 2007-05-30 8:26 ` Milton Miller 2007-06-01 20:45 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-05-30 8:26 UTC (permalink / raw) To: David Acker Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev On May 29, 2007, at 10:58 AM, David Acker wrote: > Ok, I finally got some time to code this out and study it and Ihave > some > questions. > > Milton Miller wrote: >>> We add two flags to struct rx: one says this packet is EL, and one >>> says >>> it is or was size 0. We create a function, find_mark_el(struct nic, >>> is_running) that is called after initial alloc and/or after refilling >>> skb list. In find_mark_el, we start with rx_to_use (lets rename >>> this >>> rx_to_fill), and go back two entries, call this rx_to_mark. If >>> is_running and rx_to_mark->prev->el then just return, leave EL alone. > So if we start clean and then one packet completes, we can not clear > the > old marked entry? We must then add a per nic pointer to the current > marked entry so that when the next packet completes, we can avoid > scanning for it. Why do we need to check this again? I don't think we need a per-nic pointer, we just need to check if our to_mark->prev has the "This is EL" flag set. the to_mark is the to_fill->prev->prev (I think -- to_use in the code, which is the next to be allocated, not last allocated? I haven't studied that detail). We need this to avoid our refill code running in lock step with the hardware and exposing two packets in a row with a size that it reads as 0. This means we never need to scan ahead more than one packet. > >>> Otherwise, set EL and size to 0, set the two flags in the rx struct, >>> sync the RFD, then search for the current EL (we could save the >>> pointer, >>> but its always the odl rx_to_use (fill) or its ->prev). Clear >>> RFD->EL, >>> sync, clear rx->el. Set size non-zero, sync, Leave the was_0 flag >>> set >>> if is_running (clear it only if we know reciever is stopped). >>> >>> At this point, if the receiver was stopped we can restart it,. We >>> should do so in the caller. (We always restart at rx_to_clean). >>> Restart should ack the RNR status before issuing the ru_start >>> command to >>> avoid a spurious RNR interrupt. >>> >>> In the receive loop, if RFD->C is not set, rx->was_0 is set and el >>> is not set, then we need to check rx->next->RFD->C bit (after >>> pci_sync_for_cpu). If the next packet C bit is set then we consider >>> this skb as used, and just complete it with no data to the stack. > If the C-bit is not set, we can read the status to see if the RU is > running (we cleared the EL bit before it read it but it has not found > another packet yet) or not (it read the el-bit before we cleared it). > This read lets us avoid going through an enable interrupts, wait for > the > RNR interrupt cycle. Yes agreed. But I was pointing out if we never had set size to 0 on this packet (ie rx->was_0 is clear) we dont' even need to poll the device (saves a slow mmio load) or check the next packet (saves a pci_sync_for_cpu ie a cache line invalidate, and check for c bit). >>> Because find_mark_el will only advance EL if the receiver was stopped >>> or we had more than 1 packet added, we can guarantee that if packet >>> N has was_0 set, then packet N+1 will not have it set, so we have >>> bounded lookahead. > Is this meant to be 1 packet added at any given call or 1 packet added > since the last time we cleared? This would be we allocated one packet since the last time we moved EL. >>> This adds two flags to struct rx, but that is allocated as a single >>> array and the array size is filled out as forward and back pointers. >>> Rx clean only has to look at was_0 on the last packet when it >>> decides C is not set, and only if both are set does it peek at the >>> next packet. In other words, we shouldn't worry about the added >>> flags making it a non-power-of-2 size. >>> >>> By setting size != 0 after we have modified all other fields, the >>> device will only write to this packet if we are done writing. If >>> we loose the race and only partially complete, it will just go on >>> to the next packet and we find it. If we totally loose, then the >>> receiver will go RNR and we can reclaim all the buffer space we >>> have allocated. >>> >>> Comments? Questions? >>> > > > Say we have an 8 buffer receive pool (0-7) at start. rx[6] is marked. > > hardware consumes rx[0] > software sees one rx complete without an s-bit set. wihtout el bit set? software sees one packet complete, checks next and find was_el is not set so it assumes it is still pending and the device is running. > software notes that rx[6] is marked > software allocates new buffer for rx[0] > software runs find_mark_el with rx_to_use == rx[1], rx_to_mark == > rx[7], is_running == true, marked_rx == rx[6] > find_mark_el finds rx_to_mark->prev->el set and is_running true, and > returns. > > hardware consumes rx[1] > software sees one rx complete without an s-bit set. software sees one packet complete, checks next and find was_0 is not set so it assumes it is still pending and the device is running. > software notes that rx[6] is still marked > software allocates new buffer for rx[1] > software runs find_mark_el with rx_to_use == rx[2], rx_to_mark == > rx[0], is_running == true, marked_rx == rx[6] > find_mark_el does not find rx_to_mark->prev->el set so continues > find_mark_el sets rx[0]->rfd->el rx[0]->rfd->size = 0, rx[0]->el, > rx[0]->size0; > find_mark_el clears rx[6]->rfd->el, syncs, sets rx[6]->rfd->size, > syncs, clears rx[6]->rfd->el but leaves rx[6]->rfd->size0 set. > > Is this the correct flow? > Without thinking about the ordering of these updates, yes I think that is the right flow. although the "rx[6] is still marked" is not really a step, but notices that [0]->prev === [7]->el is not set. actually, lets do second part again with what I was thinking: > > hardware consumes rx[1] > software sees one rx complete without an s-bit set. software notes that rx_to_use (fill) is rx[1] > software allocates new buffer for rx[1], and moves rx_to_use to rx[2] software runs with find_mark_el with rx_to_use = rx[2], rx_was_to_use = rx[1], is_running = true find_mark_el walks backward and calculates rx_to_mark as rx[0]. it calculates rx_to_unmark as rx_was_to_use->prev == rx[7] initially, but since ->el is not set it backs up once more to rx[6]. It then compares rx_to_unmark is != rx_to_mark->prev and continues as above. Note that if is_running == true we would also continue; this should be the second clause in the || obviously. Your logic works, this adds some conditional branching but saves a pointer, not sure which is better. Mine can be used for initializing without special casing since it will just calculate rx_to_unmark as rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never marked still works, where as for yours the "was marked" must be explicitly set to something (hmm, rxs = rx[0] might work for that initial value until we mark -2??) again, the compare of rx->el instead of rx->rfd->el is to save cache traffic to the rfd under io. The rx->was_0 is required, so the el flag is free. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-30 8:26 ` Milton Miller @ 2007-06-01 20:45 ` David Acker 2007-06-01 21:13 ` Jeff Garzik 2007-06-04 9:03 ` Milton Miller 0 siblings, 2 replies; 51+ messages in thread From: David Acker @ 2007-06-01 20:45 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Milton Miller wrote: > Your logic works, this adds some conditional branching but saves a > pointer, not sure which is better. Mine can be used for initializing > without special casing since it will just calculate rx_to_unmark as > rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never > marked still works, where as for yours the "was marked" must be > explicitly set to something (hmm, rxs = rx[0] might work for that > initial value until we mark -2??) > > again, the compare of rx->el instead of rx->rfd->el is to save cache > traffic to the rfd under io. The rx->was_0 is required, so the el flag > is free. > Ok, I took a stab at coding and testing these ideas. Below is a patch against 2.6.22-rc3. Let me know what you think. Testing shows that we can hit any of the following scenarios: Find a buffer not complete with rx->el and rx->s0 set. I read back the status and see if the receiver is still running. Find a buffer not complete with rx->el not set and rx->s0 set. I check the next buffer and if it is complete, I free the skb and return 0. If the next buffer is not complete, I read the receiver's status to see if he is still running. Find a buffer that is complete with rx->el not set and rx->s0 set. It appears that hardware can read the rfd's el-bit, then software can clear the rfd el-bit and set the rfd size, and then hardware can come in and read the size. I am reading the status back, although I don't think that I have to in this instance. I am testing a version of this code patched against 2.6.18.4 on my PXA 255 based system. I will let you all know how it goes. On the ARM, their is a race condition between software allocating a new receive buffer and hardware writing into a buffer. The two race on touching the last Receive Frame Descriptor (RFD). It has its el-bit set and its next link equal to 0. When hardware encounters this buffer it attempts to write data to it and then update Status Word bits and Actual Count in the RFD. At the same time software may try to clear the el-bit and set the link address to a new buffer. Since the entire RFD is once cache-line, the two write operations can collide. This can lead to the receive unit stalling or freed receive buffers getting written to. The fix is to set the el-bit on and the size to 0 on the next to last buffer in the chain. When the hardware encounters this buffer it stops and does not write to it at all. The hardware issues an RNR interrupt with the receive unit in the No Resources state. When software allocates buffers, it can update the tail of the list when either it knows the hardware has stopped or the previous to the new one to mark marked. Once it has a new next to last buffer prepared, it can clear the el-bit and set the size on the previous one. The race on this buffer is safe since the link already points to a valid next buffer. We keep flags in our software descriptor to note if the el bit is set and if the size was 0. When we clear the RFD's el bit and set its size, we also clear the el flag but we leave the size was 0 bit set. This was we can find this buffer again later. If the hardware sees the el-bit cleared without the size set, it will move on to the next buffer and skip this one. If it sees the size set but the el-bit still set, it will complete that buffer and then RNR interrupt and wait. Signed-off-by: David Acker <dacker@roinet.com> --- --- linux-2.6.22-rc3/drivers/net/e100.c.orig 2007-06-01 16:13:16.000000000 -0400 +++ linux-2.6.22-rc3/drivers/net/e100.c 2007-06-01 16:20:36.000000000 -0400 @@ -281,10 +281,17 @@ struct csr { }; enum scb_status { + rus_no_res = 0x08, rus_ready = 0x10, rus_mask = 0x3C, }; +enum ru_state { + ru_suspended = 0, + ru_running = 1, + ru_uninitialized = -1, +}; + enum scb_stat_ack { stat_ack_not_ours = 0x00, stat_ack_sw_gen = 0x04, @@ -395,10 +402,16 @@ struct rfd { u16 size; }; +enum rx_flags { + rx_el = 0x01, + rx_s0 = 0x02, +}; + struct rx { struct rx *next, *prev; struct sk_buff *skb; dma_addr_t dma_addr; + u8 flags; }; #if defined(__BIG_ENDIAN_BITFIELD) @@ -526,6 +539,7 @@ struct nic { struct rx *rx_to_use; struct rx *rx_to_clean; struct rfd blank_rfd; + enum ru_state ru_running; spinlock_t cb_lock ____cacheline_aligned; spinlock_t cmd_lock; @@ -947,7 +961,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = 0; nic->blank_rfd.rbd = 0xFFFFFFFF; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1742,11 +1756,55 @@ static int e100_alloc_cbs(struct nic *ni return 0; } -static inline void e100_start_receiver(struct nic *nic) +static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, int is_running) { - /* Start if RFA is non-NULL */ - if(nic->rx_to_clean->skb) - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); + struct rx *rx_to_mark = nic->rx_to_use->prev->prev; + struct rfd *rfd_to_mark; + + if(is_running && rx_to_mark->prev->flags & rx_el) + return; + + if(marked_rx == rx_to_mark) + return; + + rfd_to_mark = (struct rfd *) rx_to_mark->skb->data; + rfd_to_mark->command |= cpu_to_le16(cb_el); + rfd_to_mark->size = 0; + pci_dma_sync_single_for_cpu(nic->pdev, rx_to_mark->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + rx_to_mark->flags |= (rx_el | rx_s0); + + if(!marked_rx) + return; + + rfd_to_mark = (struct rfd *) marked_rx->skb->data; + rfd_to_mark->command &= ~cpu_to_le16(cb_el); + pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + rfd_to_mark->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); + pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_TODEVICE); + + if(is_running) + marked_rx->flags &= ~rx_el; + else + marked_rx->flags &= ~(rx_el | rx_s0); +} + +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) +{ + if(!nic->rxs) return; + if(ru_suspended != nic->ru_running) return; + + /* handle init time starts */ + if(!rx) rx = nic->rxs; + + /* (Re)start RU if suspended or idle and RFA is non-NULL */ + if(rx->skb) { + e100_exec_cmd(nic, ruc_start, rx->dma_addr); + nic->ru_running = ru_running; + } } #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) @@ -1774,8 +1832,6 @@ static int e100_rx_alloc_skb(struct nic struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)&prev_rfd->link); - wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } @@ -1789,6 +1845,7 @@ static int e100_rx_indicate(struct nic * struct sk_buff *skb = rx->skb; struct rfd *rfd = (struct rfd *)skb->data; u16 rfd_status, actual_size; + u8 status; if(unlikely(work_done && *work_done >= work_to_do)) return -EAGAIN; @@ -1800,9 +1857,46 @@ static int e100_rx_indicate(struct nic * DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status); - /* If data isn't ready, nothing to indicate */ - if(unlikely(!(rfd_status & cb_complete))) + /* If data isn't ready, nothing to indicate + * If both the el and s0 rx flags are set, we have hit the marked + * buffer but we don't know if hardware has seen it so we check + * the status. + * If only the s0 flag is set, we check the next buffer. + * If it is complete, we know that hardware saw the rfd el bit + * get cleared but did not see the rfd size get set so it + * skipped this buffer. We just return 0 and look at the + * next buffer. + * If only the s0 flag is set but the next buffer is + * not complete, we cleared the el flag as hardware + * hit this buffer. */ + if(unlikely(!(rfd_status & cb_complete))) { + u8 maskedFlags = rx->flags & (rx_el | rx_s0); + if(maskedFlags == (rx_el | rx_s0)) { + status = ioread8(&nic->csr->scb.status); + if(status & rus_no_res) { + nic->ru_running = ru_suspended; + } + } else if(maskedFlags == rx_s0) { + struct rx *next_rx = rx->next; + struct rfd *next_rfd = (struct rfd *)next_rx->skb->data; + pci_dma_sync_single_for_cpu(nic->pdev, next_rx->dma_addr, + sizeof(struct rfd), PCI_DMA_FROMDEVICE); + if(next_rfd->status & cpu_to_le16(cb_complete)) { + pci_unmap_single(nic->pdev, rx->dma_addr, + RFD_BUF_LEN, PCI_DMA_FROMDEVICE); + dev_kfree_skb_any(skb); + rx->skb = NULL; + rx->flags &= ~rx_s0; + return 0; + } else { + status = ioread8(&nic->csr->scb.status); + if(status & rus_no_res) { + nic->ru_running = ru_suspended; + } + } + } return -ENODATA; + } /* Get actual data size */ actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF; @@ -1813,6 +1907,15 @@ static int e100_rx_indicate(struct nic * pci_unmap_single(nic->pdev, rx->dma_addr, RFD_BUF_LEN, PCI_DMA_FROMDEVICE); + /* This happens when hardward sees the rfd el flag set + * but then sees the rfd size set as well */ + if(le16_to_cpu(rfd->command) & cb_el) { + status = ioread8(&nic->csr->scb.status); + if(status & rus_no_res) { + nic->ru_running = ru_suspended; + } + } + /* Pull off the RFD and put the actual data (minus eth hdr) */ skb_reserve(skb, sizeof(struct rfd)); skb_put(skb, actual_size); @@ -1842,19 +1945,46 @@ static int e100_rx_indicate(struct nic * static void e100_rx_clean(struct nic *nic, unsigned int *work_done, unsigned int work_to_do) { - struct rx *rx; + struct rx *rx, *marked_rx; + int restart_required = 0; + int err = 0; /* Indicate newly arrived packets */ for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) { - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) - break; /* No more to clean */ + err = e100_rx_indicate(nic, rx, work_done, work_to_do); + /* Hit quota or no more to clean */ + if(-EAGAIN == err || -ENODATA == err) + break; } + /* On EAGAIN, hit quota so have more work to do, restart once + * cleanup is complete. + * Else, are we already rnr? then pay attention!!! this ensures that + * the state machine progression never allows a start with a + * partially cleaned list, avoiding a race between hardware + * and rx_to_clean when in NAPI mode */ + if(-EAGAIN != err && ru_suspended == nic->ru_running) + restart_required = 1; + + marked_rx = nic->rx_to_use->prev->prev; + while(!(marked_rx->flags & rx_el)) + marked_rx = marked_rx->prev; + /* Alloc new skbs to refill list */ for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { if(unlikely(e100_rx_alloc_skb(nic, rx))) break; /* Better luck next time (see watchdog) */ } + + e100_find_mark_el(nic, marked_rx, !restart_required); + + if(restart_required) { + // ack the rnr? + iowrite8(stat_ack_rnr, &nic->csr->scb.stat_ack); + e100_start_receiver(nic, nic->rx_to_clean); + if(work_done) + (*work_done)++; + } } static void e100_rx_clean_list(struct nic *nic) @@ -1862,6 +1992,8 @@ static void e100_rx_clean_list(struct ni struct rx *rx; unsigned int i, count = nic->params.rfds.count; + nic->ru_running = ru_uninitialized; + if(nic->rxs) { for(rx = nic->rxs, i = 0; i < count; rx++, i++) { if(rx->skb) { @@ -1883,6 +2015,7 @@ static int e100_rx_alloc_list(struct nic unsigned int i, count = nic->params.rfds.count; nic->rx_to_use = nic->rx_to_clean = NULL; + nic->ru_running = ru_uninitialized; if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) return -ENOMEM; @@ -1897,6 +2030,9 @@ static int e100_rx_alloc_list(struct nic } nic->rx_to_use = nic->rx_to_clean = nic->rxs; + nic->ru_running = ru_suspended; + + e100_find_mark_el(nic, NULL, 0); return 0; } @@ -1916,6 +2052,10 @@ static irqreturn_t e100_intr(int irq, vo /* Ack interrupt(s) */ iowrite8(stat_ack, &nic->csr->scb.stat_ack); + /* We hit Receive No Resource (RNR); restart RU after cleaning */ + if(stat_ack & stat_ack_rnr) + nic->ru_running = ru_suspended; + if(likely(netif_rx_schedule_prep(netdev))) { e100_disable_irq(nic); __netif_rx_schedule(netdev); @@ -2007,7 +2147,7 @@ static int e100_up(struct nic *nic) if((err = e100_hw_init(nic))) goto err_clean_cbs; e100_set_multicast_list(nic->netdev); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); mod_timer(&nic->watchdog, jiffies); if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, nic->netdev->name, nic->netdev))) @@ -2088,7 +2228,7 @@ static int e100_loopback_test(struct nic mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, BMCR_LOOPBACK); - e100_start_receiver(nic); + e100_start_receiver(nic, NULL); if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { err = -ENOMEM; ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-01 20:45 ` David Acker @ 2007-06-01 21:13 ` Jeff Garzik 2007-06-01 22:13 ` Kok, Auke 2007-06-04 9:03 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: Jeff Garzik @ 2007-06-01 21:13 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Milton Miller, Scott Feldman, John Ronciak, Jeff Kirsher David Acker wrote: > Milton Miller wrote: >> Your logic works, this adds some conditional branching but saves a >> pointer, not sure which is better. Mine can be used for initializing >> without special casing since it will just calculate rx_to_unmark as >> rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never >> marked still works, where as for yours the "was marked" must be >> explicitly set to something (hmm, rxs = rx[0] might work for that >> initial value until we mark -2??) >> >> again, the compare of rx->el instead of rx->rfd->el is to save cache >> traffic to the rfd under io. The rx->was_0 is required, so the el >> flag is free. >> > > Ok, I took a stab at coding and testing these ideas. Below is a patch > against 2.6.22-rc3. > Let me know what you think. Testing shows that we can hit any of the > following scenarios: > > Find a buffer not complete with rx->el and rx->s0 set. I read back the > status and see if > the receiver is still running. > Find a buffer not complete with rx->el not set and rx->s0 set. I check > the next buffer > and if it is complete, I free the skb and return 0. If the next buffer > is not > complete, I read the receiver's status to see if he is still running. > Find a buffer that is complete with rx->el not set and rx->s0 set. It > appears that hardware > can read the rfd's el-bit, then software can clear the rfd el-bit and > set the rfd size, and > then hardware can come in and read the size. I am reading the status > back, although > I don't think that I have to in this instance. > > I am testing a version of this code patched against 2.6.18.4 on my PXA > 255 based system. I will let you all know how it goes. > > On the ARM, their is a race condition between software allocating a new > receive > buffer and hardware writing into a buffer. The two race on touching the > last > Receive Frame Descriptor (RFD). It has its el-bit set and its next link > equal > to 0. When hardware encounters this buffer it attempts to write data to it > and then update Status Word bits and Actual Count in the RFD. At the > same time > software may try to clear the el-bit and set the link address to a new > buffer. > Since the entire RFD is once cache-line, the two write operations can > collide. This can lead to the receive unit stalling or freed receive > buffers > getting written to. > > The fix is to set the el-bit on and the size to 0 on the next to last > buffer > in the chain. When the hardware encounters this buffer it stops and does > not write to it at all. The hardware issues an RNR interrupt with the > receive unit in the No Resources state. When software allocates buffers, > it can update the tail of the list when either it knows the hardware > has stopped or the previous to the new one to mark marked. > Once it has a new next to last buffer prepared, it can clear the el-bit > and set the size on the previous one. The race on this buffer is safe > since the link already points to a valid next buffer. We keep flags > in our software descriptor to note if the el bit is set and if the size > was 0. When we clear the RFD's el bit and set its size, we also clear > the el flag but we leave the size was 0 bit set. This was we can find > this buffer again later. > > If the hardware sees the el-bit cleared without the size set, it will > move on to the next buffer and skip this one. If it sees > the size set but the el-bit still set, it will complete that buffer > and then RNR interrupt and wait. > > > Signed-off-by: David Acker <dacker@roinet.com> That seems to vaguely match my memory of what eepro100 was doing (or trying to do). I _really_ appreciate you working on this problem. Getting e100 driver stable for the long term, and ditching eepro100, is a big hurdle to cross. Getting this right is really one of the last steps. The patch looks OK at quick glance. Thanks, Jeff ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-01 21:13 ` Jeff Garzik @ 2007-06-01 22:13 ` Kok, Auke 0 siblings, 0 replies; 51+ messages in thread From: Kok, Auke @ 2007-06-01 22:13 UTC (permalink / raw) To: Jeff Garzik Cc: David Acker, Milton Miller, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, netdev Jeff Garzik wrote: > David Acker wrote: >> Milton Miller wrote: <snip> >> the el flag but we leave the size was 0 bit set. This was we can find >> this buffer again later. >> >> If the hardware sees the el-bit cleared without the size set, it will >> move on to the next buffer and skip this one. If it sees >> the size set but the el-bit still set, it will complete that buffer >> and then RNR interrupt and wait. >> >> >> Signed-off-by: David Acker <dacker@roinet.com> > > That seems to vaguely match my memory of what eepro100 was doing (or > trying to do). > > I _really_ appreciate you working on this problem. Getting e100 driver > stable for the long term, and ditching eepro100, is a big hurdle to > cross. Getting this right is really one of the last steps. yes, absolutely agreed. I'm very pleased with the attention and hope that is clear to everyone. > The patch looks OK at quick glance. Besides copying the style errors, it looks OK as well. I will attempt to allocate some testing time again early next week on a small library of e100 nics over here. Mostly x86, but still useful to spot obvious mistakes. Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-01 20:45 ` David Acker 2007-06-01 21:13 ` Jeff Garzik @ 2007-06-04 9:03 ` Milton Miller 2007-06-05 13:34 ` David Acker 1 sibling, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-04 9:03 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik On Jun 1, 2007, at 3:45 PM, David Acker wrote: > Ok, I took a stab at coding and testing these ideas. Below is a patch > against 2.6.22-rc3. > Let me know what you think. I think you got most of the ideas. As Auke noted, your coding style is showing again. And your mailer again munged whitespace (fixed by s/^<space><space>/<space>/ s/^$/<space>/). > Testing shows that we can hit any of the following scenarios: > > Find a buffer not complete with rx->el and rx->s0 set. > I read back the status and see if the receiver is still running. > This means the hardware reached what we think is end of list, good. > Find a buffer not complete with rx->el not set and rx->s0 set. > I check the next buffer and if it is complete, I free the skb and > return 0. This is the case when hardware read the descriptor between el being removed and size being set. > If the next buffer is not complete, I read the receiver's status to > see if he is still running. > The polling of still running will save us waiting for irq or next poll when hardware read el before we cleared it, at the cost of an ioread when there is simply no packet ready and we hit a fill mark. > Find a buffer that is complete with rx->el not set and rx->s0 set. > It appears that hardware can read the rfd's el-bit, then software can > clear the rfd el-bit and set the rfd size, and then hardware can come > in and read the size. Yes, since the size is after the EL flag in the descriptor, this can happen since the pci read is not atomic. > I am reading the status back, although I don't think that I have to in > this instance. Actually, you are reading it when the rfd still has EL set. Since the cpu will never encounter that case, the if condition is never satisfied. How about creating a state unknown, for when we think we should check the device if its running. If we are in this state and then encounter a received packet without s0 set, we can set it back to running. We set it when we rx a packet with s0 set. We then move both io_status reads to the caller. > I am testing a version of this code patched against 2.6.18.4 on my PXA > 255 based system. I will let you all know how it goes. I'm assuming this is why the cleanup of the receiver start to always start on rx_to_clean got dropped again. :-) Also, I would like a few sentences in the Driver Operation section IV Receive big comment. Something like In order to keep updates to the RFD link field from colliding with hardware writes to mark packets complete, we use the feature that hardware will not write to a size 0 descriptor and mark the previous packet as end-of-list (EL). After updating the link, we remove EL and only then restore the size such that hardware may use the previous-to-end RFD. at the end of the first paragraph, and insert software before "no locking is required" in the second. > > On the ARM, their is a race condition between software allocating a > new receive > buffer and hardware writing into a buffer. The two race on touching > the last > Receive Frame Descriptor (RFD). It has its el-bit set and its next > link equal > to 0. When hardware encounters this buffer it attempts to write data > to it > and then update Status Word bits and Actual Count in the RFD. At the > same time > software may try to clear the el-bit and set the link address to a new > buffer. > Since the entire RFD is once cache-line, the two write operations can > collide. This can lead to the receive unit stalling or freed receive > buffers > getting written to. > > The fix is to set the el-bit on and the size to 0 on the next to last > buffer > in the chain. When the hardware encounters this buffer it stops and > does > not write to it at all. The hardware issues an RNR interrupt with the > receive unit in the No Resources state. When software allocates > buffers, > it can update the tail of the list when either it knows the hardware > has stopped or the previous to the new one to mark marked. > Once it has a new next to last buffer prepared, it can clear the el-bit > and set the size on the previous one. The race on this buffer is safe > since the link already points to a valid next buffer. We keep flags > in our software descriptor to note if the el bit is set and if the size > was 0. When we clear the RFD's el bit and set its size, we also clear > the el flag but we leave the size was 0 bit set. This was we can find > this buffer again later. > > If the hardware sees the el-bit cleared without the size set, it will > move on to the next buffer and skip this one. If it sees > the size set but the el-bit still set, it will complete that buffer > and then RNR interrupt and wait. > > > Signed-off-by: David Acker <dacker@roinet.com> > > --- > > --- linux-2.6.22-rc3/drivers/net/e100.c.orig 2007-06-01 > 16:13:16.000000000 -0400 > +++ linux-2.6.22-rc3/drivers/net/e100.c 2007-06-01 16:20:36.000000000 > -0400 > @@ -281,10 +281,17 @@ struct csr { > }; > > enum scb_status { > + rus_no_res = 0x08, > rus_ready = 0x10, > rus_mask = 0x3C, > }; > > +enum ru_state { > + ru_suspended = 0, > + ru_running = 1, > + ru_uninitialized = -1, > +}; > + If we are going to have uninitialized, it should be 0. But its only set (1) during teardown, and (2) during rx list allocation. In the same function its set to suspended. Also, lets call this stopped, not suspended. Suspended is a specific hardware state we are not using. See above for my comments about adding unknown or maybe_running. > enum scb_stat_ack { > stat_ack_not_ours = 0x00, > stat_ack_sw_gen = 0x04, > @@ -395,10 +402,16 @@ struct rfd { > u16 size; > }; > > +enum rx_flags { > + rx_el = 0x01, > + rx_s0 = 0x02, > +}; > + > struct rx { > struct rx *next, *prev; > struct sk_buff *skb; > dma_addr_t dma_addr; > + u8 flags; > }; > > #if defined(__BIG_ENDIAN_BITFIELD) > @@ -526,6 +539,7 @@ struct nic { > struct rx *rx_to_use; > struct rx *rx_to_clean; > struct rfd blank_rfd; > + enum ru_state ru_running; > > spinlock_t cb_lock ____cacheline_aligned; > spinlock_t cmd_lock; > @@ -947,7 +961,7 @@ static void e100_get_defaults(struct nic > ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); > > /* Template for a freshly allocated RFD */ > - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); > + nic->blank_rfd.command = 0; > nic->blank_rfd.rbd = 0xFFFFFFFF; > nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); > > @@ -1742,11 +1756,55 @@ static int e100_alloc_cbs(struct nic *ni > return 0; > } > > -static inline void e100_start_receiver(struct nic *nic) > +static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, > int is_running) > { > - /* Start if RFA is non-NULL */ > - if(nic->rx_to_clean->skb) > - e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr); > + struct rx *rx_to_mark = nic->rx_to_use->prev->prev; > + struct rfd *rfd_to_mark; Let's just call this rfd or mark_rfd (since its used for both mark setting and clearing). > + > + if(is_running && rx_to_mark->prev->flags & rx_el) > + return; or just rx_to_mark == marked_rx (saves the load of flags). > + > + if(marked_rx == rx_to_mark) > + return; > + > + rfd_to_mark = (struct rfd *) rx_to_mark->skb->data; > + rfd_to_mark->command |= cpu_to_le16(cb_el); > + rfd_to_mark->size = 0; > + pci_dma_sync_single_for_cpu(nic->pdev, rx_to_mark->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); for_device, BIDIRECTIONAL > + rx_to_mark->flags |= (rx_el | rx_s0); > + > + if(!marked_rx) > + return; > + > + rfd_to_mark = (struct rfd *) marked_rx->skb->data; > + rfd_to_mark->command &= ~cpu_to_le16(cb_el); > + pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); for_device BIDIRECTIONAL > + > + rfd_to_mark->size = cpu_to_le16(VLAN_ETH_FRAME_LEN); > + pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_TODEVICE); > for_device BIDIRECTIONAL > + > + if(is_running) > + marked_rx->flags &= ~rx_el; > + else > + marked_rx->flags &= ~(rx_el | rx_s0); > +} > + > +static inline void e100_start_receiver(struct nic *nic, struct rx *rx) > +{ > + if(!nic->rxs) return; > + if(ru_suspended != nic->ru_running) return; > Are these checks just paranoia? Can we call poll leading to rx_clean without setting up the receive list? > + > + /* handle init time starts */ > + if(!rx) rx = nic->rxs; Again, as the previous patch noted, we can drop rx as a parameter and always (re)start on rx_to_clean. > + > + /* (Re)start RU if suspended or idle and RFA is non-NULL */ > + if(rx->skb) { > + e100_exec_cmd(nic, ruc_start, rx->dma_addr); > + nic->ru_running = ru_running; > + } > } > > #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN) > @@ -1774,8 +1832,6 @@ static int e100_rx_alloc_skb(struct nic > struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; > put_unaligned(cpu_to_le32(rx->dma_addr), > (u32 *)&prev_rfd->link); > - wmb(); > - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); > pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, > sizeof(struct rfd), PCI_DMA_TODEVICE); > } > @@ -1789,6 +1845,7 @@ static int e100_rx_indicate(struct nic * > struct sk_buff *skb = rx->skb; > struct rfd *rfd = (struct rfd *)skb->data; > u16 rfd_status, actual_size; > + u8 status; > > if(unlikely(work_done && *work_done >= work_to_do)) > return -EAGAIN; > @@ -1800,9 +1857,46 @@ static int e100_rx_indicate(struct nic * > > DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status); > > - /* If data isn't ready, nothing to indicate */ > - if(unlikely(!(rfd_status & cb_complete))) > + /* If data isn't ready, nothing to indicate > + * If both the el and s0 rx flags are set, we have hit the marked > + * buffer but we don't know if hardware has seen it so we check > + * the status. > + * If only the s0 flag is set, we check the next buffer. > + * If it is complete, we know that hardware saw the rfd el bit > + * get cleared but did not see the rfd size get set so it > + * skipped this buffer. We just return 0 and look at the > + * next buffer. > + * If only the s0 flag is set but the next buffer is > + * not complete, we cleared the el flag as hardware > + * hit this buffer. */ > + if(unlikely(!(rfd_status & cb_complete))) { > + u8 maskedFlags = rx->flags & (rx_el | rx_s0); > + if(maskedFlags == (rx_el | rx_s0)) { > + status = ioread8(&nic->csr->scb.status); > + if(status & rus_no_res) { > + nic->ru_running = ru_suspended; > + } > + } else if(maskedFlags == rx_s0) { > + struct rx *next_rx = rx->next; > + struct rfd *next_rfd = (struct rfd *)next_rx->skb->data; > + pci_dma_sync_single_for_cpu(nic->pdev, next_rx->dma_addr, > + sizeof(struct rfd), PCI_DMA_FROMDEVICE); > + if(next_rfd->status & cpu_to_le16(cb_complete)) { > + pci_unmap_single(nic->pdev, rx->dma_addr, > + RFD_BUF_LEN, PCI_DMA_FROMDEVICE); > + dev_kfree_skb_any(skb); > + rx->skb = NULL; > + rx->flags &= ~rx_s0; > + return 0; > + } else { > + status = ioread8(&nic->csr->scb.status); > + if(status & rus_no_res) { > + nic->ru_running = ru_suspended; > + } > + } > + } > return -ENODATA; > + } > > /* Get actual data size */ > actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF; > @@ -1813,6 +1907,15 @@ static int e100_rx_indicate(struct nic * > pci_unmap_single(nic->pdev, rx->dma_addr, > RFD_BUF_LEN, PCI_DMA_FROMDEVICE); > > + /* This happens when hardward sees the rfd el flag set > + * but then sees the rfd size set as well */ > + if(le16_to_cpu(rfd->command) & cb_el) { The racing read of the dma device might see it, but our cpu will never see this since it wrote them in the other order. how about if rx-flags & s0 ? Also see my comments about ru_unknown. Where do you clear s0 when we encounter a packet with rfd status complete? Maybe we should just set flags to 0 when allocating the skb? > + status = ioread8(&nic->csr->scb.status); > + if(status & rus_no_res) { > + nic->ru_running = ru_suspended; > + } > + } > + > /* Pull off the RFD and put the actual data (minus eth hdr) */ > skb_reserve(skb, sizeof(struct rfd)); > skb_put(skb, actual_size); > @@ -1842,19 +1945,46 @@ static int e100_rx_indicate(struct nic * > static void e100_rx_clean(struct nic *nic, unsigned int *work_done, > unsigned int work_to_do) > { > - struct rx *rx; > + struct rx *rx, *marked_rx; > + int restart_required = 0; > + int err = 0; > > /* Indicate newly arrived packets */ > for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = > rx->next) { > - if(e100_rx_indicate(nic, rx, work_done, work_to_do)) > - break; /* No more to clean */ > + err = e100_rx_indicate(nic, rx, work_done, work_to_do); > + /* Hit quota or no more to clean */ > + if(-EAGAIN == err || -ENODATA == err) > + break; > } My ru_state of maybe_running would cause us to read the status and resolve it to either running or suspended here (after the clean loop). > > + /* On EAGAIN, hit quota so have more work to do, restart once > + * cleanup is complete. > + * Else, are we already rnr? then pay attention!!! this ensures that > + * the state machine progression never allows a start with a > + * partially cleaned list, avoiding a race between hardware > + * and rx_to_clean when in NAPI mode */ It took me a long while to figure out what race you are referring to here. At first I thought you were just moving a comment. I am totally lost by "are we already rnr? then pay attention!!!". This isn't a race, its more like "We don't restart the receiver when we stop processing due to NAPI quota because we don't yet know which rfd is the first that already contains data. That is, rx_to_clean points to something that might be a used rfd instead of a known clean one." > + if(-EAGAIN != err && ru_suspended == nic->ru_running) > + restart_required = 1; > + > + marked_rx = nic->rx_to_use->prev->prev; > + while(!(marked_rx->flags & rx_el)) > + marked_rx = marked_rx->prev; > + Not while. This should be if, followed by BUG_ON(!marked_rx->flags & rx_el). We can remove the BUG_ON later, but it points to an error in our logic if its not one of the two places. > /* Alloc new skbs to refill list */ > for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) { > if(unlikely(e100_rx_alloc_skb(nic, rx))) > break; /* Better luck next time (see watchdog) */ > } > + > + e100_find_mark_el(nic, marked_rx, !restart_required); > We could pass the true ru_status based value here, but since we won't be restoring the receiver it doesn't matter. > + > + if(restart_required) { > + // ack the rnr? > + iowrite8(stat_ack_rnr, &nic->csr->scb.stat_ack); > + e100_start_receiver(nic, nic->rx_to_clean); > + if(work_done) > + (*work_done)++; > Why is restarting the receiver work? Actually receiving the packet is work. > + } > } > > static void e100_rx_clean_list(struct nic *nic) > @@ -1862,6 +1992,8 @@ static void e100_rx_clean_list(struct ni > struct rx *rx; > unsigned int i, count = nic->params.rfds.count; > > + nic->ru_running = ru_uninitialized; > + > if(nic->rxs) { > for(rx = nic->rxs, i = 0; i < count; rx++, i++) { > if(rx->skb) { > @@ -1883,6 +2015,7 @@ static int e100_rx_alloc_list(struct nic > unsigned int i, count = nic->params.rfds.count; > > nic->rx_to_use = nic->rx_to_clean = NULL; > + nic->ru_running = ru_uninitialized; Here is the short duration of uninitialized, and its not the 0 case. > > if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC))) > return -ENOMEM; > @@ -1897,6 +2030,9 @@ static int e100_rx_alloc_list(struct nic > } > > nic->rx_to_use = nic->rx_to_clean = nic->rxs; > + nic->ru_running = ru_suspended; > + > + e100_find_mark_el(nic, NULL, 0); > > return 0; > } > @@ -1916,6 +2052,10 @@ static irqreturn_t e100_intr(int irq, vo > /* Ack interrupt(s) */ > iowrite8(stat_ack, &nic->csr->scb.stat_ack); > > + /* We hit Receive No Resource (RNR); restart RU after cleaning */ > + if(stat_ack & stat_ack_rnr) > + nic->ru_running = ru_suspended; > + > if(likely(netif_rx_schedule_prep(netdev))) { > e100_disable_irq(nic); > __netif_rx_schedule(netdev); > @@ -2007,7 +2147,7 @@ static int e100_up(struct nic *nic) > if((err = e100_hw_init(nic))) > goto err_clean_cbs; > e100_set_multicast_list(nic->netdev); > - e100_start_receiver(nic); > + e100_start_receiver(nic, NULL); (remove second arg to start_receiver, this diff will go away). > mod_timer(&nic->watchdog, jiffies); > if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED, > nic->netdev->name, nic->netdev))) > @@ -2088,7 +2228,7 @@ static int e100_loopback_test(struct nic > mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR, > BMCR_LOOPBACK); > > - e100_start_receiver(nic); > + e100_start_receiver(nic, NULL); > (and this one). > if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) { > err = -ENOMEM; > > Thanks for doing the work David. milton ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-04 9:03 ` Milton Miller @ 2007-06-05 13:34 ` David Acker 2007-06-05 16:14 ` Milton Miller 2007-06-05 16:14 ` Milton Miller 0 siblings, 2 replies; 51+ messages in thread From: David Acker @ 2007-06-05 13:34 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev, Kok, Auke Milton Miller wrote: > > On Jun 1, 2007, at 3:45 PM, David Acker wrote: >> Ok, I took a stab at coding and testing these ideas. Below is a patch >> against 2.6.22-rc3. >> Let me know what you think. > > I think you got most of the ideas. As Auke noted, your coding style is > showing again. And your mailer again munged whitespace (fixed by > s/^<space><space>/<space>/ s/^$/<space>/). Sorry about the coding style. I instinctively followed what was there instead of kernel coding convention. I will look into how whitespace is getting screwed up. >> Find a buffer that is complete with rx->el not set and rx->s0 set. >> It appears that hardware can read the rfd's el-bit, then software >> can clear the rfd el-bit and set the rfd size, and then hardware can >> come in and read the size. > > Yes, since the size is after the EL flag in the descriptor, this can > happen since the pci read is not atomic. > >> I am reading the status back, although I don't think that I have to in >> this instance. > > Actually, you are reading it when the rfd still has EL set. Since the > cpu will never encounter that case, the if condition is never satisfied. In my tests, every time I found a completed rfd with the el-bit set, the receiver was in the out of resources state. > How about creating a state unknown, for when we think we should check > the device if its running. > If we are in this state and then encounter a received packet without s0 > set, we can set it back > to running. We set it when we rx a packet with s0 set. > > We then move both io_status reads to the caller. I can look into that as I clean this up. >> I am testing a version of this code patched against 2.6.18.4 on my PXA >> 255 based system. I will let you all know how it goes. The testing I did so far did well. I will try to get some more going tonight, hopefully on a cleaned up patch. > I'm assuming this is why the cleanup of the receiver start to always > start on rx_to_clean got dropped again. :-) Yep. I will get that in the next patch. > Also, I would like a few sentences in the Driver Operation section IV > Receive big comment. Something like > > In order to keep updates to the RFD link field from colliding with > hardware writes to mark packets complete, we use the feature that > hardware will not write to a size 0 descriptor and mark the previous > packet as end-of-list (EL). After updating the link, we remove EL and > only then restore the size such that hardware may use the > previous-to-end RFD. > > at the end of the first paragraph, and insert software before "no > locking is required" in the second. Sounds good to me. I will see if I can get into a cleaned up patch today and get it out by tomorrow. Thanks for dealing with me...I have been around kernel code for awhile but posting official patches to linux is new to me. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 13:34 ` David Acker @ 2007-06-05 16:14 ` Milton Miller 2007-08-27 17:34 ` Kok, Auke 2007-06-05 16:14 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-05 16:14 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik On Jun 5, 2007, at 8:34 AM, David Acker wrote: > Milton Miller wrote: >> On Jun 1, 2007, at 3:45 PM, David Acker wrote: >>> Ok, I took a stab at coding and testing these ideas. Below is a >>> patch against 2.6.22-rc3. >>> Let me know what you think. >> I think you got most of the ideas. As Auke noted, your coding style >> is showing again. And your mailer again munged whitespace (fixed by >> s/^<space><space>/<space>/ s/^$/<space>/). > Sorry about the coding style. I instinctively followed what was there > instead of kernel coding convention. I will look into how whitespace > is getting screwed up. I have to watch my coding style too (I like to indent the closing brace). At least the white space damage seems to be reversable. More than I can say for this mailer. >>> Find a buffer that is complete with rx->el not set and rx->s0 set. >>> It appears that hardware can read the rfd's el-bit, then >>> software can clear the rfd el-bit and set the rfd size, and then >>> hardware can come in and read the size. >> Yes, since the size is after the EL flag in the descriptor, this can >> happen since the pci read is not atomic. >>> I am reading the status back, although I don't think that I have to >>> in this instance. >> Actually, you are reading it when the rfd still has EL set. Since >> the cpu will never encounter that case, the if condition is never >> satisfied. > In my tests, every time I found a completed rfd with the el-bit set, > the receiver was in the out of resources state. Yes, if the EL was set, it would be a real hard race to find the completed packet with EL but not RNR. I was trying to refer to where you find a completed packet and then check for EL in the RFD. That is what I was claiming can not be observed by the cpu (unless the card writes the EL bit back, and not just the status u16). If the unless ... above is true, then please put a comment that the device can write RFD->EL back to 1 if we raced. >> How about creating a state unknown, for when we think we should check >> the device if its running. >> If we are in this state and then encounter a received packet without >> s0 set, we can set it back >> to running. We set it when we rx a packet with s0 set. >> We then move both io_status reads to the caller. > I can look into that as I clean this up. > > >>> I am testing a version of this code patched against 2.6.18.4 on my >>> PXA 255 based system. I will let you all know how it goes. > The testing I did so far did well. I will try to get some more going > tonight, hopefully on a cleaned up patch. Good to hear our expectiations match reality. > > >> I'm assuming this is why the cleanup of the receiver start to always >> start on rx_to_clean got dropped again. :-) > Yep. I will get that in the next patch. Ok. >> Also, I would like a few sentences in the Driver Operation section IV >> Receive big comment. Something like >> In order to keep updates to the RFD link field from colliding with >> hardware writes to mark packets complete, we use the feature that >> hardware will not write to a size 0 descriptor and mark the previous >> packet as end-of-list (EL). After updating the link, we remove EL >> and only then restore the size such that hardware may use the >> previous-to-end RFD. >> at the end of the first paragraph, and insert software before "no >> locking is required" in the second. > Sounds good to me. > > I will see if I can get into a cleaned up patch today and get it out > by tomorrow. Thanks for dealing with me...I have been around kernel > code for awhile but posting official patches to linux is new to me. > -Ack I've just learned by watching the lists over the last several years. Well, and actually writing the odd patch here and there. It occurs to me that I have been focusing on the code and not the changelog. I'll send a seperate reply on that thread shortly. One more thing I'll state here ... as per the perfect patch guidelines, it is preferred that the meta-discussion about the patch and its history go after the change log, seperated from it by a line of "--- " so that the patch application scripts can just extract the email subject as the title and through the firsst line of --- as the commit log. (This saves some manual editing). [1] http://kernelnewbies.org/UpstreamMerge milton ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 16:14 ` Milton Miller @ 2007-08-27 17:34 ` Kok, Auke 2007-08-27 18:32 ` David Acker 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-08-27 17:34 UTC (permalink / raw) To: Milton Miller, David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik Milton Miller wrote: > On Jun 5, 2007, at 8:34 AM, David Acker wrote: David, Milton, This was the last communication on-topic for the proposed changes to fix e100 on ARM. We're holding our breath here waiting for more, and would love to hear that this issue and fixes hasn't died off. Thanks, Auke >> Milton Miller wrote: >>> On Jun 1, 2007, at 3:45 PM, David Acker wrote: >>>> Ok, I took a stab at coding and testing these ideas. Below is a >>>> patch against 2.6.22-rc3. >>>> Let me know what you think. >>> I think you got most of the ideas. As Auke noted, your coding style >>> is showing again. And your mailer again munged whitespace (fixed by >>> s/^<space><space>/<space>/ s/^$/<space>/). >> Sorry about the coding style. I instinctively followed what was there >> instead of kernel coding convention. I will look into how whitespace >> is getting screwed up. > > I have to watch my coding style too (I like to indent the closing > brace). > > At least the white space damage seems to be reversable. More than I > can say for this mailer. > >>>> Find a buffer that is complete with rx->el not set and rx->s0 set. >>>> It appears that hardware can read the rfd's el-bit, then >>>> software can clear the rfd el-bit and set the rfd size, and then >>>> hardware can come in and read the size. >>> Yes, since the size is after the EL flag in the descriptor, this can >>> happen since the pci read is not atomic. >>>> I am reading the status back, although I don't think that I have to >>>> in this instance. >>> Actually, you are reading it when the rfd still has EL set. Since >>> the cpu will never encounter that case, the if condition is never >>> satisfied. >> In my tests, every time I found a completed rfd with the el-bit set, >> the receiver was in the out of resources state. > > Yes, if the EL was set, it would be a real hard race to find the > completed packet with EL but not RNR. I was trying to refer to where > you find a completed packet and then check for EL in the RFD. That is > what I was claiming can not be observed by the cpu (unless the card > writes the EL bit back, and not just the status u16). > > If the unless ... above is true, then please put a comment that the > device can write RFD->EL back to 1 if we raced. > > >>> How about creating a state unknown, for when we think we should check >>> the device if its running. >>> If we are in this state and then encounter a received packet without >>> s0 set, we can set it back >>> to running. We set it when we rx a packet with s0 set. >>> We then move both io_status reads to the caller. >> I can look into that as I clean this up. >> >> >>>> I am testing a version of this code patched against 2.6.18.4 on my >>>> PXA 255 based system. I will let you all know how it goes. >> The testing I did so far did well. I will try to get some more going >> tonight, hopefully on a cleaned up patch. > > Good to hear our expectiations match reality. > >> >>> I'm assuming this is why the cleanup of the receiver start to always >>> start on rx_to_clean got dropped again. :-) >> Yep. I will get that in the next patch. > > Ok. > >>> Also, I would like a few sentences in the Driver Operation section IV >>> Receive big comment. Something like >>> In order to keep updates to the RFD link field from colliding with >>> hardware writes to mark packets complete, we use the feature that >>> hardware will not write to a size 0 descriptor and mark the previous >>> packet as end-of-list (EL). After updating the link, we remove EL >>> and only then restore the size such that hardware may use the >>> previous-to-end RFD. >>> at the end of the first paragraph, and insert software before "no >>> locking is required" in the second. >> Sounds good to me. >> >> I will see if I can get into a cleaned up patch today and get it out >> by tomorrow. Thanks for dealing with me...I have been around kernel >> code for awhile but posting official patches to linux is new to me. >> -Ack > > I've just learned by watching the lists over the last several years. > Well, and actually writing the odd patch here and there. > > It occurs to me that I have been focusing on the code and not the > changelog. I'll send a seperate reply on that thread shortly. > > One more thing I'll state here ... as per the perfect patch guidelines, > it is preferred that the meta-discussion about the patch and its > history go after the change log, seperated from it by a line of "--- " > so that the patch application scripts can just extract the email > subject as the title and through the firsst line of --- as the commit > log. (This saves some manual editing). > > [1] http://kernelnewbies.org/UpstreamMerge > > milton > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-08-27 17:34 ` Kok, Auke @ 2007-08-27 18:32 ` David Acker 0 siblings, 0 replies; 51+ messages in thread From: David Acker @ 2007-08-27 18:32 UTC (permalink / raw) To: Kok, Auke Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik Kok, Auke wrote: > Milton Miller wrote: >> On Jun 5, 2007, at 8:34 AM, David Acker wrote: > > David, Milton, > > This was the last communication on-topic for the proposed changes to fix > e100 on ARM. We're holding our breath here waiting for more, and would > love to hear that this issue and fixes hasn't died off. > > Thanks, > > Auke I am sorry folks, this is my fault. I got pulled onto a fire on one of our other products. I have only recently come back to working on our product that uses the e100 on ARM. Based on my current time available to finish cleaning up this patch, I should have a new version available by the end of this week. -Ack > > > >>> Milton Miller wrote: >>>> On Jun 1, 2007, at 3:45 PM, David Acker wrote: >>>>> Ok, I took a stab at coding and testing these ideas. Below is a >>>>> patch against 2.6.22-rc3. >>>>> Let me know what you think. >>>> I think you got most of the ideas. As Auke noted, your coding >>>> style is showing again. And your mailer again munged whitespace >>>> (fixed by s/^<space><space>/<space>/ s/^$/<space>/). >>> Sorry about the coding style. I instinctively followed what was >>> there instead of kernel coding convention. I will look into how >>> whitespace is getting screwed up. >> >> I have to watch my coding style too (I like to indent the closing brace). >> >> At least the white space damage seems to be reversable. More than I >> can say for this mailer. >> >>>>> Find a buffer that is complete with rx->el not set and rx->s0 set. >>>>> It appears that hardware can read the rfd's el-bit, then >>>>> software can clear the rfd el-bit and set the rfd size, and then >>>>> hardware can come in and read the size. >>>> Yes, since the size is after the EL flag in the descriptor, this can >>>> happen since the pci read is not atomic. >>>>> I am reading the status back, although I don't think that I have to >>>>> in this instance. >>>> Actually, you are reading it when the rfd still has EL set. Since >>>> the cpu will never encounter that case, the if condition is never >>>> satisfied. >>> In my tests, every time I found a completed rfd with the el-bit set, >>> the receiver was in the out of resources state. >> >> Yes, if the EL was set, it would be a real hard race to find the >> completed packet with EL but not RNR. I was trying to refer to where >> you find a completed packet and then check for EL in the RFD. That is >> what I was claiming can not be observed by the cpu (unless the card >> writes the EL bit back, and not just the status u16). >> >> If the unless ... above is true, then please put a comment that the >> device can write RFD->EL back to 1 if we raced. >> >> >>>> How about creating a state unknown, for when we think we should >>>> check the device if its running. >>>> If we are in this state and then encounter a received packet without >>>> s0 set, we can set it back >>>> to running. We set it when we rx a packet with s0 set. >>>> We then move both io_status reads to the caller. >>> I can look into that as I clean this up. >>> >>> >>>>> I am testing a version of this code patched against 2.6.18.4 on my >>>>> PXA 255 based system. I will let you all know how it goes. >>> The testing I did so far did well. I will try to get some more going >>> tonight, hopefully on a cleaned up patch. >> >> Good to hear our expectiations match reality. >> >>> >>>> I'm assuming this is why the cleanup of the receiver start to always >>>> start on rx_to_clean got dropped again. :-) >>> Yep. I will get that in the next patch. >> >> Ok. >> >>>> Also, I would like a few sentences in the Driver Operation section >>>> IV Receive big comment. Something like >>>> In order to keep updates to the RFD link field from colliding with >>>> hardware writes to mark packets complete, we use the feature that >>>> hardware will not write to a size 0 descriptor and mark the previous >>>> packet as end-of-list (EL). After updating the link, we remove EL >>>> and only then restore the size such that hardware may use the >>>> previous-to-end RFD. >>>> at the end of the first paragraph, and insert software before "no >>>> locking is required" in the second. >>> Sounds good to me. >>> >>> I will see if I can get into a cleaned up patch today and get it out >>> by tomorrow. Thanks for dealing with me...I have been around kernel >>> code for awhile but posting official patches to linux is new to me. >>> -Ack >> >> I've just learned by watching the lists over the last several years. >> Well, and actually writing the odd patch here and there. >> >> It occurs to me that I have been focusing on the code and not the >> changelog. I'll send a seperate reply on that thread shortly. >> >> One more thing I'll state here ... as per the perfect patch >> guidelines, it is preferred that the meta-discussion about the patch >> and its history go after the change log, seperated from it by a line >> of "--- " so that the patch application scripts can just extract the >> email subject as the title and through the firsst line of --- as the >> commit log. (This saves some manual editing). >> >> [1] http://kernelnewbies.org/UpstreamMerge >> >> milton >> >> - >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 13:34 ` David Acker 2007-06-05 16:14 ` Milton Miller @ 2007-06-05 16:14 ` Milton Miller 2007-06-05 17:27 ` Kok, Auke 1 sibling, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-05 16:14 UTC (permalink / raw) To: David Acker Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik First, a question especially to Auke and Jeff: Since this patch both reverts the broken change that is currently in -rc and creates the fixed driver, I'm not sure I like the subject stating "on ARM", although that is the feature of the rewrite, and was the intent of merging the previous patch. This is actually its a fix for all systems relative to current, including those where dma is not cache coherent, (unlike a simple revert). Should we just put a comment about reverting the previous patch early in the change log? Something like this: Fix the e100 receiver handling, supporting cache incoherent DMA. Discard the concept of setting the S (suspend) bit with the EL bit introduced in commit d52df4a35af569071fda3f4eb08e47cc7023f094. In addition to it not setting either bit, the hardware doesn't work that way. Thoughts? Here is the changelog portion of the latest patch (quoted), with my comments thrown in: > On the ARM, their is a race condition between software allocating a > new receive On systems that have cache incoherent DMA, including ARM, > buffer and hardware writing into a buffer. The two race on touching > the last > Receive Frame Descriptor (RFD). It has its el-bit set and its next > link equal > to 0. When hardware encounters this buffer it attempts to write data > to it > and then update Status Word bits and Actual Count in the RFD. At the > same time > software may try to clear the el-bit and set the link address to a new > buffer. > Since the entire RFD is once cache-line, the two write operations can > collide. This can lead to the receive unit stalling or freed receive > buffers > getting written to. This can lead to the receive unit stalling or interpreting random memory as its receive area. > > The fix is to set the el-bit on and the size to 0 on the next to last > buffer > in the chain. When the hardware encounters this buffer it stops and > does > not write to it at all. The hardware issues an RNR interrupt with the > receive unit in the No Resources state. When software allocates > buffers, > it can update the tail of the list when either it knows the hardware > has stopped or the previous to the new one to mark marked. Software can write to the tail of the list because it knows hardware will stop on the previous descriptor that was marked as the end of list. > Once it has a new next to last buffer prepared, it can clear the el-bit > and set the size on the previous one. The race on this buffer is safe > since the link already points to a valid next buffer. and we can handle the race setting the size (assuming aligned 16 bit writes are atomic with respect to the DMA read). This paragraph changed from third person (the software or hardware) to second person (we). > We keep flags > in our software descriptor to note if the el bit is set and if the size > was 0. When we clear the RFD's el bit and set its size, we also clear > the el flag but we leave the size was 0 bit set. This was we can find > this buffer again later. This way software can identify them when the race may have occurred when cleaning the ring. On these descriptors, it looks ahead and if the next one is complete then hardware must have skipped the current one. Logic is added to prevent two packets in a row being marked while the receiver is running to avoid running in lockstep with the hardware and thereby limiting the required lookahead. > If the hardware sees the el-bit cleared without the size set, it will > move on to the next buffer and skip this one. If it sees > the size set but the el-bit still set, it will complete that buffer > and then RNR interrupt and wait. These sentences should be moved to the mention of the race above to reducing mixing descriptions of the hardware and the software. milton ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 16:14 ` Milton Miller @ 2007-06-05 17:27 ` Kok, Auke 2007-06-05 17:39 ` Jeff Garzik 0 siblings, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-06-05 17:27 UTC (permalink / raw) To: Milton Miller, Jeff Garzik Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, David Acker Milton Miller wrote: > First, a question especially to Auke and Jeff: > > Since this patch both reverts the broken change that is currently in > -rc and creates the fixed driver, I'm not sure I like the subject > stating "on ARM", although that is the feature of the rewrite, and was > the intent of merging the previous patch. This is actually its a fix > for all systems relative to current, including those where dma is not > cache coherent, (unlike a simple revert). > > Should we just put a comment about reverting the previous patch early > in the change log? yes > Something like this: > > Fix the e100 receiver handling, supporting cache incoherent DMA. > > Discard the concept of setting the S (suspend) bit with the EL bit > introduced in commit d52df4a35af569071fda3f4eb08e47cc7023f094. In > addition to it not setting either bit, the hardware doesn't work that > way. > > > Thoughts? the same comment I made about the coding style counts for this too: I will clean up the patch and gladly adjust the topic, which in this case seems the right thing to do. I am too grateful that you guys are digging into this so deeply to send you back with comments on style - I'll gladly fix that up :) > Here is the changelog portion of the latest patch (quoted), with my > comments thrown in: OK, I will buffer this info and make sure this gets picked up on the final version. this opens up another question: We need to make sure that now that we're getting closer to 2.6.22 we don't end up killing e100 in it. Should we drop the current fixes in it to be on the safe side and aim for 2.6.23? I would hate to see an untested codepath breaking e100 on something like ppc or mips... that will be very painful Jeff, your thoughts on that? Auke ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 17:27 ` Kok, Auke @ 2007-06-05 17:39 ` Jeff Garzik 2007-06-05 17:42 ` David Acker 2007-06-05 17:43 ` Kok, Auke 0 siblings, 2 replies; 51+ messages in thread From: Jeff Garzik @ 2007-06-05 17:39 UTC (permalink / raw) To: Kok, Auke Cc: Milton Miller, Jeff Garzik, David Acker, netdev, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: > We need to make sure that now that we're getting closer to 2.6.22 we don't > end up killing e100 in it. Should we drop the current fixes in it to be on > the safe side and aim for 2.6.23? I would hate to see an untested codepath > breaking e100 on something like ppc or mips... that will be very painful I certainly agree with this assessment... I've been wondering if, based on all this recent work, we should revert the s-bit stuff and wait for 2.6.23. Jeff ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 17:39 ` Jeff Garzik @ 2007-06-05 17:42 ` David Acker 2007-06-05 17:43 ` Kok, Auke 1 sibling, 0 replies; 51+ messages in thread From: David Acker @ 2007-06-05 17:42 UTC (permalink / raw) To: Jeff Garzik Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Milton Miller, Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik Jeff Garzik wrote: > On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: >> We need to make sure that now that we're getting closer to 2.6.22 we don't >> end up killing e100 in it. Should we drop the current fixes in it to be on >> the safe side and aim for 2.6.23? I would hate to see an untested codepath >> breaking e100 on something like ppc or mips... that will be very painful > > I certainly agree with this assessment... > > I've been wondering if, based on all this recent work, we should revert > the s-bit stuff and wait for 2.6.23. I think so. It will be no worse then it was and make the patch that fixes it clearer. -Ack ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 17:39 ` Jeff Garzik 2007-06-05 17:42 ` David Acker @ 2007-06-05 17:43 ` Kok, Auke 2007-06-05 17:56 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-06-05 17:43 UTC (permalink / raw) To: Jeff Garzik Cc: Milton Miller, Jeff Garzik, David Acker, netdev, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman Jeff Garzik wrote: > On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: >> We need to make sure that now that we're getting closer to 2.6.22 we don't >> end up killing e100 in it. Should we drop the current fixes in it to be on >> the safe side and aim for 2.6.23? I would hate to see an untested codepath >> breaking e100 on something like ppc or mips... that will be very painful > > I certainly agree with this assessment... > > I've been wondering if, based on all this recent work, we should revert > the s-bit stuff and wait for 2.6.23. Yes, that's my point. If Milton and David agree I think we should do so immediately. If so, do you want me to write a revert-patch or do you have some magic to do that for me? Auke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 17:43 ` Kok, Auke @ 2007-06-05 17:56 ` Milton Miller 2007-06-05 23:33 ` Kok, Auke 0 siblings, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-05 17:56 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote: > Jeff Garzik wrote: >> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: >>> We need to make sure that now that we're getting closer to 2.6.22 we >>> don't end up killing e100 in it. Should we drop the current fixes in >>> it to be on the safe side and aim for 2.6.23? I would hate to see an >>> untested codepath breaking e100 on something like ppc or mips... >>> that will be very painful >> I certainly agree with this assessment... >> I've been wondering if, based on all this recent work, we should >> revert >> the s-bit stuff and wait for 2.6.23. > > Yes, that's my point. If Milton and David agree I think we should do > so immediately. We definitely need something other than what is in now. > If so, do you want me to write a revert-patch or do you have some > magic to do that for me? > The simple git revert won't work because there have been other changes (ioread for instance) that conflict. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 17:56 ` Milton Miller @ 2007-06-05 23:33 ` Kok, Auke 2007-06-05 23:44 ` Jeff Garzik 2007-06-06 2:26 ` Kok, Auke 0 siblings, 2 replies; 51+ messages in thread From: Kok, Auke @ 2007-06-05 23:33 UTC (permalink / raw) To: Jeff Garzik, Jeff Garzik Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller, Scott Feldman, John Ronciak, Jeff Kirsher, David Acker Milton Miller wrote: > On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote: > >> Jeff Garzik wrote: >>> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: >>>> We need to make sure that now that we're getting closer to 2.6.22 we >>>> don't end up killing e100 in it. Should we drop the current fixes in >>>> it to be on the safe side and aim for 2.6.23? I would hate to see an >>>> untested codepath breaking e100 on something like ppc or mips... >>>> that will be very painful >>> I certainly agree with this assessment... >>> I've been wondering if, based on all this recent work, we should >>> revert >>> the s-bit stuff and wait for 2.6.23. >> Yes, that's my point. If Milton and David agree I think we should do >> so immediately. > > We definitely need something other than what is in now. > > >> If so, do you want me to write a revert-patch or do you have some >> magic to do that for me? >> > > The simple git revert won't work because there have been other changes > (ioread for instance) that conflict. Hmm git-revert seems to do the job right. I checked it with git-show | patch -p1 -R and the results look OK. The two patches on top of the one we want to revert are unrelated enough to apply (manually it shows some fuzz, but otherwise it's OK). Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to revert the following patch for now: --- commit d52df4a35af569071fda3f4eb08e47cc7023f094 Author: Scott Feldman <sfeldma@pobox.com> Date: Wed Nov 9 02:18:52 2005 -0500 [netdrvr e100] experiment with doing RX in a similar manner to eepro100 I was going to say that eepro100's speedo_rx_link() does the same DMA abuse as e100, but then I noticed one little detail: eepro100 sets both EL (end of list) and S (suspend) bits in the RFD as it chains it to the RFD list. e100 was only setting the EL bit. Hmmm, that's interesting. That means that if HW reads a RFD with the S-bit set, it'll process that RFD and then suspend the receive unit. The receive unit will resume when SW clears the S-bit. There is no need for SW to restart the receive unit. Which means a lot of the receive unit state tracking code in the driver goes away. So here's a patch against 2.6.14. (Sorry for inlining it; the mailer I'm using now will mess with the word wrap). I can't test this on XScale (unless someone has an e100 module for Gumstix :) . It should be doing exactly what eepro100 does with RFDs. I don't believe this change will introduce a performance hit because the S-bit and EL-bit go hand-in-hand meaning if we're going to suspend because of the S- bit, we're on the last resource anyway, so we'll have to wait for SW to replenish. (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 commit) --- Cheers, Auke ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 23:33 ` Kok, Auke @ 2007-06-05 23:44 ` Jeff Garzik 2007-06-06 2:26 ` Kok, Auke 1 sibling, 0 replies; 51+ messages in thread From: Jeff Garzik @ 2007-06-05 23:44 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Milton Miller, e1000-devel, David Acker, netdev, Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak On Tue, Jun 05, 2007 at 04:33:12PM -0700, Kok, Auke wrote: > Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to > revert the following patch for now: Can you do me one more favor, and write a short paragraph describing why it is getting reverted, to paste into the commit message? You can do this yourself and have me git-pull the result, or I can do the revert and paste the paragraph from email, your choice. Jeff ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-05 23:33 ` Kok, Auke 2007-06-05 23:44 ` Jeff Garzik @ 2007-06-06 2:26 ` Kok, Auke 2007-06-06 9:28 ` Milton Miller 1 sibling, 1 reply; 51+ messages in thread From: Kok, Auke @ 2007-06-06 2:26 UTC (permalink / raw) To: Jeff Garzik, Jeff Garzik Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller, Scott Feldman, John Ronciak, Jeff Kirsher, David Acker Kok, Auke wrote: > Milton Miller wrote: >> On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote: >> >>> Jeff Garzik wrote: >>>> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote: >>>>> We need to make sure that now that we're getting closer to 2.6.22 we >>>>> don't end up killing e100 in it. Should we drop the current fixes in >>>>> it to be on the safe side and aim for 2.6.23? I would hate to see an >>>>> untested codepath breaking e100 on something like ppc or mips... >>>>> that will be very painful >>>> I certainly agree with this assessment... >>>> I've been wondering if, based on all this recent work, we should >>>> revert >>>> the s-bit stuff and wait for 2.6.23. >>> Yes, that's my point. If Milton and David agree I think we should do >>> so immediately. >> We definitely need something other than what is in now. >> >> >>> If so, do you want me to write a revert-patch or do you have some >>> magic to do that for me? >>> >> The simple git revert won't work because there have been other changes >> (ioread for instance) that conflict. > > Hmm git-revert seems to do the job right. I checked it with git-show | patch -p1 > -R and the results look OK. The two patches on top of the one we want to revert > are unrelated enough to apply (manually it shows some fuzz, but otherwise it's OK). > > Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to revert the > following patch for now: > > --- > commit d52df4a35af569071fda3f4eb08e47cc7023f094 > Author: Scott Feldman <sfeldma@pobox.com> > Date: Wed Nov 9 02:18:52 2005 -0500 > > [netdrvr e100] experiment with doing RX in a similar manner to eepro100 > > I was going to say that eepro100's speedo_rx_link() does the same DMA > abuse as e100, but then I noticed one little detail: eepro100 sets both > EL (end of list) and S (suspend) bits in the RFD as it chains it to the > RFD list. e100 was only setting the EL bit. Hmmm, that's interesting. > That means that if HW reads a RFD with the S-bit set, it'll process > that RFD and then suspend the receive unit. The receive unit will > resume when SW clears the S-bit. There is no need for SW to restart > the receive unit. Which means a lot of the receive unit state tracking > code in the driver goes away. > > So here's a patch against 2.6.14. (Sorry for inlining it; the mailer > I'm using now will mess with the word wrap). I can't test this on > XScale (unless someone has an e100 module for Gumstix :) . It should > be doing exactly what eepro100 does with RFDs. I don't believe this > change will introduce a performance hit because the S-bit and EL-bit go > hand-in-hand meaning if we're going to suspend because of the S- bit, > we're on the last resource anyway, so we'll have to wait for SW to > replenish. > (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 commit) > --- > A little bit more is needed to explain why we're reverting it for now. Jeff, please insert this into the revert commit. Auke -- This patch attempted to fix e100 for non-cache coherent memory architectures by using the cb style code that eepro100 had and using the EL and s bits from the RFD list. Unfortunately the hardware doesn't work exactly like this and therefore this patch actually breaks e100 on those systems. Reverting the change brings it back to the previously known good state for 2.6.22. The pending rewrite in progress to this code can then be safely merged later. Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> --- ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-06 2:26 ` Kok, Auke @ 2007-06-06 9:28 ` Milton Miller 2007-06-11 15:58 ` Milton Miller 0 siblings, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-06 9:28 UTC (permalink / raw) To: Kok, Auke Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote: > Kok, Auke wrote: >>> >> Hmm git-revert seems to do the job right. I checked it with git-show >> | patch -p1 -R and the results look OK. The two patches on top of the >> one we want to revert are unrelated enough to apply (manually it >> shows some fuzz, but otherwise it's OK). >> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to >> revert the following patch for now: >> --- >> commit d52df4a35af569071fda3f4eb08e47cc7023f094 >> Author: Scott Feldman <sfeldma@pobox.com> >> Date: Wed Nov 9 02:18:52 2005 -0500 >> [netdrvr e100] experiment with doing RX in a similar manner to >> eepro100 >> I was going to say that eepro100's speedo_rx_link() does the >> same DMA >> abuse as e100, but then I noticed one little detail: eepro100 >> sets both >> EL (end of list) and S (suspend) bits in the RFD as it chains it >> to the >> RFD list. e100 was only setting the EL bit. Hmmm, that's >> interesting. >> That means that if HW reads a RFD with the S-bit set, it'll >> process >> that RFD and then suspend the receive unit. The receive unit >> will >> resume when SW clears the S-bit. There is no need for SW to >> restart >> the receive unit. Which means a lot of the receive unit state >> tracking >> code in the driver goes away. >> So here's a patch against 2.6.14. (Sorry for inlining it; the >> mailer >> I'm using now will mess with the word wrap). I can't test this >> on >> XScale (unless someone has an e100 module for Gumstix :) . It >> should >> be doing exactly what eepro100 does with RFDs. I don't believe >> this >> change will introduce a performance hit because the S-bit and >> EL-bit go >> hand-in-hand meaning if we're going to suspend because of the S- >> bit, >> we're on the last resource anyway, so we'll have to wait for SW >> to >> replenish. >> (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 >> commit) >> --- > > A little bit more is needed to explain why we're reverting it for now. > Jeff, please insert this into the revert commit. > > Auke > > -- > This patch attempted to fix e100 for non-cache coherent memory > architectures by using the cb style code that eepro100 had and using > the EL and s bits from the RFD list. Unfortunately the hardware > doesn't work exactly like this and therefore this patch actually > breaks e100 on those systems. on all systems. (Both the &| typo and the removed restart logic). > Reverting the change brings it back to the previously known good state > for 2.6.22. The pending rewrite in progress to this code can then be > safely merged later. > > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> > --- milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-06 9:28 ` Milton Miller @ 2007-06-11 15:58 ` Milton Miller 2007-06-15 14:39 ` Jeff Garzik 0 siblings, 1 reply; 51+ messages in thread From: Milton Miller @ 2007-06-11 15:58 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak On Jun 6, 2007, at 4:28 AM, Milton Miller wrote: > > On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote: > >> Kok, Auke wrote: >>>> >>> Hmm git-revert seems to do the job right. I checked it with git-show >>> | patch -p1 -R and the results look OK. The two patches on top of >>> the one we want to revert are unrelated enough to apply (manually it >>> shows some fuzz, but otherwise it's OK). >>> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` >>> to revert the following patch for now: >>> --- >>> commit d52df4a35af569071fda3f4eb08e47cc7023f094 >>> Author: Scott Feldman <sfeldma@pobox.com> >>> Date: Wed Nov 9 02:18:52 2005 -0500 >>> [netdrvr e100] experiment with doing RX in a similar manner to >>> eepro100 >>> I was going to say that eepro100's speedo_rx_link() does the >>> same DMA >>> abuse as e100, but then I noticed one little detail: eepro100 >>> sets both >>> EL (end of list) and S (suspend) bits in the RFD as it chains >>> it to the >>> RFD list. e100 was only setting the EL bit. Hmmm, that's >>> interesting. >>> That means that if HW reads a RFD with the S-bit set, it'll >>> process >>> that RFD and then suspend the receive unit. The receive unit >>> will >>> resume when SW clears the S-bit. There is no need for SW to >>> restart >>> the receive unit. Which means a lot of the receive unit state >>> tracking >>> code in the driver goes away. >>> So here's a patch against 2.6.14. (Sorry for inlining it; the >>> mailer >>> I'm using now will mess with the word wrap). I can't test this >>> on >>> XScale (unless someone has an e100 module for Gumstix :) . It >>> should >>> be doing exactly what eepro100 does with RFDs. I don't believe >>> this >>> change will introduce a performance hit because the S-bit and >>> EL-bit go >>> hand-in-hand meaning if we're going to suspend because of the >>> S- bit, >>> we're on the last resource anyway, so we'll have to wait for SW >>> to >>> replenish. >>> (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 >>> commit) >>> --- >> >> A little bit more is needed to explain why we're reverting it for >> now. Jeff, please insert this into the revert commit. >> >> Auke >> >> -- >> This patch attempted to fix e100 for non-cache coherent memory >> architectures by using the cb style code that eepro100 had and using >> the EL and s bits from the RFD list. Unfortunately the hardware >> doesn't work exactly like this and therefore this patch actually >> breaks e100 on those systems. > > on all systems. (Both the &| typo and the removed restart logic). > >> Reverting the change brings it back to the previously known good >> state for 2.6.22. The pending rewrite in progress to this code can >> then be safely merged later. >> >> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> >> Jeff I don't see this in netdev upstream-fixes. Are you waiting on Auke to respond to this? milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-06-11 15:58 ` Milton Miller @ 2007-06-15 14:39 ` Jeff Garzik 0 siblings, 0 replies; 51+ messages in thread From: Jeff Garzik @ 2007-06-15 14:39 UTC (permalink / raw) To: Milton Miller Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman, John Ronciak, Jeff Kirsher, David Acker, Jeff Garzik Milton Miller wrote: > > On Jun 6, 2007, at 4:28 AM, Milton Miller wrote: > >> >> On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote: >> >>> Kok, Auke wrote: >>>>> >>>> Hmm git-revert seems to do the job right. I checked it with git-show >>>> | patch -p1 -R and the results look OK. The two patches on top of >>>> the one we want to revert are unrelated enough to apply (manually it >>>> shows some fuzz, but otherwise it's OK). >>>> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` >>>> to revert the following patch for now: >>>> --- >>>> commit d52df4a35af569071fda3f4eb08e47cc7023f094 >>>> Author: Scott Feldman <sfeldma@pobox.com> >>>> Date: Wed Nov 9 02:18:52 2005 -0500 >>>> [netdrvr e100] experiment with doing RX in a similar manner to >>>> eepro100 >>>> I was going to say that eepro100's speedo_rx_link() does the >>>> same DMA >>>> abuse as e100, but then I noticed one little detail: eepro100 >>>> sets both >>>> EL (end of list) and S (suspend) bits in the RFD as it chains >>>> it to the >>>> RFD list. e100 was only setting the EL bit. Hmmm, that's >>>> interesting. >>>> That means that if HW reads a RFD with the S-bit set, it'll >>>> process >>>> that RFD and then suspend the receive unit. The receive unit >>>> will >>>> resume when SW clears the S-bit. There is no need for SW to >>>> restart >>>> the receive unit. Which means a lot of the receive unit state >>>> tracking >>>> code in the driver goes away. >>>> So here's a patch against 2.6.14. (Sorry for inlining it; the >>>> mailer >>>> I'm using now will mess with the word wrap). I can't test this on >>>> XScale (unless someone has an e100 module for Gumstix :) . It >>>> should >>>> be doing exactly what eepro100 does with RFDs. I don't believe >>>> this >>>> change will introduce a performance hit because the S-bit and >>>> EL-bit go >>>> hand-in-hand meaning if we're going to suspend because of the >>>> S- bit, >>>> we're on the last resource anyway, so we'll have to wait for SW >>>> to >>>> replenish. >>>> (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 >>>> commit) >>>> --- >>> >>> A little bit more is needed to explain why we're reverting it for >>> now. Jeff, please insert this into the revert commit. >>> >>> Auke >>> >>> -- >>> This patch attempted to fix e100 for non-cache coherent memory >>> architectures by using the cb style code that eepro100 had and using >>> the EL and s bits from the RFD list. Unfortunately the hardware >>> doesn't work exactly like this and therefore this patch actually >>> breaks e100 on those systems. >> >> on all systems. (Both the &| typo and the removed restart logic). >> >>> Reverting the change brings it back to the previously known good >>> state for 2.6.22. The pending rewrite in progress to this code can >>> then be safely merged later. >>> >>> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> >>> > > Jeff I don't see this in netdev upstream-fixes. Are you waiting on Auke > to respond to this? It should now be upstream, as of a day or two ago. Sorry for the delaying, was travelling, and wound up accidentally losing Auke's commit description. Jeff ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-24 5:26 ` Milton Miller 2007-05-24 11:21 ` Milton Miller @ 2007-05-24 12:44 ` David Acker 1 sibling, 0 replies; 51+ messages in thread From: David Acker @ 2007-05-24 12:44 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, netdev, e1000-devel, Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman, Kok, Auke Milton Miller wrote: > On May 23, 2007, at 4:32 PM, David Acker wrote: >> Milton Miller wrote: >>> My current reading of the manual is that the C bit will not be >>> set on an RFD that is size 0. It goes on to processes EL and >>> S, and decides to stop and interrupt RNR or suspend, or just >>> go to the next packet. >> I double checked this with a quick experiment and it appears you are >> correct. >> >> What about if we always did the following: >> set the size: >> sync(); >> clear el-bit >> sync() >> >> Then if the hardware sees just the size set, the packet completes but >> with the el-bit and we know we need to restart since it completed. >> If it sees the size == 0, and the el bit set, it stops and RNR >> interrupts. > > I think this is exposed to a hole and a race: we don't know if the > hardware > read the RFD before we set the size or after, just that it was before > the EL > bit was cleared. If it read it before the size was set, then it will not > set the C bit. If it reads it after the size is set, it will complete it. Yep...I too got sidetracked! My test time got lost to two 2 month old twins needing to be fed or else! :-) > > For coherent DMA we can always observe the C bit. But for the > incoherent DMA > case, our store to clear the EL bit may overwrite the dma from the > device to > the beginning of the packet, or the write to EOF, F, and size, and/or the > write to C, OK, and status bits to tell us its done. In the worst case, we > would overwrite the beginning of the data but catch the C bit and even the > actual size, and therefore would receive corrupted data. > > We can only detect the hardware went RNR when it does so or decide we > won the > race when it receives and completes the next frame. Yes, I agree. >> When we find a buffer that is not completed but has the el-bit set, we >> read the status byte of the status control block to see if the RU is >> in the no resources state. If it isn't, it means that we found that >> buffer before the hardware did and thus need to wait for it. We will >> either find it on the next poll or enable interrupts and get told >> about it by hardware. >> >> What do you think? > > I think the second part is good ... Cool. That part seemed to work well in my tests. I will reply to your next mail to discuss your plan so that I get it all in one message. -Ack ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) 2007-05-23 14:02 ` Milton Miller 2007-05-23 21:32 ` David Acker @ 2007-05-24 4:13 ` Milton Miller 1 sibling, 0 replies; 51+ messages in thread From: Milton Miller @ 2007-05-24 4:13 UTC (permalink / raw) To: Milton Miller Cc: Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak Auke Kok pointed out I had left an unfinished thought this morning ... well, here's a completion, but I will mostly think about David's latest proposal. I think I was debating proposing this, then got side tracked then hit send. On May 23, 2007, at 9:02 AM, Milton Miller wrote: > >> What if we just fixed the alignment of rfd to make sure it is cache >> aligned? Then we know our syncs will be a single cache flush. > > That doesn't help in the coherent dma case; the device can still > read in between, and we need to handle the !EL size 0 case. > > Actually, I thought about it some more after yesterdays note. > We can set the size !0 after making sure the device sees EL > (wmb and pci_sync_for_device), as long as we remember this > buffer used to be size 0 when doing RX clean, and look ahead > for another complete packet as long as the size was ever 0. > > My proposal was just to use "size is still 0" as that flag, > wasting the space allocated for that skb. > > Another aproach is to always allocate a rfd-only packet > at the beginning of the loop, and inserting th ... inserting this dummy skb before the last successfuly allocated one, moving the links around. This has the benefit of not allocating a 1500+32 byte buffer that we don't intend to fill, but is a lot more complicated insert and move skb around when filling. We now return you to your previously scheduled program. Now back to thinking about David's proposal, and my thoughts about how I might do the allocate non-zero size. milton ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] e100 rx: or s and el bits 2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller 2007-05-01 15:01 ` David Acker @ 2007-05-01 15:21 ` Kok, Auke 1 sibling, 0 replies; 51+ messages in thread From: Kok, Auke @ 2007-05-01 15:21 UTC (permalink / raw) To: Milton Miller Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel, Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher Milton Miller wrote: > In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description > talks about emulating another driver by setting addtional bits and > the being unable to test when submitted. Seeing the & operator to > set more bits made me suspicious, and indeed the bits are defined > in positive logic: > > cb_s = 0x4000, > cb_el = 0x8000, > > So anding those together would be 0. I'm guessing they should > be or'd, but don't have hardware here to test, much like the > committed patch. In fact, I'll let someone else do the compile > test too. I'll update the comment. > > (It looks like the end of list and s bits would not be set > in the template nor cleared when linking in recieve skbs, so > as long as the kernel can keep up with the 100Mb card we > wouldn't see the adapter go off the linked list, possibly > explaining any successful use of this patch written against > 2.6.14). yes, that definately doesn't seem right. I'll try to run this and watch out for David Ackers report as well. Thanks, Auke > > Signed-off-by: Milton Miller <miltonm@bga.com> > > > --- linux-2.6/drivers/net/e100.c.orig 2007-05-01 05:19:17.000000000 -0500 > +++ linux-2.6/drivers/net/e100.c 2007-05-01 05:22:14.000000000 -0500 > @@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic > ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); > > /* Template for a freshly allocated RFD */ > - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); > + nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s); > nic->blank_rfd.rbd = 0xFFFFFFFF; > nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); > > @@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic > } > > /* Link the RFD to end of RFA by linking previous RFD to > - * this one, and clearing EL bit of previous. */ > + * this one, and clearing EL and S bit of previous. */ > if(rx->prev->skb) { > struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; > put_unaligned(cpu_to_le32(rx->dma_addr), > (u32 *)&prev_rfd->link); > wmb(); > - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); > + prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s); > pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, > sizeof(struct rfd), PCI_DMA_TODEVICE); > } ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2007-08-27 18:32 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller 2007-05-01 15:01 ` David Acker 2007-05-02 20:21 ` David Acker 2007-05-04 21:43 ` David Acker 2007-05-06 6:36 ` Milton Miller 2007-05-07 15:27 ` David Acker 2007-05-14 18:26 ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker 2007-05-18 1:54 ` Jeff Garzik 2007-05-18 3:47 ` Kok, Auke 2007-05-18 14:07 ` David Acker 2007-05-18 14:20 ` David Acker 2007-05-18 15:29 ` Kok, Auke 2007-05-18 15:47 ` David Acker 2007-05-18 15:59 ` Kok, Auke 2007-05-18 17:11 ` David Acker 2007-05-18 17:47 ` Kok, Auke 2007-05-21 17:35 ` Milton Miller 2007-05-21 17:45 ` Kok, Auke 2007-05-22 16:51 ` Milton Miller 2007-05-22 22:07 ` David Acker 2007-05-23 14:02 ` Milton Miller 2007-05-23 21:32 ` David Acker 2007-05-24 5:26 ` Milton Miller 2007-05-24 11:21 ` Milton Miller 2007-05-24 12:51 ` David Acker 2007-05-24 14:25 ` Milton Miller 2007-05-29 15:58 ` David Acker 2007-05-30 8:26 ` Milton Miller 2007-06-01 20:45 ` David Acker 2007-06-01 21:13 ` Jeff Garzik 2007-06-01 22:13 ` Kok, Auke 2007-06-04 9:03 ` Milton Miller 2007-06-05 13:34 ` David Acker 2007-06-05 16:14 ` Milton Miller 2007-08-27 17:34 ` Kok, Auke 2007-08-27 18:32 ` David Acker 2007-06-05 16:14 ` Milton Miller 2007-06-05 17:27 ` Kok, Auke 2007-06-05 17:39 ` Jeff Garzik 2007-06-05 17:42 ` David Acker 2007-06-05 17:43 ` Kok, Auke 2007-06-05 17:56 ` Milton Miller 2007-06-05 23:33 ` Kok, Auke 2007-06-05 23:44 ` Jeff Garzik 2007-06-06 2:26 ` Kok, Auke 2007-06-06 9:28 ` Milton Miller 2007-06-11 15:58 ` Milton Miller 2007-06-15 14:39 ` Jeff Garzik 2007-05-24 12:44 ` David Acker 2007-05-24 4:13 ` Milton Miller 2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).