[PATCH] e100 rx: or s and el bits

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] e100 rx: or s and el bits
@ 2007-05-01 11:24 Milton Miller
  2007-05-01 15:01 ` David Acker
  2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke
  0 siblings, 2 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-01 11:24 UTC (permalink / raw)
  To: Scott Feldman, Jeff Garzik
  Cc: Auke Kok, e1000-devel, netdev, linux-kernel, Jesse Brandeburg,
	John Ronciak, Jeff Kirsher, Andrew Morton

In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
talks about emulating another driver by setting addtional bits and
the being unable to test when submitted.  Seeing the & operator to
set more bits made me suspicious, and indeed the bits are defined
in positive logic:

	cb_s      = 0x4000,
	cb_el     = 0x8000,

So anding those together would be 0.   I'm guessing they should
be or'd, but don't have hardware here to test, much like the
committed patch.  In fact, I'll let someone else do the compile
test too.  I'll update the comment.

(It looks like the end of list and s bits would not be set
in the template nor cleared when linking in recieve skbs, so
as long as the kernel can keep up with the 100Mb card we 
wouldn't see the adapter go off the linked list, possibly
explaining any successful use of this patch written against
2.6.14).

Signed-off-by: Milton Miller <miltonm@bga.com>


--- linux-2.6/drivers/net/e100.c.orig	2007-05-01 05:19:17.000000000 -0500
+++ linux-2.6/drivers/net/e100.c	2007-05-01 05:22:14.000000000 -0500
@@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic
 		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
 
 	/* Template for a freshly allocated RFD */
-	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+	nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s);
 	nic->blank_rfd.rbd = 0xFFFFFFFF;
 	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
 
@@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic 
 	}
 
 	/* Link the RFD to end of RFA by linking previous RFD to
-	 * this one, and clearing EL bit of previous.  */
+	 * this one, and clearing EL and S bit of previous.  */
 	if(rx->prev->skb) {
 		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
 		put_unaligned(cpu_to_le32(rx->dma_addr),
 			(u32 *)&prev_rfd->link);
 		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
+		prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s);
 		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
 			sizeof(struct rfd), PCI_DMA_TODEVICE);
 	}

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller
@ 2007-05-01 15:01 ` David Acker
  2007-05-02 20:21   ` David Acker
  2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke
  1 sibling, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-01 15:01 UTC (permalink / raw)
  To: Milton Miller
  Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel,
	Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
	Auke Kok

Milton Miller wrote:
> In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
> talks about emulating another driver by setting addtional bits and
> the being unable to test when submitted.  Seeing the & operator to
> set more bits made me suspicious, and indeed the bits are defined
> in positive logic:
> 
> 	cb_s      = 0x4000,
> 	cb_el     = 0x8000,
> 
> So anding those together would be 0.   I'm guessing they should
> be or'd, but don't have hardware here to test, much like the
> committed patch.  In fact, I'll let someone else do the compile
> test too.  I'll update the comment.
> 

I wonder if this worked for me because the hardware also spun on the 
link field being NULL?  Since the RU base is also set to 0, the 
calculated physical address would be 0 as well.  I would imagine if the 
hardware tried to read/write to very low addresses across PCI, there 
would be issues.  I will retest with a small receive pool to try to hit 
the problem.

It seems to apply to a pretty recent git pull from linus's tree.  I 
manually merged this into the 2.6.18.4 kernel we are using.  With the 
original in kernel driver (just EL bit, no S bit), I had two tests that 
would always end in horrible memory corruption on a PXA255 based system. 
  One is a 12 hour bidirectional TCP test using iperf with the ethernet 
port sending packets to a wireless card and vice versa.  The other is a 
similar configuration running a 12 hour UDP test sending 20 
megabits/second in each direction.  Even through the original S-bit 
patch seems broken, we have had days of continuous traffic through it 
without issue where previously we could never go more than 6 hours.

I will let folks know how it goes.  In the UDP test, the ethernet side 
often gets ahead of the available buffers due to CPU and PCI usage by 
the wireless card and its driver.

I will also run these tests with the new patch and with a smaller 
receive pool (default is 256) to make the pool run out more often.

-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-01 15:01 ` David Acker
@ 2007-05-02 20:21   ` David Acker
  2007-05-04 21:43     ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-02 20:21 UTC (permalink / raw)
  To: Milton Miller
  Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel,
	Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
	Auke Kok

David Acker wrote:
> Milton Miller wrote:
>> In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
>> talks about emulating another driver by setting addtional bits and
>> the being unable to test when submitted.  Seeing the & operator to
>> set more bits made me suspicious, and indeed the bits are defined
>> in positive logic:
>>
>>     cb_s      = 0x4000,
>>     cb_el     = 0x8000,
>>
>> So anding those together would be 0.   I'm guessing they should
>> be or'd, but don't have hardware here to test, much like the
>> committed patch.  In fact, I'll let someone else do the compile
>> test too.  I'll update the comment.
>>
> 
> I wonder if this worked for me because the hardware also spun on the 
> link field being NULL?  Since the RU base is also set to 0, the 
> calculated physical address would be 0 as well.  I would imagine if the 
> hardware tried to read/write to very low addresses across PCI, there 
> would be issues.  I will retest with a small receive pool to try to hit 
> the problem.
> I will also run these tests with the new patch and with a smaller 
> receive pool (default is 256) to make the pool run out more often.

So far my testing has shown both the original and the new version of the 
S-bit patch work in that no corruption seemed to occur over long term 
runs.  The previous S-bit patch may have only worked due to something 
specific about how my PCI companion chip handles I/O to low memory 
addresses (from dereferencing a link address of 0).  Perhaps the e100 
handles the NULL link as well, but given that the manual does not seem 
to state what happens when the hardware encounters a buffer with a link 
of 0, I think Milton's fix is the proper way to do it.  The old eepro 
driver did set both bits although it did it with a hardcoded constant.

I will continue testing with slab debug on but that will take longer. 
Has anyone tried this on other platforms?
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-02 20:21   ` David Acker
@ 2007-05-04 21:43     ` David Acker
  2007-05-06  6:36       ` Milton Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-04 21:43 UTC (permalink / raw)
  To: Milton Miller
  Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel,
	Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
	Auke Kok

David Acker wrote:
> So far my testing has shown both the original and the new version of the 
> S-bit patch work in that no corruption seemed to occur over long term 
> runs.

I spoke too soon.  Further testing has not gone well.  If I use the 
default settings for CPU saver and drop the receive pool down to 16 
buffers I can cause problems with various forms of the patch.  With the 
original S-bit patch I can get:
e100: eth0: e100_update_stats: exec cuc_dump_reset failed

At this point sends and receives are failing and their counters are not
changing.  When a raw socket opened on the port tries to send it gets
errno 11, Resource temporarily unavailable.

When this error occurred, I added a reset call through the tx timeout
schedule_work(&nic->tx_timeout_task);

This made the port come back but obviously this is somewhat of a kludge 
and eventually, the entire system hangs or get an oops from memory 
corruption.

Apparently my earlier tests got lucky with a large receive pool size.
The only way the original patch was stable for me was by disabling CPU
saver, and setting the NAPI weight (default of 16) as large as the
default receive pool size (256) so that all buffers were claimed and
reallocated in one NAPI call.  None of this should be needed so there is 
  definitely a problem with the original patch.

The updated patch produced a different issue.  We got an RNR interrupt
indicating the receive unit got ahead of the software.  The S-bit patch
removed any handling of this issue as it assumed the hardware would spin
on the sbit.  Apparently if both the S-bit and the EL-bit are set on the 
same RFD, it follows the EL-bit handling.  Printing the stat/ack and 
status bytes on the RNR interrupts I get:
status
01001000 = 0x48 = RUS of 0010 = No Resources, CUS of 01 = Suspended

stat/ack
01010000 = 0x50 = FR, RNR
or
00010000 = 0x10 = RNR

Notice that the RUS went into No Resources and not suspended.  Thus 
clearing the S-bit does not wake it up; it needs a new start command. 
I could not find documentation that states that the S-bit need only be 
cleared to take the RU out of suspended state.  Before the S-bit patch 
the driver tried to track this need but that version of the driver 
didn't work for me either.  By the way, I am using, "Intel 8255x 10/100 
Mbps Ethernet Controller Family, Open Source Software Developer Manual, 
January 2006" as my documentation.

This got me looking at just how in the world this worked on the old
eepro100 driver.  It had another difference; it did not reap the last rx
buffer in the chain.  It set a postponed bit and then picked it up on
the next interrupt after more buffers had been allocated.  It then
noticed that the RU was in a suspended or no resources state and did a
softreset.

I don't believe this avoid the last buffer trick really fixes the race. 
  Imagine the following:
1. 4 buffers in receive pool, all freshly allocated
2. Hardware consumes 3 buffers
3. Software processes 3 buffers, begins to allocate new buffers
4. Hardware writes status bits into buffer 4 while software updates link 
and command word bits in buffer 4.  They share a cache line and corrupt 
each other.

This appears to be possible with any of the versions of this driver I 
have seen.  The problem is one of packet ownership.  Once the driver 
gives a list of buffers to hardware, hardware owns them all.  The driver 
can not safely change these buffers.  Sadly, this means that the idea of 
the driver "staying ahead" of the hardware such that the hardware never 
runs out of resources will not work here.  Once the driver gives the 
hardware a packet with S or EL bits set, it must let the hardware 
encounter the packet and return it to software.

I think the driver needs to protect the last entry in the ring by 
putting the S-bit on the entry before it.  The first time the driver 
allocates a block of packets, it writes a new S-bit out on the next to 
last packet.  As buffers complete it allocates more packets in the chain 
but does not set a new S-bit since the old one will stop hardware.  It 
can not clear the old S-bit because the driver does not own the buffer, 
hardware does.  After processing the s-bit packet the hardware will 
interrupt with a stat/ack of RNR and RUS of suspended.
When software processes a packet with an old S-bit it allocates new 
buffers and sets the s-bit on the new next to last packet.

The above case changes now:
1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. 
S-bit is on buffer 3.
2. Hardware consumes 3 buffers, hits S-bit, RNR interrupts
3. Software processes 3 buffers, begins to allocate new buffers
4. Software sends resume once buffers are allocated, S-bit is on buffer 2.
5. Hardware gets resume.  When it processed buffer 3, it saved the link 
to buffer 4 and thus resumes at buffer 4.

Here is a different flow where the software stays ahead:
1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. 
S-bit is on buffer 3.
2. Hardware consumes 2 buffers (1, 2).
3. Software processes buffers 1, 2, begins to allocate new buffers
4. Software buffers 1, 2 are allocated
5. Hardware consumes 1 buffer (#3) and hits S-bit, RNR interrupts.
6. Software consumes 1 buffer, (#3) and finds the old S-bit.  It 
allocates a new buffer 3 and sets the S-bit on buffer 2.
7. Software sends resume, hardware continues at buffer 4.

In this setup, software will send a resume command every RING_SIZE 
packets.  RNR interrupts will also occur every RING_SIZE packets.  When 
hardware is faster than software, it will process RING_SIZE packets, RNR 
interrupt and wait for software to process all of them.  When software 
is faster then hardware, hardware will still process RING_SIZE packets 
before interrupting but software will only need to allocate 1 packet or 
so before sending the resume so hardware will wait much less time.

This will probably slow things down since on a fast CPU, software will 
normally stay ahead of the hardware and the only PCI operations from the 
driver would be interrupt acks.  With this change, we have PCI 
operations every 256 packets.  I don't see how else to do this in a safe 
way on ARM (at least PXA255).

I am testing this over the weekend with a 16-buffer receive pool.  If 
all goes well, I will send a patch early next week.  It will basically 
back out the S-bit patch and then make the changes noted above.

-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-04 21:43     ` David Acker
@ 2007-05-06  6:36       ` Milton Miller
  2007-05-07 15:27         ` David Acker
  2007-05-14 18:26         ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker
  0 siblings, 2 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-06  6:36 UTC (permalink / raw)
  To: David Acker
  Cc: Auke Kok, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher

[dropping Andrew, Jeff, and LKML]

On May 4, 2007, at 4:43 PM, David Acker wrote:
> David Acker wrote:
>> So far my testing has shown both the original and the new version of 
>> the S-bit patch work in that no corruption seemed to occur over long 
>> term runs.
>
> I spoke too soon.  Further testing has not gone well.  If I use the 
> default settings for CPU saver and drop the receive pool down to 16 
> buffers I can cause problems with various forms of the patch.  With 
> the original S-bit patch I can get:
...
>
> The updated patch produced a different issue.  We got an RNR interrupt 
> indicating the receive unit got ahead of the software.  The S-bit 
> patch removed any handling of this issue as it assumed the hardware 
> would spin
> on the sbit.  Apparently if both the S-bit and the EL-bit are set on 
> the same RFD, it follows the EL-bit handling.  Printing the stat/ack 
> and status bytes on the RNR interrupts I get:
>
> status
> 01001000 = 0x48 = RUS of 0010 = No Resources, CUS of 01 = Suspended
>
> stat/ack
> 01010000 = 0x50 = FR, RNR
> or
> 00010000 = 0x10 = RNR
>
> Notice that the RUS went into No Resources and not suspended.  Thus 
> clearing the S-bit does not wake it up; it needs a new start command. 
> I could not find documentation that states that the S-bit need only be 
> cleared to take the RU out of suspended state.  Before the S-bit patch 
> the driver tried to track this need but that version of the driver 
> didn't work for me either.  By the way, I am using, "Intel 8255x 
> 10/100 Mbps Ethernet Controller Family, Open Source Software Developer 
> Manual, January 2006" as my documentation.
>
> This got me looking at just how in the world this worked on the old
> eepro100 driver.  It had another difference; it did not reap the last 
> rx
> buffer in the chain.  It set a postponed bit and then picked it up on
> the next interrupt after more buffers had been allocated.  It then
> noticed that the RU was in a suspended or no resources state and did a
> softreset.
>
> I don't believe this avoid the last buffer trick really fixes the 
> race.  Imagine the following:
> 1. 4 buffers in receive pool, all freshly allocated
> 2. Hardware consumes 3 buffers
> 3. Software processes 3 buffers, begins to allocate new buffers
> 4. Hardware writes status bits into buffer 4 while software updates 
> link and command word bits in buffer 4.  They share a cache line and 
> corrupt each other.
>
> This appears to be possible with any of the versions of this driver I 
> have seen.  The problem is one of packet ownership.  Once the driver 
> gives a list of buffers to hardware, hardware owns them all.  The 
> driver can not safely change these buffers.  Sadly, this means that 
> the idea of the driver "staying ahead" of the hardware such that the 
> hardware never runs out of resources will not work here.  Once the 
> driver gives the hardware a packet with S or EL bits set, it must let 
> the hardware encounter the packet and return it to software.
>
> I think the driver needs to protect the last entry in the ring by 
> putting the S-bit on the entry before it.  The first time the driver 
> allocates a block of packets, it writes a new S-bit out on the next to 
> last packet.  As buffers complete it allocates more packets in the 
> chain but does not set a new S-bit since the old one will stop 
> hardware.  It can not clear the old S-bit because the driver does not 
> own the buffer, hardware does.  After processing the s-bit packet the 
> hardware will interrupt with a stat/ack of RNR and RUS of suspended.
> When software processes a packet with an old S-bit it allocates new 
> buffers and sets the s-bit on the new next to last packet.
>
> The above case changes now:
> 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. 
> S-bit is on buffer 3.
> 2. Hardware consumes 3 buffers, hits S-bit, RNR interrupts
> 3. Software processes 3 buffers, begins to allocate new buffers
> 4. Software sends resume once buffers are allocated, S-bit is on 
> buffer 2.
> 5. Hardware gets resume.  When it processed buffer 3, it saved the 
> link to buffer 4 and thus resumes at buffer 4.
>
>
> Here is a different flow where the software stays ahead:
> 1. 4 buffers numbered 1-4 in a receive pool, all freshly allocated. 
> S-bit is on buffer 3.
> 2. Hardware consumes 2 buffers (1, 2).
> 3. Software processes buffers 1, 2, begins to allocate new buffers
> 4. Software buffers 1, 2 are allocated
> 5. Hardware consumes 1 buffer (#3) and hits S-bit, RNR interrupts.
> 6. Software consumes 1 buffer, (#3) and finds the old S-bit.  It 
> allocates a new buffer 3 and sets the S-bit on buffer 2.
> 7. Software sends resume, hardware continues at buffer 4.
>
> In this setup, software will send a resume command every RING_SIZE 
> packets.  RNR interrupts will also occur every RING_SIZE packets.  
> When hardware is faster than software, it will process RING_SIZE 
> packets, RNR interrupt and wait for software to process all of them.  
> When software is faster then hardware, hardware will still process 
> RING_SIZE packets before interrupting but software will only need to 
> allocate 1 packet or so before sending the resume so hardware will 
> wait much less time.
>
> This will probably slow things down since on a fast CPU, software will 
> normally stay ahead of the hardware and the only PCI operations from 
> the driver would be interrupt acks.  With this change, we have PCI 
> operations every 256 packets.  I don't see how else to do this in a 
> safe way on ARM (at least PXA255).
>
> I am testing this over the weekend with a 16-buffer receive pool.  If 
> all goes well, I will send a patch early next week.  It will basically 
> back out the S-bit patch and then make the changes noted above.
>

While this will help the problem with the cache-incoherent DMA systems 
not running, it guarantees the hardware will stop every <ring-size> 
packets and wait for the cpu to respond to an interrupt.  It would seem 
that this will lead to packet drops.

[download manual from site in source file]

In fact 6.4.3.4 says 82557 will start dropping frames immediately.

Looking at the descriptions around page 101:
(1) The link pointer, S, and EL is read when hw starts recieving the 
frame.
(2) Its pretty clear EL overrides S from the order of the descriptions 
in the text.
(3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips 
frame fill and goes to the next packet.

How about putting a zero length descriptor in consistent memory to 
suspend the rx unit before the last real frame?   In other words  fr0 
-> fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN.   We could then have 2 
such frames, and when we refill modify FrN to the new chain, with the 
WaitHere1 as its next-to-last, do the syncs, then clear the S bit on 
WaitHere0.   When the rx passes WaitHere0 we can reclaim it for the 
next use (might want a slightly larger pool, basically need RxRingSize 
/ RxRingFillBatch such frames.

milton


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-06  6:36       ` Milton Miller
@ 2007-05-07 15:27         ` David Acker
  2007-05-14 18:26         ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker
  1 sibling, 0 replies; 51+ messages in thread
From: David Acker @ 2007-05-07 15:27 UTC (permalink / raw)
  To: Milton Miller
  Cc: Auke Kok, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher

Milton Miller wrote:
> While this will help the problem with the cache-incoherent DMA systems 
> not running, it guarantees the hardware will stop every <ring-size> 
> packets and wait for the cpu to respond to an interrupt.  It would seem 
> that this will lead to packet drops.
> 
Well, in NAPI mode, we the CPU may poll its way to the last buffer 
without having to go through an interrupt cycle.  You are right that 
buffers would probably get dropped between the time the hardware hit the 
S-bit and the CPU caught up.

> [download manual from site in source file]
> 
> In fact 6.4.3.4 says 82557 will start dropping frames immediately.
> 
> Looking at the descriptions around page 101:
> (1) The link pointer, S, and EL is read when hw starts recieving the frame.
> (2) Its pretty clear EL overrides S from the order of the descriptions 
> in the text.
My testing confirms this.

> (3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips 
> frame fill and goes to the next packet.
> 
> How about putting a zero length descriptor in consistent memory to 
> suspend the rx unit before the last real frame?   In other words  fr0 -> 
> fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN.   We could then have 2 such 
> frames, and when we refill modify FrN to the new chain, with the 
> WaitHere1 as its next-to-last, do the syncs, then clear the S bit on 
> WaitHere0.   When the rx passes WaitHere0 we can reclaim it for the next 
> use (might want a slightly larger pool, basically need RxRingSize / 
> RxRingFillBatch such frames.

Hmm...I will take a look at this.  My test worked over the weekend by 
the way.  A patch will be coming.
-Ack

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-06  6:36       ` Milton Miller
  2007-05-07 15:27         ` David Acker
@ 2007-05-14 18:26         ` David Acker
  2007-05-18  1:54           ` Jeff Garzik
  1 sibling, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-14 18:26 UTC (permalink / raw)
  To: Milton Miller
  Cc: Auke Kok, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

Milton Miller wrote:
> In fact 6.4.3.4 says 82557 will start dropping frames immediately.
> 
> Looking at the descriptions around page 101:
> (1) The link pointer, S, and EL is read when hw starts recieving the frame.
> (2) Its pretty clear EL overrides S from the order of the descriptions 
> in the text.
> (3) 6.4.3.3.1 #4 looks intresting -- That is a RFD with size 0 skips 
> frame fill and goes to the next packet.
> 
> How about putting a zero length descriptor in consistent memory to 
> suspend the rx unit before the last real frame?   In other words  fr0 -> 
> fr1 ... frN-2 -> frN-1 -> WaitHere0 -> FrN.   We could then have 2 such 
> frames, and when we refill modify FrN to the new chain, with the 
> WaitHere1 as its next-to-last, do the syncs, then clear the S bit on 
> WaitHere0.   When the rx passes WaitHere0 we can reclaim it for the next 
> use (might want a slightly larger pool, basically need RxRingSize / 
> RxRingFillBatch such frames.
> 

I managed to code up something similar to what Milton described here.  Below
is a patch against 2.6.21.1, before the S-bit patch.  I ran it through two
12 hour UDP tests with 20 megabits per second in each direction for data
going from an ethernet client to an 802.11g client.  I had the receive pool
sized at 16 buffers.

---

On the ARM, their is a race condition between software allocating a new receive
buffer and hardware writing into a buffer.  The two race on touching the last
Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
to 0.  When hardware encounters this buffer it attempts to write data to it
and then update Status Word bits and Actual Count in the RFD.  At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide.  This can lead to the receive unit stalling or freed receive buffers
getting written to.

The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain.  When the hardware encounters this buffer it stops and does
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state.  When software allocates buffers,
it can update the tail of the list because it knows the hardware will stop
at the buffer before it.  Once it has a new next to last buffer prepared,
it can clear the el-bit and set the size on the previous one.  The race on
this buffer is safe since the link already points to a valid next buffer.
If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and complete that one in error.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <dacker@roinet.com>

---

--- linux-2.6.21.1/drivers/net/e100.c.orig	2007-05-11 11:27:11.000000000 -0400
+++ linux-2.6.21.1/drivers/net/e100.c	2007-05-11 11:39:06.000000000 -0400
@@ -1781,13 +1781,12 @@ static int e100_rx_alloc_skb(struct nic
  	}

  	/* Link the RFD to end of RFA by linking previous RFD to
-	 * this one, and clearing EL bit of previous.  */
+	 * this one.  We are safe to touch the previous RFD because
+	 * it is protected the before last buffer's el bit being set  */
  	if(rx->prev->skb) {
  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
  		put_unaligned(cpu_to_le32(rx->dma_addr),
  			(u32 *)&prev_rfd->link);
-		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el);
  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
  			sizeof(struct rfd), PCI_DMA_TODEVICE);
  	}
@@ -1861,6 +1860,8 @@ static void e100_rx_clean(struct nic *ni
  	struct rx *rx;
  	int restart_required = 0;
  	struct rx *rx_to_start = NULL;
+	struct rx *old_before_last_rx, *new_before_last_rx;
+	struct rfd *old_before_last_rfd, *new_before_last_rfd;

  	/* are we already rnr? then pay attention!!! this ensures that
  	 * the state machine progression never allows a start with a
@@ -1885,12 +1886,43 @@ static void e100_rx_clean(struct nic *ni
  	if(restart_required)
  		rx_to_start = nic->rx_to_clean;

+	old_before_last_rx = nic->rx_to_use->prev->prev;
+	old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
+
  	/* Alloc new skbs to refill list */
  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
  			break; /* Better luck next time (see watchdog) */
  	}

+	new_before_last_rx = nic->rx_to_use->prev->prev;
+	if (new_before_last_rx != old_before_last_rx) {
+		/* Set the el-bit on the buffer that is before the last buffer.
+		 * This lets us update the next pointer on the last buffer without
+		 * worrying about hardware touching it.
+		 * We set the size to 0 to prevent hardware from touching this buffer.
+		 * When the hardware hits the before last buffer with el-bit and size
+		 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+		 * state.  It will not complete nor write to this buffer. */
+		new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data;
+		new_before_last_rfd->size = 0;
+		new_before_last_rfd->command |= cpu_to_le16(cb_el);
+		pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+		/* Now that we have a new stopping point, we can clear the old
+		 * stopping point.
+		 * Note: There appears to be a race here where the hardware
+		 * can complete this buffer with the el-bit set but with the
+		 * size also set.  The hardware RNR interrupts, the RUS
+		 * goes into the No Resources state. */
+		old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+		wmb();
+		old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+		pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+	}
+
  	if(restart_required) {
  		// ack the rnr?
  		writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
@@ -1926,6 +1958,7 @@ static int e100_rx_alloc_list(struct nic
  {
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;
+	struct rfd *before_last;

  	nic->rx_to_use = nic->rx_to_clean = NULL;
  	nic->ru_running = RU_UNINITIALIZED;
@@ -1942,6 +1975,20 @@ static int e100_rx_alloc_list(struct nic
  		}
  	}

+	/* Set the el-bit on the buffer that is before the last buffer.
+	 * This lets us update the next pointer on the last buffer without
+	 * worrying about hardware touching it.
+	 * We set the size to 0 to prevent hardware from touching this buffer.
+	 * When the hardware hits the before last buffer with el-bit and size
+	 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+	 * state.  It will not complete nor write to this buffer. */
+	rx = nic->rxs->prev->prev;
+	before_last = (struct rfd *)rx->skb->data;
+	before_last->command |= cpu_to_le16(cb_el);
+	before_last->size = 0;
+	pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
  	nic->ru_running = RU_SUSPENDED;



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-14 18:26         ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker
@ 2007-05-18  1:54           ` Jeff Garzik
  2007-05-18  3:47             ` Kok, Auke
  0 siblings, 1 reply; 51+ messages in thread
From: Jeff Garzik @ 2007-05-18  1:54 UTC (permalink / raw)
  To: David Acker, Auke Kok
  Cc: Milton Miller, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

Can you resend against the latest kernel (2.6.22-rc1)?

And what does Intel think?

	Jeff




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18  1:54           ` Jeff Garzik
@ 2007-05-18  3:47             ` Kok, Auke
  2007-05-18 14:07               ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-05-18  3:47 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Acker, Auke Kok, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

Jeff Garzik wrote:
> Can you resend against the latest kernel (2.6.22-rc1)?
> 
> And what does Intel think?

I'm expecting at least a reply from Milton as the patch was sent to him. I 
haven't yet tested it but will certainly do so. At first glance it looks OK, and 
I'll try to put it under my colleague's noses who know e100 best.

A resend against 2.6.22-rc1 would be nice.

Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18  3:47             ` Kok, Auke
@ 2007-05-18 14:07               ` David Acker
  2007-05-18 14:20                 ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-18 14:07 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

Kok, Auke wrote:
> Jeff Garzik wrote:
>> Can you resend against the latest kernel (2.6.22-rc1)?
>>
>> And what does Intel think?
> 
> I'm expecting at least a reply from Milton as the patch was sent to him. 
> I haven't yet tested it but will certainly do so. At first glance it 
> looks OK, and I'll try to put it under my colleague's noses who know 
> e100 best.
> 
> A resend against 2.6.22-rc1 would be nice.
> 

Done.  Below is a patch against 2.6.22-rc1.  It combines removing the s-bit patch and 
applying the patch I previously sent.


On the ARM, their is a race condition between software allocating a new receive
buffer and hardware writing into a buffer.  The two race on touching the last
Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
to 0.  When hardware encounters this buffer it attempts to write data to it
and then update Status Word bits and Actual Count in the RFD.  At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide.  This can lead to the receive unit stalling or freed receive buffers
getting written to.

The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain.  When the hardware encounters this buffer it stops and does
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state.  When software allocates buffers,
it can update the tail of the list because it knows the hardware will stop
at the buffer before it.  Once it has a new next to last buffer prepared,
it can clear the el-bit and set the size on the previous one.  The race on
this buffer is safe since the link already points to a valid next buffer.
If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and complete that one in error.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <dacker@roinet.com>

---


--- linux-2.6.22-rc1/drivers/net/e100.c.orig	2007-05-18 10:01:52.000000000 -0400
+++ linux-2.6.22-rc1/drivers/net/e100.c	2007-05-18 09:59:33.000000000 -0400
@@ -285,6 +285,12 @@ enum scb_status {
  	rus_mask         = 0x3C,
  };

+enum ru_state  {
+	RU_SUSPENDED = 0,
+	RU_RUNNING	 = 1,
+	RU_UNINITIALIZED = -1,
+};
+
  enum scb_stat_ack {
  	stat_ack_not_ours    = 0x00,
  	stat_ack_sw_gen      = 0x04,
@@ -526,6 +532,7 @@ struct nic {
  	struct rx *rx_to_use;
  	struct rx *rx_to_clean;
  	struct rfd blank_rfd;
+	enum ru_state ru_running;

  	spinlock_t cb_lock			____cacheline_aligned;
  	spinlock_t cmd_lock;
@@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic
  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));

  	/* Template for a freshly allocated RFD */
-	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+	nic->blank_rfd.command = cpu_to_le16(cb_el);
  	nic->blank_rfd.rbd = 0xFFFFFFFF;
  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);

@@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni
  	return 0;
  }

-static inline void e100_start_receiver(struct nic *nic)
+static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
  {
-	/* Start if RFA is non-NULL */
-	if(nic->rx_to_clean->skb)
-		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+	if(!nic->rxs) return;
+	if(RU_SUSPENDED != nic->ru_running) return;
+
+	/* handle init time starts */
+	if(!rx) rx = nic->rxs;
+
+	/* (Re)start RU if suspended or idle and RFA is non-NULL */
+	if(rx->skb) {
+		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
+		nic->ru_running = RU_RUNNING;
+	}
  }

  #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
@@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic
  	}

  	/* Link the RFD to end of RFA by linking previous RFD to
-	 * this one, and clearing EL bit of previous.  */
+	 * this one.  We are safe to touch the previous RFD because
+	 * it is protected the before last buffer's el bit being set  */
  	if(rx->prev->skb) {
  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
  		put_unaligned(cpu_to_le32(rx->dma_addr),
  			(u32 *)&prev_rfd->link);
-		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
  			sizeof(struct rfd), PCI_DMA_TODEVICE);
  	}
@@ -1813,6 +1827,10 @@ static int e100_rx_indicate(struct nic *
  	pci_unmap_single(nic->pdev, rx->dma_addr,
  		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);

+	/* this allows for a fast restart without re-enabling interrupts */
+	if(le16_to_cpu(rfd->command) & cb_el)
+		nic->ru_running = RU_SUSPENDED;
+
  	/* Pull off the RFD and put the actual data (minus eth hdr) */
  	skb_reserve(skb, sizeof(struct rfd));
  	skb_put(skb, actual_size);
@@ -1843,18 +1861,78 @@ static void e100_rx_clean(struct nic *ni
  	unsigned int work_to_do)
  {
  	struct rx *rx;
+	int restart_required = 0;
+	struct rx *rx_to_start = NULL;
+	struct rx *old_before_last_rx, *new_before_last_rx;
+	struct rfd *old_before_last_rfd, *new_before_last_rfd;
+
+	/* are we already rnr? then pay attention!!! this ensures that
+	 * the state machine progression never allows a start with a
+	 * partially cleaned list, avoiding a race between hardware
+	 * and rx_to_clean when in NAPI mode */
+	if(RU_SUSPENDED == nic->ru_running)
+		restart_required = 1;

  	/* Indicate newly arrived packets */
  	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
-		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
+		int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+		if(-EAGAIN == err) {
+			/* hit quota so have more work to do, restart once
+			 * cleanup is complete */
+			restart_required = 0;
+			break;
+		} else if(-ENODATA == err)
  			break; /* No more to clean */
  	}

+	/* save our starting point as the place we'll restart the receiver */
+	if(restart_required)
+		rx_to_start = nic->rx_to_clean;
+
+	old_before_last_rx = nic->rx_to_use->prev->prev;
+	old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
+
  	/* Alloc new skbs to refill list */
  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
  			break; /* Better luck next time (see watchdog) */
  	}
+
+	new_before_last_rx = nic->rx_to_use->prev->prev;
+	if (new_before_last_rx != old_before_last_rx) {
+		/* Set the el-bit on the buffer that is before the last buffer.
+		 * This lets us update the next pointer on the last buffer without
+		 * worrying about hardware touching it.
+		 * We set the size to 0 to prevent hardware from touching this buffer.
+		 * When the hardware hits the before last buffer with el-bit and size
+		 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+		 * state.  It will not complete nor write to this buffer. */
+		new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data;
+		new_before_last_rfd->size = 0;
+		new_before_last_rfd->command |= cpu_to_le16(cb_el);
+		pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+		/* Now that we have a new stopping point, we can clear the old
+		 * stopping point.
+		 * Note: There appears to be a race here where the hardware
+		 * can complete this buffer with the el-bit set but with the
+		 * size also set.  The hardware RNR interrupts, the RUS
+		 * goes into the No Resources state. */
+		old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+		wmb();
+		old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+		pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+	}
+
+	if(restart_required) {
+		// ack the rnr?
+		writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
+		e100_start_receiver(nic, rx_to_start);
+		if(work_done)
+			(*work_done)++;
+	}
  }

  static void e100_rx_clean_list(struct nic *nic)
@@ -1862,6 +1940,8 @@ static void e100_rx_clean_list(struct ni
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;

+	nic->ru_running = RU_UNINITIALIZED;
+
  	if(nic->rxs) {
  		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
  			if(rx->skb) {
@@ -1881,8 +1961,10 @@ static int e100_rx_alloc_list(struct nic
  {
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;
+	struct rfd *before_last;

  	nic->rx_to_use = nic->rx_to_clean = NULL;
+	nic->ru_running = RU_UNINITIALIZED;

  	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
  		return -ENOMEM;
@@ -1896,7 +1978,22 @@ static int e100_rx_alloc_list(struct nic
  		}
  	}

+	/* Set the el-bit on the buffer that is before the last buffer.
+	 * This lets us update the next pointer on the last buffer without
+	 * worrying about hardware touching it.
+	 * We set the size to 0 to prevent hardware from touching this buffer.
+	 * When the hardware hits the before last buffer with el-bit and size
+	 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+	 * state.  It will not complete nor write to this buffer. */
+	rx = nic->rxs->prev->prev;
+	before_last = (struct rfd *)rx->skb->data;
+	before_last->command |= cpu_to_le16(cb_el);
+	before_last->size = 0;
+	pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
+	nic->ru_running = RU_SUSPENDED;

  	return 0;
  }
@@ -1916,6 +2013,10 @@ static irqreturn_t e100_intr(int irq, vo
  	/* Ack interrupt(s) */
  	iowrite8(stat_ack, &nic->csr->scb.stat_ack);

+	/* We hit Receive No Resource (RNR); restart RU after cleaning */
+	if(stat_ack & stat_ack_rnr)
+		nic->ru_running = RU_SUSPENDED;
+
  	if(likely(netif_rx_schedule_prep(netdev))) {
  		e100_disable_irq(nic);
  		__netif_rx_schedule(netdev);
@@ -2007,7 +2108,7 @@ static int e100_up(struct nic *nic)
  	if((err = e100_hw_init(nic)))
  		goto err_clean_cbs;
  	e100_set_multicast_list(nic->netdev);
-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);
  	mod_timer(&nic->watchdog, jiffies);
  	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
  		nic->netdev->name, nic->netdev)))
@@ -2088,7 +2189,7 @@ static int e100_loopback_test(struct nic
  		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
  			BMCR_LOOPBACK);

-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);

  	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
  		err = -ENOMEM;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 14:07               ` David Acker
@ 2007-05-18 14:20                 ` David Acker
  2007-05-18 15:29                   ` Kok, Auke
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-18 14:20 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

David Acker wrote:
> Kok, Auke wrote:
>> Jeff Garzik wrote:
>>> Can you resend against the latest kernel (2.6.22-rc1)?
>>>
>>> And what does Intel think?
>>
>> I'm expecting at least a reply from Milton as the patch was sent to 
>> him. I haven't yet tested it but will certainly do so. At first glance 
>> it looks OK, and I'll try to put it under my colleague's noses who 
>> know e100 best.
>>
>> A resend against 2.6.22-rc1 would be nice.
>>
> 
> Done.  Below is a patch against 2.6.22-rc1.  It combines removing the 
> s-bit patch and applying the patch I previously sent.

Oops.  I missed one state in that patch.  Since the el-bit buffer will normally not 
complete due to a zero size, we need to check if the buffer with no data has the el-bit 
set.  Without this, you have to wait for the interrupt.  Sorry about that...this was in 
the code I tested on my embedded system but got lost in the regular kernel patch.

On the ARM, their is a race condition between software allocating a new receive
buffer and hardware writing into a buffer.  The two race on touching the last
Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
to 0.  When hardware encounters this buffer it attempts to write data to it
and then update Status Word bits and Actual Count in the RFD.  At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide.  This can lead to the receive unit stalling or freed receive buffers
getting written to.

The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain.  When the hardware encounters this buffer it stops and does
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state.  When software allocates buffers,
it can update the tail of the list because it knows the hardware will stop
at the buffer before it.  Once it has a new next to last buffer prepared,
it can clear the el-bit and set the size on the previous one.  The race on
this buffer is safe since the link already points to a valid next buffer.
If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and complete that one in error.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <dacker@roinet.com>

---


--- linux-2.6.22-rc1/drivers/net/e100.c.orig	2007-05-18 10:16:03.000000000 -0400
+++ linux-2.6.22-rc1/drivers/net/e100.c	2007-05-18 10:15:53.000000000 -0400
@@ -285,6 +285,12 @@ enum scb_status {
  	rus_mask         = 0x3C,
  };

+enum ru_state  {
+	RU_SUSPENDED = 0,
+	RU_RUNNING	 = 1,
+	RU_UNINITIALIZED = -1,
+};
+
  enum scb_stat_ack {
  	stat_ack_not_ours    = 0x00,
  	stat_ack_sw_gen      = 0x04,
@@ -526,6 +532,7 @@ struct nic {
  	struct rx *rx_to_use;
  	struct rx *rx_to_clean;
  	struct rfd blank_rfd;
+	enum ru_state ru_running;

  	spinlock_t cb_lock			____cacheline_aligned;
  	spinlock_t cmd_lock;
@@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic
  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));

  	/* Template for a freshly allocated RFD */
-	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+	nic->blank_rfd.command = cpu_to_le16(cb_el);
  	nic->blank_rfd.rbd = 0xFFFFFFFF;
  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);

@@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni
  	return 0;
  }

-static inline void e100_start_receiver(struct nic *nic)
+static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
  {
-	/* Start if RFA is non-NULL */
-	if(nic->rx_to_clean->skb)
-		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+	if(!nic->rxs) return;
+	if(RU_SUSPENDED != nic->ru_running) return;
+
+	/* handle init time starts */
+	if(!rx) rx = nic->rxs;
+
+	/* (Re)start RU if suspended or idle and RFA is non-NULL */
+	if(rx->skb) {
+		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
+		nic->ru_running = RU_RUNNING;
+	}
  }

  #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
@@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic
  	}

  	/* Link the RFD to end of RFA by linking previous RFD to
-	 * this one, and clearing EL bit of previous.  */
+	 * this one.  We are safe to touch the previous RFD because
+	 * it is protected the before last buffer's el bit being set  */
  	if(rx->prev->skb) {
  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
  		put_unaligned(cpu_to_le32(rx->dma_addr),
  			(u32 *)&prev_rfd->link);
-		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
  			sizeof(struct rfd), PCI_DMA_TODEVICE);
  	}
@@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic *
  	DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);

  	/* If data isn't ready, nothing to indicate */
-	if(unlikely(!(rfd_status & cb_complete)))
+	if(unlikely(!(rfd_status & cb_complete))) {
+		/* this allows for a fast restart without re-enabling interrupts */
+		if(le16_to_cpu(rfd->command) & cb_el)
+			nic->ru_running = RU_SUSPENDED;
  		return -ENODATA;
+	}

  	/* Get actual data size */
  	actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic *
  	pci_unmap_single(nic->pdev, rx->dma_addr,
  		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);

+	/* this allows for a fast restart without re-enabling interrupts */
+	if(le16_to_cpu(rfd->command) & cb_el)
+		nic->ru_running = RU_SUSPENDED;
+
  	/* Pull off the RFD and put the actual data (minus eth hdr) */
  	skb_reserve(skb, sizeof(struct rfd));
  	skb_put(skb, actual_size);
@@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni
  	unsigned int work_to_do)
  {
  	struct rx *rx;
+	int restart_required = 0;
+	struct rx *rx_to_start = NULL;
+	struct rx *old_before_last_rx, *new_before_last_rx;
+	struct rfd *old_before_last_rfd, *new_before_last_rfd;
+
+	/* are we already rnr? then pay attention!!! this ensures that
+	 * the state machine progression never allows a start with a
+	 * partially cleaned list, avoiding a race between hardware
+	 * and rx_to_clean when in NAPI mode */
+	if(RU_SUSPENDED == nic->ru_running)
+		restart_required = 1;

  	/* Indicate newly arrived packets */
  	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
-		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
+		int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+		if(-EAGAIN == err) {
+			/* hit quota so have more work to do, restart once
+			 * cleanup is complete */
+			restart_required = 0;
+			break;
+		} else if(-ENODATA == err)
  			break; /* No more to clean */
  	}

+	/* save our starting point as the place we'll restart the receiver */
+	if(restart_required)
+		rx_to_start = nic->rx_to_clean;
+
+	old_before_last_rx = nic->rx_to_use->prev->prev;
+	old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
+
  	/* Alloc new skbs to refill list */
  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
  			break; /* Better luck next time (see watchdog) */
  	}
+
+	new_before_last_rx = nic->rx_to_use->prev->prev;
+	if (new_before_last_rx != old_before_last_rx) {
+		/* Set the el-bit on the buffer that is before the last buffer.
+		 * This lets us update the next pointer on the last buffer without
+		 * worrying about hardware touching it.
+		 * We set the size to 0 to prevent hardware from touching this buffer.
+		 * When the hardware hits the before last buffer with el-bit and size
+		 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+		 * state.  It will not complete nor write to this buffer. */
+		new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data;
+		new_before_last_rfd->size = 0;
+		new_before_last_rfd->command |= cpu_to_le16(cb_el);
+		pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+		/* Now that we have a new stopping point, we can clear the old
+		 * stopping point.
+		 * Note: There appears to be a race here where the hardware
+		 * can complete this buffer with the el-bit set but with the
+		 * size also set.  The hardware RNR interrupts, the RUS
+		 * goes into the No Resources state. */
+		old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+		wmb();
+		old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+		pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+	}
+
+	if(restart_required) {
+		// ack the rnr?
+		writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
+		e100_start_receiver(nic, rx_to_start);
+		if(work_done)
+			(*work_done)++;
+	}
  }

  static void e100_rx_clean_list(struct nic *nic)
@@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;

+	nic->ru_running = RU_UNINITIALIZED;
+
  	if(nic->rxs) {
  		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
  			if(rx->skb) {
@@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic
  {
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;
+	struct rfd *before_last;

  	nic->rx_to_use = nic->rx_to_clean = NULL;
+	nic->ru_running = RU_UNINITIALIZED;

  	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
  		return -ENOMEM;
@@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic
  		}
  	}

+	/* Set the el-bit on the buffer that is before the last buffer.
+	 * This lets us update the next pointer on the last buffer without
+	 * worrying about hardware touching it.
+	 * We set the size to 0 to prevent hardware from touching this buffer.
+	 * When the hardware hits the before last buffer with el-bit and size
+	 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+	 * state.  It will not complete nor write to this buffer. */
+	rx = nic->rxs->prev->prev;
+	before_last = (struct rfd *)rx->skb->data;
+	before_last->command |= cpu_to_le16(cb_el);
+	before_last->size = 0;
+	pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
+	nic->ru_running = RU_SUSPENDED;

  	return 0;
  }
@@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo
  	/* Ack interrupt(s) */
  	iowrite8(stat_ack, &nic->csr->scb.stat_ack);

+	/* We hit Receive No Resource (RNR); restart RU after cleaning */
+	if(stat_ack & stat_ack_rnr)
+		nic->ru_running = RU_SUSPENDED;
+
  	if(likely(netif_rx_schedule_prep(netdev))) {
  		e100_disable_irq(nic);
  		__netif_rx_schedule(netdev);
@@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic)
  	if((err = e100_hw_init(nic)))
  		goto err_clean_cbs;
  	e100_set_multicast_list(nic->netdev);
-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);
  	mod_timer(&nic->watchdog, jiffies);
  	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
  		nic->netdev->name, nic->netdev)))
@@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic
  		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
  			BMCR_LOOPBACK);

-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);

  	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
  		err = -ENOMEM;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 14:20                 ` David Acker
@ 2007-05-18 15:29                   ` Kok, Auke
  2007-05-18 15:47                     ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-05-18 15:29 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

David Acker wrote:
> David Acker wrote:
>> Kok, Auke wrote:
>>> Jeff Garzik wrote:
>>>> Can you resend against the latest kernel (2.6.22-rc1)?
>>>>
>>>> And what does Intel think?
>>> I'm expecting at least a reply from Milton as the patch was sent to 
>>> him. I haven't yet tested it but will certainly do so. At first glance 
>>> it looks OK, and I'll try to put it under my colleague's noses who 
>>> know e100 best.
>>>
>>> A resend against 2.6.22-rc1 would be nice.
>>>
>> Done.  Below is a patch against 2.6.22-rc1.  It combines removing the 
>> s-bit patch and applying the patch I previously sent.
> 
> Oops.  I missed one state in that patch.  Since the el-bit buffer will normally not 
> complete due to a zero size, we need to check if the buffer with no data has the el-bit 
> set.  Without this, you have to wait for the interrupt.  Sorry about that...this was in 
> the code I tested on my embedded system but got lost in the regular kernel patch.

OK. Thanks.

If you don't mind I'm going to have some testing on this patch done for a bit 
now (mostly x86 hardware of course) to see if there's no pitfalls in it. It'll 
be a few days because of the weekend before I get back on it.

Auke

> 
> On the ARM, their is a race condition between software allocating a new receive
> buffer and hardware writing into a buffer.  The two race on touching the last
> Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
> to 0.  When hardware encounters this buffer it attempts to write data to it
> and then update Status Word bits and Actual Count in the RFD.  At the same time
> software may try to clear the el-bit and set the link address to a new buffer.
> Since the entire RFD is once cache-line, the two write operations can
> collide.  This can lead to the receive unit stalling or freed receive buffers
> getting written to.
> 
> The fix is to set the el-bit on and the size to 0 on the next to last buffer
> in the chain.  When the hardware encounters this buffer it stops and does
> not write to it at all.  The hardware issues an RNR interrupt with the
> receive unit in the No Resources state.  When software allocates buffers,
> it can update the tail of the list because it knows the hardware will stop
> at the buffer before it.  Once it has a new next to last buffer prepared,
> it can clear the el-bit and set the size on the previous one.  The race on
> this buffer is safe since the link already points to a valid next buffer.
> If the hardware sees the el-bit cleared without the size set, it will
> move on to the next buffer and complete that one in error.  If it sees
> the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
> 
> 
> Signed-off-by: David Acker <dacker@roinet.com>
> 
> ---
> 
> 
> --- linux-2.6.22-rc1/drivers/net/e100.c.orig	2007-05-18 10:16:03.000000000 -0400
> +++ linux-2.6.22-rc1/drivers/net/e100.c	2007-05-18 10:15:53.000000000 -0400
> @@ -285,6 +285,12 @@ enum scb_status {
>   	rus_mask         = 0x3C,
>   };
> 
> +enum ru_state  {
> +	RU_SUSPENDED = 0,
> +	RU_RUNNING	 = 1,
> +	RU_UNINITIALIZED = -1,
> +};
> +
>   enum scb_stat_ack {
>   	stat_ack_not_ours    = 0x00,
>   	stat_ack_sw_gen      = 0x04,
> @@ -526,6 +532,7 @@ struct nic {
>   	struct rx *rx_to_use;
>   	struct rx *rx_to_clean;
>   	struct rfd blank_rfd;
> +	enum ru_state ru_running;
> 
>   	spinlock_t cb_lock			____cacheline_aligned;
>   	spinlock_t cmd_lock;
> @@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic
>   		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
> 
>   	/* Template for a freshly allocated RFD */
> -	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
> +	nic->blank_rfd.command = cpu_to_le16(cb_el);
>   	nic->blank_rfd.rbd = 0xFFFFFFFF;
>   	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> 
> @@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni
>   	return 0;
>   }
> 
> -static inline void e100_start_receiver(struct nic *nic)
> +static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
>   {
> -	/* Start if RFA is non-NULL */
> -	if(nic->rx_to_clean->skb)
> -		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
> +	if(!nic->rxs) return;
> +	if(RU_SUSPENDED != nic->ru_running) return;
> +
> +	/* handle init time starts */
> +	if(!rx) rx = nic->rxs;
> +
> +	/* (Re)start RU if suspended or idle and RFA is non-NULL */
> +	if(rx->skb) {
> +		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
> +		nic->ru_running = RU_RUNNING;
> +	}
>   }
> 
>   #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
> @@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic
>   	}
> 
>   	/* Link the RFD to end of RFA by linking previous RFD to
> -	 * this one, and clearing EL bit of previous.  */
> +	 * this one.  We are safe to touch the previous RFD because
> +	 * it is protected the before last buffer's el bit being set  */
>   	if(rx->prev->skb) {
>   		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
>   		put_unaligned(cpu_to_le32(rx->dma_addr),
>   			(u32 *)&prev_rfd->link);
> -		wmb();
> -		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
>   		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
>   			sizeof(struct rfd), PCI_DMA_TODEVICE);
>   	}
> @@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic *
>   	DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
> 
>   	/* If data isn't ready, nothing to indicate */
> -	if(unlikely(!(rfd_status & cb_complete)))
> +	if(unlikely(!(rfd_status & cb_complete))) {
> +		/* this allows for a fast restart without re-enabling interrupts */
> +		if(le16_to_cpu(rfd->command) & cb_el)
> +			nic->ru_running = RU_SUSPENDED;
>   		return -ENODATA;
> +	}
> 
>   	/* Get actual data size */
>   	actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic *
>   	pci_unmap_single(nic->pdev, rx->dma_addr,
>   		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
> 
> +	/* this allows for a fast restart without re-enabling interrupts */
> +	if(le16_to_cpu(rfd->command) & cb_el)
> +		nic->ru_running = RU_SUSPENDED;
> +
>   	/* Pull off the RFD and put the actual data (minus eth hdr) */
>   	skb_reserve(skb, sizeof(struct rfd));
>   	skb_put(skb, actual_size);
> @@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni
>   	unsigned int work_to_do)
>   {
>   	struct rx *rx;
> +	int restart_required = 0;
> +	struct rx *rx_to_start = NULL;
> +	struct rx *old_before_last_rx, *new_before_last_rx;
> +	struct rfd *old_before_last_rfd, *new_before_last_rfd;
> +
> +	/* are we already rnr? then pay attention!!! this ensures that
> +	 * the state machine progression never allows a start with a
> +	 * partially cleaned list, avoiding a race between hardware
> +	 * and rx_to_clean when in NAPI mode */
> +	if(RU_SUSPENDED == nic->ru_running)
> +		restart_required = 1;
> 
>   	/* Indicate newly arrived packets */
>   	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
> -		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
> +		int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> +		if(-EAGAIN == err) {
> +			/* hit quota so have more work to do, restart once
> +			 * cleanup is complete */
> +			restart_required = 0;
> +			break;
> +		} else if(-ENODATA == err)
>   			break; /* No more to clean */
>   	}
> 
> +	/* save our starting point as the place we'll restart the receiver */
> +	if(restart_required)
> +		rx_to_start = nic->rx_to_clean;
> +
> +	old_before_last_rx = nic->rx_to_use->prev->prev;
> +	old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
> +
>   	/* Alloc new skbs to refill list */
>   	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
>   		if(unlikely(e100_rx_alloc_skb(nic, rx)))
>   			break; /* Better luck next time (see watchdog) */
>   	}
> +
> +	new_before_last_rx = nic->rx_to_use->prev->prev;
> +	if (new_before_last_rx != old_before_last_rx) {
> +		/* Set the el-bit on the buffer that is before the last buffer.
> +		 * This lets us update the next pointer on the last buffer without
> +		 * worrying about hardware touching it.
> +		 * We set the size to 0 to prevent hardware from touching this buffer.
> +		 * When the hardware hits the before last buffer with el-bit and size
> +		 * of 0, it will RNR interrupt, the RUS will go into the No Resources
> +		 * state.  It will not complete nor write to this buffer. */
> +		new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data;
> +		new_before_last_rfd->size = 0;
> +		new_before_last_rfd->command |= cpu_to_le16(cb_el);
> +		pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr,
> +			sizeof(struct rfd), PCI_DMA_TODEVICE);
> +
> +		/* Now that we have a new stopping point, we can clear the old
> +		 * stopping point.
> +		 * Note: There appears to be a race here where the hardware
> +		 * can complete this buffer with the el-bit set but with the
> +		 * size also set.  The hardware RNR interrupts, the RUS
> +		 * goes into the No Resources state. */
> +		old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
> +		wmb();
> +		old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> +		pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr,
> +			sizeof(struct rfd), PCI_DMA_TODEVICE);
> +	}
> +
> +	if(restart_required) {
> +		// ack the rnr?
> +		writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
> +		e100_start_receiver(nic, rx_to_start);
> +		if(work_done)
> +			(*work_done)++;
> +	}
>   }
> 
>   static void e100_rx_clean_list(struct nic *nic)
> @@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni
>   	struct rx *rx;
>   	unsigned int i, count = nic->params.rfds.count;
> 
> +	nic->ru_running = RU_UNINITIALIZED;
> +
>   	if(nic->rxs) {
>   		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
>   			if(rx->skb) {
> @@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic
>   {
>   	struct rx *rx;
>   	unsigned int i, count = nic->params.rfds.count;
> +	struct rfd *before_last;
> 
>   	nic->rx_to_use = nic->rx_to_clean = NULL;
> +	nic->ru_running = RU_UNINITIALIZED;
> 
>   	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
>   		return -ENOMEM;
> @@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic
>   		}
>   	}
> 
> +	/* Set the el-bit on the buffer that is before the last buffer.
> +	 * This lets us update the next pointer on the last buffer without
> +	 * worrying about hardware touching it.
> +	 * We set the size to 0 to prevent hardware from touching this buffer.
> +	 * When the hardware hits the before last buffer with el-bit and size
> +	 * of 0, it will RNR interrupt, the RUS will go into the No Resources
> +	 * state.  It will not complete nor write to this buffer. */
> +	rx = nic->rxs->prev->prev;
> +	before_last = (struct rfd *)rx->skb->data;
> +	before_last->command |= cpu_to_le16(cb_el);
> +	before_last->size = 0;
> +	pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
> +		sizeof(struct rfd), PCI_DMA_TODEVICE);
> +
>   	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> +	nic->ru_running = RU_SUSPENDED;
> 
>   	return 0;
>   }
> @@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo
>   	/* Ack interrupt(s) */
>   	iowrite8(stat_ack, &nic->csr->scb.stat_ack);
> 
> +	/* We hit Receive No Resource (RNR); restart RU after cleaning */
> +	if(stat_ack & stat_ack_rnr)
> +		nic->ru_running = RU_SUSPENDED;
> +
>   	if(likely(netif_rx_schedule_prep(netdev))) {
>   		e100_disable_irq(nic);
>   		__netif_rx_schedule(netdev);
> @@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic)
>   	if((err = e100_hw_init(nic)))
>   		goto err_clean_cbs;
>   	e100_set_multicast_list(nic->netdev);
> -	e100_start_receiver(nic);
> +	e100_start_receiver(nic, NULL);
>   	mod_timer(&nic->watchdog, jiffies);
>   	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
>   		nic->netdev->name, nic->netdev)))
> @@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic
>   		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
>   			BMCR_LOOPBACK);
> 
> -	e100_start_receiver(nic);
> +	e100_start_receiver(nic, NULL);
> 
>   	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
>   		err = -ENOMEM;
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 15:29                   ` Kok, Auke
@ 2007-05-18 15:47                     ` David Acker
  2007-05-18 15:59                       ` Kok, Auke
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-18 15:47 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

Kok, Auke wrote:
> David Acker wrote:
>> David Acker wrote:
>>> Done.  Below is a patch against 2.6.22-rc1.  It combines removing the 
>>> s-bit patch and applying the patch I previously sent.
>>
>> Oops.  I missed one state in that patch.  Since the el-bit buffer will 
>> normally not complete due to a zero size, we need to check if the 
>> buffer with no data has the el-bit set.  Without this, you have to 
>> wait for the interrupt.  Sorry about that...this was in the code I 
>> tested on my embedded system but got lost in the regular kernel patch.
> 
> OK. Thanks.
> 
> If you don't mind I'm going to have some testing on this patch done for 
> a bit now (mostly x86 hardware of course) to see if there's no pitfalls 
> in it. It'll be a few days because of the weekend before I get back on it.
> 

Cool.  I will see if I can get some more tests running over the weekend on our PXA255 
platform.

-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 15:47                     ` David Acker
@ 2007-05-18 15:59                       ` Kok, Auke
  2007-05-18 17:11                         ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-05-18 15:59 UTC (permalink / raw)
  To: David Acker
  Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

David Acker wrote:
> Kok, Auke wrote:
>> David Acker wrote:
>>> David Acker wrote:
>>>> Done.  Below is a patch against 2.6.22-rc1.  It combines removing the 
>>>> s-bit patch and applying the patch I previously sent.
>>> Oops.  I missed one state in that patch.  Since the el-bit buffer will 
>>> normally not complete due to a zero size, we need to check if the 
>>> buffer with no data has the el-bit set.  Without this, you have to 
>>> wait for the interrupt.  Sorry about that...this was in the code I 
>>> tested on my embedded system but got lost in the regular kernel patch.
>> OK. Thanks.
>>
>> If you don't mind I'm going to have some testing on this patch done for 
>> a bit now (mostly x86 hardware of course) to see if there's no pitfalls 
>> in it. It'll be a few days because of the weekend before I get back on it.
>>
> 
> Cool.  I will see if I can get some more tests running over the weekend on our PXA255 
> platform.

First impression just came in: It seems RX performance is dropped to 10mbit. TX 
is unaffected and runs at 94mbit/tcp, but RX the new code seems to misbehave and 
  fluctuate, dropping below 10mbit after a few netperf runs and staying there...

ideas?

Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 15:59                       ` Kok, Auke
@ 2007-05-18 17:11                         ` David Acker
  2007-05-18 17:47                           ` Kok, Auke
  2007-05-21 17:35                           ` Milton Miller
  0 siblings, 2 replies; 51+ messages in thread
From: David Acker @ 2007-05-18 17:11 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Milton Miller, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

Kok, Auke wrote:
> First impression just came in: It seems RX performance is dropped to 
> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code 
> seems to misbehave and  fluctuate, dropping below 10mbit after a few 
> netperf runs and staying there...
> 
> ideas?

I found the problem.  Another casualty of working with two different kernels at once...arg.
The blank rfd needs to have its el-bit clear now.  Here is the new and improved patch.

On the ARM, their is a race condition between software allocating a new receive
buffer and hardware writing into a buffer.  The two race on touching the last
Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
to 0.  When hardware encounters this buffer it attempts to write data to it
and then update Status Word bits and Actual Count in the RFD.  At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide.  This can lead to the receive unit stalling or freed receive buffers
getting written to.

The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain.  When the hardware encounters this buffer it stops and does
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state.  When software allocates buffers,
it can update the tail of the list because it knows the hardware will stop
at the buffer before it.  Once it has a new next to last buffer prepared,
it can clear the el-bit and set the size on the previous one.  The race on
this buffer is safe since the link already points to a valid next buffer.
If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and complete that one in error.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <dacker@roinet.com>

---

--- linux-2.6.22-rc1/drivers/net/e100.c.orig	2007-05-18 13:08:01.000000000 -0400
+++ linux-2.6.22-rc1/drivers/net/e100.c	2007-05-18 13:08:24.000000000 -0400
@@ -285,6 +285,12 @@ enum scb_status {
  	rus_mask         = 0x3C,
  };

+enum ru_state  {
+	RU_SUSPENDED = 0,
+	RU_RUNNING	 = 1,
+	RU_UNINITIALIZED = -1,
+};
+
  enum scb_stat_ack {
  	stat_ack_not_ours    = 0x00,
  	stat_ack_sw_gen      = 0x04,
@@ -526,6 +532,7 @@ struct nic {
  	struct rx *rx_to_use;
  	struct rx *rx_to_clean;
  	struct rfd blank_rfd;
+	enum ru_state ru_running;

  	spinlock_t cb_lock			____cacheline_aligned;
  	spinlock_t cmd_lock;
@@ -947,7 +954,7 @@ static void e100_get_defaults(struct nic
  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));

  	/* Template for a freshly allocated RFD */
-	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+	nic->blank_rfd.command = 0;
  	nic->blank_rfd.rbd = 0xFFFFFFFF;
  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);

@@ -1742,11 +1749,19 @@ static int e100_alloc_cbs(struct nic *ni
  	return 0;
  }

-static inline void e100_start_receiver(struct nic *nic)
+static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
  {
-	/* Start if RFA is non-NULL */
-	if(nic->rx_to_clean->skb)
-		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+	if(!nic->rxs) return;
+	if(RU_SUSPENDED != nic->ru_running) return;
+
+	/* handle init time starts */
+	if(!rx) rx = nic->rxs;
+
+	/* (Re)start RU if suspended or idle and RFA is non-NULL */
+	if(rx->skb) {
+		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
+		nic->ru_running = RU_RUNNING;
+	}
  }

  #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
@@ -1769,13 +1784,12 @@ static int e100_rx_alloc_skb(struct nic
  	}

  	/* Link the RFD to end of RFA by linking previous RFD to
-	 * this one, and clearing EL bit of previous.  */
+	 * this one.  We are safe to touch the previous RFD because
+	 * it is protected the before last buffer's el bit being set  */
  	if(rx->prev->skb) {
  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
  		put_unaligned(cpu_to_le32(rx->dma_addr),
  			(u32 *)&prev_rfd->link);
-		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
  			sizeof(struct rfd), PCI_DMA_TODEVICE);
  	}
@@ -1801,8 +1815,12 @@ static int e100_rx_indicate(struct nic *
  	DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);

  	/* If data isn't ready, nothing to indicate */
-	if(unlikely(!(rfd_status & cb_complete)))
+	if(unlikely(!(rfd_status & cb_complete))) {
+		/* this allows for a fast restart without re-enabling interrupts */
+		if(le16_to_cpu(rfd->command) & cb_el)
+			nic->ru_running = RU_SUSPENDED;
  		return -ENODATA;
+	}

  	/* Get actual data size */
  	actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1813,6 +1831,10 @@ static int e100_rx_indicate(struct nic *
  	pci_unmap_single(nic->pdev, rx->dma_addr,
  		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);

+	/* this allows for a fast restart without re-enabling interrupts */
+	if(le16_to_cpu(rfd->command) & cb_el)
+		nic->ru_running = RU_SUSPENDED;
+
  	/* Pull off the RFD and put the actual data (minus eth hdr) */
  	skb_reserve(skb, sizeof(struct rfd));
  	skb_put(skb, actual_size);
@@ -1843,18 +1865,78 @@ static void e100_rx_clean(struct nic *ni
  	unsigned int work_to_do)
  {
  	struct rx *rx;
+	int restart_required = 0;
+	struct rx *rx_to_start = NULL;
+	struct rx *old_before_last_rx, *new_before_last_rx;
+	struct rfd *old_before_last_rfd, *new_before_last_rfd;
+
+	/* are we already rnr? then pay attention!!! this ensures that
+	 * the state machine progression never allows a start with a
+	 * partially cleaned list, avoiding a race between hardware
+	 * and rx_to_clean when in NAPI mode */
+	if(RU_SUSPENDED == nic->ru_running)
+		restart_required = 1;

  	/* Indicate newly arrived packets */
  	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
-		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
+		int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+		if(-EAGAIN == err) {
+			/* hit quota so have more work to do, restart once
+			 * cleanup is complete */
+			restart_required = 0;
+			break;
+		} else if(-ENODATA == err)
  			break; /* No more to clean */
  	}

+	/* save our starting point as the place we'll restart the receiver */
+	if(restart_required)
+		rx_to_start = nic->rx_to_clean;
+
+	old_before_last_rx = nic->rx_to_use->prev->prev;
+	old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
+
  	/* Alloc new skbs to refill list */
  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
  			break; /* Better luck next time (see watchdog) */
  	}
+
+	new_before_last_rx = nic->rx_to_use->prev->prev;
+	if (new_before_last_rx != old_before_last_rx) {
+		/* Set the el-bit on the buffer that is before the last buffer.
+		 * This lets us update the next pointer on the last buffer without
+		 * worrying about hardware touching it.
+		 * We set the size to 0 to prevent hardware from touching this buffer.
+		 * When the hardware hits the before last buffer with el-bit and size
+		 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+		 * state.  It will not complete nor write to this buffer. */
+		new_before_last_rfd = (struct rfd *)new_before_last_rx->skb->data;
+		new_before_last_rfd->size = 0;
+		new_before_last_rfd->command |= cpu_to_le16(cb_el);
+		pci_dma_sync_single_for_device(nic->pdev, new_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+		/* Now that we have a new stopping point, we can clear the old
+		 * stopping point.
+		 * Note: There appears to be a race here where the hardware
+		 * can complete this buffer with the el-bit set but with the
+		 * size also set.  The hardware RNR interrupts, the RUS
+		 * goes into the No Resources state. */
+		old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+		wmb();
+		old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+		pci_dma_sync_single_for_device(nic->pdev, old_before_last_rx->dma_addr,
+			sizeof(struct rfd), PCI_DMA_TODEVICE);
+	}
+
+	if(restart_required) {
+		// ack the rnr?
+		writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
+		e100_start_receiver(nic, rx_to_start);
+		if(work_done)
+			(*work_done)++;
+	}
  }

  static void e100_rx_clean_list(struct nic *nic)
@@ -1862,6 +1944,8 @@ static void e100_rx_clean_list(struct ni
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;

+	nic->ru_running = RU_UNINITIALIZED;
+
  	if(nic->rxs) {
  		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
  			if(rx->skb) {
@@ -1881,8 +1965,10 @@ static int e100_rx_alloc_list(struct nic
  {
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;
+	struct rfd *before_last;

  	nic->rx_to_use = nic->rx_to_clean = NULL;
+	nic->ru_running = RU_UNINITIALIZED;

  	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
  		return -ENOMEM;
@@ -1896,7 +1982,22 @@ static int e100_rx_alloc_list(struct nic
  		}
  	}

+	/* Set the el-bit on the buffer that is before the last buffer.
+	 * This lets us update the next pointer on the last buffer without
+	 * worrying about hardware touching it.
+	 * We set the size to 0 to prevent hardware from touching this buffer.
+	 * When the hardware hits the before last buffer with el-bit and size
+	 * of 0, it will RNR interrupt, the RUS will go into the No Resources
+	 * state.  It will not complete nor write to this buffer. */
+	rx = nic->rxs->prev->prev;
+	before_last = (struct rfd *)rx->skb->data;
+	before_last->command |= cpu_to_le16(cb_el);
+	before_last->size = 0;
+	pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
+	nic->ru_running = RU_SUSPENDED;

  	return 0;
  }
@@ -1916,6 +2017,10 @@ static irqreturn_t e100_intr(int irq, vo
  	/* Ack interrupt(s) */
  	iowrite8(stat_ack, &nic->csr->scb.stat_ack);

+	/* We hit Receive No Resource (RNR); restart RU after cleaning */
+	if(stat_ack & stat_ack_rnr)
+		nic->ru_running = RU_SUSPENDED;
+
  	if(likely(netif_rx_schedule_prep(netdev))) {
  		e100_disable_irq(nic);
  		__netif_rx_schedule(netdev);
@@ -2007,7 +2112,7 @@ static int e100_up(struct nic *nic)
  	if((err = e100_hw_init(nic)))
  		goto err_clean_cbs;
  	e100_set_multicast_list(nic->netdev);
-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);
  	mod_timer(&nic->watchdog, jiffies);
  	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
  		nic->netdev->name, nic->netdev)))
@@ -2088,7 +2193,7 @@ static int e100_loopback_test(struct nic
  		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
  			BMCR_LOOPBACK);

-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);

  	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
  		err = -ENOMEM;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 17:11                         ` David Acker
@ 2007-05-18 17:47                           ` Kok, Auke
  2007-05-21 17:35                           ` Milton Miller
  1 sibling, 0 replies; 51+ messages in thread
From: Kok, Auke @ 2007-05-18 17:47 UTC (permalink / raw)
  To: David Acker, Milton Miller
  Cc: Jeff Garzik, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

David Acker wrote:
> Kok, Auke wrote:
>> First impression just came in: It seems RX performance is dropped to 
>> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code 
>> seems to misbehave and  fluctuate, dropping below 10mbit after a few 
>> netperf runs and staying there...
>>
>> ideas?
> 
> I found the problem.  Another casualty of working with two different kernels at once...arg.
> The blank rfd needs to have its el-bit clear now.  Here is the new and improved patch.


> -	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
> +	nic->blank_rfd.command = 0;

OK, that fixed it and it's getting 94mbit/tcp rx now without issues. Milton, can 
you look at this patch? I'd like some more feedback. Meanwhile I will try to get 
this tested on a variety of e100 NICs, which will take a few days.

Thanks,

Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-18 17:11                         ` David Acker
  2007-05-18 17:47                           ` Kok, Auke
@ 2007-05-21 17:35                           ` Milton Miller
  2007-05-21 17:45                             ` Kok, Auke
  1 sibling, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-05-21 17:35 UTC (permalink / raw)
  To: David Acker
  Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, netdev, Kok, Auke

On May 18, 2007, at 12:11 PM, David Acker wrote:

> Kok, Auke wrote:
>> First impression just came in: It seems RX performance is dropped to 
>> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code 
>> seems to misbehave and  fluctuate, dropping below 10mbit after a few 
>> netperf runs and staying there...
>> ideas?
>
> I found the problem.  Another casualty of working with two different 
> kernels at once...arg.
> The blank rfd needs to have its el-bit clear now.  Here is the new and 
> improved patch.
>
> On the ARM, their is a race condition between software allocating a 
> new receive
> buffer and hardware writing into a buffer.  The two race on touching 
> the last
> Receive Frame Descriptor (RFD).  It has its el-bit set and its next 
> link equal
> to 0.  When hardware encounters this buffer it attempts to write data 
> to it
> and then update Status Word bits and Actual Count in the RFD.  At the 
> same time
> software may try to clear the el-bit and set the link address to a new 
> buffer.
> Since the entire RFD is once cache-line, the two write operations can
> collide.  This can lead to the receive unit stalling or freed receive 
> buffers
> getting written to.
>
> The fix is to set the el-bit on and the size to 0 on the next to last 
> buffer
> in the chain.  When the hardware encounters this buffer it stops and 
> does
> not write to it at all.  The hardware issues an RNR interrupt with the
> receive unit in the No Resources state.  When software allocates 
> buffers,
> it can update the tail of the list because it knows the hardware will 
> stop
> at the buffer before it.  Once it has a new next to last buffer 
> prepared,
> it can clear the el-bit and set the size on the previous one.  The 
> race on
> this buffer is safe since the link already points to a valid next 
> buffer.
> If the hardware sees the el-bit cleared without the size set, it will
> move on to the next buffer and complete that one in error.  If it sees
> the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
>
> Signed-off-by: David Acker <dacker@roinet.com>
>

This patch doesn't apply.  It appears white space damaged somewhere:

(1) blank lines in diff are empty not <space>
(2) unchanged lines starting with tab are <space><space><tab>

After fixing the above I still get:

patching file drivers/net/e100.c
Hunk #1 FAILED at 285.
Hunk #4 FAILED at 1749.
Hunk #8 FAILED at 1865.
Hunk #10 succeeded at 1965 with fuzz 1.
Hunk #11 succeeded at 1982 with fuzz 1.
3 out of 14 hunks FAILED -- saving rejects to file 
drivers/net/e100.c.rej

although I haven't figured out what is wrong.

Proceeding with the review:

Coding style:
(1) if body on seperate line.
(2) space after if before (
(3) The other enums in this driver are not ALL_CAPS
(4) This driver doesn't do CONSTANT != value but value != enum
     (see nic->mac for examples)

I don't understand why the old logic in start_receiver doesn't work.
When would the rx passed not be rx_to_clean ?

The driver sets EL (end-of-list) to stop the receiver.  I was trying
to determine if this is the right decision or we should use suspend
bit.  I think the EL is the right choice because it will count
discarded frames while the card waits to be refilled, it is not
obvious that those counters count in the suspended state.  Also,
we want to use the RU_START not RU_RESUME so that it reads the
RFD again with the size (as opposed to just the S bit)

However, we should not call this state RU_SUSPENDED because
that is a different state in the hw.

I also see that the driver sets the size from 0 back to frame
size.  While there is a wmb() between the removal of EL and the
restore of the frame size, there is only one pci_sync for both
of them.  Since the size is in a separate word, depending on skb
alignment it may be in a different cache line and we may end up
with all orderings being visible to the device.

This patch adds a lot of code that checks if the next RFD has
EL set and assumes that if it sees the next frame to clean has
EL set then the device will have seen it and entered the stopped
state.   In theory, the device might be held off on the PCI bus
and not have yet read the descriptor and stopped.   In addition,
while it anticipates the RNR interrupt and restarts the receiver
from the clean process, it doesn't clear the pending interrupt
and the cpu will still take the interrupt, although it will read
the status and not actually restart the RU.

It would seem simpler to me to just check the device status on
the way out after refilling and restoring the descriptor.  If
the device has stopped, rx_to_clean will still point to the
no-longer-size-0 not EL buffer and will be the correct restart
place.

Also, I think this exposes us to a window with coherent hardware
where the device will see EL cleared but size 0 and will skip
completing the frame but continue on to the next RFD.  The driver
will then issue a start command while the RU is active, which is
prohibited.

This window also means that the device may have skipped this
previously 0-length descriptor and gone on to the next one,
so we will stick waiting for the device to write to this
descriptor while it went on to the next one.

The description of how we use the EL bit and size 0 should go
in the big comment at the top describing the operation of the
driver.  Also, the description of size 0 should be separated
from the description of EL being set, as those are independent.

milton

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-21 17:35                           ` Milton Miller
@ 2007-05-21 17:45                             ` Kok, Auke
  2007-05-22 16:51                               ` Milton Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-05-21 17:45 UTC (permalink / raw)
  To: Milton Miller
  Cc: David Acker, Jeff Garzik, Scott Feldman, e1000-devel,
	Jeff Kirsher, John Ronciak, Jesse Brandeburg, netdev

Milton Miller wrote:
> On May 18, 2007, at 12:11 PM, David Acker wrote:
> 
>> Kok, Auke wrote:
>>> First impression just came in: It seems RX performance is dropped to 
>>> 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new code 
>>> seems to misbehave and  fluctuate, dropping below 10mbit after a few 
>>> netperf runs and staying there...
>>> ideas?
>> I found the problem.  Another casualty of working with two different 
>> kernels at once...arg.
>> The blank rfd needs to have its el-bit clear now.  Here is the new and 
>> improved patch.
>>
>> On the ARM, their is a race condition between software allocating a 
>> new receive
>> buffer and hardware writing into a buffer.  The two race on touching 
>> the last
>> Receive Frame Descriptor (RFD).  It has its el-bit set and its next 
>> link equal
>> to 0.  When hardware encounters this buffer it attempts to write data 
>> to it
>> and then update Status Word bits and Actual Count in the RFD.  At the 
>> same time
>> software may try to clear the el-bit and set the link address to a new 
>> buffer.
>> Since the entire RFD is once cache-line, the two write operations can
>> collide.  This can lead to the receive unit stalling or freed receive 
>> buffers
>> getting written to.
>>
>> The fix is to set the el-bit on and the size to 0 on the next to last 
>> buffer
>> in the chain.  When the hardware encounters this buffer it stops and 
>> does
>> not write to it at all.  The hardware issues an RNR interrupt with the
>> receive unit in the No Resources state.  When software allocates 
>> buffers,
>> it can update the tail of the list because it knows the hardware will 
>> stop
>> at the buffer before it.  Once it has a new next to last buffer 
>> prepared,
>> it can clear the el-bit and set the size on the previous one.  The 
>> race on
>> this buffer is safe since the link already points to a valid next 
>> buffer.
>> If the hardware sees the el-bit cleared without the size set, it will
>> move on to the next buffer and complete that one in error.  If it sees
>> the size set but the el-bit still set, it will complete that buffer
>> and then RNR interrupt and wait.
>>
>>
>> Signed-off-by: David Acker <dacker@roinet.com>
>>
> 
> 
> This patch doesn't apply.  It appears white space damaged somewhere:
> 
> (1) blank lines in diff are empty not <space>
> (2) unchanged lines starting with tab are <space><space><tab>
> 
> After fixing the above I still get:
> 
> patching file drivers/net/e100.c
> Hunk #1 FAILED at 285.
> Hunk #4 FAILED at 1749.
> Hunk #8 FAILED at 1865.
> Hunk #10 succeeded at 1965 with fuzz 1.
> Hunk #11 succeeded at 1982 with fuzz 1.
> 3 out of 14 hunks FAILED -- saving rejects to file 
> drivers/net/e100.c.rej
> 
> although I haven't figured out what is wrong.
> 
> Proceeding with the review:
> 
> Coding style:
> (1) if body on seperate line.
> (2) space after if before (
> (3) The other enums in this driver are not ALL_CAPS
> (4) This driver doesn't do CONSTANT != value but value != enum
>      (see nic->mac for examples)

I sent Milton my copy of this patch which has these style issues corrected and 
applies cleanly to a recent git tree. If anyone else specifically wants a copy 
let me know.

Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-21 17:45                             ` Kok, Auke
@ 2007-05-22 16:51                               ` Milton Miller
  2007-05-22 22:07                                 ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-05-22 16:51 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, netdev, David Acker

On May 21, 2007, at 12:45 PM, Kok, Auke wrote:

> Milton Miller wrote:
>> On May 18, 2007, at 12:11 PM, David Acker wrote:
>>> Kok, Auke wrote:
>>>> First impression just came in: It seems RX performance is dropped 
>>>> to 10mbit. TX is unaffected and runs at 94mbit/tcp, but RX the new 
>>>> code seems to misbehave and  fluctuate, dropping below 10mbit after 
>>>> a few netperf runs and staying there...
>>>> ideas?
>>> I found the problem.  Another casualty of working with two different 
>>> kernels at once...arg.
>>> The blank rfd needs to have its el-bit clear now.  Here is the new 
>>> and improved patch.
...
>> Proceeding with the review:
>> Coding style:
>> (1) if body on seperate line.
>> (2) space after if before (
>> (3) The other enums in this driver are not ALL_CAPS
>> (4) This driver doesn't do CONSTANT != value but value != enum
>>      (see nic->mac for examples)
>
> I sent Milton my copy of this patch which has these style issues 
> corrected and
> applies cleanly to a recent git tree. If anyone else specifically 
> wants a copy
> let me know.
>
> Auke

It addressed 1 and 2, and applies, but did not address 3 and 4.

But the bigger point is it didn't address the holes I identified.

I think we need to change the logic to reclaim the size from 0
only if we are restarting, and make rx_indicate look ahead to
rx->next if it encounters a !EL size 0 buffer.  Without this we
are doing a "prohibited" rx_start to a running machine.  The
device can still see this size 0 !EL state.  Also we will get
stuck when the device finds the window between the two writes.

We can remove some register pressure by finding old_before_last_rfd
when we are ready to use it, just comparing old_before_last_rx
to new.

Also, as I pointed out, the rx_to_start change to start_reciever
is compicated and unnecessary, as rx_to_clean can always be used
and it was the starting point before the changes.

As far as the RU_SUSPENDED, I don't think we need it, instead
we should poll the device.

Here is my proposal:
rx_indicate can stop when it hits the packet with EL.  If it
hits a packet with size 0, look ahead to rx->next to see if
it is complete, if so complete this one otherwise leave it
as next to clean.  After the rx_indicate loop, try to allocate
more skbs.  If we are successful, then fixup the before-next
as we do now.  Then check the device status for RNR, and if
its stopped then set rx_to_clean rfd size back to ETH_LEN
and restart the reciever.

This does have a small hole: if we only add one packet at
a time we will end up with all size 0 descriptors in the
lopp.   We can detect that and not remove EL from the old
before-next unless we are restarting.  That would imply
moving the status poll before we allocate the list.

milton

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-22 16:51                               ` Milton Miller
@ 2007-05-22 22:07                                 ` David Acker
  2007-05-23 14:02                                   ` Milton Miller
  0 siblings, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-22 22:07 UTC (permalink / raw)
  To: Milton Miller
  Cc: Kok, Auke, Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, netdev

Milton Miller wrote:
>>> Proceeding with the review:
>>> Coding style:
>>> (1) if body on seperate line.
>>> (2) space after if before (
>>> (3) The other enums in this driver are not ALL_CAPS
>>> (4) This driver doesn't do CONSTANT != value but value != enum
>>>      (see nic->mac for examples)
>>
>> I sent Milton my copy of this patch which has these style issues 
>> corrected and
>> applies cleanly to a recent git tree. If anyone else specifically 
>> wants a copy
>> let me know.
>>
>> Auke
> 
> It addressed 1 and 2, and applies, but did not address 3 and 4.
> 

Sorry about the style bugs.  I will be more careful about that next time.

Many of the issues you bring have been in the e100 for some time.  If 
you ignore the s-bit patch, I basically did the the following:
moved the el-bit to before the last buffer so that the last buffer was 
protected during chaining.
set the el-bit buffer (next-to-last) size to 0 so that it is not written 
to while we are writing to it.

This seemed the only way to protect the buffers we are changing while 
having code to stay ahead of the hardware.

(3) The driver had these all caps constants before my patch.  They went 
away with the s-bit patch.  I just put them back.  I agree they stick 
out but I wanted to leave style changes out of a bug fix patch.

I agree about the name of the constant, RU_SUSPENDED.  It is not 
accurate and I had a patch to fix this when I experimented with using 
the S-bit.

> 
> But the bigger point is it didn't address the holes I identified.

> I also see that the driver sets the size from 0 back to frame
> size.  While there is a wmb() between the removal of EL and the
> restore of the frame size, there is only one pci_sync for both
> of them.  Since the size is in a separate word, depending on skb
> alignment it may be in a different cache line and we may end up
> with all orderings being visible to the device. 
Hmmm...interesting.  The start of the data of the skb may not be aligned 
on a cache line.  Ok DMA experts...what happens when you sync across 
cache lines?  It should dump both of them, right?  I guess the orderings 
could vary...hence why I saw completions with size set but the el-bit 
still set.  So perhaps the skb alloc needs to be aligned by cache line? 
     This issue existed before my patches as well.


> This patch adds a lot of code that checks if the next RFD has
> EL set and assumes that if it sees the next frame to clean has
> EL set then the device will have seen it and entered the stopped
> state.   In theory, the device might be held off on the PCI bus
> and not have yet read the descriptor and stopped.   In addition,
> while it anticipates the RNR interrupt and restarts the receiver
> from the clean process, it doesn't clear the pending interrupt
> and the cpu will still take the interrupt, although it will read
> the status and not actually restart the RU. 
So the device is just before the el-bit buffer and stops.  We get a poll 
call with interrupts disabled and see the el-bit buffer and decide we 
need to restart.  First we alloc new buffers and move the el-bit/size 0 
down.  The restart occurs on the next poll (interrupts are not turned on 
if we did any work).  In theory, the hardware could have read the buffer 
that no longer has a size of 0 and tried to use it when we hit it with 
the start.  I am not sure how the hardware handles this.  Perhaps it is 
ignored and thus harmless.  We could poll the hardware to check before 
actually sending the start...it adds an extra pci operation but would 
avoid a command that is illegal in the manual.

If we ignore a buffer with the el-bit set but without the complete bit, 
we must wait for the hardware to RNR interrupt.  I have found that when 
a buffer has both the el-bit set and the size set to 0, it will not set 
the complete bit on that buffer.  This was part of the reason why my 
first patch had hardware just using the list of buffers given all the 
way up until the end.

> I think we need to change the logic to reclaim the size from 0
> only if we are restarting, and make rx_indicate look ahead to
> rx->next if it encounters a !EL size 0 buffer.  Without this we
> are doing a "prohibited" rx_start to a running machine.
The hardware is only restarted when we enter the RU_SUSPENDED state. 
This only happens when we:
1) get an RNR interrupt
2) get an EL-bit buffer without a completion
3) get an el-bit buffer with a completion (hardware saw size set but not 
el-bit clear)

State 2) seems to be the problem.  Get rid of that and we wait for an 
interrupt to tell us it saw the buffer in most cases.  Of course, then 
we always wait.

If we do not reclaim the size from 0, but clear the el-bit, that buffer 
will always return an error.

 > The
> device can still see this size 0 !EL state.  Also we will get
> stuck when the device finds the window between the two writes.

 > This window also means that the device may have skipped this
 > previously 0-length descriptor and gone on to the next one,
 > so we will stick waiting for the device to write to this
 > descriptor while it went on to the next one.
If the hardware sees both size 0 and EL, it RNR interrupts, does not 
write to the buffer at all and stops.

If the hardware sees size 0 and !EL, (which I never saw in testing but 
could happen) it would return the buffer with an error and continue. 
This is ok since we already have written a new size 0 and EL.  Our 
polling code would just find a packet completed in error.

If the hardware sees size set and EL, it RNR interrupts, and does write 
data and status to the buffer, and then stops.  I have see this one 
occur and it isn't a problem.

If hardware sees size set and !EL, it writes data and continues.


> We can remove some register pressure by finding old_before_last_rfd
> when we are ready to use it, just comparing old_before_last_rx
> to new.
Sure.

> 
> Also, as I pointed out, the rx_to_start change to start_reciever
> is compicated and unnecessary, as rx_to_clean can always be used
> and it was the starting point before the changes.
This was in the driver before my patch and the s-bit patch.  I agree 
that it could go away as rx_to_clean should not be capable of changing.

> 
> As far as the RU_SUSPENDED, I don't think we need it, instead
> we should poll the device.
RU_SUSPENDED is pretty much used as it was before all the patches. The 
only waste I found is that if we find the EL-bit while interrupts are 
disabled (say during a NAPI poll), we set RU_SUSPENDED and then have to 
wait for the next poll to figure out that we need to restart the receiver.

> 
> Here is my proposal:
> rx_indicate can stop when it hits the packet with EL.  If it
> hits a packet with size 0, look ahead to rx->next to see if
> it is complete, if so complete this one otherwise leave it
> as next to clean.  After the rx_indicate loop, try to allocate
> more skbs.  If we are successful, then fixup the before-next
> as we do now.  Then check the device status for RNR, and if
> its stopped then set rx_to_clean rfd size back to ETH_LEN
> and restart the reciever.
> 
> This does have a small hole: if we only add one packet at
> a time we will end up with all size 0 descriptors in the
> lopp.   We can detect that and not remove EL from the old
> before-next unless we are restarting.  That would imply
> moving the status poll before we allocate the list.
Can you send a patch or pseudo-code?  It would help to see some main parts:
1. blank_rfd setup
2. single buffer alloc
3. el-bit handling (when do we set it and when do we clear it)
4. size handling (when do we set it and when do we clear it)


What if we just fixed the alignment of rfd to make sure it is cache 
aligned?  Then we know our syncs will be a single cache flush.
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-22 22:07                                 ` David Acker
@ 2007-05-23 14:02                                   ` Milton Miller
  2007-05-23 21:32                                     ` David Acker
  2007-05-24  4:13                                     ` Milton Miller
  0 siblings, 2 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-23 14:02 UTC (permalink / raw)
  To: David Acker
  Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, netdev, Scott Feldman

I tried to remove anything we were in agreement with.

On May 22, 2007, at 5:07 PM, David Acker wrote:
> Milton Miller wrote:

> Many of the issues you bring have been in the e100 for some time.  If 
> you ignore the s-bit patch, I basically did the the following:
> moved the el-bit to before the last buffer so that the last buffer was 
> protected during chaining.
> set the el-bit buffer (next-to-last) size to 0 so that it is not 
> written to while we are writing to it.
>
> This seemed the only way to protect the buffers we are changing while 
> having code to stay ahead of the hardware.

I agree with this part of the approach.  I just think we need
a bit more work on the "what to do when we are ready for
hardware to not stop" part.

>> But the bigger point is it didn't address the holes I identified.
>>
>> I also see that the driver sets the size from 0 back to frame
>> size.  While there is a wmb() between the removal of EL and the
>> restore of the frame size, there is only one pci_sync for both
>> of them.  Since the size is in a separate word, depending on skb
>> alignment it may be in a different cache line and we may end up
>> with all orderings being visible to the device.
>
> Hmmm...interesting.  The start of the data of the skb may not be 
> aligned on a cache line.  Ok DMA experts...what happens when you sync 
> across cache lines?  It should dump both of them, right?  I guess the 
> orderings could vary...hence why I saw completions with size set but 
> the el-bit still set.  So perhaps the skb alloc needs to be aligned by 
> cache line?     This issue existed before my patches as well.

The sync is required to push both cache lines, but there is no
ordering guarantee.  This probably is why you saw size and el
set.  Aligning the RFD to a cache line conflicts with aligning
the payload (IP header and data) to a word boundary, and
depending on cache line size it may be impossible to do both.

And it won't fix the hole for coherent dma machines.

It wasn't an issue before because we never set two fields.

>> This patch adds a lot of code that checks if the next RFD has
>> EL set and assumes that if it sees the next frame to clean has
>> EL set then the device will have seen it and entered the stopped
>> state.   In theory, the device might be held off on the PCI bus
>> and not have yet read the descriptor and stopped.   In addition,
>> while it anticipates the RNR interrupt and restarts the receiver
>> from the clean process, it doesn't clear the pending interrupt
>> and the cpu will still take the interrupt, although it will read
>> the status and not actually restart the RU.
>
> So the device is just before the el-bit buffer and stops.  We get a 
> poll call with interrupts disabled and see the el-bit buffer and 
> decide we need to restart.  First we alloc new buffers and move the 
> el-bit/size 0 down.  The restart occurs on the next poll (interrupts 
> are not turned on if we did any work).  In theory, the hardware could 
> have read the buffer that no longer has a size of 0 and tried to use 
> it when we hit it with the start.  I am not sure how the hardware 
> handles this.  Perhaps it is ignored and thus harmless.  We could poll 
> the hardware to check before actually sending the start...it adds an 
> extra pci operation but would avoid a command that is illegal in the 
> manual.
>
> If we ignore a buffer with the el-bit set but without the complete 
> bit, we must wait for the hardware to RNR interrupt.  I have found 
> that when a buffer has both the el-bit set and the size set to 0, it 
> will not set the complete bit on that buffer.  This was part of the 
> reason why my first patch had hardware just using the list of buffers 
> given all the way up until the end.


My current reading of the manual is that the C bit will not be
set on an RFD that is size 0.  It goes on to processes EL and
S, and decides to stop and interrupt RNR or suspend, or just
go to the next packet.

>> I think we need to change the logic to reclaim the size from 0
>> only if we are restarting, and make rx_indicate look ahead to
>> rx->next if it encounters a !EL size 0 buffer.  Without this we
>> are doing a "prohibited" rx_start to a running machine.
>
> The hardware is only restarted when we enter the RU_SUSPENDED state. 
> This only happens when we:
> 1) get an RNR interrupt
> 2) get an EL-bit buffer without a completion
> 3) get an el-bit buffer with a completion (hardware saw size set but 
> not el-bit clear)
>
> State 2) seems to be the problem.  Get rid of that and we wait for an 
> interrupt to tell us it saw the buffer in most cases.  Of course, then 
> we always wait.

If we poll the device, I think it will be rare that it
is not stopped when we see the EL bit bit it has not
yet read the next RFD.  But it is a race and I think
a device read is worth avoiding something prohibited
in the manual.

> If we do not reclaim the size from 0, but clear the el-bit, that 
> buffer will always return an error.

I disagree.  If the size is 0, it is a special case and no
error will be generated, instead the RU will just process
EL and S then proceed to the next buffer.

If the size was smaller than the incoming packet but not 0
then it would be completed in error.

It does make are "did the device go past" logic harder.
If the buffer ever was 0, then we need to see if the
next never-zero buffer has completed, and reclaim all
the buffers in between.

>> The device can still see this size 0 !EL state.  Also we will
>> get stuck when the device finds the window between the two
>> writes.
>
>> This window also means that the device may have skipped this
>> previously 0-length descriptor and gone on to the next one,
>> so we will stick waiting for the device to write to this
>> descriptor while it went on to the next one.
>
> If the hardware sees both size 0 and EL, it RNR interrupts, does not 
> write to the buffer at all and stops.
>
> If the hardware sees size 0 and !EL, (which I never saw in testing but 
> could happen) it would return the buffer with an error and continue. 
> This is ok since we already have written a new size 0 and EL.  Our 
> polling code would just find a packet completed in error.

[A bit repetitious since we are replying to two threads]

Again, I think there is no completion, error or not, and our
clean will get stuck while the hardware goes on.

> If the hardware sees size set and EL, it RNR interrupts, and does 
> write data and status to the buffer, and then stops.  I have see this 
> one occur and it isn't a problem.
>
> If hardware sees size set and !EL, it writes data and continues.
>
>
>> As far as the RU_SUSPENDED, I don't think we need it, instead
>> we should poll the device.
> RU_SUSPENDED is pretty much used as it was before all the patches. The 
> only waste I found is that if we find the EL-bit while interrupts are 
> disabled (say during a NAPI poll), we set RU_SUSPENDED and then have 
> to wait for the next poll to figure out that we need to restart the 
> receiver.

I have no problem with reading the device when we think it may
have hit EL and restarting on this poll.  I even prefer it.

If we do restart, we should take the time to ack the RNR
interrupt so we don't process it again.

>> Here is my proposal:
>> rx_indicate can stop when it hits the packet with EL.  If it
>> hits a packet with size 0, look ahead to rx->next to see if
>> it is complete, if so complete this one otherwise leave it
>> as next to clean.  After the rx_indicate loop, try to allocate
>> more skbs.  If we are successful, then fixup the before-next
>> as we do now.  Then check the device status for RNR, and if
>> its stopped then set rx_to_clean rfd size back to ETH_LEN
>> and restart the reciever.
>> This does have a small hole: if we only add one packet at
>> a time we will end up with all size 0 descriptors in the
>> lopp.   We can detect that and not remove EL from the old
>> before-next unless we are restarting.  That would imply
>> moving the status poll before we allocate the list.
>
> Can you send a patch or pseudo-code?  It would help to see some main 
> parts:
> 1. blank_rfd setup
> 2. single buffer alloc
> 3. el-bit handling (when do we set it and when do we clear it)
> 4. size handling (when do we set it and when do we clear it)

I was afraid you were going to ask for code :-)   I'll try
to find some time to write some.   I'll need to look for
an adapter to do any testing.  They exist in some of the
machines at work.

> What if we just fixed the alignment of rfd to make sure it is cache 
> aligned?  Then we know our syncs will be a single cache flush.

That doesn't help in the coherent dma case; the device can still
read in between, and we need to handle the !EL size 0 case.

Actually, I thought about it some more after yesterdays note.
We can set the size !0 after making sure the device sees EL
(wmb and pci_sync_for_device), as long as we remember this
buffer used to be size 0 when doing RX clean, and look ahead
for another complete packet as long as the size was ever 0.

My proposal was just to use "size is still 0" as that flag,
wasting the space allocated for that skb.

Another aproach is to always allocate a rfd-only packet
at the beginning of the loop, and inserting th


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-23 14:02                                   ` Milton Miller
@ 2007-05-23 21:32                                     ` David Acker
  2007-05-24  5:26                                       ` Milton Miller
  2007-05-24  4:13                                     ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-23 21:32 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, netdev, Scott Feldman

Milton Miller wrote:
> I agree with this part of the approach.  I just think we need
> a bit more work on the "what to do when we are ready for
> hardware to not stop" part.
Agreed.

>
> The sync is required to push both cache lines, but there is no
> ordering guarantee.  This probably is why you saw size and el
> set.  Aligning the RFD to a cache line conflicts with aligning
> the payload (IP header and data) to a word boundary, and
> depending on cache line size it may be impossible to do both.
> 
> And it won't fix the hole for coherent dma machines.
Yep.

> It wasn't an issue before because we never set two fields.
Before the driver was setting link and clearing the el-bit on the last 
buffer.  This resulted in badness when hardware saw the el-bit cleared 
but link set to 0.

> My current reading of the manual is that the C bit will not be
> set on an RFD that is size 0.  It goes on to processes EL and
> S, and decides to stop and interrupt RNR or suspend, or just
> go to the next packet.
I double checked this with a quick experiment and it appears you are 
correct.

What about if we always did the following:
set the size:
sync();
clear el-bit
sync()

Then if the hardware sees just the size set, the packet completes but 
with the el-bit and we know we need to restart since it completed.
If it sees the size == 0, and the el bit set, it stops and RNR interrupts.

When we find a buffer that is not completed but has the el-bit set, we 
read the status byte of the status control block to see if the RU is in 
the no resources state.  If it isn't, it means that we found that buffer 
  before the hardware did and thus need to wait for it.  We will either 
find it on the next poll or enable interrupts and get told about it by 
hardware.

What do you think?
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-23 21:32                                     ` David Acker
@ 2007-05-24  5:26                                       ` Milton Miller
  2007-05-24 11:21                                         ` Milton Miller
  2007-05-24 12:44                                         ` David Acker
  0 siblings, 2 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-24  5:26 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, Jeff Garzik

On May 23, 2007, at 4:32 PM, David Acker wrote:
> Milton Miller wrote:
>> My current reading of the manual is that the C bit will not be
>> set on an RFD that is size 0.  It goes on to processes EL and
>> S, and decides to stop and interrupt RNR or suspend, or just
>> go to the next packet.
> I double checked this with a quick experiment and it appears you are 
> correct.
>
> What about if we always did the following:
> set the size:
> sync();
> clear el-bit
> sync()
>
> Then if the hardware sees just the size set, the packet completes but 
> with the el-bit and we know we need to restart since it completed.
> If it sees the size == 0, and the el bit set, it stops and RNR 
> interrupts.

I think this is exposed to a hole and a race:  we don't know if the 
hardware
read the RFD before we set the size or after, just that it was before 
the EL
bit was cleared.  If it read it before the size was set, then it will 
not
set the C bit.  If it reads it after the size is set, it will complete 
it.

For coherent DMA we can always observe the C bit.  But for the 
incoherent DMA
case, our store to clear the EL bit may overwrite the dma from the 
device to
the beginning of the packet, or the write to EOF, F, and size, and/or 
the
write to C, OK, and status bits to tell us its done.  In the worst 
case, we
would overwrite the beginning of the data but catch the C bit and even 
the
actual size, and therefore would receive corrupted data.

We can only detect the hardware went RNR when it does so or decide we 
won the
race when it receives and completes the next frame.

> When we find a buffer that is not completed but has the el-bit set, we 
> read the status byte of the status control block to see if the RU is 
> in the no resources state.  If it isn't, it means that we found that 
> buffer  before the hardware did and thus need to wait for it.  We will 
> either find it on the next poll or enable interrupts and get told 
> about it by hardware.
>
> What do you think?

I think the second part is good ...

Ok here's my just-before-dinner brainstorming, as relayed after dinner:

We add two flags to struct rx:  one says this packet is EL, and one says
it is or was size 0.   We create a function, find_mark_el(struct nic,
is_running) that is called after initial alloc and/or after refilling
skb list.   In find_mark_el, we start with rx_to_use (lets rename this
rx_to_fill), and go back two entries, call this rx_to_mark.   If
is_running and rx_to_mark->prev->el then just return, leave EL alone.
Otherwise, set EL and size to 0, set the two flags in the rx struct,
sync the RFD, then search for the current EL (we could save the pointer,
but its always the odl rx_to_use (fill) or its ->prev).  Clear RFD->EL,
sync,  clear rx->el.  Set size non-zero, sync,  Leave the was_0 flag set
if is_running (clear it only if we know reciever is stopped).

At this point, if the receiver was stopped we can restart it,.  We
should do so in the caller.  (We always restart at rx_to_clean).
Restart should ack the RNR status before issuing the ru_start command to
avoid a spurious RNR interrupt.

In the receive loop, if RFD->C is not set, rx->was_0 is set and el
is not set, then we need to check rx->next->RFD->C bit (after
pci_sync_for_cpu).   If the next packet C bit is set then we consider
this skb as used, and just complete it with no data to the stack.

Because find_mark_el will only advance EL if the receiver was stopped
or we had more than 1 packet added, we can guarantee that if packet
N has was_0 set, then packet N+1 will not have it set, so we have
bounded lookahead.

This adds two flags to struct rx, but that is allocated as a single
array and the array size is filled out as forward and back pointers.
Rx clean only has to look at was_0 on the last packet when it
decides C is not set, and only if both are set does it peek at the
next packet.  In other words, we shouldn't worry about the added
flags making it a non-power-of-2 size.

By setting size != 0 after we have modified all other fields, the
device will only write to this packet if we are done writing.  If
we loose the race and only partially complete, it will just go on
to the next packet and we find it.  If we totally loose, then the
receiver will go RNR and we can reclaim all the buffer space we
have allocated.

Comments?  Questions?

We need to enforce a minimum rx ring size (3? 4?).

We rely on other mechanisms to guarantee the RFD in this skb
will not cache line conflict with the data in antoher scb
(slab allocs of the skb should give us this).

milton

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-24  5:26                                       ` Milton Miller
@ 2007-05-24 11:21                                         ` Milton Miller
  2007-05-24 12:51                                           ` David Acker
  2007-05-29 15:58                                           ` David Acker
  2007-05-24 12:44                                         ` David Acker
  1 sibling, 2 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-24 11:21 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg,
	Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak

Some further thoughts ...

On May 24, 2007, at 12:26 AM, Milton Miller wrote:
> On May 23, 2007, at 4:32 PM, David Acker wrote:
>> Milton Miller wrote:
>>> My current reading of the manual is that the C bit will not be
>>> set on an RFD that is size 0.  It goes on to processes EL and
>>> S, and decides to stop and interrupt RNR or suspend, or just
>>> go to the next packet.
>> I double checked this with a quick experiment and it appears you are 
>> correct.
>> When we find a buffer that is not completed but has the el-bit set, 
>> we read the status byte of the status control block to see if the RU 
>> is in the no resources state.  If it isn't, it means that we found 
>> that buffer  before the hardware did and thus need to wait for it.  
>> We will either find it on the next poll or enable interrupts and get 
>> told about it by hardware.
>>
>> What do you think?
>
> I think the second part is good ...
>
> Ok here's my just-before-dinner brainstorming, as relayed after dinner:
>
> We add two flags to struct rx:  one says this packet is EL, and one 
> says
> it is or was size 0.   We create a function, find_mark_el(struct nic,
> is_running) that is called after initial alloc and/or after refilling
> skb list.   In find_mark_el, we start with rx_to_use (lets rename this
> rx_to_fill), and go back two entries, call this rx_to_mark.   If
> is_running and rx_to_mark->prev->el then just return, leave EL alone.
> Otherwise, set EL and size to 0, set the two flags in the rx struct,
> sync the RFD, then search for the current EL (we could save the 
> pointer,
> but its always the odl rx_to_use (fill) or its ->prev).  Clear RFD->EL,
> sync,  clear rx->el.  Set size non-zero, sync,  Leave the was_0 flag 
> set
> if is_running (clear it only if we know reciever is stopped).
>
> At this point, if the receiver was stopped we can restart it,.  We
> should do so in the caller.  (We always restart at rx_to_clean).
> Restart should ack the RNR status before issuing the ru_start command 
> to
> avoid a spurious RNR interrupt.
>
> In the receive loop, if RFD->C is not set, rx->was_0 is set and el
> is not set, then we need to check rx->next->RFD->C bit (after
> pci_sync_for_cpu).   If the next packet C bit is set then we consider
> this skb as used, and just complete it with no data to the stack.
>
> Because find_mark_el will only advance EL if the receiver was stopped
> or we had more than 1 packet added, we can guarantee that if packet
> N has was_0 set, then packet N+1 will not have it set, so we have
> bounded lookahead.
>
> This adds two flags to struct rx, but that is allocated as a single
> array and the array size is filled out as forward and back pointers.
> Rx clean only has to look at was_0 on the last packet when it
> decides C is not set, and only if both are set does it peek at the
> next packet.  In other words, we shouldn't worry about the added
> flags making it a non-power-of-2 size.
>
> By setting size != 0 after we have modified all other fields, the
> device will only write to this packet if we are done writing.  If
> we loose the race and only partially complete, it will just go on
> to the next packet and we find it.  If we totally loose, then the
> receiver will go RNR and we can reclaim all the buffer space we
> have allocated.
>
> Comments?  Questions?
>
> We need to enforce a minimum rx ring size (3? 4?).
>
> We rely on other mechanisms to guarantee the RFD in this skb
> will not cache line conflict with the data in antoher scb
> (slab allocs of the skb should give us this).

Copying EL to a flag in rx is only to avoid additional
reading of the rfd while it might be under dma.  We do
need the was_0 flag.

Do we need to continue with the stop-before-last plan?  As
long as we set size to 0 with EL, we shoud be able to change
the link, sync, set size 0, sync, and then set size.   We
still need the "advance at least 2 if not stopped" check when
deciding to move the EL.  This would break if the hardware
in the dma path can read the multiple cache lines of the
RFD out of order, so it may not be safe (if some pci host
decided to prefetch data, and got the second line ahead of
the first and didn't discard prefetch until pci bus
disconnect).  Actually, I am afraid I know hardware that
might do that.

[defer]
Currently we handle failed allocs by doing a sw interrupt
in the watchdog.  Since we fail ifup if we can't alloc
rxs, we can always start the reciever, even if we didn't
alloc a new packet -- it will just read the RFD and go RNR
again.   We could make this sw interrupt conditional on
rx_to_clean->el being set.   However, looking further, it
appears this 2s watchdog induced watchdog also covers
makeing sure that we reattempt netif_schedule_prep_rx when
it fails in e100_intr.   Any change in this area should be
in a seperate patch, and probably delayed at least one
release.    I also note that netpoll_controller calls
e100_intr, which will call into the netif_rx_schedule
only when a device interrupt is active.

milton


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-24 11:21                                         ` Milton Miller
@ 2007-05-24 12:51                                           ` David Acker
  2007-05-24 14:25                                             ` Milton Miller
  2007-05-29 15:58                                           ` David Acker
  1 sibling, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-24 12:51 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, e1000-devel, netdev, Jesse Brandeburg, Kok, Auke,
	Jeff Kirsher, Scott Feldman, John Ronciak

Milton Miller wrote:
>> Ok here's my just-before-dinner brainstorming, as relayed after dinner:
>>
>> We add two flags to struct rx:  one says this packet is EL, and one says
>> it is or was size 0.   We create a function, find_mark_el(struct nic,
>> is_running) that is called after initial alloc and/or after refilling
>> skb list.   In find_mark_el, we start with rx_to_use (lets rename this
>> rx_to_fill), and go back two entries, call this rx_to_mark.   If
>> is_running and rx_to_mark->prev->el then just return, leave EL alone.
>> Otherwise, set EL and size to 0, set the two flags in the rx struct,
>> sync the RFD, then search for the current EL (we could save the pointer,
>> but its always the odl rx_to_use (fill) or its ->prev).  Clear RFD->EL,
>> sync,  clear rx->el.  Set size non-zero, sync,  Leave the was_0 flag set
>> if is_running (clear it only if we know reciever is stopped).
>>
>> At this point, if the receiver was stopped we can restart it,.  We
>> should do so in the caller.  (We always restart at rx_to_clean).
>> Restart should ack the RNR status before issuing the ru_start command to
>> avoid a spurious RNR interrupt.
>>
>> In the receive loop, if RFD->C is not set, rx->was_0 is set and el
>> is not set, then we need to check rx->next->RFD->C bit (after
>> pci_sync_for_cpu).   If the next packet C bit is set then we consider
>> this skb as used, and just complete it with no data to the stack.
>>
>> Because find_mark_el will only advance EL if the receiver was stopped
>> or we had more than 1 packet added, we can guarantee that if packet
>> N has was_0 set, then packet N+1 will not have it set, so we have
>> bounded lookahead.
>>
>> This adds two flags to struct rx, but that is allocated as a single
>> array and the array size is filled out as forward and back pointers.
>> Rx clean only has to look at was_0 on the last packet when it
>> decides C is not set, and only if both are set does it peek at the
>> next packet.  In other words, we shouldn't worry about the added
>> flags making it a non-power-of-2 size.
>>
>> By setting size != 0 after we have modified all other fields, the
>> device will only write to this packet if we are done writing.  If
>> we loose the race and only partially complete, it will just go on
>> to the next packet and we find it.  If we totally loose, then the
>> receiver will go RNR and we can reclaim all the buffer space we
>> have allocated.
>>
>> Comments?  Questions?
This sounds pretty reasonable.  I will take a stab at coding this up 
today; I always think better looking at code.


>> We need to enforce a minimum rx ring size (3? 4?).
The code currently limits ethtool -G ethX rx calls to 16.

>> We rely on other mechanisms to guarantee the RFD in this skb
>> will not cache line conflict with the data in antoher scb
>> (slab allocs of the skb should give us this).
Yep.


> Copying EL to a flag in rx is only to avoid additional
> reading of the rfd while it might be under dma.  We do
> need the was_0 flag.
> 
> Do we need to continue with the stop-before-last plan?  As
> long as we set size to 0 with EL, we shoud be able to change
> the link, sync, set size 0, sync, and then set size.
Perhaps not.  I can take a try at coding it without it.  It would 
certainly make the driver easier to follow.

> We still need the "advance at least 2 if not stopped" check when
> deciding to move the EL.  This would break if the hardware
> in the dma path can read the multiple cache lines of the
> RFD out of order, so it may not be safe (if some pci host
> decided to prefetch data, and got the second line ahead of
> the first and didn't discard prefetch until pci bus
> disconnect).  Actually, I am afraid I know hardware that
> might do that.
Hmm, me too.

> 
> [defer]
> Currently we handle failed allocs by doing a sw interrupt
> in the watchdog.  Since we fail ifup if we can't alloc
> rxs, we can always start the reciever, even if we didn't
> alloc a new packet -- it will just read the RFD and go RNR
> again.   We could make this sw interrupt conditional on
> rx_to_clean->el being set.   However, looking further, it
> appears this 2s watchdog induced watchdog also covers
> makeing sure that we reattempt netif_schedule_prep_rx when
> it fails in e100_intr.   Any change in this area should be
> in a seperate patch, and probably delayed at least one
> release.    I also note that netpoll_controller calls
> e100_intr, which will call into the netif_rx_schedule
> only when a device interrupt is active.
Agreed.

I will get back when I have done some experiments with these ideas. 
Thanks for the replies!
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-24 12:51                                           ` David Acker
@ 2007-05-24 14:25                                             ` Milton Miller
  0 siblings, 0 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-24 14:25 UTC (permalink / raw)
  To: David Acker
  Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

On May 24, 2007, at 7:51 AM, David Acker wrote:
> Milton Miller wrote:
>> Comments?  Questions?
> This sounds pretty reasonable.  I will take a stab at coding this up 
> today; I always think better looking at code.

Thanks.

By the way, find_mark_el should probably get passed
the old fill point.   The EL it will be clearing will
either be on that rx or the previous one (or there is
none for the initial alloc_list).

>> Do we need to continue with the stop-before-last plan?  As
>> long as we set size to 0 with EL, we shoud be able to change
>> the link, sync, set size 0, sync, and then set size.
> Perhaps not.  I can take a try at coding it without it.  It would 
> certainly make the driver easier to follow.
>
>> We still need the "advance at least 2 if not stopped" check when
>> deciding to move the EL.  This would break if the hardware
>> in the dma path can read the multiple cache lines of the
>> RFD out of order, so it may not be safe (if some pci host
>> decided to prefetch data, and got the second line ahead of
>> the first and didn't discard prefetch until pci bus
>> disconnect).  Actually, I am afraid I know hardware that
>> might do that.
> Hmm, me too.
>

I think we should be conservative here, and keep the stop
before the last and modify the last that we currently have.
I know the above prefetch hardware exists on a system that
can plug e100 cards.  While prefetching multiple lines is
based on the pci fetch code (read vs read line vs read
multiple), its just too close to failing.

The reason this behavior won't break us with the two packet
version is we always have written the unaligned (and hence
not atomic) write of the link field before we even expose
the RFD to the hardware.   Since data is aligned to two
bytes, the write to the EL bit and the write to length will
be atomic.  In addition, if the hardware reorders the loads
and gets length with EL, it will just RNR after filling
in the packet and we will handle it just fine.

Perhaps we should comment that we rely on 2 byte alignment,
in case some crazy architecture sets ETH_HDR_ALIGN to an
odd value.  (It's arch dependent to allow trade offs of
partial cache line writes from IO vs processing unaligned
TCP and IP headers by the cpu).  I wouldn't do BUG_ON
but wouldn't object to BUILD_BUG_ON if someone thinks
it's needed.

The difference is a additional pointer follow in the new
find_mark_el() from the new fill point to the mark point
(and the availability of one more RFD for a packet).
Either way we need to check its ->prev for being the
being current EL when hardware is not stopped.

milton

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-24 11:21                                         ` Milton Miller
  2007-05-24 12:51                                           ` David Acker
@ 2007-05-29 15:58                                           ` David Acker
  2007-05-30  8:26                                             ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: David Acker @ 2007-05-29 15:58 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, e1000-devel, netdev, Jesse Brandeburg, Kok, Auke,
	Jeff Kirsher, Scott Feldman, John Ronciak

Ok, I finally got some time to code this out and study it and Ihave some
questions.

Milton Miller wrote:
>> We add two flags to struct rx:  one says this packet is EL, and one says
>> it is or was size 0.   We create a function, find_mark_el(struct nic,
>> is_running) that is called after initial alloc and/or after refilling
>> skb list.   In find_mark_el, we start with rx_to_use (lets rename this
>> rx_to_fill), and go back two entries, call this rx_to_mark.   If
>> is_running and rx_to_mark->prev->el then just return, leave EL alone.
So if we start clean and then one packet completes, we can not clear the
old marked entry?  We must then add a per nic pointer to the current
marked entry so that when the next packet completes, we can avoid
scanning for it.  Why do we need to check this again?

>> Otherwise, set EL and size to 0, set the two flags in the rx struct,
>> sync the RFD, then search for the current EL (we could save the pointer,
>> but its always the odl rx_to_use (fill) or its ->prev).  Clear RFD->EL,
>> sync,  clear rx->el.  Set size non-zero, sync,  Leave the was_0 flag set
>> if is_running (clear it only if we know reciever is stopped).
>>
>> At this point, if the receiver was stopped we can restart it,.  We
>> should do so in the caller.  (We always restart at rx_to_clean).
>> Restart should ack the RNR status before issuing the ru_start command to
>> avoid a spurious RNR interrupt.
>>
>> In the receive loop, if RFD->C is not set, rx->was_0 is set and el
>> is not set, then we need to check rx->next->RFD->C bit (after
>> pci_sync_for_cpu).   If the next packet C bit is set then we consider
>> this skb as used, and just complete it with no data to the stack.
If the C-bit is not set, we can read the status to see if the RU is
running (we cleared the EL bit before it read it but it has not found
another packet yet) or not (it read the el-bit before we cleared it).
This read lets us avoid going through an enable interrupts, wait for the
RNR interrupt cycle.



>> Because find_mark_el will only advance EL if the receiver was stopped
>> or we had more than 1 packet added, we can guarantee that if packet
>> N has was_0 set, then packet N+1 will not have it set, so we have
>> bounded lookahead.
Is this meant to be 1 packet added at any given call or 1 packet added
since the last time we cleared?


>> This adds two flags to struct rx, but that is allocated as a single
>> array and the array size is filled out as forward and back pointers.
>> Rx clean only has to look at was_0 on the last packet when it
>> decides C is not set, and only if both are set does it peek at the
>> next packet.  In other words, we shouldn't worry about the added
>> flags making it a non-power-of-2 size.
>>
>> By setting size != 0 after we have modified all other fields, the
>> device will only write to this packet if we are done writing.  If
>> we loose the race and only partially complete, it will just go on
>> to the next packet and we find it.  If we totally loose, then the
>> receiver will go RNR and we can reclaim all the buffer space we
>> have allocated.
>>
>> Comments?  Questions?
>>


Say we have an 8 buffer receive pool (0-7) at start.  rx[6] is marked.

hardware consumes rx[0]
software sees one rx complete without an s-bit set.
software notes that rx[6] is marked
software allocates new buffer for rx[0]
software runs find_mark_el with rx_to_use == rx[1], rx_to_mark == rx[7], 
is_running == true, marked_rx == rx[6]
find_mark_el finds rx_to_mark->prev->el set and is_running true, and 
returns.

hardware consumes rx[1]
software sees one rx complete without an s-bit set.
software notes that rx[6] is still marked
software allocates new buffer for rx[1]
software runs find_mark_el with rx_to_use == rx[2], rx_to_mark == rx[0], 
is_running == true, marked_rx == rx[6]
find_mark_el does not find rx_to_mark->prev->el set so continues
find_mark_el sets rx[0]->rfd->el rx[0]->rfd->size = 0, rx[0]->el, 
rx[0]->size0;
find_mark_el clears rx[6]->rfd->el, syncs, sets rx[6]->rfd->size, syncs, 
clears rx[6]->rfd->el but leaves rx[6]->rfd->size0 set.

Is this the correct flow?

-Ack


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-29 15:58                                           ` David Acker
@ 2007-05-30  8:26                                             ` Milton Miller
  2007-06-01 20:45                                               ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-05-30  8:26 UTC (permalink / raw)
  To: David Acker
  Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

On May 29, 2007, at 10:58 AM, David Acker wrote:

> Ok, I finally got some time to code this out and study it and Ihave 
> some
> questions.
>
> Milton Miller wrote:
>>> We add two flags to struct rx:  one says this packet is EL, and one 
>>> says
>>> it is or was size 0.   We create a function, find_mark_el(struct nic,
>>> is_running) that is called after initial alloc and/or after refilling
>>> skb list.   In find_mark_el, we start with rx_to_use (lets rename 
>>> this
>>> rx_to_fill), and go back two entries, call this rx_to_mark.   If
>>> is_running and rx_to_mark->prev->el then just return, leave EL alone.
> So if we start clean and then one packet completes, we can not clear 
> the
> old marked entry?  We must then add a per nic pointer to the current
> marked entry so that when the next packet completes, we can avoid
> scanning for it.  Why do we need to check this again?

I don't think we need a per-nic pointer, we just need to check if our
to_mark->prev has the "This is EL" flag set.  the to_mark is the
to_fill->prev->prev  (I think -- to_use in the code, which is the
next to be allocated, not last allocated?  I haven't studied that 
detail).

We need this to avoid our refill code running in lock step with the
hardware and exposing two packets in a row with a size that it reads
as 0.

This means we never need to scan ahead more than one packet.

>
>>> Otherwise, set EL and size to 0, set the two flags in the rx struct,
>>> sync the RFD, then search for the current EL (we could save the 
>>> pointer,
>>> but its always the odl rx_to_use (fill) or its ->prev).  Clear 
>>> RFD->EL,
>>> sync,  clear rx->el.  Set size non-zero, sync,  Leave the was_0 flag 
>>> set
>>> if is_running (clear it only if we know reciever is stopped).
>>>
>>> At this point, if the receiver was stopped we can restart it,.  We
>>> should do so in the caller.  (We always restart at rx_to_clean).
>>> Restart should ack the RNR status before issuing the ru_start 
>>> command to
>>> avoid a spurious RNR interrupt.
>>>
>>> In the receive loop, if RFD->C is not set, rx->was_0 is set and el
>>> is not set, then we need to check rx->next->RFD->C bit (after
>>> pci_sync_for_cpu).   If the next packet C bit is set then we consider
>>> this skb as used, and just complete it with no data to the stack.
> If the C-bit is not set, we can read the status to see if the RU is
> running (we cleared the EL bit before it read it but it has not found
> another packet yet) or not (it read the el-bit before we cleared it).
> This read lets us avoid going through an enable interrupts, wait for 
> the
> RNR interrupt cycle.

Yes agreed.  But I was pointing out if we never had set size to 0
on this packet (ie rx->was_0 is clear) we dont' even need to poll
the device (saves a slow mmio load) or check the next packet (saves
a pci_sync_for_cpu ie a cache line invalidate, and check for c bit).

>>> Because find_mark_el will only advance EL if the receiver was stopped
>>> or we had more than 1 packet added, we can guarantee that if packet
>>> N has was_0 set, then packet N+1 will not have it set, so we have
>>> bounded lookahead.
> Is this meant to be 1 packet added at any given call or 1 packet added
> since the last time we cleared?

This would be we allocated one packet since the last time we moved EL.

>>> This adds two flags to struct rx, but that is allocated as a single
>>> array and the array size is filled out as forward and back pointers.
>>> Rx clean only has to look at was_0 on the last packet when it
>>> decides C is not set, and only if both are set does it peek at the
>>> next packet.  In other words, we shouldn't worry about the added
>>> flags making it a non-power-of-2 size.
>>>
>>> By setting size != 0 after we have modified all other fields, the
>>> device will only write to this packet if we are done writing.  If
>>> we loose the race and only partially complete, it will just go on
>>> to the next packet and we find it.  If we totally loose, then the
>>> receiver will go RNR and we can reclaim all the buffer space we
>>> have allocated.
>>>
>>> Comments?  Questions?
>>>
>
>
> Say we have an 8 buffer receive pool (0-7) at start.  rx[6] is marked.
>
> hardware consumes rx[0]
> software sees one rx complete without an s-bit set.

wihtout el bit set?

software sees one packet complete, checks next and find was_el is not 
set so it assumes it is still pending and the device is running.

> software notes that rx[6] is marked
> software allocates new buffer for rx[0]
> software runs find_mark_el with rx_to_use == rx[1], rx_to_mark == 
> rx[7], is_running == true, marked_rx == rx[6]
> find_mark_el finds rx_to_mark->prev->el set and is_running true, and 
> returns.
>
> hardware consumes rx[1]
> software sees one rx complete without an s-bit set.

software sees one packet complete, checks next and find was_0 is not 
set so it assumes it is still pending and the device is running.

> software notes that rx[6] is still marked
> software allocates new buffer for rx[1]
> software runs find_mark_el with rx_to_use == rx[2], rx_to_mark == 
> rx[0], is_running == true, marked_rx == rx[6]
> find_mark_el does not find rx_to_mark->prev->el set so continues
> find_mark_el sets rx[0]->rfd->el rx[0]->rfd->size = 0, rx[0]->el, 
> rx[0]->size0;
> find_mark_el clears rx[6]->rfd->el, syncs, sets rx[6]->rfd->size, 
> syncs, clears rx[6]->rfd->el but leaves rx[6]->rfd->size0 set.
>
> Is this the correct flow?
>

Without thinking about the ordering of these updates, yes I think that 
is the right flow.

although the "rx[6] is still marked" is not really a step, but notices
that [0]->prev === [7]->el is not set.

actually, lets do second part again with what I was thinking:
>
> hardware consumes rx[1]
> software sees one rx complete without an s-bit set.

software notes that rx_to_use (fill) is rx[1]

> software allocates new buffer for rx[1], and moves rx_to_use to rx[2]

software runs with find_mark_el with rx_to_use = rx[2], rx_was_to_use = 
rx[1], is_running = true

find_mark_el walks backward and calculates rx_to_mark as rx[0].   it 
calculates rx_to_unmark as rx_was_to_use->prev == rx[7] initially, but 
since ->el is not set it backs up once more to rx[6].  It then compares 
rx_to_unmark is != rx_to_mark->prev and continues as above.   Note that 
if is_running == true we would also continue; this should be the second 
clause in the || obviously.

Your logic works, this adds some conditional branching but saves a 
pointer, not sure which is better.  Mine can be used for initializing 
without special casing since it will just calculate rx_to_unmark as 
rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never 
marked still works, where as for yours the "was marked" must be 
explicitly set to something (hmm, rxs = rx[0] might work for that 
initial value until we mark -2??)

again, the compare of rx->el instead of rx->rfd->el is to save cache 
traffic to the rfd under io.  The rx->was_0 is required, so the el flag 
is free.

milton

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-30  8:26                                             ` Milton Miller
@ 2007-06-01 20:45                                               ` David Acker
  2007-06-01 21:13                                                 ` Jeff Garzik
  2007-06-04  9:03                                                 ` Milton Miller
  0 siblings, 2 replies; 51+ messages in thread
From: David Acker @ 2007-06-01 20:45 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, Kok, Auke, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, netdev

Milton Miller wrote:
> Your logic works, this adds some conditional branching but saves a 
> pointer, not sure which is better.  Mine can be used for initializing 
> without special casing since it will just calculate rx_to_unmark as 
> rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never 
> marked still works, where as for yours the "was marked" must be 
> explicitly set to something (hmm, rxs = rx[0] might work for that 
> initial value until we mark -2??)
> 
> again, the compare of rx->el instead of rx->rfd->el is to save cache 
> traffic to the rfd under io.  The rx->was_0 is required, so the el flag 
> is free.
> 

Ok, I took a stab at coding and testing these ideas.  Below is a patch against 2.6.22-rc3.
Let me know what you think.  Testing shows that we can hit any of the following scenarios:

Find a buffer not complete with rx->el and rx->s0 set. I read back the status and see if
the receiver is still running.
Find a buffer not complete with rx->el not set and rx->s0 set.  I check the next buffer
and if it is complete, I free the skb and return 0.  If the next buffer is not
complete, I read the receiver's status to see if he is still running.
Find a buffer that is complete with rx->el not set and rx->s0 set.  It appears that hardware
can read the rfd's el-bit, then software can clear the rfd el-bit and set the rfd size, and
then hardware can come in and read the size.  I am reading the status back, although
I don't think that I have to in this instance.

I am testing a version of this code patched against 2.6.18.4 on my PXA 255 based system.  I will let 
you all know how it goes.

On the ARM, their is a race condition between software allocating a new receive
buffer and hardware writing into a buffer.  The two race on touching the last
Receive Frame Descriptor (RFD).  It has its el-bit set and its next link equal
to 0.  When hardware encounters this buffer it attempts to write data to it
and then update Status Word bits and Actual Count in the RFD.  At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide.  This can lead to the receive unit stalling or freed receive buffers
getting written to.

The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain.  When the hardware encounters this buffer it stops and does
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state.  When software allocates buffers,
it can update the tail of the list when either it knows the hardware
has stopped or the previous to the new one to mark marked.
Once it has a new next to last buffer prepared, it can clear the el-bit
and set the size on the previous one.  The race on this buffer is safe
since the link already points to a valid next buffer.  We keep flags
in our software descriptor to note if the el bit is set and if the size
was 0.  When we clear the RFD's el bit and set its size, we also clear
the el flag but we leave the size was 0 bit set.  This was we can find
this buffer again later.

If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and skip this one.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <dacker@roinet.com>

---

--- linux-2.6.22-rc3/drivers/net/e100.c.orig	2007-06-01 16:13:16.000000000 -0400
+++ linux-2.6.22-rc3/drivers/net/e100.c	2007-06-01 16:20:36.000000000 -0400
@@ -281,10 +281,17 @@ struct csr {
  };

  enum scb_status {
+	rus_no_res       = 0x08,
  	rus_ready        = 0x10,
  	rus_mask         = 0x3C,
  };

+enum ru_state  {
+	ru_suspended = 0,
+	ru_running	 = 1,
+	ru_uninitialized = -1,
+};
+
  enum scb_stat_ack {
  	stat_ack_not_ours    = 0x00,
  	stat_ack_sw_gen      = 0x04,
@@ -395,10 +402,16 @@ struct rfd {
  	u16 size;
  };

+enum rx_flags {
+	rx_el = 0x01,
+	rx_s0 = 0x02,
+};
+
  struct rx {
  	struct rx *next, *prev;
  	struct sk_buff *skb;
  	dma_addr_t dma_addr;
+	u8 flags;
  };

  #if defined(__BIG_ENDIAN_BITFIELD)
@@ -526,6 +539,7 @@ struct nic {
  	struct rx *rx_to_use;
  	struct rx *rx_to_clean;
  	struct rfd blank_rfd;
+	enum ru_state ru_running;

  	spinlock_t cb_lock			____cacheline_aligned;
  	spinlock_t cmd_lock;
@@ -947,7 +961,7 @@ static void e100_get_defaults(struct nic
  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));

  	/* Template for a freshly allocated RFD */
-	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+	nic->blank_rfd.command = 0;
  	nic->blank_rfd.rbd = 0xFFFFFFFF;
  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);

@@ -1742,11 +1756,55 @@ static int e100_alloc_cbs(struct nic *ni
  	return 0;
  }

-static inline void e100_start_receiver(struct nic *nic)
+static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, int is_running)
  {
-	/* Start if RFA is non-NULL */
-	if(nic->rx_to_clean->skb)
-		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+	struct rx *rx_to_mark = nic->rx_to_use->prev->prev;
+	struct rfd *rfd_to_mark;
+
+	if(is_running && rx_to_mark->prev->flags & rx_el)
+		return;
+
+	if(marked_rx == rx_to_mark)
+		return;
+
+	rfd_to_mark = (struct rfd *) rx_to_mark->skb->data;
+	rfd_to_mark->command |= cpu_to_le16(cb_el);
+	rfd_to_mark->size = 0;
+	pci_dma_sync_single_for_cpu(nic->pdev, rx_to_mark->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+	rx_to_mark->flags |= (rx_el | rx_s0);
+
+	if(!marked_rx)
+		return;
+
+	rfd_to_mark = (struct rfd *) marked_rx->skb->data;
+	rfd_to_mark->command &= ~cpu_to_le16(cb_el);
+	pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+	rfd_to_mark->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+	pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
+		sizeof(struct rfd), PCI_DMA_TODEVICE);
+
+	if(is_running)
+		marked_rx->flags &= ~rx_el;
+	else
+		marked_rx->flags &= ~(rx_el | rx_s0);
+}
+
+static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
+{
+	if(!nic->rxs) return;
+	if(ru_suspended != nic->ru_running) return;
+
+	/* handle init time starts */
+	if(!rx) rx = nic->rxs;
+
+	/* (Re)start RU if suspended or idle and RFA is non-NULL */
+	if(rx->skb) {
+		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
+		nic->ru_running = ru_running;
+	}
  }

  #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
@@ -1774,8 +1832,6 @@ static int e100_rx_alloc_skb(struct nic
  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
  		put_unaligned(cpu_to_le32(rx->dma_addr),
  			(u32 *)&prev_rfd->link);
-		wmb();
-		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
  			sizeof(struct rfd), PCI_DMA_TODEVICE);
  	}
@@ -1789,6 +1845,7 @@ static int e100_rx_indicate(struct nic *
  	struct sk_buff *skb = rx->skb;
  	struct rfd *rfd = (struct rfd *)skb->data;
  	u16 rfd_status, actual_size;
+	u8 status;

  	if(unlikely(work_done && *work_done >= work_to_do))
  		return -EAGAIN;
@@ -1800,9 +1857,46 @@ static int e100_rx_indicate(struct nic *

  	DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);

-	/* If data isn't ready, nothing to indicate */
-	if(unlikely(!(rfd_status & cb_complete)))
+	/* If data isn't ready, nothing to indicate
+	 * If both the el and s0 rx flags are set, we have hit the marked
+	 * buffer but we don't know if hardware has seen it so we check
+	 * the status.
+	 * If only the s0 flag is set, we check the next buffer.
+	 * If it is complete, we know that hardware saw the rfd el bit
+	 * get cleared but did not see the rfd size get set so it
+	 * skipped this buffer.  We just return 0 and look at the
+	 * next buffer.
+	 * If only the s0 flag is set but the next buffer is
+	 * not complete, we cleared the el flag as hardware
+	 * hit this buffer. */
+	if(unlikely(!(rfd_status & cb_complete))) {
+		u8 maskedFlags = rx->flags & (rx_el | rx_s0);
+		if(maskedFlags == (rx_el | rx_s0)) {
+			status = ioread8(&nic->csr->scb.status);
+			if(status & rus_no_res) {
+				nic->ru_running = ru_suspended;
+			}
+		} else if(maskedFlags == rx_s0) {
+			struct rx *next_rx = rx->next;
+			struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
+			pci_dma_sync_single_for_cpu(nic->pdev, next_rx->dma_addr,
+				sizeof(struct rfd), PCI_DMA_FROMDEVICE);
+			if(next_rfd->status & cpu_to_le16(cb_complete)) {
+				pci_unmap_single(nic->pdev, rx->dma_addr,
+					RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
+				dev_kfree_skb_any(skb);
+				rx->skb = NULL;
+				rx->flags &= ~rx_s0;
+				return 0;
+			} else {
+				status = ioread8(&nic->csr->scb.status);
+				if(status & rus_no_res) {
+					nic->ru_running = ru_suspended;
+				}
+			}
+		}
  		return -ENODATA;
+	}

  	/* Get actual data size */
  	actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1813,6 +1907,15 @@ static int e100_rx_indicate(struct nic *
  	pci_unmap_single(nic->pdev, rx->dma_addr,
  		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);

+	/* This happens when hardward sees the rfd el flag set
+	 * but then sees the rfd size set as well */
+	if(le16_to_cpu(rfd->command) & cb_el) {
+		status = ioread8(&nic->csr->scb.status);
+		if(status & rus_no_res) {
+			nic->ru_running = ru_suspended;
+		}
+	}
+
  	/* Pull off the RFD and put the actual data (minus eth hdr) */
  	skb_reserve(skb, sizeof(struct rfd));
  	skb_put(skb, actual_size);
@@ -1842,19 +1945,46 @@ static int e100_rx_indicate(struct nic *
  static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
  	unsigned int work_to_do)
  {
-	struct rx *rx;
+	struct rx *rx, *marked_rx;
+	int restart_required = 0;
+	int err = 0;

  	/* Indicate newly arrived packets */
  	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
-		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
-			break; /* No more to clean */
+		err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+		/* Hit quota or no more to clean */
+		if(-EAGAIN == err || -ENODATA == err)
+			break;
  	}

+	/* On EAGAIN, hit quota so have more work to do, restart once
+	 * cleanup is complete.
+	 * Else, are we already rnr? then pay attention!!! this ensures that
+	 * the state machine progression never allows a start with a
+	 * partially cleaned list, avoiding a race between hardware
+	 * and rx_to_clean when in NAPI mode */
+	if(-EAGAIN != err && ru_suspended == nic->ru_running)
+		restart_required = 1;
+
+	marked_rx = nic->rx_to_use->prev->prev;
+	while(!(marked_rx->flags & rx_el))
+		marked_rx = marked_rx->prev;
+
  	/* Alloc new skbs to refill list */
  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
  			break; /* Better luck next time (see watchdog) */
  	}
+
+	e100_find_mark_el(nic, marked_rx, !restart_required);
+
+	if(restart_required) {
+		// ack the rnr?
+		iowrite8(stat_ack_rnr, &nic->csr->scb.stat_ack);
+		e100_start_receiver(nic, nic->rx_to_clean);
+		if(work_done)
+			(*work_done)++;
+	}
  }

  static void e100_rx_clean_list(struct nic *nic)
@@ -1862,6 +1992,8 @@ static void e100_rx_clean_list(struct ni
  	struct rx *rx;
  	unsigned int i, count = nic->params.rfds.count;

+	nic->ru_running = ru_uninitialized;
+
  	if(nic->rxs) {
  		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
  			if(rx->skb) {
@@ -1883,6 +2015,7 @@ static int e100_rx_alloc_list(struct nic
  	unsigned int i, count = nic->params.rfds.count;

  	nic->rx_to_use = nic->rx_to_clean = NULL;
+	nic->ru_running = ru_uninitialized;

  	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
  		return -ENOMEM;
@@ -1897,6 +2030,9 @@ static int e100_rx_alloc_list(struct nic
  	}

  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
+	nic->ru_running = ru_suspended;
+
+	e100_find_mark_el(nic, NULL, 0);

  	return 0;
  }
@@ -1916,6 +2052,10 @@ static irqreturn_t e100_intr(int irq, vo
  	/* Ack interrupt(s) */
  	iowrite8(stat_ack, &nic->csr->scb.stat_ack);

+	/* We hit Receive No Resource (RNR); restart RU after cleaning */
+	if(stat_ack & stat_ack_rnr)
+		nic->ru_running = ru_suspended;
+
  	if(likely(netif_rx_schedule_prep(netdev))) {
  		e100_disable_irq(nic);
  		__netif_rx_schedule(netdev);
@@ -2007,7 +2147,7 @@ static int e100_up(struct nic *nic)
  	if((err = e100_hw_init(nic)))
  		goto err_clean_cbs;
  	e100_set_multicast_list(nic->netdev);
-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);
  	mod_timer(&nic->watchdog, jiffies);
  	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
  		nic->netdev->name, nic->netdev)))
@@ -2088,7 +2228,7 @@ static int e100_loopback_test(struct nic
  		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
  			BMCR_LOOPBACK);

-	e100_start_receiver(nic);
+	e100_start_receiver(nic, NULL);

  	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
  		err = -ENOMEM;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-01 20:45                                               ` David Acker
@ 2007-06-01 21:13                                                 ` Jeff Garzik
  2007-06-01 22:13                                                   ` Kok, Auke
  2007-06-04  9:03                                                 ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Jeff Garzik @ 2007-06-01 21:13 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
	Scott Feldman, John Ronciak, Jeff Kirsher

David Acker wrote:
> Milton Miller wrote:
>> Your logic works, this adds some conditional branching but saves a 
>> pointer, not sure which is better.  Mine can be used for initializing 
>> without special casing since it will just calculate rx_to_unmark as 
>> rx[n-2] and rx_to_mark as rx[n-2] != rx[n-2]->prev; unmarking a never 
>> marked still works, where as for yours the "was marked" must be 
>> explicitly set to something (hmm, rxs = rx[0] might work for that 
>> initial value until we mark -2??)
>>
>> again, the compare of rx->el instead of rx->rfd->el is to save cache 
>> traffic to the rfd under io.  The rx->was_0 is required, so the el 
>> flag is free.
>>
> 
> Ok, I took a stab at coding and testing these ideas.  Below is a patch 
> against 2.6.22-rc3.
> Let me know what you think.  Testing shows that we can hit any of the 
> following scenarios:
> 
> Find a buffer not complete with rx->el and rx->s0 set. I read back the 
> status and see if
> the receiver is still running.
> Find a buffer not complete with rx->el not set and rx->s0 set.  I check 
> the next buffer
> and if it is complete, I free the skb and return 0.  If the next buffer 
> is not
> complete, I read the receiver's status to see if he is still running.
> Find a buffer that is complete with rx->el not set and rx->s0 set.  It 
> appears that hardware
> can read the rfd's el-bit, then software can clear the rfd el-bit and 
> set the rfd size, and
> then hardware can come in and read the size.  I am reading the status 
> back, although
> I don't think that I have to in this instance.
> 
> I am testing a version of this code patched against 2.6.18.4 on my PXA 
> 255 based system.  I will let you all know how it goes.
> 
> On the ARM, their is a race condition between software allocating a new 
> receive
> buffer and hardware writing into a buffer.  The two race on touching the 
> last
> Receive Frame Descriptor (RFD).  It has its el-bit set and its next link 
> equal
> to 0.  When hardware encounters this buffer it attempts to write data to it
> and then update Status Word bits and Actual Count in the RFD.  At the 
> same time
> software may try to clear the el-bit and set the link address to a new 
> buffer.
> Since the entire RFD is once cache-line, the two write operations can
> collide.  This can lead to the receive unit stalling or freed receive 
> buffers
> getting written to.
> 
> The fix is to set the el-bit on and the size to 0 on the next to last 
> buffer
> in the chain.  When the hardware encounters this buffer it stops and does
> not write to it at all.  The hardware issues an RNR interrupt with the
> receive unit in the No Resources state.  When software allocates buffers,
> it can update the tail of the list when either it knows the hardware
> has stopped or the previous to the new one to mark marked.
> Once it has a new next to last buffer prepared, it can clear the el-bit
> and set the size on the previous one.  The race on this buffer is safe
> since the link already points to a valid next buffer.  We keep flags
> in our software descriptor to note if the el bit is set and if the size
> was 0.  When we clear the RFD's el bit and set its size, we also clear
> the el flag but we leave the size was 0 bit set.  This was we can find
> this buffer again later.
> 
> If the hardware sees the el-bit cleared without the size set, it will
> move on to the next buffer and skip this one.  If it sees
> the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
> 
> 
> Signed-off-by: David Acker <dacker@roinet.com>

That seems to vaguely match my memory of what eepro100 was doing (or 
trying to do).

I _really_ appreciate you working on this problem.  Getting e100 driver 
stable for the long term, and ditching eepro100, is a big hurdle to 
cross.  Getting this right is really one of the last steps.

The patch looks OK at quick glance.

Thanks,

	Jeff




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-01 21:13                                                 ` Jeff Garzik
@ 2007-06-01 22:13                                                   ` Kok, Auke
  0 siblings, 0 replies; 51+ messages in thread
From: Kok, Auke @ 2007-06-01 22:13 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Acker, Milton Miller, Kok, Auke, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, Scott Feldman, netdev

Jeff Garzik wrote:
> David Acker wrote:
>> Milton Miller wrote:

<snip>

>> the el flag but we leave the size was 0 bit set.  This was we can find
>> this buffer again later.
>>
>> If the hardware sees the el-bit cleared without the size set, it will
>> move on to the next buffer and skip this one.  If it sees
>> the size set but the el-bit still set, it will complete that buffer
>> and then RNR interrupt and wait.
>>
>>
>> Signed-off-by: David Acker <dacker@roinet.com>
> 
> That seems to vaguely match my memory of what eepro100 was doing (or 
> trying to do).
> 
> I _really_ appreciate you working on this problem.  Getting e100 driver 
> stable for the long term, and ditching eepro100, is a big hurdle to 
> cross.  Getting this right is really one of the last steps.

yes, absolutely agreed. I'm very pleased with the attention and hope that is 
clear to everyone.

> The patch looks OK at quick glance.

Besides copying the style errors, it looks OK as well. I will attempt to 
allocate some testing time again early next week on a small library of e100 nics 
over here. Mostly x86, but still useful to spot obvious mistakes.


Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-01 20:45                                               ` David Acker
  2007-06-01 21:13                                                 ` Jeff Garzik
@ 2007-06-04  9:03                                                 ` Milton Miller
  2007-06-05 13:34                                                   ` David Acker
  1 sibling, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-04  9:03 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, Jeff Garzik


On Jun 1, 2007, at 3:45 PM, David Acker wrote:
> Ok, I took a stab at coding and testing these ideas.  Below is a patch 
> against 2.6.22-rc3.
> Let me know what you think.

I think you got most of the ideas.   As Auke noted, your coding style 
is showing again.   And your mailer again munged whitespace (fixed by 
s/^<space><space>/<space>/ s/^$/<space>/).

>   Testing shows that we can hit any of the following scenarios:
>
> Find a buffer not complete with rx->el and rx->s0 set.
> 	I read back the status and see if the receiver is still running.
>

This means the hardware reached what we think is end of list, good.

> Find a buffer not complete with rx->el not set and rx->s0 set.
> 	I check the next buffer and if it is complete, I free the skb and 
> return 0.	
This is the case when hardware read the descriptor between el being 
removed and size being set.


> If the next buffer is not  complete, I read the receiver's status to 
> see if he is still running.
>

The polling of still running will save us waiting for irq or next poll 
when hardware read el before we cleared it, at the cost of an ioread 
when there is simply no packet ready and we hit a fill mark.

> Find a buffer that is complete with rx->el not set and rx->s0 set.
> 	It appears that hardware can read the rfd's el-bit, then software can 
> clear the rfd el-bit and set the rfd size, and then hardware can come 
> in and read the size.

Yes, since the size is after the EL flag in the descriptor, this can 
happen since the pci read is not atomic.

> I am reading the status back, although I don't think that I have to in 
> this instance.

Actually, you are reading it when the rfd still has EL set.  Since the 
cpu will never encounter that case, the if condition is never 
satisfied.

How about creating a state unknown, for when we think we should check 
the device if its running.
If we are in this state and then encounter a received packet without s0 
set, we can set it back
to running.   We set it when we rx a packet with s0 set.

We then move both io_status reads to the caller.


> I am testing a version of this code patched against 2.6.18.4 on my PXA 
> 255 based system.  I will let you all know how it goes.

I'm assuming this is why the cleanup of the receiver start to always 
start on rx_to_clean got dropped again. :-)

Also, I would like a few sentences in the Driver Operation section IV 
Receive big comment.  Something like

In order to keep updates to the RFD link field from colliding with 
hardware writes to mark packets complete, we use the feature that 
hardware will not write to a size 0 descriptor and mark the previous 
packet as end-of-list (EL).   After updating the link, we remove EL and 
only then restore the size such that hardware may use the 
previous-to-end RFD.

at the end of the first paragraph, and insert software before "no 
locking is required" in the second.

>
> On the ARM, their is a race condition between software allocating a 
> new receive
> buffer and hardware writing into a buffer.  The two race on touching 
> the last
> Receive Frame Descriptor (RFD).  It has its el-bit set and its next 
> link equal
> to 0.  When hardware encounters this buffer it attempts to write data 
> to it
> and then update Status Word bits and Actual Count in the RFD.  At the 
> same time
> software may try to clear the el-bit and set the link address to a new 
> buffer.
> Since the entire RFD is once cache-line, the two write operations can
> collide.  This can lead to the receive unit stalling or freed receive 
> buffers
> getting written to.
>
> The fix is to set the el-bit on and the size to 0 on the next to last 
> buffer
> in the chain.  When the hardware encounters this buffer it stops and 
> does
> not write to it at all.  The hardware issues an RNR interrupt with the
> receive unit in the No Resources state.  When software allocates 
> buffers,
> it can update the tail of the list when either it knows the hardware
> has stopped or the previous to the new one to mark marked.
> Once it has a new next to last buffer prepared, it can clear the el-bit
> and set the size on the previous one.  The race on this buffer is safe
> since the link already points to a valid next buffer.  We keep flags
> in our software descriptor to note if the el bit is set and if the size
> was 0.  When we clear the RFD's el bit and set its size, we also clear
> the el flag but we leave the size was 0 bit set.  This was we can find
> this buffer again later.
>
> If the hardware sees the el-bit cleared without the size set, it will
> move on to the next buffer and skip this one.  If it sees
> the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
>
> Signed-off-by: David Acker <dacker@roinet.com>
>
> ---
>
> --- linux-2.6.22-rc3/drivers/net/e100.c.orig	2007-06-01 
> 16:13:16.000000000 -0400
> +++ linux-2.6.22-rc3/drivers/net/e100.c	2007-06-01 16:20:36.000000000 
> -0400
> @@ -281,10 +281,17 @@ struct csr {
>  };
>
>  enum scb_status {
> +	rus_no_res       = 0x08,
>  	rus_ready        = 0x10,
>  	rus_mask         = 0x3C,
>  };
>
> +enum ru_state  {
> +	ru_suspended = 0,
> +	ru_running	 = 1,
> +	ru_uninitialized = -1,
> +};
> +

If we are going to have uninitialized, it should be 0.  But its only 
set (1) during teardown, and (2) during rx list allocation.  In the 
same function its set to suspended.

Also, lets call this stopped, not suspended.  Suspended is a specific 
hardware state we are not using.

See above for my comments about adding unknown or maybe_running.

>  enum scb_stat_ack {
>  	stat_ack_not_ours    = 0x00,
>  	stat_ack_sw_gen      = 0x04,
> @@ -395,10 +402,16 @@ struct rfd {
>  	u16 size;
>  };
>
> +enum rx_flags {
> +	rx_el = 0x01,
> +	rx_s0 = 0x02,
> +};
> +
>  struct rx {
>  	struct rx *next, *prev;
>  	struct sk_buff *skb;
>  	dma_addr_t dma_addr;
> +	u8 flags;
>  };
>
>  #if defined(__BIG_ENDIAN_BITFIELD)
> @@ -526,6 +539,7 @@ struct nic {
>  	struct rx *rx_to_use;
>  	struct rx *rx_to_clean;
>  	struct rfd blank_rfd;
> +	enum ru_state ru_running;
>
>  	spinlock_t cb_lock			____cacheline_aligned;
>  	spinlock_t cmd_lock;
> @@ -947,7 +961,7 @@ static void e100_get_defaults(struct nic
>  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>
>  	/* Template for a freshly allocated RFD */
> -	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
> +	nic->blank_rfd.command = 0;
>  	nic->blank_rfd.rbd = 0xFFFFFFFF;
>  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>
> @@ -1742,11 +1756,55 @@ static int e100_alloc_cbs(struct nic *ni
>  	return 0;
>  }
>
> -static inline void e100_start_receiver(struct nic *nic)
> +static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, 
> int is_running)
>  {
> -	/* Start if RFA is non-NULL */
> -	if(nic->rx_to_clean->skb)
> -		e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
> +	struct rx *rx_to_mark = nic->rx_to_use->prev->prev;
> +	struct rfd *rfd_to_mark;

Let's just call this rfd or mark_rfd (since its used for both mark 
setting and clearing).

> +
> +	if(is_running && rx_to_mark->prev->flags & rx_el)
> +		return;

or just rx_to_mark == marked_rx (saves the load of flags).

> +
> +	if(marked_rx == rx_to_mark)
> +		return;
> +
> +	rfd_to_mark = (struct rfd *) rx_to_mark->skb->data;
> +	rfd_to_mark->command |= cpu_to_le16(cb_el);
> +	rfd_to_mark->size = 0;
> +	pci_dma_sync_single_for_cpu(nic->pdev, rx_to_mark->dma_addr,
> +		sizeof(struct rfd), PCI_DMA_TODEVICE);

for_device, BIDIRECTIONAL

> +	rx_to_mark->flags |= (rx_el | rx_s0);
> +
> +	if(!marked_rx)
> +		return;
> +
> +	rfd_to_mark = (struct rfd *) marked_rx->skb->data;
> +	rfd_to_mark->command &= ~cpu_to_le16(cb_el);
> +	pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
> +		sizeof(struct rfd), PCI_DMA_TODEVICE);

for_device BIDIRECTIONAL

> +
> +	rfd_to_mark->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> +	pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
> +		sizeof(struct rfd), PCI_DMA_TODEVICE);
>
for_device BIDIRECTIONAL

> +
> +	if(is_running)
> +		marked_rx->flags &= ~rx_el;
> +	else
> +		marked_rx->flags &= ~(rx_el | rx_s0);
> +}
> +
> +static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
> +{
> +	if(!nic->rxs) return;
> +	if(ru_suspended != nic->ru_running) return;
>
Are these checks just paranoia?  Can we call poll leading to rx_clean 
without setting up the receive list?

> +
> +	/* handle init time starts */
> +	if(!rx) rx = nic->rxs;

Again, as the previous patch noted, we can drop rx as a parameter and 
always (re)start on rx_to_clean.

> +
> +	/* (Re)start RU if suspended or idle and RFA is non-NULL */
> +	if(rx->skb) {
> +		e100_exec_cmd(nic, ruc_start, rx->dma_addr);
> +		nic->ru_running = ru_running;
> +	}
>  }
>
>  #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
> @@ -1774,8 +1832,6 @@ static int e100_rx_alloc_skb(struct nic
>  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
>  		put_unaligned(cpu_to_le32(rx->dma_addr),
>  			(u32 *)&prev_rfd->link);
> -		wmb();
> -		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
>  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
>  			sizeof(struct rfd), PCI_DMA_TODEVICE);
>  	}
> @@ -1789,6 +1845,7 @@ static int e100_rx_indicate(struct nic *
>  	struct sk_buff *skb = rx->skb;
>  	struct rfd *rfd = (struct rfd *)skb->data;
>  	u16 rfd_status, actual_size;
> +	u8 status;
>
>  	if(unlikely(work_done && *work_done >= work_to_do))
>  		return -EAGAIN;
> @@ -1800,9 +1857,46 @@ static int e100_rx_indicate(struct nic *
>
>  	DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>
> -	/* If data isn't ready, nothing to indicate */
> -	if(unlikely(!(rfd_status & cb_complete)))
> +	/* If data isn't ready, nothing to indicate
> +	 * If both the el and s0 rx flags are set, we have hit the marked
> +	 * buffer but we don't know if hardware has seen it so we check
> +	 * the status.
> +	 * If only the s0 flag is set, we check the next buffer.
> +	 * If it is complete, we know that hardware saw the rfd el bit
> +	 * get cleared but did not see the rfd size get set so it
> +	 * skipped this buffer.  We just return 0 and look at the
> +	 * next buffer.
> +	 * If only the s0 flag is set but the next buffer is
> +	 * not complete, we cleared the el flag as hardware
> +	 * hit this buffer. */
> +	if(unlikely(!(rfd_status & cb_complete))) {
> +		u8 maskedFlags = rx->flags & (rx_el | rx_s0);
> +		if(maskedFlags == (rx_el | rx_s0)) {
> +			status = ioread8(&nic->csr->scb.status);
> +			if(status & rus_no_res) {
> +				nic->ru_running = ru_suspended;
> +			}
> +		} else if(maskedFlags == rx_s0) {
> +			struct rx *next_rx = rx->next;
> +			struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
> +			pci_dma_sync_single_for_cpu(nic->pdev, next_rx->dma_addr,
> +				sizeof(struct rfd), PCI_DMA_FROMDEVICE);
> +			if(next_rfd->status & cpu_to_le16(cb_complete)) {
> +				pci_unmap_single(nic->pdev, rx->dma_addr,
> +					RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
> +				dev_kfree_skb_any(skb);
> +				rx->skb = NULL;
> +				rx->flags &= ~rx_s0;
> +				return 0;
> +			} else {
> +				status = ioread8(&nic->csr->scb.status);
> +				if(status & rus_no_res) {
> +					nic->ru_running = ru_suspended;
> +				}
> +			}
> +		}
>  		return -ENODATA;
> +	}
>
>  	/* Get actual data size */
>  	actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1813,6 +1907,15 @@ static int e100_rx_indicate(struct nic *
>  	pci_unmap_single(nic->pdev, rx->dma_addr,
>  		RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>
> +	/* This happens when hardward sees the rfd el flag set
> +	 * but then sees the rfd size set as well */
> +	if(le16_to_cpu(rfd->command) & cb_el) {

The racing read of the dma device might see it, but our cpu will never 
see this since it wrote them in the other order.

how about if rx-flags & s0 ?
Also see my comments about ru_unknown.

Where do you clear s0 when we encounter a packet with rfd status 
complete?

Maybe we should just set flags to 0 when allocating the skb?

> +		status = ioread8(&nic->csr->scb.status);
> +		if(status & rus_no_res) {
> +			nic->ru_running = ru_suspended;
> +		}
> +	}
> +
>  	/* Pull off the RFD and put the actual data (minus eth hdr) */
>  	skb_reserve(skb, sizeof(struct rfd));
>  	skb_put(skb, actual_size);
> @@ -1842,19 +1945,46 @@ static int e100_rx_indicate(struct nic *
>  static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
>  	unsigned int work_to_do)
>  {
> -	struct rx *rx;
> +	struct rx *rx, *marked_rx;
> +	int restart_required = 0;
> +	int err = 0;
>
>  	/* Indicate newly arrived packets */
>  	for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = 
> rx->next) {
> -		if(e100_rx_indicate(nic, rx, work_done, work_to_do))
> -			break; /* No more to clean */
> +		err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> +		/* Hit quota or no more to clean */
> +		if(-EAGAIN == err || -ENODATA == err)
> +			break;
>  	}

My ru_state of maybe_running would cause us to read the status and 
resolve it to either running or suspended here (after the clean loop).
>
> +	/* On EAGAIN, hit quota so have more work to do, restart once
> +	 * cleanup is complete.
> +	 * Else, are we already rnr? then pay attention!!! this ensures that
> +	 * the state machine progression never allows a start with a
> +	 * partially cleaned list, avoiding a race between hardware
> +	 * and rx_to_clean when in NAPI mode */

It took me a long while to figure out what race you are referring to 
here.   At first I thought you were just moving a comment.  I am 
totally lost by "are we already rnr? then pay attention!!!".

This isn't a race, its more like "We don't restart the receiver when we 
stop processing due to NAPI quota because we don't yet know which rfd 
is the first that already contains data.  That is, rx_to_clean points 
to something that might be a used rfd instead of a known clean one."



> +	if(-EAGAIN != err && ru_suspended == nic->ru_running)
> +		restart_required = 1;
> +
> +	marked_rx = nic->rx_to_use->prev->prev;
> +	while(!(marked_rx->flags & rx_el))
> +		marked_rx = marked_rx->prev;
> +

Not while.  This should be if, followed by BUG_ON(!marked_rx->flags & 
rx_el).   We can remove the BUG_ON later, but it points to an error in 
our logic if its not one of the two places.

>  	/* Alloc new skbs to refill list */
>  	for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
>  		if(unlikely(e100_rx_alloc_skb(nic, rx)))
>  			break; /* Better luck next time (see watchdog) */
>  	}
> +
> +	e100_find_mark_el(nic, marked_rx, !restart_required);
>

We could pass the true ru_status based value here, but since we won't 
be restoring the receiver it doesn't matter.

> +
> +	if(restart_required) {
> +		// ack the rnr?
> +		iowrite8(stat_ack_rnr, &nic->csr->scb.stat_ack);
> +		e100_start_receiver(nic, nic->rx_to_clean);
> +		if(work_done)
> +			(*work_done)++;
>

Why is restarting the receiver work?   Actually receiving the packet is 
work.

> +	}
>  }
>
>  static void e100_rx_clean_list(struct nic *nic)
> @@ -1862,6 +1992,8 @@ static void e100_rx_clean_list(struct ni
>  	struct rx *rx;
>  	unsigned int i, count = nic->params.rfds.count;
>
> +	nic->ru_running = ru_uninitialized;
> +
>  	if(nic->rxs) {
>  		for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
>  			if(rx->skb) {
> @@ -1883,6 +2015,7 @@ static int e100_rx_alloc_list(struct nic
>  	unsigned int i, count = nic->params.rfds.count;
>
>  	nic->rx_to_use = nic->rx_to_clean = NULL;
> +	nic->ru_running = ru_uninitialized;

Here is the short duration of uninitialized, and its not the 0 case.
>
>  	if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
>  		return -ENOMEM;
> @@ -1897,6 +2030,9 @@ static int e100_rx_alloc_list(struct nic
>  	}
>
>  	nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> +	nic->ru_running = ru_suspended;
> +
> +	e100_find_mark_el(nic, NULL, 0);
>
>  	return 0;
>  }
> @@ -1916,6 +2052,10 @@ static irqreturn_t e100_intr(int irq, vo
>  	/* Ack interrupt(s) */
>  	iowrite8(stat_ack, &nic->csr->scb.stat_ack);
>
> +	/* We hit Receive No Resource (RNR); restart RU after cleaning */
> +	if(stat_ack & stat_ack_rnr)
> +		nic->ru_running = ru_suspended;
> +
>  	if(likely(netif_rx_schedule_prep(netdev))) {
>  		e100_disable_irq(nic);
>  		__netif_rx_schedule(netdev);
> @@ -2007,7 +2147,7 @@ static int e100_up(struct nic *nic)
>  	if((err = e100_hw_init(nic)))
>  		goto err_clean_cbs;
>  	e100_set_multicast_list(nic->netdev);
> -	e100_start_receiver(nic);
> +	e100_start_receiver(nic, NULL);

(remove second arg to start_receiver, this diff will go away).

>  	mod_timer(&nic->watchdog, jiffies);
>  	if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
>  		nic->netdev->name, nic->netdev)))
> @@ -2088,7 +2228,7 @@ static int e100_loopback_test(struct nic
>  		mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
>  			BMCR_LOOPBACK);
>
> -	e100_start_receiver(nic);
> +	e100_start_receiver(nic, NULL);
>
(and this one).

>  	if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
>  		err = -ENOMEM;
>
>

Thanks for doing the work David.

milton


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-04  9:03                                                 ` Milton Miller
@ 2007-06-05 13:34                                                   ` David Acker
  2007-06-05 16:14                                                     ` Milton Miller
  2007-06-05 16:14                                                     ` Milton Miller
  0 siblings, 2 replies; 51+ messages in thread
From: David Acker @ 2007-06-05 13:34 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, Scott Feldman, e1000-devel, Jeff Kirsher,
	John Ronciak, Jesse Brandeburg, netdev, Kok, Auke

Milton Miller wrote:
> 
> On Jun 1, 2007, at 3:45 PM, David Acker wrote:
>> Ok, I took a stab at coding and testing these ideas.  Below is a patch 
>> against 2.6.22-rc3.
>> Let me know what you think.
> 
> I think you got most of the ideas.   As Auke noted, your coding style is 
> showing again.   And your mailer again munged whitespace (fixed by 
> s/^<space><space>/<space>/ s/^$/<space>/).
Sorry about the coding style.  I instinctively followed what was there 
instead of kernel coding convention.  I will look into how whitespace is 
getting screwed up.

>> Find a buffer that is complete with rx->el not set and rx->s0 set.
>>     It appears that hardware can read the rfd's el-bit, then software 
>> can clear the rfd el-bit and set the rfd size, and then hardware can 
>> come in and read the size.
> 
> Yes, since the size is after the EL flag in the descriptor, this can 
> happen since the pci read is not atomic.
> 
>> I am reading the status back, although I don't think that I have to in 
>> this instance.
> 
> Actually, you are reading it when the rfd still has EL set.  Since the 
> cpu will never encounter that case, the if condition is never satisfied.
In my tests, every time I found a completed rfd with the el-bit set, the 
receiver was in the out of resources state.


> How about creating a state unknown, for when we think we should check 
> the device if its running.
> If we are in this state and then encounter a received packet without s0 
> set, we can set it back
> to running.   We set it when we rx a packet with s0 set.
> 
> We then move both io_status reads to the caller.
I can look into that as I clean this up.


>> I am testing a version of this code patched against 2.6.18.4 on my PXA 
>> 255 based system.  I will let you all know how it goes.
The testing I did so far did well.  I will try to get some more going 
tonight, hopefully on a cleaned up patch.


> I'm assuming this is why the cleanup of the receiver start to always 
> start on rx_to_clean got dropped again. :-)
Yep.  I will get that in the next patch.


> Also, I would like a few sentences in the Driver Operation section IV 
> Receive big comment.  Something like
> 
> In order to keep updates to the RFD link field from colliding with 
> hardware writes to mark packets complete, we use the feature that 
> hardware will not write to a size 0 descriptor and mark the previous 
> packet as end-of-list (EL).   After updating the link, we remove EL and 
> only then restore the size such that hardware may use the 
> previous-to-end RFD.
> 
> at the end of the first paragraph, and insert software before "no 
> locking is required" in the second.
Sounds good to me.

I will see if I can get into a cleaned up patch today and get it out by 
tomorrow.  Thanks for dealing with me...I have been around kernel code 
for awhile but posting official patches to linux is new to me.
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 13:34                                                   ` David Acker
@ 2007-06-05 16:14                                                     ` Milton Miller
  2007-08-27 17:34                                                       ` Kok, Auke
  2007-06-05 16:14                                                     ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-05 16:14 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, Jeff Garzik

On Jun 5, 2007, at 8:34 AM, David Acker wrote:
> Milton Miller wrote:
>> On Jun 1, 2007, at 3:45 PM, David Acker wrote:
>>> Ok, I took a stab at coding and testing these ideas.  Below is a 
>>> patch against 2.6.22-rc3.
>>> Let me know what you think.
>> I think you got most of the ideas.   As Auke noted, your coding style 
>> is showing again.   And your mailer again munged whitespace (fixed by 
>> s/^<space><space>/<space>/ s/^$/<space>/).
> Sorry about the coding style.  I instinctively followed what was there 
> instead of kernel coding convention.  I will look into how whitespace 
> is getting screwed up.

I have to watch my coding style too (I like to indent the closing 
brace).

At least the white space damage seems to be reversable.  More than I 
can say for this mailer.

>>> Find a buffer that is complete with rx->el not set and rx->s0 set.
>>>     It appears that hardware can read the rfd's el-bit, then 
>>> software can clear the rfd el-bit and set the rfd size, and then 
>>> hardware can come in and read the size.
>> Yes, since the size is after the EL flag in the descriptor, this can 
>> happen since the pci read is not atomic.
>>> I am reading the status back, although I don't think that I have to 
>>> in this instance.
>> Actually, you are reading it when the rfd still has EL set.  Since 
>> the cpu will never encounter that case, the if condition is never 
>> satisfied.
> In my tests, every time I found a completed rfd with the el-bit set, 
> the receiver was in the out of resources state.

Yes, if the EL was set, it would be a real hard race to find the 
completed packet with EL but not RNR.   I was trying to refer to where 
you find a completed packet and then check for EL in the RFD.  That is 
what I was claiming can not be observed by the cpu (unless the card 
writes the EL bit back, and not just the status u16).

If the unless ... above is true, then please put a comment that the 
device can write RFD->EL back to 1 if we raced.


>> How about creating a state unknown, for when we think we should check 
>> the device if its running.
>> If we are in this state and then encounter a received packet without 
>> s0 set, we can set it back
>> to running.   We set it when we rx a packet with s0 set.
>> We then move both io_status reads to the caller.
> I can look into that as I clean this up.
>
>
>>> I am testing a version of this code patched against 2.6.18.4 on my 
>>> PXA 255 based system.  I will let you all know how it goes.
> The testing I did so far did well.  I will try to get some more going 
> tonight, hopefully on a cleaned up patch.

Good to hear our expectiations match reality.

>
>
>> I'm assuming this is why the cleanup of the receiver start to always 
>> start on rx_to_clean got dropped again. :-)
> Yep.  I will get that in the next patch.

Ok.

>> Also, I would like a few sentences in the Driver Operation section IV 
>> Receive big comment.  Something like
>> In order to keep updates to the RFD link field from colliding with 
>> hardware writes to mark packets complete, we use the feature that 
>> hardware will not write to a size 0 descriptor and mark the previous 
>> packet as end-of-list (EL).   After updating the link, we remove EL 
>> and only then restore the size such that hardware may use the 
>> previous-to-end RFD.
>> at the end of the first paragraph, and insert software before "no 
>> locking is required" in the second.
> Sounds good to me.
>
> I will see if I can get into a cleaned up patch today and get it out 
> by tomorrow.  Thanks for dealing with me...I have been around kernel 
> code for awhile but posting official patches to linux is new to me.
> -Ack

I've just learned by watching the lists over the last several years.  
Well, and actually writing the odd patch here and there.

It occurs to me that I have been focusing on the code and not the 
changelog.   I'll send a seperate reply on that thread shortly.

One more thing I'll state here ... as per the perfect patch guidelines, 
it is preferred that the meta-discussion about the patch and its 
history go after the change log, seperated from it by a line of "--- " 
so that the patch application scripts can just extract the email 
subject as the title and through the firsst line of --- as the commit 
log.  (This saves some manual editing).

[1] http://kernelnewbies.org/UpstreamMerge

milton


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 16:14                                                     ` Milton Miller
@ 2007-08-27 17:34                                                       ` Kok, Auke
  2007-08-27 18:32                                                         ` David Acker
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-08-27 17:34 UTC (permalink / raw)
  To: Milton Miller, David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, Jeff Garzik

Milton Miller wrote:
> On Jun 5, 2007, at 8:34 AM, David Acker wrote:

David, Milton,

This was the last communication on-topic for the proposed changes to fix e100 on 
ARM. We're holding our breath here waiting for more, and would love to hear that 
this issue and fixes hasn't died off.

Thanks,

Auke



>> Milton Miller wrote:
>>> On Jun 1, 2007, at 3:45 PM, David Acker wrote:
>>>> Ok, I took a stab at coding and testing these ideas.  Below is a 
>>>> patch against 2.6.22-rc3.
>>>> Let me know what you think.
>>> I think you got most of the ideas.   As Auke noted, your coding style 
>>> is showing again.   And your mailer again munged whitespace (fixed by 
>>> s/^<space><space>/<space>/ s/^$/<space>/).
>> Sorry about the coding style.  I instinctively followed what was there 
>> instead of kernel coding convention.  I will look into how whitespace 
>> is getting screwed up.
> 
> I have to watch my coding style too (I like to indent the closing 
> brace).
> 
> At least the white space damage seems to be reversable.  More than I 
> can say for this mailer.
> 
>>>> Find a buffer that is complete with rx->el not set and rx->s0 set.
>>>>     It appears that hardware can read the rfd's el-bit, then 
>>>> software can clear the rfd el-bit and set the rfd size, and then 
>>>> hardware can come in and read the size.
>>> Yes, since the size is after the EL flag in the descriptor, this can 
>>> happen since the pci read is not atomic.
>>>> I am reading the status back, although I don't think that I have to 
>>>> in this instance.
>>> Actually, you are reading it when the rfd still has EL set.  Since 
>>> the cpu will never encounter that case, the if condition is never 
>>> satisfied.
>> In my tests, every time I found a completed rfd with the el-bit set, 
>> the receiver was in the out of resources state.
> 
> Yes, if the EL was set, it would be a real hard race to find the 
> completed packet with EL but not RNR.   I was trying to refer to where 
> you find a completed packet and then check for EL in the RFD.  That is 
> what I was claiming can not be observed by the cpu (unless the card 
> writes the EL bit back, and not just the status u16).
> 
> If the unless ... above is true, then please put a comment that the 
> device can write RFD->EL back to 1 if we raced.
> 
> 
>>> How about creating a state unknown, for when we think we should check 
>>> the device if its running.
>>> If we are in this state and then encounter a received packet without 
>>> s0 set, we can set it back
>>> to running.   We set it when we rx a packet with s0 set.
>>> We then move both io_status reads to the caller.
>> I can look into that as I clean this up.
>>
>>
>>>> I am testing a version of this code patched against 2.6.18.4 on my 
>>>> PXA 255 based system.  I will let you all know how it goes.
>> The testing I did so far did well.  I will try to get some more going 
>> tonight, hopefully on a cleaned up patch.
> 
> Good to hear our expectiations match reality.
> 
>>
>>> I'm assuming this is why the cleanup of the receiver start to always 
>>> start on rx_to_clean got dropped again. :-)
>> Yep.  I will get that in the next patch.
> 
> Ok.
> 
>>> Also, I would like a few sentences in the Driver Operation section IV 
>>> Receive big comment.  Something like
>>> In order to keep updates to the RFD link field from colliding with 
>>> hardware writes to mark packets complete, we use the feature that 
>>> hardware will not write to a size 0 descriptor and mark the previous 
>>> packet as end-of-list (EL).   After updating the link, we remove EL 
>>> and only then restore the size such that hardware may use the 
>>> previous-to-end RFD.
>>> at the end of the first paragraph, and insert software before "no 
>>> locking is required" in the second.
>> Sounds good to me.
>>
>> I will see if I can get into a cleaned up patch today and get it out 
>> by tomorrow.  Thanks for dealing with me...I have been around kernel 
>> code for awhile but posting official patches to linux is new to me.
>> -Ack
> 
> I've just learned by watching the lists over the last several years.  
> Well, and actually writing the odd patch here and there.
> 
> It occurs to me that I have been focusing on the code and not the 
> changelog.   I'll send a seperate reply on that thread shortly.
> 
> One more thing I'll state here ... as per the perfect patch guidelines, 
> it is preferred that the meta-discussion about the patch and its 
> history go after the change log, seperated from it by a line of "--- " 
> so that the patch application scripts can just extract the email 
> subject as the title and through the firsst line of --- as the commit 
> log.  (This saves some manual editing).
> 
> [1] http://kernelnewbies.org/UpstreamMerge
> 
> milton
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-08-27 17:34                                                       ` Kok, Auke
@ 2007-08-27 18:32                                                         ` David Acker
  0 siblings, 0 replies; 51+ messages in thread
From: David Acker @ 2007-08-27 18:32 UTC (permalink / raw)
  To: Kok, Auke
  Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
	Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik

Kok, Auke wrote:
> Milton Miller wrote:
>> On Jun 5, 2007, at 8:34 AM, David Acker wrote:
> 
> David, Milton,
> 
> This was the last communication on-topic for the proposed changes to fix 
> e100 on ARM. We're holding our breath here waiting for more, and would 
> love to hear that this issue and fixes hasn't died off.
> 
> Thanks,
> 
> Auke

I am sorry folks, this is my fault.  I got pulled onto a fire on one of 
our other products.  I have only recently come back to working on our 
product that uses the e100 on ARM.  Based on my current time available 
to finish cleaning up this patch, I should have a new version available 
by the end of this week.
-Ack


> 
> 
> 
>>> Milton Miller wrote:
>>>> On Jun 1, 2007, at 3:45 PM, David Acker wrote:
>>>>> Ok, I took a stab at coding and testing these ideas.  Below is a 
>>>>> patch against 2.6.22-rc3.
>>>>> Let me know what you think.
>>>> I think you got most of the ideas.   As Auke noted, your coding 
>>>> style is showing again.   And your mailer again munged whitespace 
>>>> (fixed by s/^<space><space>/<space>/ s/^$/<space>/).
>>> Sorry about the coding style.  I instinctively followed what was 
>>> there instead of kernel coding convention.  I will look into how 
>>> whitespace is getting screwed up.
>>
>> I have to watch my coding style too (I like to indent the closing brace).
>>
>> At least the white space damage seems to be reversable.  More than I 
>> can say for this mailer.
>>
>>>>> Find a buffer that is complete with rx->el not set and rx->s0 set.
>>>>>     It appears that hardware can read the rfd's el-bit, then 
>>>>> software can clear the rfd el-bit and set the rfd size, and then 
>>>>> hardware can come in and read the size.
>>>> Yes, since the size is after the EL flag in the descriptor, this can 
>>>> happen since the pci read is not atomic.
>>>>> I am reading the status back, although I don't think that I have to 
>>>>> in this instance.
>>>> Actually, you are reading it when the rfd still has EL set.  Since 
>>>> the cpu will never encounter that case, the if condition is never 
>>>> satisfied.
>>> In my tests, every time I found a completed rfd with the el-bit set, 
>>> the receiver was in the out of resources state.
>>
>> Yes, if the EL was set, it would be a real hard race to find the 
>> completed packet with EL but not RNR.   I was trying to refer to where 
>> you find a completed packet and then check for EL in the RFD.  That is 
>> what I was claiming can not be observed by the cpu (unless the card 
>> writes the EL bit back, and not just the status u16).
>>
>> If the unless ... above is true, then please put a comment that the 
>> device can write RFD->EL back to 1 if we raced.
>>
>>
>>>> How about creating a state unknown, for when we think we should 
>>>> check the device if its running.
>>>> If we are in this state and then encounter a received packet without 
>>>> s0 set, we can set it back
>>>> to running.   We set it when we rx a packet with s0 set.
>>>> We then move both io_status reads to the caller.
>>> I can look into that as I clean this up.
>>>
>>>
>>>>> I am testing a version of this code patched against 2.6.18.4 on my 
>>>>> PXA 255 based system.  I will let you all know how it goes.
>>> The testing I did so far did well.  I will try to get some more going 
>>> tonight, hopefully on a cleaned up patch.
>>
>> Good to hear our expectiations match reality.
>>
>>>
>>>> I'm assuming this is why the cleanup of the receiver start to always 
>>>> start on rx_to_clean got dropped again. :-)
>>> Yep.  I will get that in the next patch.
>>
>> Ok.
>>
>>>> Also, I would like a few sentences in the Driver Operation section 
>>>> IV Receive big comment.  Something like
>>>> In order to keep updates to the RFD link field from colliding with 
>>>> hardware writes to mark packets complete, we use the feature that 
>>>> hardware will not write to a size 0 descriptor and mark the previous 
>>>> packet as end-of-list (EL).   After updating the link, we remove EL 
>>>> and only then restore the size such that hardware may use the 
>>>> previous-to-end RFD.
>>>> at the end of the first paragraph, and insert software before "no 
>>>> locking is required" in the second.
>>> Sounds good to me.
>>>
>>> I will see if I can get into a cleaned up patch today and get it out 
>>> by tomorrow.  Thanks for dealing with me...I have been around kernel 
>>> code for awhile but posting official patches to linux is new to me.
>>> -Ack
>>
>> I've just learned by watching the lists over the last several years.  
>> Well, and actually writing the odd patch here and there.
>>
>> It occurs to me that I have been focusing on the code and not the 
>> changelog.   I'll send a seperate reply on that thread shortly.
>>
>> One more thing I'll state here ... as per the perfect patch 
>> guidelines, it is preferred that the meta-discussion about the patch 
>> and its history go after the change log, seperated from it by a line 
>> of "--- " so that the patch application scripts can just extract the 
>> email subject as the title and through the firsst line of --- as the 
>> commit log.  (This saves some manual editing).
>>
>> [1] http://kernelnewbies.org/UpstreamMerge
>>
>> milton
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 13:34                                                   ` David Acker
  2007-06-05 16:14                                                     ` Milton Miller
@ 2007-06-05 16:14                                                     ` Milton Miller
  2007-06-05 17:27                                                       ` Kok, Auke
  1 sibling, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-05 16:14 UTC (permalink / raw)
  To: David Acker
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, Jeff Garzik

First, a question especially to Auke and Jeff:

Since this patch both reverts the broken change that is currently in 
-rc and creates the fixed driver, I'm not sure I like the subject 
stating "on ARM", although that is the feature of the rewrite, and was 
the intent of merging the previous patch.  This is actually its a fix 
for all systems relative to current, including those where dma is not 
cache coherent, (unlike a simple revert).

Should we just put a comment about reverting the previous patch early 
in the change log?

Something like this:

Fix the e100 receiver handling, supporting cache incoherent DMA.

Discard the concept of setting the S (suspend) bit with the EL bit 
introduced in commit d52df4a35af569071fda3f4eb08e47cc7023f094.  In 
addition to it not setting either bit, the hardware doesn't work that 
way.

Thoughts?

Here is the changelog portion of the latest patch (quoted), with my 
comments thrown in:

> On the ARM, their is a race condition between software allocating a 
> new receive

On systems that have cache incoherent DMA, including ARM,

> buffer and hardware writing into a buffer.  The two race on touching 
> the last
> Receive Frame Descriptor (RFD).  It has its el-bit set and its next 
> link equal
> to 0.  When hardware encounters this buffer it attempts to write data 
> to it
> and then update Status Word bits and Actual Count in the RFD.  At the 
> same time
> software may try to clear the el-bit and set the link address to a new 
> buffer.
> Since the entire RFD is once cache-line, the two write operations can
> collide.  This can lead to the receive unit stalling or freed receive 
> buffers
> getting written to.

This can lead to the receive unit stalling or interpreting random 
memory as its receive area.

>
> The fix is to set the el-bit on and the size to 0 on the next to last 
> buffer
> in the chain.  When the hardware encounters this buffer it stops and 
> does
> not write to it at all.  The hardware issues an RNR interrupt with the
> receive unit in the No Resources state.  When software allocates 
> buffers,
> it can update the tail of the list when either it knows the hardware
> has stopped or the previous to the new one to mark marked.

Software can write to the tail of the list because it knows hardware 
will stop on the previous descriptor that was marked as the end of 
list.

> Once it has a new next to last buffer prepared, it can clear the el-bit
> and set the size on the previous one.  The race on this buffer is safe
> since the link already points to a valid next buffer.

and we can handle the race setting the size (assuming aligned 16 bit 
writes are atomic with respect to the DMA read).

This paragraph changed from third person (the software or hardware) to 
second person (we).

>   We keep flags
> in our software descriptor to note if the el bit is set and if the size
> was 0.  When we clear the RFD's el bit and set its size, we also clear
> the el flag but we leave the size was 0 bit set.  This was we can find
> this buffer again later.

This way software can identify them when the race may have occurred 
when cleaning the ring.  On these descriptors, it looks ahead and if 
the next one is complete then hardware must have skipped the current 
one.  Logic is added to prevent two packets in a row being marked while 
the receiver is running to avoid running in lockstep with the hardware 
and thereby limiting the required lookahead.

> If the hardware sees the el-bit cleared without the size set, it will
> move on to the next buffer and skip this one.  If it sees
> the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.

These sentences should be moved to the mention of the race above to 
reducing mixing descriptions of the hardware and the software.

milton

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 16:14                                                     ` Milton Miller
@ 2007-06-05 17:27                                                       ` Kok, Auke
  2007-06-05 17:39                                                         ` Jeff Garzik
  0 siblings, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-06-05 17:27 UTC (permalink / raw)
  To: Milton Miller, Jeff Garzik
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, David Acker

Milton Miller wrote:
> First, a question especially to Auke and Jeff:
> 
> Since this patch both reverts the broken change that is currently in 
> -rc and creates the fixed driver, I'm not sure I like the subject 
> stating "on ARM", although that is the feature of the rewrite, and was 
> the intent of merging the previous patch.  This is actually its a fix 
> for all systems relative to current, including those where dma is not 
> cache coherent, (unlike a simple revert).
> 
> Should we just put a comment about reverting the previous patch early 
> in the change log?

yes

> Something like this:
> 
> Fix the e100 receiver handling, supporting cache incoherent DMA.
> 
> Discard the concept of setting the S (suspend) bit with the EL bit 
> introduced in commit d52df4a35af569071fda3f4eb08e47cc7023f094.  In 
> addition to it not setting either bit, the hardware doesn't work that 
> way.
> 
> 
> Thoughts?

the same comment I made about the coding style counts for this too: I will clean 
up the patch and gladly adjust the topic, which in this case seems the right 
thing to do. I am too grateful that you guys are digging into this so deeply to 
send you back with comments on style - I'll gladly fix that up :)

> Here is the changelog portion of the latest patch (quoted), with my 
> comments thrown in:

OK, I will buffer this info and make sure this gets picked up on the final version.

this opens up another question:

We need to make sure that now that we're getting closer to 2.6.22 we don't end 
up killing e100 in it. Should we drop the current fixes in it to be on the safe 
side and aim for 2.6.23? I would hate to see an untested codepath breaking e100 
on something like ppc or mips... that will be very painful

Jeff, your thoughts on that?

Auke

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 17:27                                                       ` Kok, Auke
@ 2007-06-05 17:39                                                         ` Jeff Garzik
  2007-06-05 17:42                                                           ` David Acker
  2007-06-05 17:43                                                           ` Kok, Auke
  0 siblings, 2 replies; 51+ messages in thread
From: Jeff Garzik @ 2007-06-05 17:39 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Milton Miller, Jeff Garzik, David Acker, netdev, e1000-devel,
	Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman

On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
> We need to make sure that now that we're getting closer to 2.6.22 we don't 
> end up killing e100 in it. Should we drop the current fixes in it to be on 
> the safe side and aim for 2.6.23? I would hate to see an untested codepath 
> breaking e100 on something like ppc or mips... that will be very painful

I certainly agree with this assessment...

I've been wondering if, based on all this recent work, we should revert
the s-bit stuff and wait for 2.6.23.

	Jeff




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 17:39                                                         ` Jeff Garzik
@ 2007-06-05 17:42                                                           ` David Acker
  2007-06-05 17:43                                                           ` Kok, Auke
  1 sibling, 0 replies; 51+ messages in thread
From: David Acker @ 2007-06-05 17:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
	Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik

Jeff Garzik wrote:
> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
>> We need to make sure that now that we're getting closer to 2.6.22 we don't 
>> end up killing e100 in it. Should we drop the current fixes in it to be on 
>> the safe side and aim for 2.6.23? I would hate to see an untested codepath 
>> breaking e100 on something like ppc or mips... that will be very painful
> 
> I certainly agree with this assessment...
> 
> I've been wondering if, based on all this recent work, we should revert
> the s-bit stuff and wait for 2.6.23.

I think so.  It will be no worse then it was and make the patch that 
fixes it clearer.
-Ack

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 17:39                                                         ` Jeff Garzik
  2007-06-05 17:42                                                           ` David Acker
@ 2007-06-05 17:43                                                           ` Kok, Auke
  2007-06-05 17:56                                                             ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-06-05 17:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Milton Miller, Jeff Garzik, David Acker, netdev, e1000-devel,
	Jeff Kirsher, John Ronciak, Jesse Brandeburg, Scott Feldman

Jeff Garzik wrote:
> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
>> We need to make sure that now that we're getting closer to 2.6.22 we don't 
>> end up killing e100 in it. Should we drop the current fixes in it to be on 
>> the safe side and aim for 2.6.23? I would hate to see an untested codepath 
>> breaking e100 on something like ppc or mips... that will be very painful
> 
> I certainly agree with this assessment...
> 
> I've been wondering if, based on all this recent work, we should revert
> the s-bit stuff and wait for 2.6.23.

Yes, that's my point. If Milton and David agree I think we should do so immediately.

If so, do you want me to write a revert-patch or do you have some magic to do 
that for me?

Auke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 17:43                                                           ` Kok, Auke
@ 2007-06-05 17:56                                                             ` Milton Miller
  2007-06-05 23:33                                                               ` Kok, Auke
  0 siblings, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-05 17:56 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev,
	Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak


On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote:

> Jeff Garzik wrote:
>> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
>>> We need to make sure that now that we're getting closer to 2.6.22 we 
>>> don't end up killing e100 in it. Should we drop the current fixes in 
>>> it to be on the safe side and aim for 2.6.23? I would hate to see an 
>>> untested codepath breaking e100 on something like ppc or mips... 
>>> that will be very painful
>> I certainly agree with this assessment...
>> I've been wondering if, based on all this recent work, we should 
>> revert
>> the s-bit stuff and wait for 2.6.23.
>
> Yes, that's my point. If Milton and David agree I think we should do 
> so immediately.

We definitely need something other than what is in now.


> If so, do you want me to write a revert-patch or do you have some 
> magic to do that for me?
>

The simple git revert won't work because there have been other changes 
(ioread for instance) that conflict.

milton


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 17:56                                                             ` Milton Miller
@ 2007-06-05 23:33                                                               ` Kok, Auke
  2007-06-05 23:44                                                                 ` Jeff Garzik
  2007-06-06  2:26                                                                 ` Kok, Auke
  0 siblings, 2 replies; 51+ messages in thread
From: Kok, Auke @ 2007-06-05 23:33 UTC (permalink / raw)
  To: Jeff Garzik, Jeff Garzik
  Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
	Scott Feldman, John Ronciak, Jeff Kirsher, David Acker

Milton Miller wrote:
> On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote:
> 
>> Jeff Garzik wrote:
>>> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
>>>> We need to make sure that now that we're getting closer to 2.6.22 we 
>>>> don't end up killing e100 in it. Should we drop the current fixes in 
>>>> it to be on the safe side and aim for 2.6.23? I would hate to see an 
>>>> untested codepath breaking e100 on something like ppc or mips... 
>>>> that will be very painful
>>> I certainly agree with this assessment...
>>> I've been wondering if, based on all this recent work, we should 
>>> revert
>>> the s-bit stuff and wait for 2.6.23.
>> Yes, that's my point. If Milton and David agree I think we should do 
>> so immediately.
> 
> We definitely need something other than what is in now.
> 
> 
>> If so, do you want me to write a revert-patch or do you have some 
>> magic to do that for me?
>>
> 
> The simple git revert won't work because there have been other changes 
> (ioread for instance) that conflict.

Hmm git-revert seems to do the job right. I checked it with git-show | patch -p1 
-R and the results look OK. The two patches on top of the one we want to revert 
are unrelated enough to apply (manually it shows some fuzz, but otherwise it's OK).

Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to revert the 
following patch for now:

---
commit  d52df4a35af569071fda3f4eb08e47cc7023f094
Author: Scott Feldman <sfeldma@pobox.com>
Date:   Wed Nov 9 02:18:52 2005 -0500

     [netdrvr e100] experiment with doing RX in a similar manner to eepro100

     I was going to say that eepro100's speedo_rx_link() does the same DMA
     abuse as e100, but then I noticed one little detail: eepro100 sets  both
     EL (end of list) and S (suspend) bits in the RFD as it chains it  to the
     RFD list.  e100 was only setting the EL bit.  Hmmm, that's  interesting.
     That means that if HW reads a RFD with the S-bit set,  it'll process
     that RFD and then suspend the receive unit.  The  receive unit will
     resume when SW clears the S-bit.  There is no need  for SW to restart
     the receive unit.  Which means a lot of the receive  unit state tracking
     code in the driver goes away.

     So here's a patch against 2.6.14.  (Sorry for inlining it; the mailer
     I'm using now will mess with the word wrap).  I can't test this on
     XScale (unless someone has an e100 module for Gumstix :) .  It should
     be doing exactly what eepro100 does with RFDs.  I don't believe this
     change will introduce a performance hit because the S-bit and EL-bit  go
     hand-in-hand meaning if we're going to suspend because of the S- bit,
     we're on the last resource anyway, so we'll have to wait for SW  to
     replenish.
     (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 commit)
---

Cheers,

Auke

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 23:33                                                               ` Kok, Auke
@ 2007-06-05 23:44                                                                 ` Jeff Garzik
  2007-06-06  2:26                                                                 ` Kok, Auke
  1 sibling, 0 replies; 51+ messages in thread
From: Jeff Garzik @ 2007-06-05 23:44 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Milton Miller, e1000-devel, David Acker, netdev,
	Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak

On Tue, Jun 05, 2007 at 04:33:12PM -0700, Kok, Auke wrote:
> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to 
> revert the following patch for now:

Can you do me one more favor, and write a short paragraph describing why
it is getting reverted, to paste into the commit message?

You can do this yourself and have me git-pull the result, or I can do
the revert and paste the paragraph from email, your choice.

	Jeff




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-05 23:33                                                               ` Kok, Auke
  2007-06-05 23:44                                                                 ` Jeff Garzik
@ 2007-06-06  2:26                                                                 ` Kok, Auke
  2007-06-06  9:28                                                                   ` Milton Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Kok, Auke @ 2007-06-06  2:26 UTC (permalink / raw)
  To: Jeff Garzik, Jeff Garzik
  Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
	Scott Feldman, John Ronciak, Jeff Kirsher, David Acker

Kok, Auke wrote:
> Milton Miller wrote:
>> On Jun 5, 2007, at 12:43 PM, Kok, Auke wrote:
>>
>>> Jeff Garzik wrote:
>>>> On Tue, Jun 05, 2007 at 10:27:19AM -0700, Kok, Auke wrote:
>>>>> We need to make sure that now that we're getting closer to 2.6.22 we 
>>>>> don't end up killing e100 in it. Should we drop the current fixes in 
>>>>> it to be on the safe side and aim for 2.6.23? I would hate to see an 
>>>>> untested codepath breaking e100 on something like ppc or mips... 
>>>>> that will be very painful
>>>> I certainly agree with this assessment...
>>>> I've been wondering if, based on all this recent work, we should 
>>>> revert
>>>> the s-bit stuff and wait for 2.6.23.
>>> Yes, that's my point. If Milton and David agree I think we should do 
>>> so immediately.
>> We definitely need something other than what is in now.
>>
>>
>>> If so, do you want me to write a revert-patch or do you have some 
>>> magic to do that for me?
>>>
>> The simple git revert won't work because there have been other changes 
>> (ioread for instance) that conflict.
> 
> Hmm git-revert seems to do the job right. I checked it with git-show | patch -p1 
> -R and the results look OK. The two patches on top of the one we want to revert 
> are unrelated enough to apply (manually it shows some fuzz, but otherwise it's OK).
> 
> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to revert the 
> following patch for now:
> 
> ---
> commit  d52df4a35af569071fda3f4eb08e47cc7023f094
> Author: Scott Feldman <sfeldma@pobox.com>
> Date:   Wed Nov 9 02:18:52 2005 -0500
> 
>      [netdrvr e100] experiment with doing RX in a similar manner to eepro100
> 
>      I was going to say that eepro100's speedo_rx_link() does the same DMA
>      abuse as e100, but then I noticed one little detail: eepro100 sets  both
>      EL (end of list) and S (suspend) bits in the RFD as it chains it  to the
>      RFD list.  e100 was only setting the EL bit.  Hmmm, that's  interesting.
>      That means that if HW reads a RFD with the S-bit set,  it'll process
>      that RFD and then suspend the receive unit.  The  receive unit will
>      resume when SW clears the S-bit.  There is no need  for SW to restart
>      the receive unit.  Which means a lot of the receive  unit state tracking
>      code in the driver goes away.
> 
>      So here's a patch against 2.6.14.  (Sorry for inlining it; the mailer
>      I'm using now will mess with the word wrap).  I can't test this on
>      XScale (unless someone has an e100 module for Gumstix :) .  It should
>      be doing exactly what eepro100 does with RFDs.  I don't believe this
>      change will introduce a performance hit because the S-bit and EL-bit  go
>      hand-in-hand meaning if we're going to suspend because of the S- bit,
>      we're on the last resource anyway, so we'll have to wait for SW  to
>      replenish.
>      (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 commit)
> ---
> 

A little bit more is needed to explain why we're reverting it for now. Jeff, 
please insert this into the revert commit.

Auke

--
This patch attempted to fix e100 for non-cache coherent memory architectures by 
using the cb style code that eepro100 had and using the EL and s bits from the 
RFD list. Unfortunately the hardware doesn't work exactly like this and 
therefore this patch actually breaks e100 on those systems. Reverting the change 
brings it back to the previously known good state for 2.6.22. The pending 
rewrite in progress to this code can then be safely merged later.

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-06  2:26                                                                 ` Kok, Auke
@ 2007-06-06  9:28                                                                   ` Milton Miller
  2007-06-11 15:58                                                                     ` Milton Miller
  0 siblings, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-06  9:28 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev,
	Jesse Brandeburg, Jeff Kirsher, Scott Feldman, John Ronciak


On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote:

> Kok, Auke wrote:
>>>
>> Hmm git-revert seems to do the job right. I checked it with git-show 
>> | patch -p1 -R and the results look OK. The two patches on top of the 
>> one we want to revert are unrelated enough to apply (manually it 
>> shows some fuzz, but otherwise it's OK).
>> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` to 
>> revert the following patch for now:
>> ---
>> commit  d52df4a35af569071fda3f4eb08e47cc7023f094
>> Author: Scott Feldman <sfeldma@pobox.com>
>> Date:   Wed Nov 9 02:18:52 2005 -0500
>>      [netdrvr e100] experiment with doing RX in a similar manner to 
>> eepro100
>>      I was going to say that eepro100's speedo_rx_link() does the 
>> same DMA
>>      abuse as e100, but then I noticed one little detail: eepro100 
>> sets  both
>>      EL (end of list) and S (suspend) bits in the RFD as it chains it 
>>  to the
>>      RFD list.  e100 was only setting the EL bit.  Hmmm, that's  
>> interesting.
>>      That means that if HW reads a RFD with the S-bit set,  it'll 
>> process
>>      that RFD and then suspend the receive unit.  The  receive unit 
>> will
>>      resume when SW clears the S-bit.  There is no need  for SW to 
>> restart
>>      the receive unit.  Which means a lot of the receive  unit state 
>> tracking
>>      code in the driver goes away.
>>      So here's a patch against 2.6.14.  (Sorry for inlining it; the 
>> mailer
>>      I'm using now will mess with the word wrap).  I can't test this 
>> on
>>      XScale (unless someone has an e100 module for Gumstix :) .  It 
>> should
>>      be doing exactly what eepro100 does with RFDs.  I don't believe 
>> this
>>      change will introduce a performance hit because the S-bit and 
>> EL-bit  go
>>      hand-in-hand meaning if we're going to suspend because of the S- 
>> bit,
>>      we're on the last resource anyway, so we'll have to wait for SW  
>> to
>>      replenish.
>>      (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 
>> commit)
>> ---
>
> A little bit more is needed to explain why we're reverting it for now. 
> Jeff, please insert this into the revert commit.
>
> Auke
>
> --
> This patch attempted to fix e100 for non-cache coherent memory 
> architectures by using the cb style code that eepro100 had and using 
> the EL and s bits from the RFD list. Unfortunately the hardware 
> doesn't work exactly like this and therefore this patch actually 
> breaks e100 on those systems.

on all systems.  (Both the &| typo and the removed restart logic).

> Reverting the change brings it back to the previously known good state 
> for 2.6.22. The pending rewrite in progress to this code can then be 
> safely merged later.
>
> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> ---

milton


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-06  9:28                                                                   ` Milton Miller
@ 2007-06-11 15:58                                                                     ` Milton Miller
  2007-06-15 14:39                                                                       ` Jeff Garzik
  0 siblings, 1 reply; 51+ messages in thread
From: Milton Miller @ 2007-06-11 15:58 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, Jeff Garzik, e1000-devel, David Acker, netdev,
	Jesse Brandeburg, Kok, Auke, Jeff Kirsher, Scott Feldman,
	John Ronciak


On Jun 6, 2007, at 4:28 AM, Milton Miller wrote:

>
> On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote:
>
>> Kok, Auke wrote:
>>>>
>>> Hmm git-revert seems to do the job right. I checked it with git-show 
>>> | patch -p1 -R and the results look OK. The two patches on top of 
>>> the one we want to revert are unrelated enough to apply (manually it 
>>> shows some fuzz, but otherwise it's OK).
>>> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` 
>>> to revert the following patch for now:
>>> ---
>>> commit  d52df4a35af569071fda3f4eb08e47cc7023f094
>>> Author: Scott Feldman <sfeldma@pobox.com>
>>> Date:   Wed Nov 9 02:18:52 2005 -0500
>>>      [netdrvr e100] experiment with doing RX in a similar manner to 
>>> eepro100
>>>      I was going to say that eepro100's speedo_rx_link() does the 
>>> same DMA
>>>      abuse as e100, but then I noticed one little detail: eepro100 
>>> sets  both
>>>      EL (end of list) and S (suspend) bits in the RFD as it chains 
>>> it  to the
>>>      RFD list.  e100 was only setting the EL bit.  Hmmm, that's  
>>> interesting.
>>>      That means that if HW reads a RFD with the S-bit set,  it'll 
>>> process
>>>      that RFD and then suspend the receive unit.  The  receive unit 
>>> will
>>>      resume when SW clears the S-bit.  There is no need  for SW to 
>>> restart
>>>      the receive unit.  Which means a lot of the receive  unit state 
>>> tracking
>>>      code in the driver goes away.
>>>      So here's a patch against 2.6.14.  (Sorry for inlining it; the 
>>> mailer
>>>      I'm using now will mess with the word wrap).  I can't test this 
>>> on
>>>      XScale (unless someone has an e100 module for Gumstix :) .  It 
>>> should
>>>      be doing exactly what eepro100 does with RFDs.  I don't believe 
>>> this
>>>      change will introduce a performance hit because the S-bit and 
>>> EL-bit  go
>>>      hand-in-hand meaning if we're going to suspend because of the 
>>> S- bit,
>>>      we're on the last resource anyway, so we'll have to wait for SW 
>>>  to
>>>      replenish.
>>>      (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 
>>> commit)
>>> ---
>>
>> A little bit more is needed to explain why we're reverting it for 
>> now. Jeff, please insert this into the revert commit.
>>
>> Auke
>>
>> --
>> This patch attempted to fix e100 for non-cache coherent memory 
>> architectures by using the cb style code that eepro100 had and using 
>> the EL and s bits from the RFD list. Unfortunately the hardware 
>> doesn't work exactly like this and therefore this patch actually 
>> breaks e100 on those systems.
>
> on all systems.  (Both the &| typo and the removed restart logic).
>
>> Reverting the change brings it back to the previously known good 
>> state for 2.6.22. The pending rewrite in progress to this code can 
>> then be safely merged later.
>>
>> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
>>

Jeff I don't see this in netdev upstream-fixes.  Are you waiting on 
Auke to respond to this?

milton


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-06-11 15:58                                                                     ` Milton Miller
@ 2007-06-15 14:39                                                                       ` Jeff Garzik
  0 siblings, 0 replies; 51+ messages in thread
From: Jeff Garzik @ 2007-06-15 14:39 UTC (permalink / raw)
  To: Milton Miller
  Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Scott Feldman,
	John Ronciak, Jeff Kirsher, David Acker, Jeff Garzik

Milton Miller wrote:
> 
> On Jun 6, 2007, at 4:28 AM, Milton Miller wrote:
> 
>>
>> On Jun 5, 2007, at 9:26 PM, Kok, Auke wrote:
>>
>>> Kok, Auke wrote:
>>>>>
>>>> Hmm git-revert seems to do the job right. I checked it with git-show 
>>>> | patch -p1 -R and the results look OK. The two patches on top of 
>>>> the one we want to revert are unrelated enough to apply (manually it 
>>>> shows some fuzz, but otherwise it's OK).
>>>> Jeff, please `git-revert d52df4a35af569071fda3f4eb08e47cc7023f094` 
>>>> to revert the following patch for now:
>>>> ---
>>>> commit  d52df4a35af569071fda3f4eb08e47cc7023f094
>>>> Author: Scott Feldman <sfeldma@pobox.com>
>>>> Date:   Wed Nov 9 02:18:52 2005 -0500
>>>>      [netdrvr e100] experiment with doing RX in a similar manner to 
>>>> eepro100
>>>>      I was going to say that eepro100's speedo_rx_link() does the 
>>>> same DMA
>>>>      abuse as e100, but then I noticed one little detail: eepro100 
>>>> sets  both
>>>>      EL (end of list) and S (suspend) bits in the RFD as it chains 
>>>> it  to the
>>>>      RFD list.  e100 was only setting the EL bit.  Hmmm, that's  
>>>> interesting.
>>>>      That means that if HW reads a RFD with the S-bit set,  it'll 
>>>> process
>>>>      that RFD and then suspend the receive unit.  The  receive unit 
>>>> will
>>>>      resume when SW clears the S-bit.  There is no need  for SW to 
>>>> restart
>>>>      the receive unit.  Which means a lot of the receive  unit state 
>>>> tracking
>>>>      code in the driver goes away.
>>>>      So here's a patch against 2.6.14.  (Sorry for inlining it; the 
>>>> mailer
>>>>      I'm using now will mess with the word wrap).  I can't test this on
>>>>      XScale (unless someone has an e100 module for Gumstix :) .  It 
>>>> should
>>>>      be doing exactly what eepro100 does with RFDs.  I don't believe 
>>>> this
>>>>      change will introduce a performance hit because the S-bit and 
>>>> EL-bit  go
>>>>      hand-in-hand meaning if we're going to suspend because of the 
>>>> S- bit,
>>>>      we're on the last resource anyway, so we'll have to wait for SW 
>>>>  to
>>>>      replenish.
>>>>      (cherry picked from 29e79da9495261119e3b2e4e7c72507348e75976 
>>>> commit)
>>>> ---
>>>
>>> A little bit more is needed to explain why we're reverting it for 
>>> now. Jeff, please insert this into the revert commit.
>>>
>>> Auke
>>>
>>> -- 
>>> This patch attempted to fix e100 for non-cache coherent memory 
>>> architectures by using the cb style code that eepro100 had and using 
>>> the EL and s bits from the RFD list. Unfortunately the hardware 
>>> doesn't work exactly like this and therefore this patch actually 
>>> breaks e100 on those systems.
>>
>> on all systems.  (Both the &| typo and the removed restart logic).
>>
>>> Reverting the change brings it back to the previously known good 
>>> state for 2.6.22. The pending rewrite in progress to this code can 
>>> then be safely merged later.
>>>
>>> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
>>>
> 
> Jeff I don't see this in netdev upstream-fixes.  Are you waiting on Auke 
> to respond to this?

It should now be upstream, as of a day or two ago.

Sorry for the delaying, was travelling, and wound up accidentally losing 
Auke's commit description.

	Jeff




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-24  5:26                                       ` Milton Miller
  2007-05-24 11:21                                         ` Milton Miller
@ 2007-05-24 12:44                                         ` David Acker
  1 sibling, 0 replies; 51+ messages in thread
From: David Acker @ 2007-05-24 12:44 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, netdev, e1000-devel, Jeff Kirsher, John Ronciak,
	Jesse Brandeburg, Scott Feldman, Kok, Auke

Milton Miller wrote:
> On May 23, 2007, at 4:32 PM, David Acker wrote:
>> Milton Miller wrote:
>>> My current reading of the manual is that the C bit will not be
>>> set on an RFD that is size 0.  It goes on to processes EL and
>>> S, and decides to stop and interrupt RNR or suspend, or just
>>> go to the next packet.
>> I double checked this with a quick experiment and it appears you are 
>> correct.
>>
>> What about if we always did the following:
>> set the size:
>> sync();
>> clear el-bit
>> sync()
>>
>> Then if the hardware sees just the size set, the packet completes but 
>> with the el-bit and we know we need to restart since it completed.
>> If it sees the size == 0, and the el bit set, it stops and RNR 
>> interrupts.
> 
> I think this is exposed to a hole and a race:  we don't know if the 
> hardware
> read the RFD before we set the size or after, just that it was before 
> the EL
> bit was cleared.  If it read it before the size was set, then it will not
> set the C bit.  If it reads it after the size is set, it will complete it.
Yep...I too got sidetracked!  My test time got lost to two 2 month old 
twins needing to be fed or else! :-)

> 
> For coherent DMA we can always observe the C bit.  But for the 
> incoherent DMA
> case, our store to clear the EL bit may overwrite the dma from the 
> device to
> the beginning of the packet, or the write to EOF, F, and size, and/or the
> write to C, OK, and status bits to tell us its done.  In the worst case, we
> would overwrite the beginning of the data but catch the C bit and even the
> actual size, and therefore would receive corrupted data.
> 
> We can only detect the hardware went RNR when it does so or decide we 
> won the
> race when it receives and completes the next frame.
Yes, I agree.

>> When we find a buffer that is not completed but has the el-bit set, we 
>> read the status byte of the status control block to see if the RU is 
>> in the no resources state.  If it isn't, it means that we found that 
>> buffer  before the hardware did and thus need to wait for it.  We will 
>> either find it on the next poll or enable interrupts and get told 
>> about it by hardware.
>>
>> What do you think?
> 
> I think the second part is good ...
Cool.  That part seemed to work well in my tests.

I will reply to your next mail to discuss your plan so that I get it all 
in one message.
-Ack

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)
  2007-05-23 14:02                                   ` Milton Miller
  2007-05-23 21:32                                     ` David Acker
@ 2007-05-24  4:13                                     ` Milton Miller
  1 sibling, 0 replies; 51+ messages in thread
From: Milton Miller @ 2007-05-24  4:13 UTC (permalink / raw)
  To: Milton Miller
  Cc: Jeff Garzik, e1000-devel, David Acker, netdev, Jesse Brandeburg,
	Kok, Auke, Jeff Kirsher, Scott Feldman, John Ronciak

Auke Kok pointed out I had left an unfinished thought this
morning ... well, here's a completion, but I will mostly
think about David's latest proposal.

I think I was debating proposing this, then got side
tracked then hit send.

On May 23, 2007, at 9:02 AM, Milton Miller wrote:
>
>> What if we just fixed the alignment of rfd to make sure it is cache 
>> aligned?  Then we know our syncs will be a single cache flush.
>
> That doesn't help in the coherent dma case; the device can still
> read in between, and we need to handle the !EL size 0 case.
>
> Actually, I thought about it some more after yesterdays note.
> We can set the size !0 after making sure the device sees EL
> (wmb and pci_sync_for_device), as long as we remember this
> buffer used to be size 0 when doing RX clean, and look ahead
> for another complete packet as long as the size was ever 0.
>
> My proposal was just to use "size is still 0" as that flag,
> wasting the space allocated for that skb.
>
> Another aproach is to always allocate a rfd-only packet
> at the beginning of the loop, and inserting th

... inserting this dummy skb before the last successfuly
allocated one, moving the links around.   This has the
benefit of not allocating a 1500+32 byte buffer that we
don't intend to fill, but is a lot more complicated
insert and move skb around when filling.

We now return you to your previously scheduled program.

Now back to thinking about David's proposal, and my thoughts
about how I might do the allocate non-zero size.

milton

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] e100 rx: or s and el bits
  2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller
  2007-05-01 15:01 ` David Acker
@ 2007-05-01 15:21 ` Kok, Auke
  1 sibling, 0 replies; 51+ messages in thread
From: Kok, Auke @ 2007-05-01 15:21 UTC (permalink / raw)
  To: Milton Miller
  Cc: Scott Feldman, Jeff Garzik, e1000-devel, netdev, linux-kernel,
	Andrew Morton, John Ronciak, Jesse Brandeburg, Jeff Kirsher

Milton Miller wrote:
> In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
> talks about emulating another driver by setting addtional bits and
> the being unable to test when submitted.  Seeing the & operator to
> set more bits made me suspicious, and indeed the bits are defined
> in positive logic:
> 
> 	cb_s      = 0x4000,
> 	cb_el     = 0x8000,
> 
> So anding those together would be 0.   I'm guessing they should
> be or'd, but don't have hardware here to test, much like the
> committed patch.  In fact, I'll let someone else do the compile
> test too.  I'll update the comment.
> 
> (It looks like the end of list and s bits would not be set
> in the template nor cleared when linking in recieve skbs, so
> as long as the kernel can keep up with the 100Mb card we 
> wouldn't see the adapter go off the linked list, possibly
> explaining any successful use of this patch written against
> 2.6.14).

yes, that definately doesn't seem right. I'll try to run this and watch out for 
David Ackers report as well.

Thanks,

Auke


> 
> Signed-off-by: Milton Miller <miltonm@bga.com>
> 
> 
> --- linux-2.6/drivers/net/e100.c.orig	2007-05-01 05:19:17.000000000 -0500
> +++ linux-2.6/drivers/net/e100.c	2007-05-01 05:22:14.000000000 -0500
> @@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic
>  		((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>  
>  	/* Template for a freshly allocated RFD */
> -	nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
> +	nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s);
>  	nic->blank_rfd.rbd = 0xFFFFFFFF;
>  	nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>  
> @@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic 
>  	}
>  
>  	/* Link the RFD to end of RFA by linking previous RFD to
> -	 * this one, and clearing EL bit of previous.  */
> +	 * this one, and clearing EL and S bit of previous.  */
>  	if(rx->prev->skb) {
>  		struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
>  		put_unaligned(cpu_to_le32(rx->dma_addr),
>  			(u32 *)&prev_rfd->link);
>  		wmb();
> -		prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
> +		prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s);
>  		pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
>  			sizeof(struct rfd), PCI_DMA_TODEVICE);
>  	}

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2007-08-27 18:32 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-01 11:24 [PATCH] e100 rx: or s and el bits Milton Miller
2007-05-01 15:01 ` David Acker
2007-05-02 20:21   ` David Acker
2007-05-04 21:43     ` David Acker
2007-05-06  6:36       ` Milton Miller
2007-05-07 15:27         ` David Acker
2007-05-14 18:26         ` [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits) David Acker
2007-05-18  1:54           ` Jeff Garzik
2007-05-18  3:47             ` Kok, Auke
2007-05-18 14:07               ` David Acker
2007-05-18 14:20                 ` David Acker
2007-05-18 15:29                   ` Kok, Auke
2007-05-18 15:47                     ` David Acker
2007-05-18 15:59                       ` Kok, Auke
2007-05-18 17:11                         ` David Acker
2007-05-18 17:47                           ` Kok, Auke
2007-05-21 17:35                           ` Milton Miller
2007-05-21 17:45                             ` Kok, Auke
2007-05-22 16:51                               ` Milton Miller
2007-05-22 22:07                                 ` David Acker
2007-05-23 14:02                                   ` Milton Miller
2007-05-23 21:32                                     ` David Acker
2007-05-24  5:26                                       ` Milton Miller
2007-05-24 11:21                                         ` Milton Miller
2007-05-24 12:51                                           ` David Acker
2007-05-24 14:25                                             ` Milton Miller
2007-05-29 15:58                                           ` David Acker
2007-05-30  8:26                                             ` Milton Miller
2007-06-01 20:45                                               ` David Acker
2007-06-01 21:13                                                 ` Jeff Garzik
2007-06-01 22:13                                                   ` Kok, Auke
2007-06-04  9:03                                                 ` Milton Miller
2007-06-05 13:34                                                   ` David Acker
2007-06-05 16:14                                                     ` Milton Miller
2007-08-27 17:34                                                       ` Kok, Auke
2007-08-27 18:32                                                         ` David Acker
2007-06-05 16:14                                                     ` Milton Miller
2007-06-05 17:27                                                       ` Kok, Auke
2007-06-05 17:39                                                         ` Jeff Garzik
2007-06-05 17:42                                                           ` David Acker
2007-06-05 17:43                                                           ` Kok, Auke
2007-06-05 17:56                                                             ` Milton Miller
2007-06-05 23:33                                                               ` Kok, Auke
2007-06-05 23:44                                                                 ` Jeff Garzik
2007-06-06  2:26                                                                 ` Kok, Auke
2007-06-06  9:28                                                                   ` Milton Miller
2007-06-11 15:58                                                                     ` Milton Miller
2007-06-15 14:39                                                                       ` Jeff Garzik
2007-05-24 12:44                                         ` David Acker
2007-05-24  4:13                                     ` Milton Miller
2007-05-01 15:21 ` [PATCH] e100 rx: or s and el bits Kok, Auke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).