* [PATCH] Fix e100 on systems that have cache incoherent DMA
@ 2007-08-31 20:54 David Acker
2007-09-04 17:02 ` Kok, Auke
2007-09-07 16:31 ` Kok, Auke
0 siblings, 2 replies; 28+ messages in thread
From: David Acker @ 2007-08-31 20:54 UTC (permalink / raw)
To: Auke Kok
Cc: John Ronciak, Jesse Brandeburg, Jeff Kirsher, Milton Miller,
Jeff Garzik, netdev, e1000-devel, Scott Feldman
On the systems that have cache incoherent DMA, including ARM, there is a
race condition between software allocating a new receive buffer and hardware
writing into a buffer. The two race on touching the last Receive Frame
Descriptor (RFD). It has its el-bit set and its next link equal to 0.
When hardware encounters this buffer it attempts to write data to it and
then update Status Word bits and Actual Count in the RFD. At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can collide.
This can lead to the receive unit stalling or interpreting random memory as
its receive area.
The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain. When the hardware encounters this buffer it stops and does not
write to it at all. The hardware issues an RNR interrupt with the receive
unit in the No Resources state. Software can write to the tail of the list
because it knows hardware will stop on the previous descriptor that was
marked as the end of list.
Once it has a new next to last buffer prepared, it can clear the el-bit and
set the size on the previous one. The race on this buffer is safe since
the link already points to a valid next buffer and the software can handle
the race setting the size (assuming aligned 16 bit writes are atomic with
respect to the DMA read). If the hardware sees the el-bit cleared without
the size set, it will move on to the next buffer and skip this one. If it
sees the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.
Flags are kept in the software descriptor to note if the el bit is set and if
the size was 0. When software clears the RFD's el bit and set its size, it
also clears the el flag but leaves the size was 0 bit set. This way software
can identify them when the race may have occurred when cleaning the ring.
On these descriptors, it looks ahead and if the next one is complete then
hardware must have skipped the current one. Logic is added to prevent two
packets in a row being marked while the receiver is running to avoid running
in lockstep with the hardware and thereby limiting the required lookahead.
This is a patch for 2.6.23-rc4.
Signed-off-by: David Acker <dacker@roinet.com>
---
--- linux-2.6.23-rc4/drivers/net/e100.c.orig 2007-08-30 13:32:10.000000000 -0400
+++ linux-2.6.23-rc4/drivers/net/e100.c 2007-08-30 15:42:07.000000000 -0400
@@ -106,6 +106,13 @@
* the RFD, the RFD must be dma_sync'ed to maintain a consistent
* view from software and hardware.
*
+ * In order to keep updates to the RFD link field from colliding with
+ * hardware writes to mark packets complete, we use the feature that
+ * hardware will not write to a size 0 descriptor and mark the previous
+ * packet as end-of-list (EL). After updating the link, we remove EL
+ * and only then restore the size such that hardware may use the
+ * previous-to-end RFD.
+ *
* Under typical operation, the receive unit (RU) is start once,
* and the controller happily fills RFDs as frames arrive. If
* replacement RFDs cannot be allocated, or the RU goes non-active,
@@ -281,14 +288,14 @@ struct csr {
};
enum scb_status {
+ rus_no_res = 0x08,
rus_ready = 0x10,
rus_mask = 0x3C,
};
enum ru_state {
- RU_SUSPENDED = 0,
- RU_RUNNING = 1,
- RU_UNINITIALIZED = -1,
+ ru_stopped = 0,
+ ru_running = 1,
};
enum scb_stat_ack {
@@ -401,10 +408,16 @@ struct rfd {
u16 size;
};
+enum rx_flags {
+ rx_el = 0x01,
+ rx_s0 = 0x02,
+};
+
struct rx {
struct rx *next, *prev;
struct sk_buff *skb;
dma_addr_t dma_addr;
+ u8 flags;
};
#if defined(__BIG_ENDIAN_BITFIELD)
@@ -952,7 +965,7 @@ static void e100_get_defaults(struct nic
((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
/* Template for a freshly allocated RFD */
- nic->blank_rfd.command = cpu_to_le16(cb_el);
+ nic->blank_rfd.command = 0;
nic->blank_rfd.rbd = 0xFFFFFFFF;
nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
@@ -1753,18 +1766,48 @@ static int e100_alloc_cbs(struct nic *ni
return 0;
}
-static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
+static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, int is_running)
{
- if(!nic->rxs) return;
- if(RU_SUSPENDED != nic->ru_running) return;
+ struct rx *rx = nic->rx_to_use->prev->prev;
+ struct rfd *rfd;
+
+ if (marked_rx == rx)
+ return;
+
+ rfd = (struct rfd *) rx->skb->data;
+ rfd->command |= cpu_to_le16(cb_el);
+ rfd->size = 0;
+ pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+ sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
+ rx->flags |= (rx_el | rx_s0);
+
+ if (!marked_rx)
+ return;
+
+ rfd = (struct rfd *) marked_rx->skb->data;
+ rfd->command &= ~cpu_to_le16(cb_el);
+ pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
+ sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
+
+ rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+ pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
+ sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
- /* handle init time starts */
- if(!rx) rx = nic->rxs;
+ if (is_running)
+ marked_rx->flags &= ~rx_el;
+ else
+ marked_rx->flags &= ~(rx_el | rx_s0);
+}
+
+static inline void e100_start_receiver(struct nic *nic)
+{
+ if(!nic->rxs) return;
+ if (ru_stopped != nic->ru_running) return;
/* (Re)start RU if suspended or idle and RFA is non-NULL */
- if(rx->skb) {
- e100_exec_cmd(nic, ruc_start, rx->dma_addr);
- nic->ru_running = RU_RUNNING;
+ if (nic->rx_to_clean->skb) {
+ e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+ nic->ru_running = ru_running;
}
}
@@ -1793,8 +1836,6 @@ static int e100_rx_alloc_skb(struct nic
struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
put_unaligned(cpu_to_le32(rx->dma_addr),
(u32 *)&prev_rfd->link);
- wmb();
- prev_rfd->command &= ~cpu_to_le16(cb_el);
pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
sizeof(struct rfd), PCI_DMA_TODEVICE);
}
@@ -1808,6 +1849,7 @@ static int e100_rx_indicate(struct nic *
struct sk_buff *skb = rx->skb;
struct rfd *rfd = (struct rfd *)skb->data;
u16 rfd_status, actual_size;
+ u8 status;
if(unlikely(work_done && *work_done >= work_to_do))
return -EAGAIN;
@@ -1819,9 +1861,47 @@ static int e100_rx_indicate(struct nic *
DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
- /* If data isn't ready, nothing to indicate */
- if(unlikely(!(rfd_status & cb_complete)))
+ /*
+ * If data isn't ready, nothing to indicate
+ * If both the el and s0 rx flags are set, we have hit the marked
+ * buffer but we don't know if hardware has seen it so we check
+ * the status.
+ * If only the s0 flag is set, we check the next buffer.
+ * If it is complete, we know that hardware saw the rfd el bit
+ * get cleared but did not see the rfd size get set so it
+ * skipped this buffer. We just return 0 and look at the
+ * next buffer.
+ * If only the s0 flag is set but the next buffer is
+ * not complete, we cleared the el flag as hardware
+ * hit this buffer.
+ */
+ if (unlikely(!(rfd_status & cb_complete))) {
+ u8 maskedFlags = rx->flags & (rx_el | rx_s0);
+ if (maskedFlags == (rx_el | rx_s0)) {
+ status = readb(&nic->csr->scb.status);
+ if (status & rus_no_res)
+ nic->ru_running = ru_stopped;
+ } else if (maskedFlags == rx_s0) {
+ struct rx *next_rx = rx->next;
+ struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
+ pci_dma_sync_single_for_cpu(nic->pdev,
+ next_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_FROMDEVICE);
+ if (next_rfd->status & cpu_to_le16(cb_complete)) {
+ pci_unmap_single(nic->pdev, rx->dma_addr,
+ RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
+ dev_kfree_skb_any(skb);
+ rx->skb = NULL;
+ rx->flags &= ~rx_s0;
+ return 0;
+ } else {
+ status = readb(&nic->csr->scb.status);
+ if (status & rus_no_res)
+ nic->ru_running = ru_stopped;
+ }
+ }
return -ENODATA;
+ }
/* Get actual data size */
actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1832,9 +1912,15 @@ static int e100_rx_indicate(struct nic *
pci_unmap_single(nic->pdev, rx->dma_addr,
RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
- /* this allows for a fast restart without re-enabling interrupts */
- if(le16_to_cpu(rfd->command) & cb_el)
- nic->ru_running = RU_SUSPENDED;
+ /*
+ * This happens when hardward sees the rfd el flag set
+ * but then sees the rfd size set as well
+ */
+ if (le16_to_cpu(rfd->command) & cb_el) {
+ status = readb(&nic->csr->scb.status);
+ if (status & rus_no_res)
+ nic->ru_running = ru_stopped;
+ }
/* Pull off the RFD and put the actual data (minus eth hdr) */
skb_reserve(skb, sizeof(struct rfd));
@@ -1865,32 +1951,34 @@ static int e100_rx_indicate(struct nic *
static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
unsigned int work_to_do)
{
- struct rx *rx;
+ struct rx *rx, *marked_rx;
int restart_required = 0;
- struct rx *rx_to_start = NULL;
-
- /* are we already rnr? then pay attention!!! this ensures that
- * the state machine progression never allows a start with a
- * partially cleaned list, avoiding a race between hardware
- * and rx_to_clean when in NAPI mode */
- if(RU_SUSPENDED == nic->ru_running)
- restart_required = 1;
+ int err = 0;
/* Indicate newly arrived packets */
for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
- int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
- if(-EAGAIN == err) {
- /* hit quota so have more work to do, restart once
- * cleanup is complete */
- restart_required = 0;
+ err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+ /* Hit quota or no more to clean */
+ if(-EAGAIN == err || -ENODATA == err)
break;
- } else if(-ENODATA == err)
- break; /* No more to clean */
}
- /* save our starting point as the place we'll restart the receiver */
- if(restart_required)
- rx_to_start = nic->rx_to_clean;
+ /*
+ * On EAGAIN, hit quota so have more work to do, restart once
+ * cleanup is complete.
+ * Else, are we already rnr? then pay attention!!! this ensures that
+ * the state machine progression never allows a start with a
+ * partially cleaned list, avoiding a race between hardware
+ * and rx_to_clean when in NAPI mode
+ */
+ if(-EAGAIN != err && ru_stopped == nic->ru_running)
+ restart_required = 1;
+
+ marked_rx = nic->rx_to_use->prev->prev;
+ if (!(marked_rx->flags & rx_el)) {
+ marked_rx = marked_rx->prev;
+ BUG_ON(!marked_rx->flags & rx_el);
+ }
/* Alloc new skbs to refill list */
for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
@@ -1898,10 +1986,12 @@ static void e100_rx_clean(struct nic *ni
break; /* Better luck next time (see watchdog) */
}
+ e100_find_mark_el(nic, marked_rx, !restart_required);
+
if(restart_required) {
// ack the rnr?
writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
- e100_start_receiver(nic, rx_to_start);
+ e100_start_receiver(nic);
if(work_done)
(*work_done)++;
}
@@ -1912,8 +2002,6 @@ static void e100_rx_clean_list(struct ni
struct rx *rx;
unsigned int i, count = nic->params.rfds.count;
- nic->ru_running = RU_UNINITIALIZED;
-
if(nic->rxs) {
for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
if(rx->skb) {
@@ -1935,7 +2023,6 @@ static int e100_rx_alloc_list(struct nic
unsigned int i, count = nic->params.rfds.count;
nic->rx_to_use = nic->rx_to_clean = NULL;
- nic->ru_running = RU_UNINITIALIZED;
if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
return -ENOMEM;
@@ -1950,7 +2037,9 @@ static int e100_rx_alloc_list(struct nic
}
nic->rx_to_use = nic->rx_to_clean = nic->rxs;
- nic->ru_running = RU_SUSPENDED;
+ nic->ru_running = ru_stopped;
+
+ e100_find_mark_el(nic, NULL, 0);
return 0;
}
@@ -1971,8 +2060,8 @@ static irqreturn_t e100_intr(int irq, vo
iowrite8(stat_ack, &nic->csr->scb.stat_ack);
/* We hit Receive No Resource (RNR); restart RU after cleaning */
- if(stat_ack & stat_ack_rnr)
- nic->ru_running = RU_SUSPENDED;
+ if (stat_ack & stat_ack_rnr)
+ nic->ru_running = ru_stopped;
if(likely(netif_rx_schedule_prep(netdev))) {
e100_disable_irq(nic);
@@ -2065,7 +2154,7 @@ static int e100_up(struct nic *nic)
if((err = e100_hw_init(nic)))
goto err_clean_cbs;
e100_set_multicast_list(nic->netdev);
- e100_start_receiver(nic, NULL);
+ e100_start_receiver(nic);
mod_timer(&nic->watchdog, jiffies);
if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
nic->netdev->name, nic->netdev)))
@@ -2146,7 +2235,7 @@ static int e100_loopback_test(struct nic
mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
BMCR_LOOPBACK);
- e100_start_receiver(nic, NULL);
+ e100_start_receiver(nic);
if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
err = -ENOMEM;
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-08-31 20:54 [PATCH] Fix e100 on systems that have cache incoherent DMA David Acker
@ 2007-09-04 17:02 ` Kok, Auke
2007-09-07 16:31 ` Kok, Auke
1 sibling, 0 replies; 28+ messages in thread
From: Kok, Auke @ 2007-09-04 17:02 UTC (permalink / raw)
To: David Acker
Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik
David Acker wrote:
> On the systems that have cache incoherent DMA, including ARM, there is a
> race condition between software allocating a new receive buffer and hardware
> writing into a buffer. The two race on touching the last Receive Frame
> Descriptor (RFD). It has its el-bit set and its next link equal to 0.
> When hardware encounters this buffer it attempts to write data to it and
> then update Status Word bits and Actual Count in the RFD. At the same time
> software may try to clear the el-bit and set the link address to a new buffer.
>
> Since the entire RFD is once cache-line, the two write operations can collide.
> This can lead to the receive unit stalling or interpreting random memory as
> its receive area.
>
> The fix is to set the el-bit on and the size to 0 on the next to last buffer
> in the chain. When the hardware encounters this buffer it stops and does not
> write to it at all. The hardware issues an RNR interrupt with the receive
> unit in the No Resources state. Software can write to the tail of the list
> because it knows hardware will stop on the previous descriptor that was
> marked as the end of list.
>
> Once it has a new next to last buffer prepared, it can clear the el-bit and
> set the size on the previous one. The race on this buffer is safe since
> the link already points to a valid next buffer and the software can handle
> the race setting the size (assuming aligned 16 bit writes are atomic with
> respect to the DMA read). If the hardware sees the el-bit cleared without
> the size set, it will move on to the next buffer and skip this one. If it
> sees the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
> Flags are kept in the software descriptor to note if the el bit is set and if
> the size was 0. When software clears the RFD's el bit and set its size, it
> also clears the el flag but leaves the size was 0 bit set. This way software
> can identify them when the race may have occurred when cleaning the ring.
> On these descriptors, it looks ahead and if the next one is complete then
> hardware must have skipped the current one. Logic is added to prevent two
> packets in a row being marked while the receiver is running to avoid running
> in lockstep with the hardware and thereby limiting the required lookahead.
>
> This is a patch for 2.6.23-rc4.
>
> Signed-off-by: David Acker <dacker@roinet.com>
thanks David,
I'm going to try to give this some decent testing on x86 in the next two weeks,
I'll let everyone know how this is going and take a look at the patch a bit
later in -depth.
Auke
>
> ---
>
> --- linux-2.6.23-rc4/drivers/net/e100.c.orig 2007-08-30 13:32:10.000000000 -0400
> +++ linux-2.6.23-rc4/drivers/net/e100.c 2007-08-30 15:42:07.000000000 -0400
> @@ -106,6 +106,13 @@
> * the RFD, the RFD must be dma_sync'ed to maintain a consistent
> * view from software and hardware.
> *
> + * In order to keep updates to the RFD link field from colliding with
> + * hardware writes to mark packets complete, we use the feature that
> + * hardware will not write to a size 0 descriptor and mark the previous
> + * packet as end-of-list (EL). After updating the link, we remove EL
> + * and only then restore the size such that hardware may use the
> + * previous-to-end RFD.
> + *
> * Under typical operation, the receive unit (RU) is start once,
> * and the controller happily fills RFDs as frames arrive. If
> * replacement RFDs cannot be allocated, or the RU goes non-active,
> @@ -281,14 +288,14 @@ struct csr {
> };
>
> enum scb_status {
> + rus_no_res = 0x08,
> rus_ready = 0x10,
> rus_mask = 0x3C,
> };
>
> enum ru_state {
> - RU_SUSPENDED = 0,
> - RU_RUNNING = 1,
> - RU_UNINITIALIZED = -1,
> + ru_stopped = 0,
> + ru_running = 1,
> };
>
> enum scb_stat_ack {
> @@ -401,10 +408,16 @@ struct rfd {
> u16 size;
> };
>
> +enum rx_flags {
> + rx_el = 0x01,
> + rx_s0 = 0x02,
> +};
> +
> struct rx {
> struct rx *next, *prev;
> struct sk_buff *skb;
> dma_addr_t dma_addr;
> + u8 flags;
> };
>
> #if defined(__BIG_ENDIAN_BITFIELD)
> @@ -952,7 +965,7 @@ static void e100_get_defaults(struct nic
> ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>
> /* Template for a freshly allocated RFD */
> - nic->blank_rfd.command = cpu_to_le16(cb_el);
> + nic->blank_rfd.command = 0;
> nic->blank_rfd.rbd = 0xFFFFFFFF;
> nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>
> @@ -1753,18 +1766,48 @@ static int e100_alloc_cbs(struct nic *ni
> return 0;
> }
>
> -static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
> +static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, int is_running)
> {
> - if(!nic->rxs) return;
> - if(RU_SUSPENDED != nic->ru_running) return;
> + struct rx *rx = nic->rx_to_use->prev->prev;
> + struct rfd *rfd;
> +
> + if (marked_rx == rx)
> + return;
> +
> + rfd = (struct rfd *) rx->skb->data;
> + rfd->command |= cpu_to_le16(cb_el);
> + rfd->size = 0;
> + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
> + rx->flags |= (rx_el | rx_s0);
> +
> + if (!marked_rx)
> + return;
> +
> + rfd = (struct rfd *) marked_rx->skb->data;
> + rfd->command &= ~cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
> +
> + rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> + pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
>
> - /* handle init time starts */
> - if(!rx) rx = nic->rxs;
> + if (is_running)
> + marked_rx->flags &= ~rx_el;
> + else
> + marked_rx->flags &= ~(rx_el | rx_s0);
> +}
> +
> +static inline void e100_start_receiver(struct nic *nic)
> +{
> + if(!nic->rxs) return;
> + if (ru_stopped != nic->ru_running) return;
>
> /* (Re)start RU if suspended or idle and RFA is non-NULL */
> - if(rx->skb) {
> - e100_exec_cmd(nic, ruc_start, rx->dma_addr);
> - nic->ru_running = RU_RUNNING;
> + if (nic->rx_to_clean->skb) {
> + e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
> + nic->ru_running = ru_running;
> }
> }
>
> @@ -1793,8 +1836,6 @@ static int e100_rx_alloc_skb(struct nic
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned(cpu_to_le32(rx->dma_addr),
> (u32 *)&prev_rfd->link);
> - wmb();
> - prev_rfd->command &= ~cpu_to_le16(cb_el);
> pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
> @@ -1808,6 +1849,7 @@ static int e100_rx_indicate(struct nic *
> struct sk_buff *skb = rx->skb;
> struct rfd *rfd = (struct rfd *)skb->data;
> u16 rfd_status, actual_size;
> + u8 status;
>
> if(unlikely(work_done && *work_done >= work_to_do))
> return -EAGAIN;
> @@ -1819,9 +1861,47 @@ static int e100_rx_indicate(struct nic *
>
> DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>
> - /* If data isn't ready, nothing to indicate */
> - if(unlikely(!(rfd_status & cb_complete)))
> + /*
> + * If data isn't ready, nothing to indicate
> + * If both the el and s0 rx flags are set, we have hit the marked
> + * buffer but we don't know if hardware has seen it so we check
> + * the status.
> + * If only the s0 flag is set, we check the next buffer.
> + * If it is complete, we know that hardware saw the rfd el bit
> + * get cleared but did not see the rfd size get set so it
> + * skipped this buffer. We just return 0 and look at the
> + * next buffer.
> + * If only the s0 flag is set but the next buffer is
> + * not complete, we cleared the el flag as hardware
> + * hit this buffer.
> + */
> + if (unlikely(!(rfd_status & cb_complete))) {
> + u8 maskedFlags = rx->flags & (rx_el | rx_s0);
> + if (maskedFlags == (rx_el | rx_s0)) {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + } else if (maskedFlags == rx_s0) {
> + struct rx *next_rx = rx->next;
> + struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
> + pci_dma_sync_single_for_cpu(nic->pdev,
> + next_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_FROMDEVICE);
> + if (next_rfd->status & cpu_to_le16(cb_complete)) {
> + pci_unmap_single(nic->pdev, rx->dma_addr,
> + RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
> + dev_kfree_skb_any(skb);
> + rx->skb = NULL;
> + rx->flags &= ~rx_s0;
> + return 0;
> + } else {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
> + }
> return -ENODATA;
> + }
>
> /* Get actual data size */
> actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1832,9 +1912,15 @@ static int e100_rx_indicate(struct nic *
> pci_unmap_single(nic->pdev, rx->dma_addr,
> RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>
> - /* this allows for a fast restart without re-enabling interrupts */
> - if(le16_to_cpu(rfd->command) & cb_el)
> - nic->ru_running = RU_SUSPENDED;
> + /*
> + * This happens when hardward sees the rfd el flag set
> + * but then sees the rfd size set as well
> + */
> + if (le16_to_cpu(rfd->command) & cb_el) {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
>
> /* Pull off the RFD and put the actual data (minus eth hdr) */
> skb_reserve(skb, sizeof(struct rfd));
> @@ -1865,32 +1951,34 @@ static int e100_rx_indicate(struct nic *
> static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
> unsigned int work_to_do)
> {
> - struct rx *rx;
> + struct rx *rx, *marked_rx;
> int restart_required = 0;
> - struct rx *rx_to_start = NULL;
> -
> - /* are we already rnr? then pay attention!!! this ensures that
> - * the state machine progression never allows a start with a
> - * partially cleaned list, avoiding a race between hardware
> - * and rx_to_clean when in NAPI mode */
> - if(RU_SUSPENDED == nic->ru_running)
> - restart_required = 1;
> + int err = 0;
>
> /* Indicate newly arrived packets */
> for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
> - int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> - if(-EAGAIN == err) {
> - /* hit quota so have more work to do, restart once
> - * cleanup is complete */
> - restart_required = 0;
> + err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> + /* Hit quota or no more to clean */
> + if(-EAGAIN == err || -ENODATA == err)
> break;
> - } else if(-ENODATA == err)
> - break; /* No more to clean */
> }
>
> - /* save our starting point as the place we'll restart the receiver */
> - if(restart_required)
> - rx_to_start = nic->rx_to_clean;
> + /*
> + * On EAGAIN, hit quota so have more work to do, restart once
> + * cleanup is complete.
> + * Else, are we already rnr? then pay attention!!! this ensures that
> + * the state machine progression never allows a start with a
> + * partially cleaned list, avoiding a race between hardware
> + * and rx_to_clean when in NAPI mode
> + */
> + if(-EAGAIN != err && ru_stopped == nic->ru_running)
> + restart_required = 1;
> +
> + marked_rx = nic->rx_to_use->prev->prev;
> + if (!(marked_rx->flags & rx_el)) {
> + marked_rx = marked_rx->prev;
> + BUG_ON(!marked_rx->flags & rx_el);
> + }
>
> /* Alloc new skbs to refill list */
> for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
> @@ -1898,10 +1986,12 @@ static void e100_rx_clean(struct nic *ni
> break; /* Better luck next time (see watchdog) */
> }
>
> + e100_find_mark_el(nic, marked_rx, !restart_required);
> +
> if(restart_required) {
> // ack the rnr?
> writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
> - e100_start_receiver(nic, rx_to_start);
> + e100_start_receiver(nic);
> if(work_done)
> (*work_done)++;
> }
> @@ -1912,8 +2002,6 @@ static void e100_rx_clean_list(struct ni
> struct rx *rx;
> unsigned int i, count = nic->params.rfds.count;
>
> - nic->ru_running = RU_UNINITIALIZED;
> -
> if(nic->rxs) {
> for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
> if(rx->skb) {
> @@ -1935,7 +2023,6 @@ static int e100_rx_alloc_list(struct nic
> unsigned int i, count = nic->params.rfds.count;
>
> nic->rx_to_use = nic->rx_to_clean = NULL;
> - nic->ru_running = RU_UNINITIALIZED;
>
> if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
> return -ENOMEM;
> @@ -1950,7 +2037,9 @@ static int e100_rx_alloc_list(struct nic
> }
>
> nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> - nic->ru_running = RU_SUSPENDED;
> + nic->ru_running = ru_stopped;
> +
> + e100_find_mark_el(nic, NULL, 0);
>
> return 0;
> }
> @@ -1971,8 +2060,8 @@ static irqreturn_t e100_intr(int irq, vo
> iowrite8(stat_ack, &nic->csr->scb.stat_ack);
>
> /* We hit Receive No Resource (RNR); restart RU after cleaning */
> - if(stat_ack & stat_ack_rnr)
> - nic->ru_running = RU_SUSPENDED;
> + if (stat_ack & stat_ack_rnr)
> + nic->ru_running = ru_stopped;
>
> if(likely(netif_rx_schedule_prep(netdev))) {
> e100_disable_irq(nic);
> @@ -2065,7 +2154,7 @@ static int e100_up(struct nic *nic)
> if((err = e100_hw_init(nic)))
> goto err_clean_cbs;
> e100_set_multicast_list(nic->netdev);
> - e100_start_receiver(nic, NULL);
> + e100_start_receiver(nic);
> mod_timer(&nic->watchdog, jiffies);
> if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
> nic->netdev->name, nic->netdev)))
> @@ -2146,7 +2235,7 @@ static int e100_loopback_test(struct nic
> mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
> BMCR_LOOPBACK);
>
> - e100_start_receiver(nic, NULL);
> + e100_start_receiver(nic);
>
> if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
> err = -ENOMEM;
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-08-31 20:54 [PATCH] Fix e100 on systems that have cache incoherent DMA David Acker
2007-09-04 17:02 ` Kok, Auke
@ 2007-09-07 16:31 ` Kok, Auke
2007-09-07 20:41 ` David Acker
1 sibling, 1 reply; 28+ messages in thread
From: Kok, Auke @ 2007-09-07 16:31 UTC (permalink / raw)
To: David Acker
Cc: John Ronciak, Jesse Brandeburg, Jeff Kirsher, Milton Miller,
Jeff Garzik, netdev, e1000-devel, Scott Feldman
David Acker wrote:
> On the systems that have cache incoherent DMA, including ARM, there is a
> race condition between software allocating a new receive buffer and hardware
> writing into a buffer. The two race on touching the last Receive Frame
> Descriptor (RFD). It has its el-bit set and its next link equal to 0.
> When hardware encounters this buffer it attempts to write data to it and
> then update Status Word bits and Actual Count in the RFD. At the same time
> software may try to clear the el-bit and set the link address to a new buffer.
>
> Since the entire RFD is once cache-line, the two write operations can collide.
> This can lead to the receive unit stalling or interpreting random memory as
> its receive area.
>
> The fix is to set the el-bit on and the size to 0 on the next to last buffer
> in the chain. When the hardware encounters this buffer it stops and does not
> write to it at all. The hardware issues an RNR interrupt with the receive
> unit in the No Resources state. Software can write to the tail of the list
> because it knows hardware will stop on the previous descriptor that was
> marked as the end of list.
>
> Once it has a new next to last buffer prepared, it can clear the el-bit and
> set the size on the previous one. The race on this buffer is safe since
> the link already points to a valid next buffer and the software can handle
> the race setting the size (assuming aligned 16 bit writes are atomic with
> respect to the DMA read). If the hardware sees the el-bit cleared without
> the size set, it will move on to the next buffer and skip this one. If it
> sees the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
> Flags are kept in the software descriptor to note if the el bit is set and if
> the size was 0. When software clears the RFD's el bit and set its size, it
> also clears the el flag but leaves the size was 0 bit set. This way software
> can identify them when the race may have occurred when cleaning the ring.
> On these descriptors, it looks ahead and if the next one is complete then
> hardware must have skipped the current one. Logic is added to prevent two
> packets in a row being marked while the receiver is running to avoid running
> in lockstep with the hardware and thereby limiting the required lookahead.
>
> This is a patch for 2.6.23-rc4.
>
> Signed-off-by: David Acker <dacker@roinet.com>
first impressions are not good: pings are erratic and shoot up to 3 seconds. In
an overnight stress test, the receive unit went offline and never came back up
(TX still working).
it sounds like something in the logic is suspending the ru too much, but I
haven't had time to look deeply into the code yet.
Auke
>
> ---
>
> --- linux-2.6.23-rc4/drivers/net/e100.c.orig 2007-08-30 13:32:10.000000000 -0400
> +++ linux-2.6.23-rc4/drivers/net/e100.c 2007-08-30 15:42:07.000000000 -0400
> @@ -106,6 +106,13 @@
> * the RFD, the RFD must be dma_sync'ed to maintain a consistent
> * view from software and hardware.
> *
> + * In order to keep updates to the RFD link field from colliding with
> + * hardware writes to mark packets complete, we use the feature that
> + * hardware will not write to a size 0 descriptor and mark the previous
> + * packet as end-of-list (EL). After updating the link, we remove EL
> + * and only then restore the size such that hardware may use the
> + * previous-to-end RFD.
> + *
> * Under typical operation, the receive unit (RU) is start once,
> * and the controller happily fills RFDs as frames arrive. If
> * replacement RFDs cannot be allocated, or the RU goes non-active,
> @@ -281,14 +288,14 @@ struct csr {
> };
>
> enum scb_status {
> + rus_no_res = 0x08,
> rus_ready = 0x10,
> rus_mask = 0x3C,
> };
>
> enum ru_state {
> - RU_SUSPENDED = 0,
> - RU_RUNNING = 1,
> - RU_UNINITIALIZED = -1,
> + ru_stopped = 0,
> + ru_running = 1,
> };
>
> enum scb_stat_ack {
> @@ -401,10 +408,16 @@ struct rfd {
> u16 size;
> };
>
> +enum rx_flags {
> + rx_el = 0x01,
> + rx_s0 = 0x02,
> +};
> +
> struct rx {
> struct rx *next, *prev;
> struct sk_buff *skb;
> dma_addr_t dma_addr;
> + u8 flags;
> };
>
> #if defined(__BIG_ENDIAN_BITFIELD)
> @@ -952,7 +965,7 @@ static void e100_get_defaults(struct nic
> ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>
> /* Template for a freshly allocated RFD */
> - nic->blank_rfd.command = cpu_to_le16(cb_el);
> + nic->blank_rfd.command = 0;
> nic->blank_rfd.rbd = 0xFFFFFFFF;
> nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>
> @@ -1753,18 +1766,48 @@ static int e100_alloc_cbs(struct nic *ni
> return 0;
> }
>
> -static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
> +static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx, int is_running)
> {
> - if(!nic->rxs) return;
> - if(RU_SUSPENDED != nic->ru_running) return;
> + struct rx *rx = nic->rx_to_use->prev->prev;
> + struct rfd *rfd;
> +
> + if (marked_rx == rx)
> + return;
> +
> + rfd = (struct rfd *) rx->skb->data;
> + rfd->command |= cpu_to_le16(cb_el);
> + rfd->size = 0;
> + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
> + rx->flags |= (rx_el | rx_s0);
> +
> + if (!marked_rx)
> + return;
> +
> + rfd = (struct rfd *) marked_rx->skb->data;
> + rfd->command &= ~cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
> +
> + rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> + pci_dma_sync_single_for_device(nic->pdev, marked_rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_BIDIRECTIONAL);
>
> - /* handle init time starts */
> - if(!rx) rx = nic->rxs;
> + if (is_running)
> + marked_rx->flags &= ~rx_el;
> + else
> + marked_rx->flags &= ~(rx_el | rx_s0);
> +}
> +
> +static inline void e100_start_receiver(struct nic *nic)
> +{
> + if(!nic->rxs) return;
> + if (ru_stopped != nic->ru_running) return;
>
> /* (Re)start RU if suspended or idle and RFA is non-NULL */
> - if(rx->skb) {
> - e100_exec_cmd(nic, ruc_start, rx->dma_addr);
> - nic->ru_running = RU_RUNNING;
> + if (nic->rx_to_clean->skb) {
> + e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
> + nic->ru_running = ru_running;
> }
> }
>
> @@ -1793,8 +1836,6 @@ static int e100_rx_alloc_skb(struct nic
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned(cpu_to_le32(rx->dma_addr),
> (u32 *)&prev_rfd->link);
> - wmb();
> - prev_rfd->command &= ~cpu_to_le16(cb_el);
> pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
> @@ -1808,6 +1849,7 @@ static int e100_rx_indicate(struct nic *
> struct sk_buff *skb = rx->skb;
> struct rfd *rfd = (struct rfd *)skb->data;
> u16 rfd_status, actual_size;
> + u8 status;
>
> if(unlikely(work_done && *work_done >= work_to_do))
> return -EAGAIN;
> @@ -1819,9 +1861,47 @@ static int e100_rx_indicate(struct nic *
>
> DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>
> - /* If data isn't ready, nothing to indicate */
> - if(unlikely(!(rfd_status & cb_complete)))
> + /*
> + * If data isn't ready, nothing to indicate
> + * If both the el and s0 rx flags are set, we have hit the marked
> + * buffer but we don't know if hardware has seen it so we check
> + * the status.
> + * If only the s0 flag is set, we check the next buffer.
> + * If it is complete, we know that hardware saw the rfd el bit
> + * get cleared but did not see the rfd size get set so it
> + * skipped this buffer. We just return 0 and look at the
> + * next buffer.
> + * If only the s0 flag is set but the next buffer is
> + * not complete, we cleared the el flag as hardware
> + * hit this buffer.
> + */
> + if (unlikely(!(rfd_status & cb_complete))) {
> + u8 maskedFlags = rx->flags & (rx_el | rx_s0);
> + if (maskedFlags == (rx_el | rx_s0)) {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + } else if (maskedFlags == rx_s0) {
> + struct rx *next_rx = rx->next;
> + struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
> + pci_dma_sync_single_for_cpu(nic->pdev,
> + next_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_FROMDEVICE);
> + if (next_rfd->status & cpu_to_le16(cb_complete)) {
> + pci_unmap_single(nic->pdev, rx->dma_addr,
> + RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
> + dev_kfree_skb_any(skb);
> + rx->skb = NULL;
> + rx->flags &= ~rx_s0;
> + return 0;
> + } else {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
> + }
> return -ENODATA;
> + }
>
> /* Get actual data size */
> actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1832,9 +1912,15 @@ static int e100_rx_indicate(struct nic *
> pci_unmap_single(nic->pdev, rx->dma_addr,
> RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>
> - /* this allows for a fast restart without re-enabling interrupts */
> - if(le16_to_cpu(rfd->command) & cb_el)
> - nic->ru_running = RU_SUSPENDED;
> + /*
> + * This happens when hardward sees the rfd el flag set
> + * but then sees the rfd size set as well
> + */
> + if (le16_to_cpu(rfd->command) & cb_el) {
> + status = readb(&nic->csr->scb.status);
> + if (status & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
>
> /* Pull off the RFD and put the actual data (minus eth hdr) */
> skb_reserve(skb, sizeof(struct rfd));
> @@ -1865,32 +1951,34 @@ static int e100_rx_indicate(struct nic *
> static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
> unsigned int work_to_do)
> {
> - struct rx *rx;
> + struct rx *rx, *marked_rx;
> int restart_required = 0;
> - struct rx *rx_to_start = NULL;
> -
> - /* are we already rnr? then pay attention!!! this ensures that
> - * the state machine progression never allows a start with a
> - * partially cleaned list, avoiding a race between hardware
> - * and rx_to_clean when in NAPI mode */
> - if(RU_SUSPENDED == nic->ru_running)
> - restart_required = 1;
> + int err = 0;
>
> /* Indicate newly arrived packets */
> for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
> - int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> - if(-EAGAIN == err) {
> - /* hit quota so have more work to do, restart once
> - * cleanup is complete */
> - restart_required = 0;
> + err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> + /* Hit quota or no more to clean */
> + if(-EAGAIN == err || -ENODATA == err)
> break;
> - } else if(-ENODATA == err)
> - break; /* No more to clean */
> }
>
> - /* save our starting point as the place we'll restart the receiver */
> - if(restart_required)
> - rx_to_start = nic->rx_to_clean;
> + /*
> + * On EAGAIN, hit quota so have more work to do, restart once
> + * cleanup is complete.
> + * Else, are we already rnr? then pay attention!!! this ensures that
> + * the state machine progression never allows a start with a
> + * partially cleaned list, avoiding a race between hardware
> + * and rx_to_clean when in NAPI mode
> + */
> + if(-EAGAIN != err && ru_stopped == nic->ru_running)
> + restart_required = 1;
> +
> + marked_rx = nic->rx_to_use->prev->prev;
> + if (!(marked_rx->flags & rx_el)) {
> + marked_rx = marked_rx->prev;
> + BUG_ON(!marked_rx->flags & rx_el);
> + }
>
> /* Alloc new skbs to refill list */
> for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
> @@ -1898,10 +1986,12 @@ static void e100_rx_clean(struct nic *ni
> break; /* Better luck next time (see watchdog) */
> }
>
> + e100_find_mark_el(nic, marked_rx, !restart_required);
> +
> if(restart_required) {
> // ack the rnr?
> writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
> - e100_start_receiver(nic, rx_to_start);
> + e100_start_receiver(nic);
> if(work_done)
> (*work_done)++;
> }
> @@ -1912,8 +2002,6 @@ static void e100_rx_clean_list(struct ni
> struct rx *rx;
> unsigned int i, count = nic->params.rfds.count;
>
> - nic->ru_running = RU_UNINITIALIZED;
> -
> if(nic->rxs) {
> for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
> if(rx->skb) {
> @@ -1935,7 +2023,6 @@ static int e100_rx_alloc_list(struct nic
> unsigned int i, count = nic->params.rfds.count;
>
> nic->rx_to_use = nic->rx_to_clean = NULL;
> - nic->ru_running = RU_UNINITIALIZED;
>
> if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
> return -ENOMEM;
> @@ -1950,7 +2037,9 @@ static int e100_rx_alloc_list(struct nic
> }
>
> nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> - nic->ru_running = RU_SUSPENDED;
> + nic->ru_running = ru_stopped;
> +
> + e100_find_mark_el(nic, NULL, 0);
>
> return 0;
> }
> @@ -1971,8 +2060,8 @@ static irqreturn_t e100_intr(int irq, vo
> iowrite8(stat_ack, &nic->csr->scb.stat_ack);
>
> /* We hit Receive No Resource (RNR); restart RU after cleaning */
> - if(stat_ack & stat_ack_rnr)
> - nic->ru_running = RU_SUSPENDED;
> + if (stat_ack & stat_ack_rnr)
> + nic->ru_running = ru_stopped;
>
> if(likely(netif_rx_schedule_prep(netdev))) {
> e100_disable_irq(nic);
> @@ -2065,7 +2154,7 @@ static int e100_up(struct nic *nic)
> if((err = e100_hw_init(nic)))
> goto err_clean_cbs;
> e100_set_multicast_list(nic->netdev);
> - e100_start_receiver(nic, NULL);
> + e100_start_receiver(nic);
> mod_timer(&nic->watchdog, jiffies);
> if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
> nic->netdev->name, nic->netdev)))
> @@ -2146,7 +2235,7 @@ static int e100_loopback_test(struct nic
> mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
> BMCR_LOOPBACK);
>
> - e100_start_receiver(nic, NULL);
> + e100_start_receiver(nic);
>
> if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
> err = -ENOMEM;
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-07 16:31 ` Kok, Auke
@ 2007-09-07 20:41 ` David Acker
2007-09-07 21:03 ` Kok, Auke
2007-09-07 23:24 ` Jeff Garzik
0 siblings, 2 replies; 28+ messages in thread
From: David Acker @ 2007-09-07 20:41 UTC (permalink / raw)
To: Kok, Auke
Cc: John Ronciak, Jesse Brandeburg, Jeff Kirsher, Milton Miller,
Jeff Garzik, netdev, e1000-devel, Scott Feldman
Kok, Auke wrote:
> first impressions are not good: pings are erratic and shoot up to 3
> seconds. In an overnight stress test, the receive unit went offline and
> never came back up (TX still working).
>
> it sounds like something in the logic is suspending the ru too much, but
> I haven't had time to look deeply into the code yet.
I don't have an e100 enabled x86 box handy but I will look into getting one setup.
I just applied this patch to my PXA255 based system http://www.compulab.co.il/x255/html/x255-cm-datasheet.htm .
It is running 2.6.18.4 plus compulab patches plus some hostap patches plus the e100 patch. I get:
pings going from the embedded system to a desktop machine.
100 packets transmitted, 100 received, 0% packet loss, time 98996ms
rtt min/avg/max/mdev = 0.239/0.728/1.512/0.571 ms
Pings going the from the desktop machine to the embedded system
100 packets transmitted, 100 received, 0% packet loss, time 99217ms
rtt min/avg/max/mdev = 0.206/0.876/1.473/0.575 ms
iperf tcp from embedded to desktop gets:
[ 5] 0.0-100.0 sec 1007 MBytes 84.4 Mbits/sec
iperf udp from the embedded to the desktop gets (embedded told to send at 100mbps):
[ 5] Server Report:
[ 5] 0.0-100.0 sec 947 MBytes 79.4 Mbits/sec 0.068 ms 16/675645 (0.0024%)
[ 5] 0.0-100.0 sec 1 datagrams received out-of-order
iperf tcp from the desktop to the embedded gets:
[ 6] 0.0-100.0 sec 1.01 GBytes 86.4 Mbits/sec
iperf udp from the desktop to the embedded gets the following when the desktop sent at 100 mbps
[ 5] 0.0-100.0 sec 964 MBytes 80.8 Mbits/sec 0.359 ms 126467/813760 (16%)
[ 5] 0.0-100.0 sec 1 datagrams received out-of-order
Boot messages for my e100 are:
e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
e100: Copyright(c) 1999-2005 Intel Corporation
PCI: enabling device 0000:00:09.0 (0000 -> 0003)
PCI: Setting latency timer of device 0000:00:09.0 to 64
e100: eth0: e100_probe: addr 0x10131000, irq 111, MAC addr 00:09:30:FF:F2:F6
cat /sys/bus/pci/drivers/e100/0000\:00\:09.0/{device,vendor,subsystem_device,subsystem_vendor}
0x1209
0x8086
0x0000
0x0000
It's on its own interrupt line:
cm-debian:~# cat /proc/interrupts |grep eth0
111: 402428 - eth0
lspci shows:
00:09.0 Ethernet controller: Intel Corporation 8255xER/82551IT Fast Ethernet Controller (rev 09)
Let me know if there is any other information I can provide you. I will look through the code to see what could be
going on with your machine. I will also look into reproducing these results with a newer kernel. This may be tricky
since compulab's patches are pretty stale and don't always apply easily.
-Ack
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-07 20:41 ` David Acker
@ 2007-09-07 21:03 ` Kok, Auke
2007-09-07 21:18 ` Kok, Auke
2007-09-07 23:24 ` Jeff Garzik
1 sibling, 1 reply; 28+ messages in thread
From: Kok, Auke @ 2007-09-07 21:03 UTC (permalink / raw)
To: David Acker
Cc: Kok, Auke, e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik
David Acker wrote:
> Kok, Auke wrote:
>> first impressions are not good: pings are erratic and shoot up to 3
>> seconds. In an overnight stress test, the receive unit went offline and
>> never came back up (TX still working).
>>
>> it sounds like something in the logic is suspending the ru too much, but
>> I haven't had time to look deeply into the code yet.
>
> I don't have an e100 enabled x86 box handy but I will look into getting one setup.
>
> I just applied this patch to my PXA255 based system http://www.compulab.co.il/x255/html/x255-cm-datasheet.htm .
> It is running 2.6.18.4 plus compulab patches plus some hostap patches plus the e100 patch. I get:
>
> pings going from the embedded system to a desktop machine.
> 100 packets transmitted, 100 received, 0% packet loss, time 98996ms
> rtt min/avg/max/mdev = 0.239/0.728/1.512/0.571 ms
>
> Pings going the from the desktop machine to the embedded system
> 100 packets transmitted, 100 received, 0% packet loss, time 99217ms
> rtt min/avg/max/mdev = 0.206/0.876/1.473/0.575 ms
ok, I just got a note from our lab saying that that particular system has the
freak ping times even without your patch applied 8)
ignoring the ping issue, we still have the ru offline, but that could have
possibly been caused by whatever is causing this ping issue... More testing is
needed, and I'll try to find a system without the ping issue here first.
Auke
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-07 21:03 ` Kok, Auke
@ 2007-09-07 21:18 ` Kok, Auke
0 siblings, 0 replies; 28+ messages in thread
From: Kok, Auke @ 2007-09-07 21:18 UTC (permalink / raw)
To: Kok, Auke
Cc: David Acker, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
Milton Miller, Jeff Garzik, netdev, e1000-devel, Scott Feldman
Kok, Auke wrote:
> David Acker wrote:
>> Kok, Auke wrote:
>>> first impressions are not good: pings are erratic and shoot up to 3
>>> seconds. In an overnight stress test, the receive unit went offline and
>>> never came back up (TX still working).
>>>
>>> it sounds like something in the logic is suspending the ru too much, but
>>> I haven't had time to look deeply into the code yet.
>> I don't have an e100 enabled x86 box handy but I will look into getting one setup.
>>
>> I just applied this patch to my PXA255 based system http://www.compulab.co.il/x255/html/x255-cm-datasheet.htm .
>> It is running 2.6.18.4 plus compulab patches plus some hostap patches plus the e100 patch. I get:
>>
>> pings going from the embedded system to a desktop machine.
>> 100 packets transmitted, 100 received, 0% packet loss, time 98996ms
>> rtt min/avg/max/mdev = 0.239/0.728/1.512/0.571 ms
>>
>> Pings going the from the desktop machine to the embedded system
>> 100 packets transmitted, 100 received, 0% packet loss, time 99217ms
>> rtt min/avg/max/mdev = 0.206/0.876/1.473/0.575 ms
>
> ok, I just got a note from our lab saying that that particular system has the
> freak ping times even without your patch applied 8)
>
> ignoring the ping issue, we still have the ru offline, but that could have
> possibly been caused by whatever is causing this ping issue... More testing is
> needed, and I'll try to find a system without the ping issue here first.
update: Emil reports that the unit with the RU hang did not have bad ping times
to begin with, pointing to a problem with the patch for sure now...
Auke
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-07 20:41 ` David Acker
2007-09-07 21:03 ` Kok, Auke
@ 2007-09-07 23:24 ` Jeff Garzik
2007-09-11 20:54 ` David Acker
1 sibling, 1 reply; 28+ messages in thread
From: Jeff Garzik @ 2007-09-07 23:24 UTC (permalink / raw)
To: David Acker
Cc: Kok, Auke, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
Milton Miller, netdev, e1000-devel, Scott Feldman
David Acker wrote:
> Let me know if there is any other information I can provide you. I will
> look through the code to see what could be going on with your machine.
> I will also look into reproducing these results with a newer kernel.
> This may be tricky since compulab's patches are pretty stale and don't
> always apply easily.
pktgen outputs for the various cases modified/unmodified[/others?] would
be nice, if you have a spot of time.
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-07 23:24 ` Jeff Garzik
@ 2007-09-11 20:54 ` David Acker
2007-09-12 11:30 ` James Chapman
0 siblings, 1 reply; 28+ messages in thread
From: David Acker @ 2007-09-11 20:54 UTC (permalink / raw)
To: Jeff Garzik
Cc: Kok, Auke, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
Milton Miller, netdev, e1000-devel, Scott Feldman
Jeff Garzik wrote:
> David Acker wrote:
>> Let me know if there is any other information I can provide you. I
>> will look through the code to see what could be going on with your
>> machine. I will also look into reproducing these results with a newer
>> kernel. This may be tricky since compulab's patches are pretty stale
>> and don't always apply easily.
>
>
> pktgen outputs for the various cases modified/unmodified[/others?] would
> be nice, if you have a spot of time.
>
> Jeff
I am not familiar with pktgen but I seem to have it working for a simple test.
I edited the 1-1 example from ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ . The results with
and without the patch are below. Let me know if you want any other tests run. I obtained a PCI to miniPCI adapter that
will let me test my miniPCI based e100s on my main dev box. I will work on reproducing Intel's results tomorrow as time
permits. I am also still trying to get time to work on a recent kernel merge. Not a lot changed in the 100 since
2.6.18.4 except a flip from readX/writeX to ioreadX/iowriteX.
-Ack
cm-debian:/tmp# ./pktgen.conf-1-1
Removing all devices
Adding eth0
Setting max_before_softirq 10000
Configuring /proc/net/pktgen/eth0
Running... ctrl^C to stop
Done
Here are the results on 2.6.18.4 with the patch I submitted on my embedded system.
cm-debian:/tmp# cat /proc/net/pktgen/eth0
Params: count 10000000 min_pkt_size: 60 max_pkt_size: 60
frags: 0 delay: 0 clone_skb: 1000000 ifname: eth0
flows: 0 flowlen: 0
dst_min: 192.168.1.40 dst_max:
src_min: src_max:
src_mac: 00:09:30:FF:F2:F6 dst_mac: 00:19:B9:0B:45:8E
udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags:
Current:
pkts-sofar: 10000000 errors: 0
started: 14852031282115us stopped: 14852181463422us idle: 918us
seq_num: 10000011 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 0x2901a8c0 cur_daddr: 0x2801a8c0
cur_udp_dst: 9 cur_udp_src: 9
flows: 0
Result: OK: 150181307(c150180389+d918) usec, 10000000 (60byte,0frags)
66586pps 31Mb/sec (31961280bps) errors: 0
and here are the results without the patch:
cm-debian:/tmp# cat /proc/net/pktgen/eth0
Params: count 10000000 min_pkt_size: 60 max_pkt_size: 60
frags: 0 delay: 0 clone_skb: 1000000 ifname: eth0
flows: 0 flowlen: 0
dst_min: 192.168.1.40 dst_max:
src_min: src_max:
src_mac: 00:09:30:FF:F2:F6 dst_mac: 00:19:B9:0B:45:8E
udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags:
Current:
pkts-sofar: 10000000 errors: 0
started: 14864204332576us stopped: 14864355451225us idle: 1431us
seq_num: 10000011 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 0x2901a8c0 cur_daddr: 0x2801a8c0
cur_udp_dst: 9 cur_udp_src: 9
flows: 0
Result: OK: 151118649(c151117218+d1431) usec, 10000000 (60byte,0frags)
66173pps 31Mb/sec (31763040bps) errors: 0
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-11 20:54 ` David Acker
@ 2007-09-12 11:30 ` James Chapman
2007-09-12 20:11 ` David Acker
0 siblings, 1 reply; 28+ messages in thread
From: James Chapman @ 2007-09-12 11:30 UTC (permalink / raw)
To: David Acker
Cc: Jeff Garzik, Kok, Auke, John Ronciak, Jesse Brandeburg,
Jeff Kirsher, Milton Miller, netdev, e1000-devel, Scott Feldman
David Acker wrote:
> Jeff Garzik wrote:
>> David Acker wrote:
>>> Let me know if there is any other information I can provide you. I
>>> will look through the code to see what could be going on with your
>>> machine. I will also look into reproducing these results with a
>>> newer kernel. This may be tricky since compulab's patches are pretty
>>> stale and don't always apply easily.
>>
>>
>> pktgen outputs for the various cases modified/unmodified[/others?]
>> would be nice, if you have a spot of time.
>>
>> Jeff
>
> I am not familiar with pktgen but I seem to have it working for a simple
> test.
> I edited the 1-1 example from
> ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ .
> The results with and without the patch are below.
It looks like you ran pktgen on the embedded system, which exercised
only the transmit path. Auke indicated that the lockup was in the RU.
Have you run pktgen on a test system to fire packets at the embedded
system at max rate? Also test what happens when you fire packets in both
directions simultaneously.
--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-09-12 11:30 ` James Chapman
@ 2007-09-12 20:11 ` David Acker
0 siblings, 0 replies; 28+ messages in thread
From: David Acker @ 2007-09-12 20:11 UTC (permalink / raw)
To: James Chapman
Cc: Jeff Garzik, Kok, Auke, John Ronciak, Jesse Brandeburg,
Jeff Kirsher, Milton Miller, netdev, e1000-devel, Scott Feldman
James Chapman wrote:
> David Acker wrote:
>> Jeff Garzik wrote:
>>> pktgen outputs for the various cases modified/unmodified[/others?]
>>> would be nice, if you have a spot of time.
>>>
>>> Jeff
>>
>> I am not familiar with pktgen but I seem to have it working for a
>> simple test.
>> I edited the 1-1 example from
>> ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
>> . The results with and without the patch are below.
>
> It looks like you ran pktgen on the embedded system, which exercised
> only the transmit path. Auke indicated that the lockup was in the RU.
> Have you run pktgen on a test system to fire packets at the embedded
> system at max rate? Also test what happens when you fire packets in both
> directions simultaneously.
>
It appears that running pktgen between my pc and the embedded device at the same time eventually hangs the RU. I will
be looking into this failure in detail and report back when I know more. Thanks for bearing with me on this!
-Ack
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Fix e100 on systems that have cache incoherent DMA
@ 2007-11-02 13:27 David Acker
2007-11-02 16:05 ` Kok, Auke
2007-11-06 17:01 ` Kok, Auke
0 siblings, 2 replies; 28+ messages in thread
From: David Acker @ 2007-11-02 13:27 UTC (permalink / raw)
To: Auke Kok
Cc: e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
Scott Feldman, John Ronciak, Jeff Kirsher, Jeff Garzik
On the systems that have cache incoherent DMA, including ARM, there is a
race condition between software allocating a new receive buffer and hardware
writing into a buffer. The two race on touching the last Receive Frame
Descriptor (RFD). It has its el-bit set and its next link equal to 0.
When hardware encounters this buffer it attempts to write data to it and
then update Status Word bits and Actual Count in the RFD. At the same time
software may try to clear the el-bit and set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can collide.
This can lead to the receive unit stalling or interpreting random memory as
its receive area.
The fix is to set the el-bit on and the size to 0 on the next to last buffer
in the chain. When the hardware encounters this buffer it stops and does not
write to it at all. The hardware issues an RNR interrupt with the receive
unit in the No Resources state. Software can write to the tail of the list
because it knows hardware will stop on the previous descriptor that was
marked as the end of list.
Once it has a new next to last buffer prepared, it can clear the el-bit and
set the size on the previous one. The race on this buffer is safe since
the link already points to a valid next buffer and the software can handle
the race setting the size (assuming aligned 16 bit writes are atomic with
respect to the DMA read). If the hardware sees the el-bit cleared without
the size set, it will move on to the next buffer and skip this one. If it
sees the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.
This is a patch for 2.6.24-rc1.
Signed-off-by: David Acker <dacker@roinet.com>
---
This version is based on the simpler patch I did in May. The algorithm I tried
after that never worked correctly under load. It would hang the RU and the
transmit unit sometimes and if the card was restarted it would often crash the
system with memory corruption. This patch was tested on my embedded system
using pktgen. I had it sending while a PC sent at it. I also ran it as
wireless access point with a 12-hour bidirectional 20 mbps UDP going between an
ethernet host on the e100 and a wireless client.
--- linux-2.6.24-rc1/drivers/net/e100.c.orig 2007-11-01 11:42:35.000000000 -0400
+++ linux-2.6.24-rc1/drivers/net/e100.c 2007-11-02 09:09:47.000000000 -0400
@@ -106,6 +106,13 @@
* the RFD, the RFD must be dma_sync'ed to maintain a consistent
* view from software and hardware.
*
+ * In order to keep updates to the RFD link field from colliding with
+ * hardware writes to mark packets complete, we use the feature that
+ * hardware will not write to a size 0 descriptor and mark the previous
+ * packet as end-of-list (EL). After updating the link, we remove EL
+ * and only then restore the size such that hardware may use the
+ * previous-to-end RFD.
+ *
* Under typical operation, the receive unit (RU) is start once,
* and the controller happily fills RFDs as frames arrive. If
* replacement RFDs cannot be allocated, or the RU goes non-active,
@@ -281,14 +288,15 @@ struct csr {
};
enum scb_status {
+ rus_no_res = 0x08,
rus_ready = 0x10,
rus_mask = 0x3C,
};
enum ru_state {
- RU_SUSPENDED = 0,
- RU_RUNNING = 1,
- RU_UNINITIALIZED = -1,
+ ru_stopped = 0,
+ ru_running = 1,
+ ru_uninitialized = -1,
};
enum scb_stat_ack {
@@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic
((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
/* Template for a freshly allocated RFD */
- nic->blank_rfd.command = cpu_to_le16(cb_el);
+ nic->blank_rfd.command = 0;
nic->blank_rfd.rbd = 0xFFFFFFFF;
nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
@@ -1759,7 +1767,7 @@ static int e100_alloc_cbs(struct nic *ni
static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
{
if(!nic->rxs) return;
- if(RU_SUSPENDED != nic->ru_running) return;
+ if (ru_stopped != nic->ru_running) return;
/* handle init time starts */
if(!rx) rx = nic->rxs;
@@ -1767,7 +1775,7 @@ static inline void e100_start_receiver(s
/* (Re)start RU if suspended or idle and RFA is non-NULL */
if(rx->skb) {
e100_exec_cmd(nic, ruc_start, rx->dma_addr);
- nic->ru_running = RU_RUNNING;
+ nic->ru_running = ru_running;
}
}
@@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic
}
/* Link the RFD to end of RFA by linking previous RFD to
- * this one, and clearing EL bit of previous. */
+ * this one. We are safe to touch the previous RFD because
+ * it is protected by the before last buffer's el bit being set */
if(rx->prev->skb) {
struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
put_unaligned(cpu_to_le32(rx->dma_addr),
(u32 *)&prev_rfd->link);
- wmb();
- prev_rfd->command &= ~cpu_to_le16(cb_el);
- pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
- sizeof(struct rfd), PCI_DMA_TODEVICE);
}
return 0;
@@ -1824,8 +1829,20 @@ static int e100_rx_indicate(struct nic *
DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
/* If data isn't ready, nothing to indicate */
- if(unlikely(!(rfd_status & cb_complete)))
+ if (unlikely(!(rfd_status & cb_complete))) {
+ /* If the next buffer has the el bit, but we think the receiver
+ * is still running, check to see if it really stopped while
+ * we had interrupts off.
+ * This allows for a fast restart without re-enabling
+ * interrupts */
+ if ((le16_to_cpu(rfd->command) & cb_el) &&
+ (ru_running == nic->ru_running)) {
+
+ if (readb(&nic->csr->scb.status) & rus_no_res)
+ nic->ru_running = ru_stopped;
+ }
return -ENODATA;
+ }
/* Get actual data size */
actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1836,9 +1853,18 @@ static int e100_rx_indicate(struct nic *
pci_unmap_single(nic->pdev, rx->dma_addr,
RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
- /* this allows for a fast restart without re-enabling interrupts */
- if(le16_to_cpu(rfd->command) & cb_el)
- nic->ru_running = RU_SUSPENDED;
+ /* If this buffer has the el bit, but we think the receiver
+ * is still running, check to see if it really stopped while
+ * we had interrupts off.
+ * This allows for a fast restart without re-enabling interrupts.
+ * This can happen when the RU sees the size change but also sees
+ * the el bit set. */
+ if ((le16_to_cpu(rfd->command) & cb_el) &&
+ (ru_running == nic->ru_running)) {
+
+ if (readb(&nic->csr->scb.status) & rus_no_res)
+ nic->ru_running = ru_stopped;
+ }
/* Pull off the RFD and put the actual data (minus eth hdr) */
skb_reserve(skb, sizeof(struct rfd));
@@ -1870,31 +1896,30 @@ static void e100_rx_clean(struct nic *ni
unsigned int work_to_do)
{
struct rx *rx;
- int restart_required = 0;
- struct rx *rx_to_start = NULL;
-
- /* are we already rnr? then pay attention!!! this ensures that
- * the state machine progression never allows a start with a
- * partially cleaned list, avoiding a race between hardware
- * and rx_to_clean when in NAPI mode */
- if(RU_SUSPENDED == nic->ru_running)
- restart_required = 1;
+ int restart_required = 0, err = 0;
+ struct rx *old_before_last_rx, *new_before_last_rx;
+ struct rfd *old_before_last_rfd, *new_before_last_rfd;
/* Indicate newly arrived packets */
for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
- int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
- if(-EAGAIN == err) {
- /* hit quota so have more work to do, restart once
- * cleanup is complete */
- restart_required = 0;
+ err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+ /* Hit quota or no more to clean */
+ if (-EAGAIN == err || -ENODATA == err)
break;
- } else if(-ENODATA == err)
- break; /* No more to clean */
}
- /* save our starting point as the place we'll restart the receiver */
- if(restart_required)
- rx_to_start = nic->rx_to_clean;
+
+ /* On EAGAIN, hit quota so have more work to do, restart once
+ * cleanup is complete.
+ * Else, are we already rnr? then pay attention!!! this ensures that
+ * the state machine progression never allows a start with a
+ * partially cleaned list, avoiding a race between hardware
+ * and rx_to_clean when in NAPI mode */
+ if (-EAGAIN != err && ru_stopped == nic->ru_running)
+ restart_required = 1;
+
+ old_before_last_rx = nic->rx_to_use->prev->prev;
+ old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
/* Alloc new skbs to refill list */
for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
@@ -1902,10 +1927,42 @@ static void e100_rx_clean(struct nic *ni
break; /* Better luck next time (see watchdog) */
}
+ new_before_last_rx = nic->rx_to_use->prev->prev;
+ if (new_before_last_rx != old_before_last_rx) {
+ /* Set the el-bit on the buffer that is before the last buffer.
+ * This lets us update the next pointer on the last buffer
+ * without worrying about hardware touching it.
+ * We set the size to 0 to prevent hardware from touching this
+ * buffer.
+ * When the hardware hits the before last buffer with el-bit
+ * and size of 0, it will RNR interrupt, the RUS will go into
+ * the No Resources state. It will not complete nor write to
+ * this buffer. */
+ new_before_last_rfd =
+ (struct rfd *)new_before_last_rx->skb->data;
+ new_before_last_rfd->size = 0;
+ new_before_last_rfd->command |= cpu_to_le16(cb_el);
+ pci_dma_sync_single_for_device(nic->pdev,
+ new_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+
+ /* Now that we have a new stopping point, we can clear the old
+ * stopping point. We must sync twice to get the proper
+ * ordering on the hardware side of things. */
+ old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+ pci_dma_sync_single_for_device(nic->pdev,
+ old_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+ old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+ pci_dma_sync_single_for_device(nic->pdev,
+ old_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+ }
+
if(restart_required) {
// ack the rnr?
writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
- e100_start_receiver(nic, rx_to_start);
+ e100_start_receiver(nic, nic->rx_to_clean);
if(work_done)
(*work_done)++;
}
@@ -1916,7 +1973,7 @@ static void e100_rx_clean_list(struct ni
struct rx *rx;
unsigned int i, count = nic->params.rfds.count;
- nic->ru_running = RU_UNINITIALIZED;
+ nic->ru_running = ru_uninitialized;
if(nic->rxs) {
for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
@@ -1937,9 +1994,10 @@ static int e100_rx_alloc_list(struct nic
{
struct rx *rx;
unsigned int i, count = nic->params.rfds.count;
+ struct rfd *before_last;
nic->rx_to_use = nic->rx_to_clean = NULL;
- nic->ru_running = RU_UNINITIALIZED;
+ nic->ru_running = ru_uninitialized;
if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
return -ENOMEM;
@@ -1952,9 +2010,22 @@ static int e100_rx_alloc_list(struct nic
return -ENOMEM;
}
}
+ /* Set the el-bit on the buffer that is before the last buffer.
+ * This lets us update the next pointer on the last buffer without
+ * worrying about hardware touching it.
+ * We set the size to 0 to prevent hardware from touching this buffer.
+ * When the hardware hits the before last buffer with el-bit and size
+ * of 0, it will RNR interrupt, the RU will go into the No Resources
+ * state. It will not complete nor write to this buffer. */
+ rx = nic->rxs->prev->prev;
+ before_last = (struct rfd *)rx->skb->data;
+ before_last->command |= cpu_to_le16(cb_el);
+ before_last->size = 0;
+ pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+ sizeof(struct rfd), PCI_DMA_TODEVICE);
nic->rx_to_use = nic->rx_to_clean = nic->rxs;
- nic->ru_running = RU_SUSPENDED;
+ nic->ru_running = ru_stopped;
return 0;
}
@@ -1976,7 +2047,7 @@ static irqreturn_t e100_intr(int irq, vo
/* We hit Receive No Resource (RNR); restart RU after cleaning */
if(stat_ack & stat_ack_rnr)
- nic->ru_running = RU_SUSPENDED;
+ nic->ru_running = ru_stopped;
if(likely(netif_rx_schedule_prep(netdev, &nic->napi))) {
e100_disable_irq(nic);
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-02 13:27 David Acker
@ 2007-11-02 16:05 ` Kok, Auke
2007-11-02 16:11 ` Jeff Garzik
2007-11-06 17:01 ` Kok, Auke
1 sibling, 1 reply; 28+ messages in thread
From: Kok, Auke @ 2007-11-02 16:05 UTC (permalink / raw)
To: David Acker
Cc: Auke Kok, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
Milton Miller, Jeff Garzik, netdev, e1000-devel, Scott Feldman
David Acker wrote:
> On the systems that have cache incoherent DMA, including ARM, there is a
> race condition between software allocating a new receive buffer and hardware
> writing into a buffer. The two race on touching the last Receive Frame
> Descriptor (RFD). It has its el-bit set and its next link equal to 0.
> When hardware encounters this buffer it attempts to write data to it and
> then update Status Word bits and Actual Count in the RFD. At the same time
> software may try to clear the el-bit and set the link address to a new buffer.
>
> Since the entire RFD is once cache-line, the two write operations can collide.
> This can lead to the receive unit stalling or interpreting random memory as
> its receive area.
>
> The fix is to set the el-bit on and the size to 0 on the next to last buffer
> in the chain. When the hardware encounters this buffer it stops and does not
> write to it at all. The hardware issues an RNR interrupt with the receive
> unit in the No Resources state. Software can write to the tail of the list
> because it knows hardware will stop on the previous descriptor that was
> marked as the end of list.
>
> Once it has a new next to last buffer prepared, it can clear the el-bit and
> set the size on the previous one. The race on this buffer is safe since
> the link already points to a valid next buffer and the software can handle
> the race setting the size (assuming aligned 16 bit writes are atomic with
> respect to the DMA read). If the hardware sees the el-bit cleared without
> the size set, it will move on to the next buffer and skip this one. If it
> sees the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
>
> This is a patch for 2.6.24-rc1.
>
> Signed-off-by: David Acker <dacker@roinet.com>
>
> ---
>
> This version is based on the simpler patch I did in May. The algorithm I tried
> after that never worked correctly under load. It would hang the RU and the
> transmit unit sometimes and if the card was restarted it would often crash the
> system with memory corruption. This patch was tested on my embedded system
> using pktgen. I had it sending while a PC sent at it. I also ran it as
> wireless access point with a 12-hour bidirectional 20 mbps UDP going between an
> ethernet host on the e100 and a wireless client.
looks much simpler to me too, which I like.
It's good to see something coming from you! I'm going to make sure this gets on
the test bench today and will keep you posted on the progress. We'll take a few
days to make sure that this doesn't break early.
Thanks!!!
Auke
>
> --- linux-2.6.24-rc1/drivers/net/e100.c.orig 2007-11-01 11:42:35.000000000 -0400
> +++ linux-2.6.24-rc1/drivers/net/e100.c 2007-11-02 09:09:47.000000000 -0400
> @@ -106,6 +106,13 @@
> * the RFD, the RFD must be dma_sync'ed to maintain a consistent
> * view from software and hardware.
> *
> + * In order to keep updates to the RFD link field from colliding with
> + * hardware writes to mark packets complete, we use the feature that
> + * hardware will not write to a size 0 descriptor and mark the previous
> + * packet as end-of-list (EL). After updating the link, we remove EL
> + * and only then restore the size such that hardware may use the
> + * previous-to-end RFD.
> + *
> * Under typical operation, the receive unit (RU) is start once,
> * and the controller happily fills RFDs as frames arrive. If
> * replacement RFDs cannot be allocated, or the RU goes non-active,
> @@ -281,14 +288,15 @@ struct csr {
> };
>
> enum scb_status {
> + rus_no_res = 0x08,
> rus_ready = 0x10,
> rus_mask = 0x3C,
> };
>
> enum ru_state {
> - RU_SUSPENDED = 0,
> - RU_RUNNING = 1,
> - RU_UNINITIALIZED = -1,
> + ru_stopped = 0,
> + ru_running = 1,
> + ru_uninitialized = -1,
> };
>
> enum scb_stat_ack {
> @@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic
> ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>
> /* Template for a freshly allocated RFD */
> - nic->blank_rfd.command = cpu_to_le16(cb_el);
> + nic->blank_rfd.command = 0;
> nic->blank_rfd.rbd = 0xFFFFFFFF;
> nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>
> @@ -1759,7 +1767,7 @@ static int e100_alloc_cbs(struct nic *ni
> static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
> {
> if(!nic->rxs) return;
> - if(RU_SUSPENDED != nic->ru_running) return;
> + if (ru_stopped != nic->ru_running) return;
>
> /* handle init time starts */
> if(!rx) rx = nic->rxs;
> @@ -1767,7 +1775,7 @@ static inline void e100_start_receiver(s
> /* (Re)start RU if suspended or idle and RFA is non-NULL */
> if(rx->skb) {
> e100_exec_cmd(nic, ruc_start, rx->dma_addr);
> - nic->ru_running = RU_RUNNING;
> + nic->ru_running = ru_running;
> }
> }
>
> @@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic
> }
>
> /* Link the RFD to end of RFA by linking previous RFD to
> - * this one, and clearing EL bit of previous. */
> + * this one. We are safe to touch the previous RFD because
> + * it is protected by the before last buffer's el bit being set */
> if(rx->prev->skb) {
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned(cpu_to_le32(rx->dma_addr),
> (u32 *)&prev_rfd->link);
> - wmb();
> - prev_rfd->command &= ~cpu_to_le16(cb_el);
> - pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> - sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
>
> return 0;
> @@ -1824,8 +1829,20 @@ static int e100_rx_indicate(struct nic *
> DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>
> /* If data isn't ready, nothing to indicate */
> - if(unlikely(!(rfd_status & cb_complete)))
> + if (unlikely(!(rfd_status & cb_complete))) {
> + /* If the next buffer has the el bit, but we think the receiver
> + * is still running, check to see if it really stopped while
> + * we had interrupts off.
> + * This allows for a fast restart without re-enabling
> + * interrupts */
> + if ((le16_to_cpu(rfd->command) & cb_el) &&
> + (ru_running == nic->ru_running)) {
> +
> + if (readb(&nic->csr->scb.status) & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
> return -ENODATA;
> + }
>
> /* Get actual data size */
> actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1836,9 +1853,18 @@ static int e100_rx_indicate(struct nic *
> pci_unmap_single(nic->pdev, rx->dma_addr,
> RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>
> - /* this allows for a fast restart without re-enabling interrupts */
> - if(le16_to_cpu(rfd->command) & cb_el)
> - nic->ru_running = RU_SUSPENDED;
> + /* If this buffer has the el bit, but we think the receiver
> + * is still running, check to see if it really stopped while
> + * we had interrupts off.
> + * This allows for a fast restart without re-enabling interrupts.
> + * This can happen when the RU sees the size change but also sees
> + * the el bit set. */
> + if ((le16_to_cpu(rfd->command) & cb_el) &&
> + (ru_running == nic->ru_running)) {
> +
> + if (readb(&nic->csr->scb.status) & rus_no_res)
> + nic->ru_running = ru_stopped;
> + }
>
> /* Pull off the RFD and put the actual data (minus eth hdr) */
> skb_reserve(skb, sizeof(struct rfd));
> @@ -1870,31 +1896,30 @@ static void e100_rx_clean(struct nic *ni
> unsigned int work_to_do)
> {
> struct rx *rx;
> - int restart_required = 0;
> - struct rx *rx_to_start = NULL;
> -
> - /* are we already rnr? then pay attention!!! this ensures that
> - * the state machine progression never allows a start with a
> - * partially cleaned list, avoiding a race between hardware
> - * and rx_to_clean when in NAPI mode */
> - if(RU_SUSPENDED == nic->ru_running)
> - restart_required = 1;
> + int restart_required = 0, err = 0;
> + struct rx *old_before_last_rx, *new_before_last_rx;
> + struct rfd *old_before_last_rfd, *new_before_last_rfd;
>
> /* Indicate newly arrived packets */
> for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
> - int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> - if(-EAGAIN == err) {
> - /* hit quota so have more work to do, restart once
> - * cleanup is complete */
> - restart_required = 0;
> + err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> + /* Hit quota or no more to clean */
> + if (-EAGAIN == err || -ENODATA == err)
> break;
> - } else if(-ENODATA == err)
> - break; /* No more to clean */
> }
>
> - /* save our starting point as the place we'll restart the receiver */
> - if(restart_required)
> - rx_to_start = nic->rx_to_clean;
> +
> + /* On EAGAIN, hit quota so have more work to do, restart once
> + * cleanup is complete.
> + * Else, are we already rnr? then pay attention!!! this ensures that
> + * the state machine progression never allows a start with a
> + * partially cleaned list, avoiding a race between hardware
> + * and rx_to_clean when in NAPI mode */
> + if (-EAGAIN != err && ru_stopped == nic->ru_running)
> + restart_required = 1;
> +
> + old_before_last_rx = nic->rx_to_use->prev->prev;
> + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
>
> /* Alloc new skbs to refill list */
> for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
> @@ -1902,10 +1927,42 @@ static void e100_rx_clean(struct nic *ni
> break; /* Better luck next time (see watchdog) */
> }
>
> + new_before_last_rx = nic->rx_to_use->prev->prev;
> + if (new_before_last_rx != old_before_last_rx) {
> + /* Set the el-bit on the buffer that is before the last buffer.
> + * This lets us update the next pointer on the last buffer
> + * without worrying about hardware touching it.
> + * We set the size to 0 to prevent hardware from touching this
> + * buffer.
> + * When the hardware hits the before last buffer with el-bit
> + * and size of 0, it will RNR interrupt, the RUS will go into
> + * the No Resources state. It will not complete nor write to
> + * this buffer. */
> + new_before_last_rfd =
> + (struct rfd *)new_before_last_rx->skb->data;
> + new_before_last_rfd->size = 0;
> + new_before_last_rfd->command |= cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev,
> + new_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> +
> + /* Now that we have a new stopping point, we can clear the old
> + * stopping point. We must sync twice to get the proper
> + * ordering on the hardware side of things. */
> + old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev,
> + old_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> + pci_dma_sync_single_for_device(nic->pdev,
> + old_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> + }
> +
> if(restart_required) {
> // ack the rnr?
> writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
> - e100_start_receiver(nic, rx_to_start);
> + e100_start_receiver(nic, nic->rx_to_clean);
> if(work_done)
> (*work_done)++;
> }
> @@ -1916,7 +1973,7 @@ static void e100_rx_clean_list(struct ni
> struct rx *rx;
> unsigned int i, count = nic->params.rfds.count;
>
> - nic->ru_running = RU_UNINITIALIZED;
> + nic->ru_running = ru_uninitialized;
>
> if(nic->rxs) {
> for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
> @@ -1937,9 +1994,10 @@ static int e100_rx_alloc_list(struct nic
> {
> struct rx *rx;
> unsigned int i, count = nic->params.rfds.count;
> + struct rfd *before_last;
>
> nic->rx_to_use = nic->rx_to_clean = NULL;
> - nic->ru_running = RU_UNINITIALIZED;
> + nic->ru_running = ru_uninitialized;
>
> if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
> return -ENOMEM;
> @@ -1952,9 +2010,22 @@ static int e100_rx_alloc_list(struct nic
> return -ENOMEM;
> }
> }
> + /* Set the el-bit on the buffer that is before the last buffer.
> + * This lets us update the next pointer on the last buffer without
> + * worrying about hardware touching it.
> + * We set the size to 0 to prevent hardware from touching this buffer.
> + * When the hardware hits the before last buffer with el-bit and size
> + * of 0, it will RNR interrupt, the RU will go into the No Resources
> + * state. It will not complete nor write to this buffer. */
> + rx = nic->rxs->prev->prev;
> + before_last = (struct rfd *)rx->skb->data;
> + before_last->command |= cpu_to_le16(cb_el);
> + before_last->size = 0;
> + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_TODEVICE);
>
> nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> - nic->ru_running = RU_SUSPENDED;
> + nic->ru_running = ru_stopped;
>
> return 0;
> }
> @@ -1976,7 +2047,7 @@ static irqreturn_t e100_intr(int irq, vo
>
> /* We hit Receive No Resource (RNR); restart RU after cleaning */
> if(stat_ack & stat_ack_rnr)
> - nic->ru_running = RU_SUSPENDED;
> + nic->ru_running = ru_stopped;
>
> if(likely(netif_rx_schedule_prep(netdev, &nic->napi))) {
> e100_disable_irq(nic);
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-02 16:05 ` Kok, Auke
@ 2007-11-02 16:11 ` Jeff Garzik
0 siblings, 0 replies; 28+ messages in thread
From: Jeff Garzik @ 2007-11-02 16:11 UTC (permalink / raw)
To: Kok, Auke
Cc: David Acker, John Ronciak, Jesse Brandeburg, Jeff Kirsher,
Milton Miller, netdev, e1000-devel, Scott Feldman
Kok, Auke wrote:
> David Acker wrote:
>> On the systems that have cache incoherent DMA, including ARM, there is a
>> race condition between software allocating a new receive buffer and hardware
>> writing into a buffer. The two race on touching the last Receive Frame
>> Descriptor (RFD). It has its el-bit set and its next link equal to 0.
>> When hardware encounters this buffer it attempts to write data to it and
>> then update Status Word bits and Actual Count in the RFD. At the same time
>> software may try to clear the el-bit and set the link address to a new buffer.
>>
>> Since the entire RFD is once cache-line, the two write operations can collide.
>> This can lead to the receive unit stalling or interpreting random memory as
>> its receive area.
>>
>> The fix is to set the el-bit on and the size to 0 on the next to last buffer
>> in the chain. When the hardware encounters this buffer it stops and does not
>> write to it at all. The hardware issues an RNR interrupt with the receive
>> unit in the No Resources state. Software can write to the tail of the list
>> because it knows hardware will stop on the previous descriptor that was
>> marked as the end of list.
>>
>> Once it has a new next to last buffer prepared, it can clear the el-bit and
>> set the size on the previous one. The race on this buffer is safe since
>> the link already points to a valid next buffer and the software can handle
>> the race setting the size (assuming aligned 16 bit writes are atomic with
>> respect to the DMA read). If the hardware sees the el-bit cleared without
>> the size set, it will move on to the next buffer and skip this one. If it
>> sees the size set but the el-bit still set, it will complete that buffer
>> and then RNR interrupt and wait.
>>
>>
>> This is a patch for 2.6.24-rc1.
>>
>> Signed-off-by: David Acker <dacker@roinet.com>
>>
>> ---
>>
>> This version is based on the simpler patch I did in May. The algorithm I tried
>> after that never worked correctly under load. It would hang the RU and the
>> transmit unit sometimes and if the card was restarted it would often crash the
>> system with memory corruption. This patch was tested on my embedded system
>> using pktgen. I had it sending while a PC sent at it. I also ran it as
>> wireless access point with a 12-hour bidirectional 20 mbps UDP going between an
>> ethernet host on the e100 and a wireless client.
>
> looks much simpler to me too, which I like.
>
> It's good to see something coming from you! I'm going to make sure this gets on
> the test bench today and will keep you posted on the progress. We'll take a few
> days to make sure that this doesn't break early.
>
> Thanks!!!
Agreed, I _really_ appreciate this effort being kept alive.
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-02 13:27 David Acker
2007-11-02 16:05 ` Kok, Auke
@ 2007-11-06 17:01 ` Kok, Auke
1 sibling, 0 replies; 28+ messages in thread
From: Kok, Auke @ 2007-11-06 17:01 UTC (permalink / raw)
To: David Acker, Jeff Garzik
Cc: Auke Kok, e1000-devel, netdev, Jesse Brandeburg, Milton Miller,
Scott Feldman, John Ronciak, Jeff Kirsher
David Acker wrote:
> On the systems that have cache incoherent DMA, including ARM, there is a
> race condition between software allocating a new receive buffer and hardware
> writing into a buffer. The two race on touching the last Receive Frame
> Descriptor (RFD). It has its el-bit set and its next link equal to 0.
> When hardware encounters this buffer it attempts to write data to it and
> then update Status Word bits and Actual Count in the RFD. At the same time
> software may try to clear the el-bit and set the link address to a new buffer.
>
> Since the entire RFD is once cache-line, the two write operations can collide.
> This can lead to the receive unit stalling or interpreting random memory as
> its receive area.
>
> The fix is to set the el-bit on and the size to 0 on the next to last buffer
> in the chain. When the hardware encounters this buffer it stops and does not
> write to it at all. The hardware issues an RNR interrupt with the receive
> unit in the No Resources state. Software can write to the tail of the list
> because it knows hardware will stop on the previous descriptor that was
> marked as the end of list.
>
> Once it has a new next to last buffer prepared, it can clear the el-bit and
> set the size on the previous one. The race on this buffer is safe since
> the link already points to a valid next buffer and the software can handle
> the race setting the size (assuming aligned 16 bit writes are atomic with
> respect to the DMA read). If the hardware sees the el-bit cleared without
> the size set, it will move on to the next buffer and skip this one. If it
> sees the size set but the el-bit still set, it will complete that buffer
> and then RNR interrupt and wait.
>
>
> This is a patch for 2.6.24-rc1.
>
> Signed-off-by: David Acker <dacker@roinet.com>
OK, there are still a few small tests running but I think it's good. It ran over
the weekend in various tests and we didn't see any receive unit hangs, which we
would have spotted by now. That means that the patch as far as I can assess is
good to go and I will send it to Jeff later on.
There is a minute impact to small packet receive performance. the driver uses
about 0.5% more CPU at 64byte packets wire speed but it's still in the 3% total
range, so I think we're fine there.
Thanks for all the work David!
Auke
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Fix e100 on systems that have cache incoherent DMA
@ 2007-11-08 18:17 Auke Kok
2007-11-28 19:12 ` David Acker
0 siblings, 1 reply; 28+ messages in thread
From: Auke Kok @ 2007-11-08 18:17 UTC (permalink / raw)
To: jeff, akpm
Cc: netdev, dacker, auke-jan.h.kok, jesse.brandeburg, miltonm,
e1000-devel
From: David Acker <dacker@roinet.com>
On the systems that have cache incoherent DMA, including ARM, there
is a race condition between software allocating a new receive buffer
and hardware writing into a buffer. The two race on touching the last
Receive Frame Descriptor (RFD). It has its el-bit set and its next
link equal to 0. When hardware encounters this buffer it attempts to
write data to it and then update Status Word bits and Actual Count in
the RFD. At the same time software may try to clear the el-bit and
set the link address to a new buffer.
Since the entire RFD is once cache-line, the two write operations can
collide. This can lead to the receive unit stalling or interpreting
random memory as its receive area.
The fix is to set the el-bit on and the size to 0 on the next to last
buffer in the chain. When the hardware encounters this buffer it stops
and does not write to it at all. The hardware issues an RNR interrupt
with the receive unit in the No Resources state. Software can write
to the tail of the list because it knows hardware will stop on the
previous descriptor that was marked as the end of list.
Once it has a new next to last buffer prepared, it can clear the el-bit
and set the size on the previous one. The race on this buffer is safe
since the link already points to a valid next buffer and the software
can handle the race setting the size (assuming aligned 16 bit writes
are atomic with respect to the DMA read). If the hardware sees the
el-bit cleared without the size set, it will move on to the next buffer
and skip this one. If it sees the size set but the el-bit still set,
it will complete that buffer and then RNR interrupt and wait.
Signed-off-by: David Acker <dacker@roinet.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---
drivers/net/e100.c | 128 ++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 3dbaec6..2153058 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -106,6 +106,13 @@
* the RFD, the RFD must be dma_sync'ed to maintain a consistent
* view from software and hardware.
*
+ * In order to keep updates to the RFD link field from colliding with
+ * hardware writes to mark packets complete, we use the feature that
+ * hardware will not write to a size 0 descriptor and mark the previous
+ * packet as end-of-list (EL). After updating the link, we remove EL
+ * and only then restore the size such that hardware may use the
+ * previous-to-end RFD.
+ *
* Under typical operation, the receive unit (RU) is start once,
* and the controller happily fills RFDs as frames arrive. If
* replacement RFDs cannot be allocated, or the RU goes non-active,
@@ -281,6 +288,7 @@ struct csr {
};
enum scb_status {
+ rus_no_res = 0x08,
rus_ready = 0x10,
rus_mask = 0x3C,
};
@@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic *nic)
((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
/* Template for a freshly allocated RFD */
- nic->blank_rfd.command = cpu_to_le16(cb_el);
+ nic->blank_rfd.command = 0;
nic->blank_rfd.rbd = 0xFFFFFFFF;
nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
@@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
}
/* Link the RFD to end of RFA by linking previous RFD to
- * this one, and clearing EL bit of previous. */
+ * this one. We are safe to touch the previous RFD because
+ * it is protected by the before last buffer's el bit being set */
if(rx->prev->skb) {
struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
put_unaligned(cpu_to_le32(rx->dma_addr),
(u32 *)&prev_rfd->link);
- wmb();
- prev_rfd->command &= ~cpu_to_le16(cb_el);
- pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
- sizeof(struct rfd), PCI_DMA_TODEVICE);
}
return 0;
@@ -1824,8 +1829,19 @@ static int e100_rx_indicate(struct nic *nic, struct rx *rx,
DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
/* If data isn't ready, nothing to indicate */
- if(unlikely(!(rfd_status & cb_complete)))
+ if (unlikely(!(rfd_status & cb_complete))) {
+ /* If the next buffer has the el bit, but we think the receiver
+ * is still running, check to see if it really stopped while
+ * we had interrupts off.
+ * This allows for a fast restart without re-enabling
+ * interrupts */
+ if ((le16_to_cpu(rfd->command) & cb_el) &&
+ (RU_RUNNING == nic->ru_running))
+
+ if (readb(&nic->csr->scb.status) & rus_no_res)
+ nic->ru_running = RU_SUSPENDED;
return -ENODATA;
+ }
/* Get actual data size */
actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1836,9 +1852,18 @@ static int e100_rx_indicate(struct nic *nic, struct rx *rx,
pci_unmap_single(nic->pdev, rx->dma_addr,
RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
- /* this allows for a fast restart without re-enabling interrupts */
- if(le16_to_cpu(rfd->command) & cb_el)
+ /* If this buffer has the el bit, but we think the receiver
+ * is still running, check to see if it really stopped while
+ * we had interrupts off.
+ * This allows for a fast restart without re-enabling interrupts.
+ * This can happen when the RU sees the size change but also sees
+ * the el bit set. */
+ if ((le16_to_cpu(rfd->command) & cb_el) &&
+ (RU_RUNNING == nic->ru_running)) {
+
+ if (readb(&nic->csr->scb.status) & rus_no_res)
nic->ru_running = RU_SUSPENDED;
+ }
/* Pull off the RFD and put the actual data (minus eth hdr) */
skb_reserve(skb, sizeof(struct rfd));
@@ -1870,31 +1895,30 @@ static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
unsigned int work_to_do)
{
struct rx *rx;
- int restart_required = 0;
- struct rx *rx_to_start = NULL;
-
- /* are we already rnr? then pay attention!!! this ensures that
- * the state machine progression never allows a start with a
- * partially cleaned list, avoiding a race between hardware
- * and rx_to_clean when in NAPI mode */
- if(RU_SUSPENDED == nic->ru_running)
- restart_required = 1;
+ int restart_required = 0, err = 0;
+ struct rx *old_before_last_rx, *new_before_last_rx;
+ struct rfd *old_before_last_rfd, *new_before_last_rfd;
/* Indicate newly arrived packets */
for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
- int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
- if(-EAGAIN == err) {
- /* hit quota so have more work to do, restart once
- * cleanup is complete */
- restart_required = 0;
+ err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+ /* Hit quota or no more to clean */
+ if (-EAGAIN == err || -ENODATA == err)
break;
- } else if(-ENODATA == err)
- break; /* No more to clean */
}
- /* save our starting point as the place we'll restart the receiver */
- if(restart_required)
- rx_to_start = nic->rx_to_clean;
+
+ /* On EAGAIN, hit quota so have more work to do, restart once
+ * cleanup is complete.
+ * Else, are we already rnr? then pay attention!!! this ensures that
+ * the state machine progression never allows a start with a
+ * partially cleaned list, avoiding a race between hardware
+ * and rx_to_clean when in NAPI mode */
+ if (-EAGAIN != err && RU_SUSPENDED == nic->ru_running)
+ restart_required = 1;
+
+ old_before_last_rx = nic->rx_to_use->prev->prev;
+ old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
/* Alloc new skbs to refill list */
for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
@@ -1902,10 +1926,42 @@ static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
break; /* Better luck next time (see watchdog) */
}
+ new_before_last_rx = nic->rx_to_use->prev->prev;
+ if (new_before_last_rx != old_before_last_rx) {
+ /* Set the el-bit on the buffer that is before the last buffer.
+ * This lets us update the next pointer on the last buffer
+ * without worrying about hardware touching it.
+ * We set the size to 0 to prevent hardware from touching this
+ * buffer.
+ * When the hardware hits the before last buffer with el-bit
+ * and size of 0, it will RNR interrupt, the RUS will go into
+ * the No Resources state. It will not complete nor write to
+ * this buffer. */
+ new_before_last_rfd =
+ (struct rfd *)new_before_last_rx->skb->data;
+ new_before_last_rfd->size = 0;
+ new_before_last_rfd->command |= cpu_to_le16(cb_el);
+ pci_dma_sync_single_for_device(nic->pdev,
+ new_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+
+ /* Now that we have a new stopping point, we can clear the old
+ * stopping point. We must sync twice to get the proper
+ * ordering on the hardware side of things. */
+ old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
+ pci_dma_sync_single_for_device(nic->pdev,
+ old_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+ old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+ pci_dma_sync_single_for_device(nic->pdev,
+ old_before_last_rx->dma_addr, sizeof(struct rfd),
+ PCI_DMA_TODEVICE);
+ }
+
if(restart_required) {
// ack the rnr?
writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
- e100_start_receiver(nic, rx_to_start);
+ e100_start_receiver(nic, nic->rx_to_clean);
if(work_done)
(*work_done)++;
}
@@ -1937,6 +1993,7 @@ static int e100_rx_alloc_list(struct nic *nic)
{
struct rx *rx;
unsigned int i, count = nic->params.rfds.count;
+ struct rfd *before_last;
nic->rx_to_use = nic->rx_to_clean = NULL;
nic->ru_running = RU_UNINITIALIZED;
@@ -1952,6 +2009,19 @@ static int e100_rx_alloc_list(struct nic *nic)
return -ENOMEM;
}
}
+ /* Set the el-bit on the buffer that is before the last buffer.
+ * This lets us update the next pointer on the last buffer without
+ * worrying about hardware touching it.
+ * We set the size to 0 to prevent hardware from touching this buffer.
+ * When the hardware hits the before last buffer with el-bit and size
+ * of 0, it will RNR interrupt, the RU will go into the No Resources
+ * state. It will not complete nor write to this buffer. */
+ rx = nic->rxs->prev->prev;
+ before_last = (struct rfd *)rx->skb->data;
+ before_last->command |= cpu_to_le16(cb_el);
+ before_last->size = 0;
+ pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
+ sizeof(struct rfd), PCI_DMA_TODEVICE);
nic->rx_to_use = nic->rx_to_clean = nic->rxs;
nic->ru_running = RU_SUSPENDED;
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-08 18:17 Auke Kok
@ 2007-11-28 19:12 ` David Acker
2007-11-28 19:21 ` Kok, Auke
0 siblings, 1 reply; 28+ messages in thread
From: David Acker @ 2007-11-28 19:12 UTC (permalink / raw)
To: Auke Kok; +Cc: jeff, akpm, netdev, jesse.brandeburg, miltonm, e1000-devel
What is the status of this patch? Is there anything folks would like me to do in order to move it forward? As an FYI,
my company has been using this patch since I posted it and so far we have not had any problems with it.
-Ack
Auke Kok wrote:
> From: David Acker <dacker@roinet.com>
>
> On the systems that have cache incoherent DMA, including ARM, there
> is a race condition between software allocating a new receive buffer
> and hardware writing into a buffer. The two race on touching the last
> Receive Frame Descriptor (RFD). It has its el-bit set and its next
> link equal to 0. When hardware encounters this buffer it attempts to
> write data to it and then update Status Word bits and Actual Count in
> the RFD. At the same time software may try to clear the el-bit and
> set the link address to a new buffer.
>
> Since the entire RFD is once cache-line, the two write operations can
> collide. This can lead to the receive unit stalling or interpreting
> random memory as its receive area.
>
> The fix is to set the el-bit on and the size to 0 on the next to last
> buffer in the chain. When the hardware encounters this buffer it stops
> and does not write to it at all. The hardware issues an RNR interrupt
> with the receive unit in the No Resources state. Software can write
> to the tail of the list because it knows hardware will stop on the
> previous descriptor that was marked as the end of list.
>
> Once it has a new next to last buffer prepared, it can clear the el-bit
> and set the size on the previous one. The race on this buffer is safe
> since the link already points to a valid next buffer and the software
> can handle the race setting the size (assuming aligned 16 bit writes
> are atomic with respect to the DMA read). If the hardware sees the
> el-bit cleared without the size set, it will move on to the next buffer
> and skip this one. If it sees the size set but the el-bit still set,
> it will complete that buffer and then RNR interrupt and wait.
>
> Signed-off-by: David Acker <dacker@roinet.com>
> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> ---
>
> drivers/net/e100.c | 128 ++++++++++++++++++++++++++++++++++++++++------------
> 1 files changed, 99 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
> index 3dbaec6..2153058 100644
> --- a/drivers/net/e100.c
> +++ b/drivers/net/e100.c
> @@ -106,6 +106,13 @@
> * the RFD, the RFD must be dma_sync'ed to maintain a consistent
> * view from software and hardware.
> *
> + * In order to keep updates to the RFD link field from colliding with
> + * hardware writes to mark packets complete, we use the feature that
> + * hardware will not write to a size 0 descriptor and mark the previous
> + * packet as end-of-list (EL). After updating the link, we remove EL
> + * and only then restore the size such that hardware may use the
> + * previous-to-end RFD.
> + *
> * Under typical operation, the receive unit (RU) is start once,
> * and the controller happily fills RFDs as frames arrive. If
> * replacement RFDs cannot be allocated, or the RU goes non-active,
> @@ -281,6 +288,7 @@ struct csr {
> };
>
> enum scb_status {
> + rus_no_res = 0x08,
> rus_ready = 0x10,
> rus_mask = 0x3C,
> };
> @@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic *nic)
> ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>
> /* Template for a freshly allocated RFD */
> - nic->blank_rfd.command = cpu_to_le16(cb_el);
> + nic->blank_rfd.command = 0;
> nic->blank_rfd.rbd = 0xFFFFFFFF;
> nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>
> @@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
> }
>
> /* Link the RFD to end of RFA by linking previous RFD to
> - * this one, and clearing EL bit of previous. */
> + * this one. We are safe to touch the previous RFD because
> + * it is protected by the before last buffer's el bit being set */
> if(rx->prev->skb) {
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned(cpu_to_le32(rx->dma_addr),
> (u32 *)&prev_rfd->link);
> - wmb();
> - prev_rfd->command &= ~cpu_to_le16(cb_el);
> - pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> - sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
>
> return 0;
> @@ -1824,8 +1829,19 @@ static int e100_rx_indicate(struct nic *nic, struct rx *rx,
> DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>
> /* If data isn't ready, nothing to indicate */
> - if(unlikely(!(rfd_status & cb_complete)))
> + if (unlikely(!(rfd_status & cb_complete))) {
> + /* If the next buffer has the el bit, but we think the receiver
> + * is still running, check to see if it really stopped while
> + * we had interrupts off.
> + * This allows for a fast restart without re-enabling
> + * interrupts */
> + if ((le16_to_cpu(rfd->command) & cb_el) &&
> + (RU_RUNNING == nic->ru_running))
> +
> + if (readb(&nic->csr->scb.status) & rus_no_res)
> + nic->ru_running = RU_SUSPENDED;
> return -ENODATA;
> + }
>
> /* Get actual data size */
> actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
> @@ -1836,9 +1852,18 @@ static int e100_rx_indicate(struct nic *nic, struct rx *rx,
> pci_unmap_single(nic->pdev, rx->dma_addr,
> RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>
> - /* this allows for a fast restart without re-enabling interrupts */
> - if(le16_to_cpu(rfd->command) & cb_el)
> + /* If this buffer has the el bit, but we think the receiver
> + * is still running, check to see if it really stopped while
> + * we had interrupts off.
> + * This allows for a fast restart without re-enabling interrupts.
> + * This can happen when the RU sees the size change but also sees
> + * the el bit set. */
> + if ((le16_to_cpu(rfd->command) & cb_el) &&
> + (RU_RUNNING == nic->ru_running)) {
> +
> + if (readb(&nic->csr->scb.status) & rus_no_res)
> nic->ru_running = RU_SUSPENDED;
> + }
>
> /* Pull off the RFD and put the actual data (minus eth hdr) */
> skb_reserve(skb, sizeof(struct rfd));
> @@ -1870,31 +1895,30 @@ static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
> unsigned int work_to_do)
> {
> struct rx *rx;
> - int restart_required = 0;
> - struct rx *rx_to_start = NULL;
> -
> - /* are we already rnr? then pay attention!!! this ensures that
> - * the state machine progression never allows a start with a
> - * partially cleaned list, avoiding a race between hardware
> - * and rx_to_clean when in NAPI mode */
> - if(RU_SUSPENDED == nic->ru_running)
> - restart_required = 1;
> + int restart_required = 0, err = 0;
> + struct rx *old_before_last_rx, *new_before_last_rx;
> + struct rfd *old_before_last_rfd, *new_before_last_rfd;
>
> /* Indicate newly arrived packets */
> for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean = rx->next) {
> - int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> - if(-EAGAIN == err) {
> - /* hit quota so have more work to do, restart once
> - * cleanup is complete */
> - restart_required = 0;
> + err = e100_rx_indicate(nic, rx, work_done, work_to_do);
> + /* Hit quota or no more to clean */
> + if (-EAGAIN == err || -ENODATA == err)
> break;
> - } else if(-ENODATA == err)
> - break; /* No more to clean */
> }
>
> - /* save our starting point as the place we'll restart the receiver */
> - if(restart_required)
> - rx_to_start = nic->rx_to_clean;
> +
> + /* On EAGAIN, hit quota so have more work to do, restart once
> + * cleanup is complete.
> + * Else, are we already rnr? then pay attention!!! this ensures that
> + * the state machine progression never allows a start with a
> + * partially cleaned list, avoiding a race between hardware
> + * and rx_to_clean when in NAPI mode */
> + if (-EAGAIN != err && RU_SUSPENDED == nic->ru_running)
> + restart_required = 1;
> +
> + old_before_last_rx = nic->rx_to_use->prev->prev;
> + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
>
> /* Alloc new skbs to refill list */
> for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
> @@ -1902,10 +1926,42 @@ static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
> break; /* Better luck next time (see watchdog) */
> }
>
> + new_before_last_rx = nic->rx_to_use->prev->prev;
> + if (new_before_last_rx != old_before_last_rx) {
> + /* Set the el-bit on the buffer that is before the last buffer.
> + * This lets us update the next pointer on the last buffer
> + * without worrying about hardware touching it.
> + * We set the size to 0 to prevent hardware from touching this
> + * buffer.
> + * When the hardware hits the before last buffer with el-bit
> + * and size of 0, it will RNR interrupt, the RUS will go into
> + * the No Resources state. It will not complete nor write to
> + * this buffer. */
> + new_before_last_rfd =
> + (struct rfd *)new_before_last_rx->skb->data;
> + new_before_last_rfd->size = 0;
> + new_before_last_rfd->command |= cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev,
> + new_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> +
> + /* Now that we have a new stopping point, we can clear the old
> + * stopping point. We must sync twice to get the proper
> + * ordering on the hardware side of things. */
> + old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
> + pci_dma_sync_single_for_device(nic->pdev,
> + old_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
> + pci_dma_sync_single_for_device(nic->pdev,
> + old_before_last_rx->dma_addr, sizeof(struct rfd),
> + PCI_DMA_TODEVICE);
> + }
> +
> if(restart_required) {
> // ack the rnr?
> writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
> - e100_start_receiver(nic, rx_to_start);
> + e100_start_receiver(nic, nic->rx_to_clean);
> if(work_done)
> (*work_done)++;
> }
> @@ -1937,6 +1993,7 @@ static int e100_rx_alloc_list(struct nic *nic)
> {
> struct rx *rx;
> unsigned int i, count = nic->params.rfds.count;
> + struct rfd *before_last;
>
> nic->rx_to_use = nic->rx_to_clean = NULL;
> nic->ru_running = RU_UNINITIALIZED;
> @@ -1952,6 +2009,19 @@ static int e100_rx_alloc_list(struct nic *nic)
> return -ENOMEM;
> }
> }
> + /* Set the el-bit on the buffer that is before the last buffer.
> + * This lets us update the next pointer on the last buffer without
> + * worrying about hardware touching it.
> + * We set the size to 0 to prevent hardware from touching this buffer.
> + * When the hardware hits the before last buffer with el-bit and size
> + * of 0, it will RNR interrupt, the RU will go into the No Resources
> + * state. It will not complete nor write to this buffer. */
> + rx = nic->rxs->prev->prev;
> + before_last = (struct rfd *)rx->skb->data;
> + before_last->command |= cpu_to_le16(cb_el);
> + before_last->size = 0;
> + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
> + sizeof(struct rfd), PCI_DMA_TODEVICE);
>
> nic->rx_to_use = nic->rx_to_clean = nic->rxs;
> nic->ru_running = RU_SUSPENDED;
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-28 19:12 ` David Acker
@ 2007-11-28 19:21 ` Kok, Auke
2007-11-28 19:26 ` Jeff Garzik
0 siblings, 1 reply; 28+ messages in thread
From: Kok, Auke @ 2007-11-28 19:21 UTC (permalink / raw)
To: David Acker; +Cc: jeff, akpm, netdev, jesse.brandeburg, miltonm, e1000-devel
David Acker wrote:
> What is the status of this patch? Is there anything folks would like me
> to do in order to move it forward? As an FYI, my company has been using
> this patch since I posted it and so far we have not had any problems
> with it.
> -Ack
Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
However, we could consider pushing this patch to 2.6.24 since it's a valid fix.
Andrew has also been carrying this patch in -mm for a while and so have we over
here without issues.
Jeff, what do you think?
Auke
>
> Auke Kok wrote:
>> From: David Acker <dacker@roinet.com>
>>
>> On the systems that have cache incoherent DMA, including ARM, there
>> is a race condition between software allocating a new receive buffer
>> and hardware writing into a buffer. The two race on touching the last
>> Receive Frame Descriptor (RFD). It has its el-bit set and its next
>> link equal to 0. When hardware encounters this buffer it attempts to
>> write data to it and then update Status Word bits and Actual Count in
>> the RFD. At the same time software may try to clear the el-bit and
>> set the link address to a new buffer.
>>
>> Since the entire RFD is once cache-line, the two write operations can
>> collide. This can lead to the receive unit stalling or interpreting
>> random memory as its receive area.
>>
>> The fix is to set the el-bit on and the size to 0 on the next to last
>> buffer in the chain. When the hardware encounters this buffer it stops
>> and does not write to it at all. The hardware issues an RNR interrupt
>> with the receive unit in the No Resources state. Software can write
>> to the tail of the list because it knows hardware will stop on the
>> previous descriptor that was marked as the end of list.
>>
>> Once it has a new next to last buffer prepared, it can clear the el-bit
>> and set the size on the previous one. The race on this buffer is safe
>> since the link already points to a valid next buffer and the software
>> can handle the race setting the size (assuming aligned 16 bit writes
>> are atomic with respect to the DMA read). If the hardware sees the
>> el-bit cleared without the size set, it will move on to the next buffer
>> and skip this one. If it sees the size set but the el-bit still set,
>> it will complete that buffer and then RNR interrupt and wait.
>>
>> Signed-off-by: David Acker <dacker@roinet.com>
>> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
>> ---
>>
>> drivers/net/e100.c | 128
>> ++++++++++++++++++++++++++++++++++++++++------------
>> 1 files changed, 99 insertions(+), 29 deletions(-)
>>
>> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
>> index 3dbaec6..2153058 100644
>> --- a/drivers/net/e100.c
>> +++ b/drivers/net/e100.c
>> @@ -106,6 +106,13 @@
>> * the RFD, the RFD must be dma_sync'ed to maintain a consistent
>> * view from software and hardware.
>> *
>> + * In order to keep updates to the RFD link field from colliding with
>> + * hardware writes to mark packets complete, we use the feature that
>> + * hardware will not write to a size 0 descriptor and mark the
>> previous
>> + * packet as end-of-list (EL). After updating the link, we
>> remove EL
>> + * and only then restore the size such that hardware may use the
>> + * previous-to-end RFD.
>> + *
>> * Under typical operation, the receive unit (RU) is start once,
>> * and the controller happily fills RFDs as frames arrive. If
>> * replacement RFDs cannot be allocated, or the RU goes non-active,
>> @@ -281,6 +288,7 @@ struct csr {
>> };
>>
>> enum scb_status {
>> + rus_no_res = 0x08,
>> rus_ready = 0x10,
>> rus_mask = 0x3C,
>> };
>> @@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic *nic)
>> ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>>
>> /* Template for a freshly allocated RFD */
>> - nic->blank_rfd.command = cpu_to_le16(cb_el);
>> + nic->blank_rfd.command = 0;
>> nic->blank_rfd.rbd = 0xFFFFFFFF;
>> nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>>
>> @@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic *nic,
>> struct rx *rx)
>> }
>>
>> /* Link the RFD to end of RFA by linking previous RFD to
>> - * this one, and clearing EL bit of previous. */
>> + * this one. We are safe to touch the previous RFD because
>> + * it is protected by the before last buffer's el bit being set */
>> if(rx->prev->skb) {
>> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
>> put_unaligned(cpu_to_le32(rx->dma_addr),
>> (u32 *)&prev_rfd->link);
>> - wmb();
>> - prev_rfd->command &= ~cpu_to_le16(cb_el);
>> - pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
>> - sizeof(struct rfd), PCI_DMA_TODEVICE);
>> }
>>
>> return 0;
>> @@ -1824,8 +1829,19 @@ static int e100_rx_indicate(struct nic *nic,
>> struct rx *rx,
>> DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>>
>> /* If data isn't ready, nothing to indicate */
>> - if(unlikely(!(rfd_status & cb_complete)))
>> + if (unlikely(!(rfd_status & cb_complete))) {
>> + /* If the next buffer has the el bit, but we think the receiver
>> + * is still running, check to see if it really stopped while
>> + * we had interrupts off.
>> + * This allows for a fast restart without re-enabling
>> + * interrupts */
>> + if ((le16_to_cpu(rfd->command) & cb_el) &&
>> + (RU_RUNNING == nic->ru_running))
>> +
>> + if (readb(&nic->csr->scb.status) & rus_no_res)
>> + nic->ru_running = RU_SUSPENDED;
>> return -ENODATA;
>> + }
>>
>> /* Get actual data size */
>> actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
>> @@ -1836,9 +1852,18 @@ static int e100_rx_indicate(struct nic *nic,
>> struct rx *rx,
>> pci_unmap_single(nic->pdev, rx->dma_addr,
>> RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>>
>> - /* this allows for a fast restart without re-enabling interrupts */
>> - if(le16_to_cpu(rfd->command) & cb_el)
>> + /* If this buffer has the el bit, but we think the receiver
>> + * is still running, check to see if it really stopped while
>> + * we had interrupts off.
>> + * This allows for a fast restart without re-enabling interrupts.
>> + * This can happen when the RU sees the size change but also sees
>> + * the el bit set. */
>> + if ((le16_to_cpu(rfd->command) & cb_el) &&
>> + (RU_RUNNING == nic->ru_running)) {
>> +
>> + if (readb(&nic->csr->scb.status) & rus_no_res)
>> nic->ru_running = RU_SUSPENDED;
>> + }
>>
>> /* Pull off the RFD and put the actual data (minus eth hdr) */
>> skb_reserve(skb, sizeof(struct rfd));
>> @@ -1870,31 +1895,30 @@ static void e100_rx_clean(struct nic *nic,
>> unsigned int *work_done,
>> unsigned int work_to_do)
>> {
>> struct rx *rx;
>> - int restart_required = 0;
>> - struct rx *rx_to_start = NULL;
>> -
>> - /* are we already rnr? then pay attention!!! this ensures that
>> - * the state machine progression never allows a start with a
>> - * partially cleaned list, avoiding a race between hardware
>> - * and rx_to_clean when in NAPI mode */
>> - if(RU_SUSPENDED == nic->ru_running)
>> - restart_required = 1;
>> + int restart_required = 0, err = 0;
>> + struct rx *old_before_last_rx, *new_before_last_rx;
>> + struct rfd *old_before_last_rfd, *new_before_last_rfd;
>>
>> /* Indicate newly arrived packets */
>> for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean =
>> rx->next) {
>> - int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
>> - if(-EAGAIN == err) {
>> - /* hit quota so have more work to do, restart once
>> - * cleanup is complete */
>> - restart_required = 0;
>> + err = e100_rx_indicate(nic, rx, work_done, work_to_do);
>> + /* Hit quota or no more to clean */
>> + if (-EAGAIN == err || -ENODATA == err)
>> break;
>> - } else if(-ENODATA == err)
>> - break; /* No more to clean */
>> }
>>
>> - /* save our starting point as the place we'll restart the
>> receiver */
>> - if(restart_required)
>> - rx_to_start = nic->rx_to_clean;
>> +
>> + /* On EAGAIN, hit quota so have more work to do, restart once
>> + * cleanup is complete.
>> + * Else, are we already rnr? then pay attention!!! this ensures that
>> + * the state machine progression never allows a start with a
>> + * partially cleaned list, avoiding a race between hardware
>> + * and rx_to_clean when in NAPI mode */
>> + if (-EAGAIN != err && RU_SUSPENDED == nic->ru_running)
>> + restart_required = 1;
>> +
>> + old_before_last_rx = nic->rx_to_use->prev->prev;
>> + old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
>>
>> /* Alloc new skbs to refill list */
>> for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
>> @@ -1902,10 +1926,42 @@ static void e100_rx_clean(struct nic *nic,
>> unsigned int *work_done,
>> break; /* Better luck next time (see watchdog) */
>> }
>>
>> + new_before_last_rx = nic->rx_to_use->prev->prev;
>> + if (new_before_last_rx != old_before_last_rx) {
>> + /* Set the el-bit on the buffer that is before the last buffer.
>> + * This lets us update the next pointer on the last buffer
>> + * without worrying about hardware touching it.
>> + * We set the size to 0 to prevent hardware from touching this
>> + * buffer.
>> + * When the hardware hits the before last buffer with el-bit
>> + * and size of 0, it will RNR interrupt, the RUS will go into
>> + * the No Resources state. It will not complete nor write to
>> + * this buffer. */
>> + new_before_last_rfd =
>> + (struct rfd *)new_before_last_rx->skb->data;
>> + new_before_last_rfd->size = 0;
>> + new_before_last_rfd->command |= cpu_to_le16(cb_el);
>> + pci_dma_sync_single_for_device(nic->pdev,
>> + new_before_last_rx->dma_addr, sizeof(struct rfd),
>> + PCI_DMA_TODEVICE);
>> +
>> + /* Now that we have a new stopping point, we can clear the old
>> + * stopping point. We must sync twice to get the proper
>> + * ordering on the hardware side of things. */
>> + old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
>> + pci_dma_sync_single_for_device(nic->pdev,
>> + old_before_last_rx->dma_addr, sizeof(struct rfd),
>> + PCI_DMA_TODEVICE);
>> + old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>> + pci_dma_sync_single_for_device(nic->pdev,
>> + old_before_last_rx->dma_addr, sizeof(struct rfd),
>> + PCI_DMA_TODEVICE);
>> + }
>> +
>> if(restart_required) {
>> // ack the rnr?
>> writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
>> - e100_start_receiver(nic, rx_to_start);
>> + e100_start_receiver(nic, nic->rx_to_clean);
>> if(work_done)
>> (*work_done)++;
>> }
>> @@ -1937,6 +1993,7 @@ static int e100_rx_alloc_list(struct nic *nic)
>> {
>> struct rx *rx;
>> unsigned int i, count = nic->params.rfds.count;
>> + struct rfd *before_last;
>>
>> nic->rx_to_use = nic->rx_to_clean = NULL;
>> nic->ru_running = RU_UNINITIALIZED;
>> @@ -1952,6 +2009,19 @@ static int e100_rx_alloc_list(struct nic *nic)
>> return -ENOMEM;
>> }
>> }
>> + /* Set the el-bit on the buffer that is before the last buffer.
>> + * This lets us update the next pointer on the last buffer without
>> + * worrying about hardware touching it.
>> + * We set the size to 0 to prevent hardware from touching this
>> buffer.
>> + * When the hardware hits the before last buffer with el-bit and
>> size
>> + * of 0, it will RNR interrupt, the RU will go into the No Resources
>> + * state. It will not complete nor write to this buffer. */
>> + rx = nic->rxs->prev->prev;
>> + before_last = (struct rfd *)rx->skb->data;
>> + before_last->command |= cpu_to_le16(cb_el);
>> + before_last->size = 0;
>> + pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
>> + sizeof(struct rfd), PCI_DMA_TODEVICE);
>>
>> nic->rx_to_use = nic->rx_to_clean = nic->rxs;
>> nic->ru_running = RU_SUSPENDED;
>> -
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-28 19:21 ` Kok, Auke
@ 2007-11-28 19:26 ` Jeff Garzik
2007-11-28 19:50 ` David Acker
0 siblings, 1 reply; 28+ messages in thread
From: Jeff Garzik @ 2007-11-28 19:26 UTC (permalink / raw)
To: Kok, Auke
Cc: David Acker, akpm, netdev, jesse.brandeburg, miltonm, e1000-devel
Kok, Auke wrote:
> David Acker wrote:
>> What is the status of this patch? Is there anything folks would like me
>> to do in order to move it forward? As an FYI, my company has been using
>> this patch since I posted it and so far we have not had any problems
>> with it.
>> -Ack
>
>
> Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
>
> However, we could consider pushing this patch to 2.6.24 since it's a valid fix.
> Andrew has also been carrying this patch in -mm for a while and so have we over
> here without issues.
>
> Jeff, what do you think?
IMO it's too late into 2.6.24-rc to push it. It needs the additional
exposure that 2.6.25 merge window will give it. Since we don't have
people actively complaining today, there is no need to stir the pot for
2.6.24-rc.
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-28 19:26 ` Jeff Garzik
@ 2007-11-28 19:50 ` David Acker
2008-06-18 18:54 ` Anders Grafström
0 siblings, 1 reply; 28+ messages in thread
From: David Acker @ 2007-11-28 19:50 UTC (permalink / raw)
To: Jeff Garzik
Cc: Kok, Auke, akpm, netdev, jesse.brandeburg, miltonm, e1000-devel
Jeff Garzik wrote:
> Kok, Auke wrote:
>> David Acker wrote:
>>> What is the status of this patch?
>>
>> Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
>>
>> However, we could consider pushing this patch to 2.6.24 since it's a
>> valid fix.
>> Andrew has also been carrying this patch in -mm for a while and so
>> have we over
>> here without issues.
>>
>> Jeff, what do you think?
>
> IMO it's too late into 2.6.24-rc to push it. It needs the additional
> exposure that 2.6.25 merge window will give it. Since we don't have
> people actively complaining today, there is no need to stir the pot for
> 2.6.24-rc.
This sounds good to me. Thanks!
-Ack
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2007-11-28 19:50 ` David Acker
@ 2008-06-18 18:54 ` Anders Grafström
2008-06-18 19:16 ` David Acker
2008-07-01 21:35 ` David Acker
0 siblings, 2 replies; 28+ messages in thread
From: Anders Grafström @ 2008-06-18 18:54 UTC (permalink / raw)
To: David Acker
Cc: Kok, Auke, Jeff Garzik, e1000-devel, netdev, jesse.brandeburg,
miltonm, akpm
>>> David Acker wrote:
>>>> What is the status of this patch?
>>>
>>> Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
The e100 driver broke in 2.6.25 on the ixp4xx based platform I'm using.
This patch seems to be the cause.
It appears to work again with pci_dma_sync_single_for_device() restored.
So I'm wondering if the patch below would be valid and correct?
Anders
diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index f3cba5e..1037b13 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1803,6 +1803,8 @@ static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
if (rx->prev->skb) {
struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
put_unaligned_le32(rx->dma_addr, &prev_rfd->link);
+ pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
+ sizeof(struct rfd), PCI_DMA_TODEVICE);
}
return 0;
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-06-18 18:54 ` Anders Grafström
@ 2008-06-18 19:16 ` David Acker
2008-06-19 12:38 ` Anders Grafström
2008-07-01 21:35 ` David Acker
1 sibling, 1 reply; 28+ messages in thread
From: David Acker @ 2008-06-18 19:16 UTC (permalink / raw)
To: Anders Grafström
Cc: Kok, Auke, Jeff Garzik, e1000-devel, netdev, jesse.brandeburg,
miltonm, akpm
Anders Grafström wrote:
>>>> David Acker wrote:
>>>>> What is the status of this patch?
>>>>
>>>> Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
>
> The e100 driver broke in 2.6.25 on the ixp4xx based platform I'm using.
> This patch seems to be the cause.
>
> It appears to work again with pci_dma_sync_single_for_device() restored.
> So I'm wondering if the patch below would be valid and correct?
>
> Anders
>
> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
> index f3cba5e..1037b13 100644
> --- a/drivers/net/e100.c
> +++ b/drivers/net/e100.c
> @@ -1803,6 +1803,8 @@ static int e100_rx_alloc_skb(struct nic *nic,
> struct rx *rx)
> if (rx->prev->skb) {
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned_le32(rx->dma_addr, &prev_rfd->link);
> + pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> + sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
>
> return 0;
Interesting. That e100 patch really is the patch that never ends! :-)
I didn't think I needed the pci_dma_sync_single_for_device because I
wasn't changing the prev_rfd->command anymore. Now I look again and I
see that the code is still setting prev_rfd->link which may require a
sync. Although I did the bulk of my work with a compulab CM-X255,
http://www.compulab.co.il/x255/html/x255-cm-datasheet.htm , I happen to
have some IXP425 based Gateworks Avilas. They are GW2348-4,
http://www.gateworks.com/products/avila/gw2348-4.php .
May I ask what actual board you are using and how the e100 is connected
to it? I plan to test with miniPCI based e100 cards. Also, can you say
more about it being broke? Does it crash immediately, fail to move any
data, move some data and stop, etc. ?
-Ack
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-06-18 19:16 ` David Acker
@ 2008-06-19 12:38 ` Anders Grafström
2008-07-01 8:26 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Anders Grafström @ 2008-06-19 12:38 UTC (permalink / raw)
To: David Acker
Cc: Kok, Auke, Jeff Garzik, e1000-devel, netdev, jesse.brandeburg,
miltonm, akpm
David Acker wrote:
> May I ask what actual board you are using and how the e100 is connected
> to it? I plan to test with miniPCI based e100 cards. Also, can you say
> more about it being broke? Does it crash immediately, fail to move any
> data, move some data and stop, etc. ?
It's a custom board based on the IXDP425 reference design.
IXP420 with 82551ER integrated on the same board.
Both RX and TX appears to stop shortly after the driver has been initialized.
It never recovers after the "exec cuc_dump_reset failed" messages start to show.
[42949402.380000] e100: eth0: e100_hw_init: e100_hw_init
[42949402.460000] e100: eth0: e100_phy_init: phy_addr = 1
[42949402.470000] e100: eth0: e100_phy_init: phy ID = 0x015402A8
[42949402.490000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.490000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.490000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.490000] e100: eth0: e100_set_multicast_list: mc_count=0, flags=0x5002
[42949402.490000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.490000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.490000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.490000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.490000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.490000] e100: eth0: e100_tx_clean: cb[0]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[1]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[2]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[3]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[4]->status = 0xA000
[42949402.490000] e100: eth0: e100_set_multicast_list: mc_count=0, flags=0x5003
[42949402.490000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.490000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.490000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.490000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.490000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.490000] e100: eth0: e100_tx_clean: cb[5]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[6]->status = 0xA000
[42949402.490000] e100: eth0: e100_set_multicast_list: mc_count=1, flags=0x5003
[42949402.490000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.490000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.490000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.490000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.490000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.490000] e100: eth0: e100_tx_clean: cb[7]->status = 0xA000
[42949402.490000] e100: eth0: e100_tx_clean: cb[8]->status = 0xA000
[42949402.500000] e100: eth0: e100_watchdog: right now = -27046
[42949402.500000] e100: eth0: e100_watchdog: link up, 100Mbps, half-duplex
[42949402.510000] e100: eth0: e100_intr: stat_ack = 0x04
[42949402.510000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.510000] e100: eth0: e100_set_multicast_list: mc_count=1, flags=0x5003
[42949402.510000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.510000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.510000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.510000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.510000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.510000] e100: eth0: e100_tx_clean: cb[9]->status = 0xA000
[42949402.510000] e100: eth0: e100_tx_clean: cb[10]->status = 0xA000
[42949402.550000] e100: eth0: e100_configure: [00-07]=20:08:00:01:00:00:26:07
[42949402.550000] e100: eth0: e100_configure: [08-15]=01:00:2E:00:60:00:F2:48
[42949402.550000] e100: eth0: e100_configure: [16-23]=00:40:FA:86:3F:05:01:00
[42949402.550000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.550000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.550000] e100: eth0: e100_tx_clean: cb[11]->status = 0xA000
[42949402.550000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.550000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.550000] e100: eth0: e100_tx_clean: cb[12]->status = 0xA000
[42949402.550000] e100: eth0: e100_intr: stat_ack = 0x20
[42949402.550000] e100: eth0: e100_rx_indicate: status=0x0000
[42949402.550000] e100: eth0: e100_tx_clean: cb[13]->status = 0xA000
[42949403.630000] e100: eth0: e100_intr: stat_ack = 0x20
[42949403.630000] e100: eth0: e100_rx_indicate: status=0x0000
[42949403.630000] e100: eth0: e100_tx_clean: cb[14]->status = 0xA000
[42949404.630000] e100: eth0: e100_intr: stat_ack = 0x20
[42949404.630000] e100: eth0: e100_rx_indicate: status=0xA020
[42949404.630000] e100: eth0: e100_rx_indicate: status=0x0000
[42949404.630000] e100: eth0: e100_tx_clean: cb[15]->status = 0xA000
[42949404.630000] e100: eth0: e100_intr: stat_ack = 0x40
[42949404.630000] e100: eth0: e100_rx_indicate: status=0x0000
[42949405.000000] e100: eth0: e100_watchdog: right now = -26796
[42949405.000000] e100: eth0: e100_intr: stat_ack = 0x04
[42949405.010000] e100: eth0: e100_update_stats: exec cuc_dump_reset failed
[42949405.010000] e100: eth0: e100_rx_indicate: status=0x0000
[42949407.000000] e100: eth0: e100_watchdog: right now = -26596
[42949407.000000] e100: eth0: e100_intr: stat_ack = 0x04
[42949407.010000] e100: eth0: e100_update_stats: exec cuc_dump_reset failed
[42949407.010000] e100: eth0: e100_rx_indicate: status=0x0000
# ethtool -S eth0
NIC statistics:
rx_packets: 1
tx_packets: 4
rx_bytes: 60
tx_bytes: 168
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 0
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_deferred: 0
tx_single_collisions: 0
tx_multi_collisions: 0
tx_flow_control_pause: 0
rx_flow_control_pause: 0
rx_flow_control_unsupported: 0
tx_tco_packets: 0
rx_tco_packets: 0
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-06-19 12:38 ` Anders Grafström
@ 2008-07-01 8:26 ` Andrew Morton
2008-07-01 9:49 ` Jeff Garzik
0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2008-07-01 8:26 UTC (permalink / raw)
To: Anders Grafström
Cc: Kok, Auke, Jeff Garzik, e1000-devel, netdev, jesse.brandeburg,
miltonm, David Acker
On Thu, 19 Jun 2008 14:38:33 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
> David Acker wrote:
> > May I ask what actual board you are using and how the e100 is connected
> > to it? I plan to test with miniPCI based e100 cards. Also, can you say
> > more about it being broke? Does it crash immediately, fail to move any
> > data, move some data and stop, etc. ?
>
> It's a custom board based on the IXDP425 reference design.
> IXP420 with 82551ER integrated on the same board.
> Both RX and TX appears to stop shortly after the driver has been initialized.
> It never recovers after the "exec cuc_dump_reset failed" messages start to show.
So.. where are we with this? e100 is a pretty popular device and
having it obscurely busted on kooky architectures is likely to be
unpopular.
Did this:
> --- a/drivers/net/e100.c
> +++ b/drivers/net/e100.c
> @@ -1803,6 +1803,8 @@ static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
> if (rx->prev->skb) {
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned_le32(rx->dma_addr, &prev_rfd->link);
> + pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> + sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
>
> return 0;
actually fix it?
If so, it's probably better than nothing.
Thanks.
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-07-01 8:26 ` Andrew Morton
@ 2008-07-01 9:49 ` Jeff Garzik
2008-07-01 18:07 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Jeff Garzik @ 2008-07-01 9:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Anders Grafström, David Acker, Kok, Auke, netdev,
jesse.brandeburg, miltonm, e1000-devel
Andrew Morton wrote:
> On Thu, 19 Jun 2008 14:38:33 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
>
>> David Acker wrote:
>>> May I ask what actual board you are using and how the e100 is connected
>>> to it? I plan to test with miniPCI based e100 cards. Also, can you say
>>> more about it being broke? Does it crash immediately, fail to move any
>>> data, move some data and stop, etc. ?
>> It's a custom board based on the IXDP425 reference design.
>> IXP420 with 82551ER integrated on the same board.
>> Both RX and TX appears to stop shortly after the driver has been initialized.
>> It never recovers after the "exec cuc_dump_reset failed" messages start to show.
>
> So.. where are we with this?
It's in net-2.6, davem's 2.6.26-rc queue.
Jeff
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-07-01 9:49 ` Jeff Garzik
@ 2008-07-01 18:07 ` Andrew Morton
2008-07-02 17:36 ` Anders Grafström
0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2008-07-01 18:07 UTC (permalink / raw)
To: Jeff Garzik
Cc: Kok, Auke, e1000-devel, netdev, jesse.brandeburg, miltonm,
Anders Grafström, David Acker
On Tue, 01 Jul 2008 05:49:47 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> Andrew Morton wrote:
> > On Thu, 19 Jun 2008 14:38:33 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
> >
> >> David Acker wrote:
> >>> May I ask what actual board you are using and how the e100 is connected
> >>> to it? I plan to test with miniPCI based e100 cards. Also, can you say
> >>> more about it being broke? Does it crash immediately, fail to move any
> >>> data, move some data and stop, etc. ?
> >> It's a custom board based on the IXDP425 reference design.
> >> IXP420 with 82551ER integrated on the same board.
> >> Both RX and TX appears to stop shortly after the driver has been initialized.
> >> It never recovers after the "exec cuc_dump_reset failed" messages start to show.
> >
> > So.. where are we with this?
>
> It's in net-2.6, davem's 2.6.26-rc queue.
>
ah, thanks. Is it needed in 2.6.25.x?
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-07-01 18:07 ` Andrew Morton
@ 2008-07-02 17:36 ` Anders Grafström
2008-07-02 17:45 ` Andrew Morton
0 siblings, 1 reply; 28+ messages in thread
From: Anders Grafström @ 2008-07-02 17:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, Anders Grafström, David Acker, Kok, Auke,
netdev, jesse.brandeburg, miltonm, e1000-devel
Andrew Morton wrote:
> On Tue, 01 Jul 2008 05:49:47 -0400 Jeff Garzik <jeff@garzik.org> wrote:
>
>> Andrew Morton wrote:
>>> On Thu, 19 Jun 2008 14:38:33 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
>>>
>>>> David Acker wrote:
>>>>> May I ask what actual board you are using and how the e100 is connected
>>>>> to it? I plan to test with miniPCI based e100 cards. Also, can you say
>>>>> more about it being broke? Does it crash immediately, fail to move any
>>>>> data, move some data and stop, etc. ?
>>>> It's a custom board based on the IXDP425 reference design.
>>>> IXP420 with 82551ER integrated on the same board.
>>>> Both RX and TX appears to stop shortly after the driver has been initialized.
>>>> It never recovers after the "exec cuc_dump_reset failed" messages start to show.
>>> So.. where are we with this?
>> It's in net-2.6, davem's 2.6.26-rc queue.
>>
>
> ah, thanks. Is it needed in 2.6.25.x?
I think so.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-07-02 17:36 ` Anders Grafström
@ 2008-07-02 17:45 ` Andrew Morton
0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2008-07-02 17:45 UTC (permalink / raw)
To: Anders Grafström
Cc: Kok, Auke, Jeff Garzik, e1000-devel, netdev, jesse.brandeburg,
miltonm, David Acker, stable
On Wed, 02 Jul 2008 19:36:16 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
> Andrew Morton wrote:
> > On Tue, 01 Jul 2008 05:49:47 -0400 Jeff Garzik <jeff@garzik.org> wrote:
> >
> >> Andrew Morton wrote:
> >>> On Thu, 19 Jun 2008 14:38:33 +0200 Anders Grafstr__m <grfstrm@users.sourceforge.net> wrote:
> >>>
> >>>> David Acker wrote:
> >>>>> May I ask what actual board you are using and how the e100 is connected
> >>>>> to it? I plan to test with miniPCI based e100 cards. Also, can you say
> >>>>> more about it being broke? Does it crash immediately, fail to move any
> >>>>> data, move some data and stop, etc. ?
> >>>> It's a custom board based on the IXDP425 reference design.
> >>>> IXP420 with 82551ER integrated on the same board.
> >>>> Both RX and TX appears to stop shortly after the driver has been initialized.
> >>>> It never recovers after the "exec cuc_dump_reset failed" messages start to show.
> >>> So.. where are we with this?
> >> It's in net-2.6, davem's 2.6.26-rc queue.
> >>
> >
> > ah, thanks. Is it needed in 2.6.25.x?
>
> I think so.
OK. It's in Jeff's hands now, so I'm out of the loop.
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Fix e100 on systems that have cache incoherent DMA
2008-06-18 18:54 ` Anders Grafström
2008-06-18 19:16 ` David Acker
@ 2008-07-01 21:35 ` David Acker
1 sibling, 0 replies; 28+ messages in thread
From: David Acker @ 2008-07-01 21:35 UTC (permalink / raw)
To: Anders Grafström
Cc: Jeff Garzik, Kok, Auke, akpm, netdev, jesse.brandeburg, miltonm,
e1000-devel, kexin.hao
Anders Grafström wrote:
>>>> David Acker wrote:
>>>>> What is the status of this patch?
>>>>
>>>> Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.
>
> The e100 driver broke in 2.6.25 on the ixp4xx based platform I'm using.
> This patch seems to be the cause.
>
> It appears to work again with pci_dma_sync_single_for_device() restored.
> So I'm wondering if the patch below would be valid and correct?
>
> Anders
>
> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
> index f3cba5e..1037b13 100644
> --- a/drivers/net/e100.c
> +++ b/drivers/net/e100.c
> @@ -1803,6 +1803,8 @@ static int e100_rx_alloc_skb(struct nic *nic,
> struct rx *rx)
> if (rx->prev->skb) {
> struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
> put_unaligned_le32(rx->dma_addr, &prev_rfd->link);
> + pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
> + sizeof(struct rfd), PCI_DMA_TODEVICE);
> }
>
> return 0;
> --
Acked-by: David Acker <dacker@roinet.com>
I got a chance to test this patch on both a Compulab CM-X255 and a
Gateworks Avila GW2348-4. It seems to work fine with little to no
performance hit. I will give it some longer tests over this week. If I
run into any problems I will report them here. Thanks to Anders and
Kevin Hao who posted a similar patch.
-Ack
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2008-07-02 17:45 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-31 20:54 [PATCH] Fix e100 on systems that have cache incoherent DMA David Acker
2007-09-04 17:02 ` Kok, Auke
2007-09-07 16:31 ` Kok, Auke
2007-09-07 20:41 ` David Acker
2007-09-07 21:03 ` Kok, Auke
2007-09-07 21:18 ` Kok, Auke
2007-09-07 23:24 ` Jeff Garzik
2007-09-11 20:54 ` David Acker
2007-09-12 11:30 ` James Chapman
2007-09-12 20:11 ` David Acker
-- strict thread matches above, loose matches on Subject: below --
2007-11-02 13:27 David Acker
2007-11-02 16:05 ` Kok, Auke
2007-11-02 16:11 ` Jeff Garzik
2007-11-06 17:01 ` Kok, Auke
2007-11-08 18:17 Auke Kok
2007-11-28 19:12 ` David Acker
2007-11-28 19:21 ` Kok, Auke
2007-11-28 19:26 ` Jeff Garzik
2007-11-28 19:50 ` David Acker
2008-06-18 18:54 ` Anders Grafström
2008-06-18 19:16 ` David Acker
2008-06-19 12:38 ` Anders Grafström
2008-07-01 8:26 ` Andrew Morton
2008-07-01 9:49 ` Jeff Garzik
2008-07-01 18:07 ` Andrew Morton
2008-07-02 17:36 ` Anders Grafström
2008-07-02 17:45 ` Andrew Morton
2008-07-01 21:35 ` David Acker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).