* via-rhine interrupts
From: Jakub Ružička @ 2010-07-29 11:03 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 735 bytes --]
Hello,
the via-rhine driver powered cards generate a really big number of
interrupts, almost one per packet (11429 interrupts for 8210 incoming
and 2475 outgoing packets per second on full 100 Mbps load). This is
observed on multiple different machines (embbed and desktop) and
kernels (2.6.25 with and without NAPI, 2.6.30, 2.6.32 and 2.6.33). Do
you have any idea why isn't the polling used or what can I try to find
out what's wrong?
I have tested sending to/from the machines with nc and scp, measured
interrups and load with atop. Few of these measurements on an embbed
device (where the interrupt handling is a problem) are attached.
I'm not a subscriber, please Cc me if needed.
Thankfully
Jakub Ružička
[-- Attachment #2: atop_on_2.6.25.20_no_NAPI --]
[-- Type: application/octet-stream, Size: 1013 bytes --]
PRC | sys 0.06s | user 0.00s | #proc 44 | #zombie 0 | #exit 0/s
CPU | sys 0% | user 0% | irq 45% | idle 54% | wait 0% |
CPL | avg1 0.00 | avg5 0.00 | avg15 0.00 | csw 13/s | intr 10757/s
MEM | tot 247.4M | free 183.1M | cache 38.4M | buff 4.8M | slab 6.2M
SWP | tot 0.0M | free 0.0M | | vmcom 46.0M | vmlim 123.7M
PAG | scan 0/s | stall 0/s | | swin 0/s | swout 0/s
DSK | sda | busy 0% | read 0/s | write 0/s | avio 0 ms
NET | transport | tcpi 1/s | tcpo 1/s | udpi 0/s | udpo 0/s
NET | network | ipi 1/s | ipo 1/s | ipfrw 0/s | deliv 1/s
NET | eth2 98% | pcki 8112/s | pcko 0/s | si 98 Mbps | so 0 Kbps
NET | eth1 2% | pcki 3838/s | pcko 0/s | si 2032 Kbps | so 0 Kbps
NET | eth0 0% | pcki 7/s | pcko 1/s | si 3 Kbps | so 1 Kbps
NET | lo ---- | pcki 0/s | pcko 0/s | si 0 Kbps | so 0 Kbps
[-- Attachment #3: atop_on_2.6.25.20_with_NAPI --]
[-- Type: application/octet-stream, Size: 1014 bytes --]
PRC | sys 0.08s | user 0.00s | #proc 42 | #zombie 0 | #exit 0/s
CPU | sys 1% | user 2% | irq 47% | idle 50% | wait 0% |
CPL | avg1 0.06 | avg5 0.23 | avg15 0.12 | csw 12/s | intr 11902/s
MEM | tot 247.4M | free 189.1M | cache 35.5M | buff 2.8M | slab 6.2M
SWP | tot 0.0M | free 0.0M | | vmcom 42.0M | vmlim 123.7M
PAG | scan 0/s | stall 0/s | | swin 0/s | swout 0/s
DSK | sda | busy 0% | read 0/s | write 2/s | avio 2 ms
NET | transport | tcpi 1/s | tcpo 1/s | udpi 0/s | udpo 0/s
NET | network | ipi 1/s | ipo 1/s | ipfrw 0/s | deliv 1/s
NET | eth2 98% | pcki 8133/s | pcko 0/s | si 98 Mbps | so 0 Kbps
NET | eth1 2% | pcki 3983/s | pcko 0/s | si 2104 Kbps | so 0 Kbps
NET | eth0 0% | pcki 3/s | pcko 1/s | si 1 Kbps | so 1 Kbps
NET | lo ---- | pcki 0/s | pcko 0/s | si 0 Kbps | so 0 Kbps
[-- Attachment #4: atop_on_2.6.32 --]
[-- Type: application/octet-stream, Size: 1014 bytes --]
PRC | sys 0.16s | user 0.01s | #proc 50 | #zombie 0 | #exit 0/s
CPU | sys 3% | user 0% | irq 69% | idle 28% | wait 0%
CPL | avg1 0.03 | avg5 0.26 | avg15 0.15 | csw 11/s | intr 11342/s
MEM | tot 243.2M | free 180.1M | cache 36.8M | buff 2.8M | slab 6.8M
SWP | tot 0.0M | free 0.0M | | vmcom 45.8M | vmlim 121.6M
PAG | scan 0/s | stall 0/s | | swin 0/s | swout 0/s
DSK | sda | busy 0% | read 0/s | write 1/s | avio 8 ms
NET | transport | tcpi 1/s | tcpo 1/s | udpi 0/s | udpo 0/s
NET | network | ipi 1/s | ipo 1/s | ipfrw 0/s | deliv 1/s
NET | eth2 98% | pcki 8126/s | pcko 0/s | si 98 Mbps | so 0 Kbps
NET | eth1 2% | pcki 3887/s | pcko 0/s | si 2057 Kbps | so 0 Kbps
NET | eth0 0% | pcki 1/s | pcko 1/s | si 0 Kbps | so 1 Kbps
NET | lo ---- | pcki 0/s | pcko 0/s | si 0 Kbps | so 0 Kbps
^ permalink raw reply
* [RFC PATCH v8 16/16] An example how to alloc user buffer based on napi_gro_frags() interface.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-16-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
This example is made on ixgbe driver which using napi_gro_frags().
It can get buffers from guest side directly using netdev_alloc_page()
and release guest buffers using netdev_free_page().
---
drivers/net/ixgbe/ixgbe_main.c | 19 +++++++++++++++----
1 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index cfe6853..de5c6d0 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -691,7 +691,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
struct net_device *dev)
{
- return true;
+ return dev_is_mpassthru(dev);
+}
+
+static u32 get_page_skb_offset(struct net_device *dev)
+{
+ if (!dev_is_mpassthru(dev))
+ return 0;
+ return dev->mp_port->vnet_hlen;
}
/**
@@ -764,7 +771,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
adapter->alloc_rx_page_failed++;
goto no_buffers;
}
- bi->page_skb_offset = 0;
+ bi->page_skb_offset =
+ get_page_skb_offset(adapter->netdev);
bi->dma = pci_map_page(pdev, bi->page_skb,
bi->page_skb_offset,
(PAGE_SIZE / 2),
@@ -899,8 +907,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
len = le16_to_cpu(rx_desc->wb.upper.length);
}
- if (is_no_buffer(rx_buffer_info))
+ if (is_no_buffer(rx_buffer_info)) {
+ printk("no buffers\n");
break;
+ }
cleaned = true;
@@ -962,7 +972,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
upper_len);
if ((rx_ring->rx_buf_len > (PAGE_SIZE / 2)) ||
- (page_count(rx_buffer_info->page) != 1))
+ (page_count(rx_buffer_info->page) != 1) ||
+ dev_is_mpassthru(netdev))
rx_buffer_info->page = NULL;
else
get_page(rx_buffer_info->page);
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 15/16] An example how to modifiy NIC driver to use napi_gro_frags() interface
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-15-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.
---
drivers/net/ixgbe/ixgbe.h | 3 +
drivers/net/ixgbe/ixgbe_main.c | 138 +++++++++++++++++++++++++++++++--------
2 files changed, 112 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..fceffc5 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
struct page *page;
dma_addr_t page_dma;
unsigned int page_offset;
+ u16 mapped_as_page;
+ struct page *page_skb;
+ unsigned int page_skb_offset;
};
struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6c00ee4..cfe6853 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring->reg_idx), val);
}
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+ struct net_device *dev)
+{
+ return true;
+}
+
/**
* ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
* @adapter: address of board private structure
@@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
i = rx_ring->next_to_use;
bi = &rx_ring->rx_buffer_info[i];
+
while (cleaned_count--) {
rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
+ bi->mapped_as_page =
+ is_rx_buffer_mapped_as_page(bi, adapter->netdev);
+
if (!bi->page_dma &&
(rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
if (!bi->page) {
- bi->page = alloc_page(GFP_ATOMIC);
+ bi->page = netdev_alloc_page(adapter->netdev);
if (!bi->page) {
adapter->alloc_rx_page_failed++;
goto no_buffers;
@@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
PCI_DMA_FROMDEVICE);
}
- if (!bi->skb) {
+ if (!bi->mapped_as_page && !bi->skb) {
struct sk_buff *skb;
/* netdev_alloc_skb reserves 32 bytes up front!! */
uint bufsz = rx_ring->rx_buf_len + SMP_CACHE_BYTES;
@@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
rx_ring->rx_buf_len,
PCI_DMA_FROMDEVICE);
}
+
+ if (bi->mapped_as_page && !bi->page_skb) {
+ bi->page_skb = netdev_alloc_page(adapter->netdev);
+ if (!bi->page_skb) {
+ adapter->alloc_rx_page_failed++;
+ goto no_buffers;
+ }
+ bi->page_skb_offset = 0;
+ bi->dma = pci_map_page(pdev, bi->page_skb,
+ bi->page_skb_offset,
+ (PAGE_SIZE / 2),
+ PCI_DMA_FROMDEVICE);
+ }
/* Refresh the desc even if buffer_addrs didn't change because
* each write-back erases this info. */
if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
@@ -823,6 +846,13 @@ struct ixgbe_rsc_cb {
dma_addr_t dma;
};
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+ return ((!rx_buffer_info->skb ||
+ !rx_buffer_info->page_skb) &&
+ !rx_buffer_info->page);
+}
+
#define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
struct ixgbe_adapter *adapter = q_vector->adapter;
struct net_device *netdev = adapter->netdev;
struct pci_dev *pdev = adapter->pdev;
+ struct napi_struct *napi = &q_vector->napi;
union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
struct sk_buff *skb;
@@ -868,29 +899,57 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
len = le16_to_cpu(rx_desc->wb.upper.length);
}
+ if (is_no_buffer(rx_buffer_info))
+ break;
+
cleaned = true;
- skb = rx_buffer_info->skb;
- prefetch(skb->data);
- rx_buffer_info->skb = NULL;
- if (rx_buffer_info->dma) {
- if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
- (!(staterr & IXGBE_RXD_STAT_EOP)) &&
- (!(skb->prev)))
- /*
- * When HWRSC is enabled, delay unmapping
- * of the first packet. It carries the
- * header information, HW may still
- * access the header after the writeback.
- * Only unmap it when EOP is reached
- */
- IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
- else
- pci_unmap_single(pdev, rx_buffer_info->dma,
- rx_ring->rx_buf_len,
- PCI_DMA_FROMDEVICE);
- rx_buffer_info->dma = 0;
- skb_put(skb, len);
+ if (!rx_buffer_info->mapped_as_page) {
+ skb = rx_buffer_info->skb;
+ prefetch(skb->data);
+ rx_buffer_info->skb = NULL;
+
+ if (rx_buffer_info->dma) {
+ if ((adapter->flags2 &
+ IXGBE_FLAG2_RSC_ENABLED) &&
+ (!(staterr & IXGBE_RXD_STAT_EOP)) &&
+ (!(skb->prev)))
+ /*
+ * When HWRSC is enabled, delay unmapping
+ * of the first packet. It carries the
+ * header information, HW may still
+ * access the header after the writeback.
+ * Only unmap it when EOP is reached
+ */
+ IXGBE_RSC_CB(skb)->dma =
+ rx_buffer_info->dma;
+ else
+ pci_unmap_single(pdev,
+ rx_buffer_info->dma,
+ rx_ring->rx_buf_len,
+ PCI_DMA_FROMDEVICE);
+ rx_buffer_info->dma = 0;
+ skb_put(skb, len);
+ }
+ } else {
+ skb = napi_get_frags(napi);
+ prefetch(rx_buffer_info->page_skb_offset);
+ rx_buffer_info->skb = NULL;
+ if (rx_buffer_info->dma) {
+ pci_unmap_page(pdev, rx_buffer_info->dma,
+ PAGE_SIZE / 2,
+ PCI_DMA_FROMDEVICE);
+ rx_buffer_info->dma = 0;
+ skb_fill_page_desc(skb,
+ skb_shinfo(skb)->nr_frags,
+ rx_buffer_info->page_skb,
+ rx_buffer_info->page_skb_offset,
+ len);
+ rx_buffer_info->page_skb = NULL;
+ skb->len += len;
+ skb->data_len += len;
+ skb->truesize += len;
+ }
}
if (upper_len) {
@@ -956,6 +1015,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
rx_buffer_info->dma = next_buffer->dma;
next_buffer->skb = skb;
next_buffer->dma = 0;
+ if (rx_buffer_info->mapped_as_page) {
+ rx_buffer_info->page_skb =
+ next_buffer->page_skb;
+ next_buffer->page_skb = NULL;
+ next_buffer->skb = NULL;
+ }
} else {
skb->next = next_buffer->skb;
skb->next->prev = skb;
@@ -975,7 +1040,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
total_rx_bytes += skb->len;
total_rx_packets++;
- skb->protocol = eth_type_trans(skb, adapter->netdev);
+ if (!rx_buffer_info->mapped_as_page)
+ skb->protocol = eth_type_trans(skb, adapter->netdev);
#ifdef IXGBE_FCOE
/* if ddp, not passing to ULD unless for FCP_RSP or error */
if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
@@ -984,7 +1050,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
goto next_desc;
}
#endif /* IXGBE_FCOE */
- ixgbe_receive_skb(q_vector, skb, staterr, rx_ring, rx_desc);
+
+ if (!rx_buffer_info->mapped_as_page)
+ ixgbe_receive_skb(q_vector, skb, staterr,
+ rx_ring, rx_desc);
+ else {
+ skb_record_rx_queue(skb, rx_ring->queue_index);
+ napi_gro_frags(napi);
+ }
next_desc:
rx_desc->wb.upper.status_error = 0;
@@ -3131,9 +3204,16 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
rx_buffer_info = &rx_ring->rx_buffer_info[i];
if (rx_buffer_info->dma) {
- pci_unmap_single(pdev, rx_buffer_info->dma,
- rx_ring->rx_buf_len,
- PCI_DMA_FROMDEVICE);
+ if (!rx_buffer_info->mapped_as_page) {
+ pci_unmap_single(pdev, rx_buffer_info->dma,
+ rx_ring->rx_buf_len,
+ PCI_DMA_FROMDEVICE);
+ } else {
+ pci_unmap_page(pdev, rx_buffer_info->dma,
+ PAGE_SIZE / 2,
+ PCI_DMA_FROMDEVICE);
+ rx_buffer_info->page_skb = NULL;
+ }
rx_buffer_info->dma = 0;
}
if (rx_buffer_info->skb) {
@@ -3158,7 +3238,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
rx_buffer_info->page_dma = 0;
}
- put_page(rx_buffer_info->page);
+ netdev_free_page(adapter->netdev, rx_buffer_info->page);
rx_buffer_info->page = NULL;
rx_buffer_info->page_offset = 0;
}
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 13/16] Add a kconfig entry and make entry for mp device.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-12-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
drivers/vhost/Kconfig | 10 ++++++++++
drivers/vhost/Makefile | 2 ++
2 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
To compile this driver as a module, choose M here: the module will
be called vhost_net.
+config MEDIATE_PASSTHRU
+ tristate "mediate passthru network driver (EXPERIMENTAL)"
+ depends on VHOST_NET
+ ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
obj-$(CONFIG_VHOST_NET) += vhost_net.o
vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 10/16] Add a hook to intercept external buffers from NIC driver.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-10-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
net/core/dev.c | 35 +++++++++++++++++++++++++++++++++++
1 files changed, 35 insertions(+), 0 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 636f11b..4b379b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2517,6 +2517,37 @@ err:
EXPORT_SYMBOL(netdev_mp_port_prep);
#endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+ struct packet_type **pt_prev,
+ int *ret,
+ struct net_device *orig_dev)
+{
+ struct mpassthru_port *mp_port = NULL;
+ struct sock *sk = NULL;
+
+ if (!dev_is_mpassthru(skb->dev))
+ return skb;
+ mp_port = skb->dev->mp_port;
+
+ if (*pt_prev) {
+ *ret = deliver_skb(skb, *pt_prev, orig_dev);
+ *pt_prev = NULL;
+ }
+
+ sk = mp_port->sock->sk;
+ skb_queue_tail(&sk->sk_receive_queue, skb);
+ sk->sk_state_change(sk);
+
+ return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
@@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
ncls:
#endif
+ /* To intercept mediate passthru(zero-copy) packets here */
+ skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+ if (!skb)
+ goto out;
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 09/16] Don't do skb recycle, if device use external buffer.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-9-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
net/core/skbuff.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbf4707..9b156bb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;
+ /* if the device wants to do mediate passthru, the skb may
+ * get external buffer, so don't recycle
+ */
+ if (dev_is_mpassthru(skb->dev))
+ return 0;
+
skb_release_head_state(skb);
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 07/16] Modify netdev_alloc_page() to get external buffer
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-7-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
Currently, it can get external buffers from mp device.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
net/core/skbuff.c | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 117d82b..1a61e2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
}
EXPORT_SYMBOL(__netdev_alloc_skb);
+struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages)
+{
+ struct mpassthru_port *port;
+ struct skb_ext_page *ext_page = NULL;
+
+ port = dev->mp_port;
+ if (!port)
+ goto out;
+ ext_page = port->ctor(port, NULL, npages);
+ if (ext_page)
+ return ext_page->page;
+out:
+ return NULL;
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_pages);
+
+struct page *netdev_alloc_ext_page(struct net_device *dev)
+{
+ return netdev_alloc_ext_pages(dev, 1);
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_page);
+
struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
{
int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
struct page *page;
+ if (dev_is_mpassthru(dev))
+ return netdev_alloc_ext_page(dev);
+
page = alloc_pages_node(node, gfp_mask, 0);
return page;
}
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 04/16] Add a function make external buffer owner to query capability.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-4-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
The external buffer owner can use the functions to get
the capability of the underlying NIC driver.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
include/linux/netdevice.h | 2 +
net/core/dev.c | 49 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+), 0 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aba0308..5f192de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi,
gro_result_t ret);
extern struct sk_buff * napi_frags_skb(struct napi_struct *napi);
extern gro_result_t napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port);
static inline void napi_free_frags(struct napi_struct *napi)
{
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..636f11b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
}
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ int rc;
+ int npages, data_len;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (ops->ndo_mp_port_prep) {
+ rc = ops->ndo_mp_port_prep(dev, port);
+ if (rc)
+ return rc;
+ } else {
+ /* If the NIC driver did not report this,
+ * then we try to use default value.
+ */
+ port->hdr_len = 128;
+ port->data_len = 2048;
+ port->npages = 1;
+ }
+
+ if (port->hdr_len <= 0)
+ goto err;
+
+ npages = port->npages;
+ data_len = port->data_len;
+ if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+ (data_len < PAGE_SIZE * (npages - 1) ||
+ data_len > PAGE_SIZE * npages))
+ goto err;
+
+ return 0;
+err:
+ dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
--
1.5.4.4
^ permalink raw reply related
* [RFC PATCH v8 03/16] Add a ndo_mp_port_prep func to net_device_ops.
From: xiaohui.xin @ 2010-07-29 11:14 UTC (permalink / raw)
To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1280402088-5849-3-git-send-email-xiaohui.xin@intel.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
include/linux/netdevice.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ba582e1..aba0308 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -710,6 +710,10 @@ struct net_device_ops {
int (*ndo_fcoe_get_wwn)(struct net_device *dev,
u64 *wwn, int type);
#endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+ int (*ndo_mp_port_prep)(struct net_device *dev,
+ struct mpassthru_port *port);
+#endif
};
/*
--
1.5.4.4
^ permalink raw reply related
* bridge: Allow multicast snooping to be disabled before ifup
From: Herbert Xu @ 2010-07-29 10:45 UTC (permalink / raw)
To: David S. Miller, netdev
Hi:
bridge: Allow multicast snooping to be disabled before ifup
Currently you cannot disable multicast snooping while a device is
down. There is no good reason for this restriction and this patch
removes it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 27ae946..585ce6e 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1728,13 +1728,9 @@ unlock:
int br_multicast_toggle(struct net_bridge *br, unsigned long val)
{
struct net_bridge_port *port;
- int err = -ENOENT;
+ int err = 0;
spin_lock(&br->multicast_lock);
- if (!netif_running(br->dev))
- goto unlock;
-
- err = 0;
if (br->multicast_disabled == !val)
goto unlock;
@@ -1742,6 +1738,9 @@ int br_multicast_toggle(struct net_bridge *br, unsigned long val)
if (br->multicast_disabled)
goto unlock;
+ if (!netif_running(br->dev))
+ goto unlock;
+
if (br->mdb) {
if (br->mdb->old) {
err = -EEXIST;
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* Re: nfs client hang
From: Andy Chittenden @ 2010-07-29 10:10 UTC (permalink / raw)
To: Chuck Lever
Cc: Eric Dumazet,
Linux Kernel Mailing List (linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Trond Myklebust, netdev, Linux NFS Mailing List
In-Reply-To: <4C506AD0.4070608-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
On 2010-07-28 18:37, Chuck Lever wrote:
> On 07/28/10 03:24 AM, Andy Chittenden wrote:
>> resending as it seems to have been corrupted on LKML!
>>
>>> The RPC client marks the socket closed. and the linger timeout is
>>> cancelled. At this point, sk_shutdown should be set to zero, correct?
>>> I don't see an xs_error_report() call here, which would confirm that the
>>> socket took a trip through tcp_disconnect().
>> From my reading of tcp_disconnect(), it calls sk->sk_error_report(sk)
>> unconditionally so as there's no xs_error_report(), that surely means
>> the exact opposite: tcp_disconnect() wasn't called. If it's not
>> called, sk_shutdown is not cleared. And my revised tracing confirmed
>> that it was set to SEND_SHUTDOWN.
> Sorry, that's what I meant above.
>
> An xs_error_report() debugging message at that point in the log would
> confirm that the socket took a trip through tcp_disconnect(). But I
> don't see such a message.
I don't see how tcp_disconnect() gets called if the application does a
shutdown when the state is TCP_ESTABLISHED (or a myriad of other
states). It just seems to send a FIN. Should tcp_disconnect() be called?
If so, how? Alternatively, I wonder whether my patch that set
sk_shutdown to 0 in tcp_connect_init() is the correct fix after all.
--
Andy, BlueArc Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: slub numa: Fix rare allocation from unexpected node
From: Pekka Enberg @ 2010-07-29 10:00 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, jamal, netdev, linux-kernel
In-Reply-To: <alpine.DEB.2.00.1007261040430.5438@router.home>
Christoph Lameter wrote:
> Subject: slub numa: Fix rare allocation from unexpected node
>
> The network developers have seen sporadic allocations resulting in objects
> coming from unexpected NUMA nodes despite asking for objects from a
> specific node.
>
> This is due to get_partial() calling get_any_partial() if partial
> slabs are exhausted for a node even if a node was specified and therefore
> one would expect allocations only from the specified node.
>
> get_any_partial() sporadically may return a slab from a foreign
> node to gradually reduce the size of partial lists on remote nodes
> and thereby reduce total memory use for a slab cache.
>
> The behavior is controlled by the remote_defrag_ratio of each cache.
>
> Strictly speaking this is permitted behavior since __GFP_THISNODE was
> not specified for the allocation but it is certain surprising.
>
> This patch makes sure that the remote defrag behavior only occurs
> if no node was specified.
>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
> mm/slub.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2010-07-23 09:24:11.000000000 -0500
> +++ linux-2.6/mm/slub.c 2010-07-23 09:25:15.000000000 -0500
> @@ -1390,7 +1390,7 @@ static struct page *get_partial(struct k
> int searchnode = (node == -1) ? numa_node_id() : node;
>
> page = get_partial_node(get_node(s, searchnode));
> - if (page || (flags & __GFP_THISNODE))
> + if (page || node != -1)
> return page;
>
> return get_any_partial(s, flags);
Applied, thanks!
^ permalink raw reply
* Re: can: expected receive behavior broken
From: Oliver Hartkopp @ 2010-07-29 9:36 UTC (permalink / raw)
To: Matthias Fuchs
Cc: Socketcan-core-0fE9KPoRgkgATYTw5x5z8w, Linux Netdev List,
Wolfgang Grandegger
In-Reply-To: <201007281023.25039.matthias.fuchs-iOnpLzIbIdM@public.gmane.org>
On 28.07.2010 10:23, Matthias Fuchs wrote:
> plx_pci/sja1000 + esd_usb2
>
Hi Matthias,
i added a test program to the SVN that checks whether the CAN_RAW_LOOPBACK and
CAN_RAW_RECV_OWN_MSGS socket options work properly (in can.ko and can-raw-ko
and vcan.ko).
So far i was only able to test it on vcan0, as i'm on a business trip and
don't have a real CAN hardware with me.
I'll enhance it to force the CAN netdev to be given on the commandline.
Regarding your request, i was able to see the bad behaviour in the latest
net-next-2.6. You need to make
modprobe vcan echo=1
before creating vcan devices to test the loopback on driver level!
Invoking tst-rcv-own-msgs produces this output, which is far away from the
correct (wanted) output seen in the commit message below.
sockopt default
s : 0
t : 0
timeout
sockopt - -
timeout
sockopt - R
timeout
sockopt L -
s : 3
t : 3
timeout
sockopt L R
s : 4
t : 4
timeout
done.
I'll check that with the latest linux-2.6 (after rebooting :-)
Thanks for the hint! I'll run the tst-rcv-own-msgs test tool on the upcoming
net-next-2.6's and also put it into LTP later on.
Regards,
Oliver
--- snip! ---
Added:
trunk/test/tst-rcv-own-msgs.c
Modified:
trunk/test/Makefile
Log:
Added test programm to check the correct functionality of
CAN_RAW_LOOPBACK and CAN_RAW_RECV_OWN_MSGS socket options.
It needs a vcan0 virtual CAN network interface and should produce an output
like this, when invoked:
sockopt default
t : 0
timeout
sockopt - -
timeout
sockopt - R
timeout
sockopt L -
t : 3
timeout
sockopt L R
s : 4
t : 4
timeout
done.
>
> On Wednesday 28 July 2010 10:17, Wolfgang Grandegger wrote:
>> On 07/28/2010 09:56 AM, Matthias Fuchs wrote:
>>> Hi,
>>>
>>> I just noticed that the receive behavior of CAN sockets is broken
>>> in current net-next-2.6.
>>> I wrote some simple code that receives messages and echos them back to
>>> the bus. When I now trigger one single message on the bus, I get
>>> this message received and echoed back in an endless loop.
>>>
>>> I do not touch the sockopts CAN_RAW_LOOPBACK or CAN_RAW_RECV_OWN_MSGS in my code.
>>> Only (!) setting CAN_RAW_LOOPBACK to 0 helps at the moment. But this behavior
>>> actually has nothing to do with LOOPBACK but more with RECV_OWN_MSGS.
>>
>> Sounds wired! What driver are you using?
>>
>> Wolfgang.
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [patch] dnet: fixup error handling in initialization
From: Dan Carpenter @ 2010-07-29 8:27 UTC (permalink / raw)
To: netdev
Cc: Stephen Hemminger, Eric Dumazet, Frans Pop, Richard Cochran,
David S. Miller, kernel-janitors
There were two problems here. We returned success if dnet_mii_init()
failed and there was a release_mem_region() missing.
Signed-off-by: Dan Carpenter <error27@gmail.com>
diff --git a/drivers/net/dnet.c b/drivers/net/dnet.c
index 4ea7141..7c07575 100644
--- a/drivers/net/dnet.c
+++ b/drivers/net/dnet.c
@@ -854,7 +854,7 @@ static int __devinit dnet_probe(struct platform_device *pdev)
dev = alloc_etherdev(sizeof(*bp));
if (!dev) {
dev_err(&pdev->dev, "etherdev alloc failed, aborting.\n");
- goto err_out;
+ goto err_out_release_mem;
}
/* TODO: Actually, we have some interesting features... */
@@ -911,7 +911,8 @@ static int __devinit dnet_probe(struct platform_device *pdev)
if (err)
dev_warn(&pdev->dev, "Cannot register PHY board fixup.\n");
- if (dnet_mii_init(bp) != 0)
+ err = dnet_mii_init(bp);
+ if (err)
goto err_out_unregister_netdev;
dev_info(&pdev->dev, "Dave DNET at 0x%p (0x%08x) irq %d %pM\n",
@@ -936,6 +937,8 @@ err_out_iounmap:
iounmap(bp->regs);
err_out_free_dev:
free_netdev(dev);
+err_out_release_mem:
+ release_mem_region(mem_base, mem_size);
err_out:
return err;
}
^ permalink raw reply related
* Re: linux-next: manual merge of the net tree with the net-current tree
From: Jiri Pirko @ 2010-07-29 5:51 UTC (permalink / raw)
To: Stephen Rothwell
Cc: David Miller, netdev, linux-next, linux-kernel, stephen hemminger
In-Reply-To: <20100729110504.df0b5791.sfr@canb.auug.org.au>
Looks good to me.
Thu, Jul 29, 2010 at 03:05:04AM CEST, sfr@canb.auug.org.au wrote:
>Hi all,
>
>Today's linux-next merge of the net tree got a conflict in
>net/bridge/br_input.c between commit
>eeaf61d8891f9c9ed12c1a667e72bf83f0857954 ("bridge: add rcu_read_lock on
>transmit") from the net-current tree and commit
>ab95bfe01f9872459c8678572ccadbf646badad0 ("net: replace hooks in
>__netif_receive_skb V5") from the net tree.
>
>Just overlapping changes in a comment. I fixed it up (see below) and can
>carry the fix for a while.
>--
>Cheers,
>Stephen Rothwell sfr@canb.auug.org.au
>
>diff --cc net/bridge/br_input.c
>index 114365c,5fc1c5b..0000000
>--- a/net/bridge/br_input.c
>+++ b/net/bridge/br_input.c
>@@@ -108,13 -110,12 +110,12 @@@ drop
> goto out;
> }
>
> -/* note: already called with rcu_read_lock (preempt_disabled) */
> +/* note: already called with rcu_read_lock */
> static int br_handle_local_finish(struct sk_buff *skb)
> {
>- struct net_bridge_port *p = rcu_dereference(skb->dev->br_port);
>+ struct net_bridge_port *p = br_port_get_rcu(skb->dev);
>
>- if (p)
>- br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
>+ br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
> return 0; /* process further */
> }
>
>@@@ -131,12 -132,13 +132,12 @@@ static inline int is_link_local(const u
> }
>
> /*
>- * Called via br_handle_frame_hook.
> * Return NULL if skb is handled
>- * note: already called with rcu_read_lock
> - * note: already called with rcu_read_lock (preempt_disabled) from
> - * netif_receive_skb
>++ * note: already called with rcu_read_lock from netif_receive_skb
> */
>- struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
>+ struct sk_buff *br_handle_frame(struct sk_buff *skb)
> {
>+ struct net_bridge_port *p;
> const unsigned char *dest = eth_hdr(skb)->h_dest;
> int (*rhook)(struct sk_buff *skb);
>
^ permalink raw reply
* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2010-07-29 5:21 UTC (permalink / raw)
To: sfr; +Cc: netdev, linux-next, linux-kernel, dmitry, eilong
In-Reply-To: <20100729141306.f27fad7a.sfr@canb.auug.org.au>
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 29 Jul 2010 14:13:06 +1000
> net: bnx2x_cmn.c needs net/ip6_checksum.h for csum_ipv6_magic
>
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Applied, thanks Stephen.
^ permalink raw reply
* linux-next: build failure after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2010-07-29 4:13 UTC (permalink / raw)
To: David Miller, netdev
Cc: linux-next, linux-kernel, Dmitry Kravkov, Eilon Greenstein
Hi Dave,
After merging the final tree, today's linux-next build (powerpc
allyesconfig) failed like this:
drivers/net/bnx2x/bnx2x_cmn.c: In function 'bnx2x_start_xmit':
drivers/net/bnx2x/bnx2x_cmn.c:2015: error: implicit declaration of function 'csum_ipv6_magic'
Caused by commit 9f6c925889ad9204c7d1f5ca116d2e5fd6036c72 ("bnx2x: Create
bnx2x_cmn.* files"). See Rule 1 in Documentation/SubmitChecklist. :-)
I applied the following patch for today:
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 29 Jul 2010 14:07:49 +1000
Subject: [PATCH] net: bnx2x_cmn.c needs net/ip6_checksum.h for csum_ipv6_magic
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
drivers/net/bnx2x/bnx2x_cmn.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index 30d20c7..02bf710 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -19,6 +19,7 @@
#include <linux/etherdevice.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
+#include <net/ip6_checksum.h>
#include "bnx2x_cmn.h"
#ifdef BCM_VLAN
--
1.7.1
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
^ permalink raw reply related
* TAHI CN-6-4-1 failed on Linux 2.6.32 kernel
From: Steve Chen @ 2010-07-29 3:20 UTC (permalink / raw)
To: usagi-users-ctl, netdev
Hello,
The TAHI correspondent node tests CN-6-4-1 (Processing in upper layer
- Echo Checksum) failed for me in the 2.6.32 kernel. It appears that
the Linux kernel is replying the ICMP echo request in
icmpv6_echo_reply without much checking. Is this an intentional
non-conformance to RFC3775 section 9.3.1?
Thanks,
Steve
^ permalink raw reply
* Re: linux-next: manual merge of the net tree with the net-current tree
From: Jeff Kirsher @ 2010-07-29 1:26 UTC (permalink / raw)
To: Stephen Rothwell
Cc: David Miller, netdev, linux-next, linux-kernel, Bruce Allan
In-Reply-To: <20100729111904.90b148a6.sfr@canb.auug.org.au>
On Wed, Jul 28, 2010 at 18:19, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi all,
>
> Today's linux-next merge of the net tree got a conflict in
> drivers/net/e1000e/hw.h between commit
> ff847ac2d3e90edd94674c28bade25ae1e6a2e49 ("") from the net-current tree
> and commit d3738bb8203acf8552c3ec8b3447133fc0938ddd ("e1000e: initial
> support for 82579 LOMs") from the net tree.
>
> Just context changes. I fixed it up (see below) and can carry the fix for
> a while.
> --
> Cheers,
> Stephen Rothwell sfr@canb.auug.org.au
>
> diff --cc drivers/net/e1000e/hw.h
> index 664ed58,0cd569a..0000000
> --- a/drivers/net/e1000e/hw.h
> +++ b/drivers/net/e1000e/hw.h
> @@@ -308,8 -312,8 +312,8 @@@ enum e1e_registers
> #define E1000_KMRNCTRLSTA_INBAND_PARAM 0x9 /* Kumeran InBand Parameters */
> #define E1000_KMRNCTRLSTA_DIAG_NELPBK 0x1000 /* Nearend Loopback mode */
> #define E1000_KMRNCTRLSTA_K1_CONFIG 0x7
> -#define E1000_KMRNCTRLSTA_K1_ENABLE 0x140E
> +#define E1000_KMRNCTRLSTA_K1_ENABLE 0x0002
> - #define E1000_KMRNCTRLSTA_K1_DISABLE 0x1400
> + #define E1000_KMRNCTRLSTA_HD_CTRL 0x0002
>
> #define IFE_PHY_EXTENDED_STATUS_CONTROL 0x10
> #define IFE_PHY_SPECIAL_CONTROL 0x11 /* 100BaseTx PHY Special Control */
> --
yes, that is fine and expected. When Dave sync's up is net-next-2.6
tree with net-2.6 tree, this issue will be fixed up.
--
Cheers,
Jeff
^ permalink raw reply
* linux-next: manual merge of the net tree with the net-current tree
From: Stephen Rothwell @ 2010-07-29 1:19 UTC (permalink / raw)
To: David Miller, netdev; +Cc: linux-next, linux-kernel, Bruce Allan, Jeff Kirsher
Hi all,
Today's linux-next merge of the net tree got a conflict in
drivers/net/e1000e/hw.h between commit
ff847ac2d3e90edd94674c28bade25ae1e6a2e49 ("") from the net-current tree
and commit d3738bb8203acf8552c3ec8b3447133fc0938ddd ("e1000e: initial
support for 82579 LOMs") from the net tree.
Just context changes. I fixed it up (see below) and can carry the fix for
a while.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
diff --cc drivers/net/e1000e/hw.h
index 664ed58,0cd569a..0000000
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@@ -308,8 -312,8 +312,8 @@@ enum e1e_registers
#define E1000_KMRNCTRLSTA_INBAND_PARAM 0x9 /* Kumeran InBand Parameters */
#define E1000_KMRNCTRLSTA_DIAG_NELPBK 0x1000 /* Nearend Loopback mode */
#define E1000_KMRNCTRLSTA_K1_CONFIG 0x7
-#define E1000_KMRNCTRLSTA_K1_ENABLE 0x140E
+#define E1000_KMRNCTRLSTA_K1_ENABLE 0x0002
- #define E1000_KMRNCTRLSTA_K1_DISABLE 0x1400
+ #define E1000_KMRNCTRLSTA_HD_CTRL 0x0002
#define IFE_PHY_EXTENDED_STATUS_CONTROL 0x10
#define IFE_PHY_SPECIAL_CONTROL 0x11 /* 100BaseTx PHY Special Control */
^ permalink raw reply
* Re: [PATCH net-next] bonding: take rtnl in bond_loadbalance_arp_mon
From: Andy Gospodarek @ 2010-07-29 1:13 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev
In-Reply-To: <29111.1280353189@death>
On Wed, Jul 28, 2010 at 02:39:49PM -0700, Jay Vosburgh wrote:
> Andy Gospodarek <andy@greyhouse.net> wrote:
>
> >With the latest code in net-next-2.6 the following (and similar) are
> >spewed when using arp monitoring and balance-alb.
>
> Does the ARP monitor function correctly for balance-alb? My
> recollection is that the ARP monitor probes interfere with the tailored
> ARP messages that balance-alb sends.
It seems to work fine here on a few tries (I only use sysfs for
configuration anymore), but it might be blind luck that the addresses
chosen are hashing out correctly to make arp monitoring work.
> The bond_check_params function
> disallows setting arp_interval (it forces miimon on). I suspect this
> nuance was missed when setting up the sysfs code, but if it does work,
> then perhaps it is too strict.
You are correct, it does. It is clear that some checks should be added
to the sysfs code and it also seems like some work should be done to
more clearly define what modes support which form of link monitoring (it
doesn't seem to me like balance-rr should support can monitoring in it's
current implementation, but there is no explicit code to check for it in
the sysfs-layer or bond_check_params).
> As I recall, I had deliberately left acquiring rtnl out of the
> loadbalance_arp_mon function, since none of the modes that used it
> required rtnl for failover.
Understood.
Based on your comments, at least something like the following should
probably be done.
[PATCH net-next] bonding: prevent sysfs from allowing arp monitoring with alb/tlb
When using module options arp monitoring and balance-alb/balance-tlb
are mutually exclusive options. Anytime balance-alb/balance-tlb are
enabled mii monitoring is forced to 100ms if not set. When configuring
via sysfs no checking is currently done.
Handling these cases with sysfs has to be done a bit differently because
we do not have all configuration information available at once. This
patch will not allow a mode change to balance-alb/balance-tlb if
arp_interval is already non-zero. It will also not allow the user to
set a non-zero arp_interval value if the mode is already set to
balance-alb/balance-tlb. They are still mutually exclusive on a
first-come, first serve basis.
Tested with initscripts on Fedora and manual setting via sysfs.
Signed-off-by: Andy Gospodarek <gospo@redhat.com>
---
drivers/net/bonding/bond_sysfs.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 1a99764..c311aed 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -313,19 +313,26 @@ static ssize_t bonding_store_mode(struct device *d,
bond->dev->name, (int)strlen(buf) - 1, buf);
ret = -EINVAL;
goto out;
- } else {
- if (bond->params.mode == BOND_MODE_8023AD)
- bond_unset_master_3ad_flags(bond);
+ }
+ if ((new_value == BOND_MODE_ALB ||
+ new_value == BOND_MODE_TLB) &&
+ bond->params.arp_interval) {
+ pr_err("%s: %s mode is incompatible with arp monitoring.\n",
+ bond->dev->name, bond_mode_tbl[new_value].modename);
+ ret = -EINVAL;
+ goto out;
+ }
+ if (bond->params.mode == BOND_MODE_8023AD)
+ bond_unset_master_3ad_flags(bond);
- if (bond->params.mode == BOND_MODE_ALB)
- bond_unset_master_alb_flags(bond);
+ if (bond->params.mode == BOND_MODE_ALB)
+ bond_unset_master_alb_flags(bond);
- bond->params.mode = new_value;
- bond_set_mode_ops(bond, bond->params.mode);
- pr_info("%s: setting mode to %s (%d).\n",
- bond->dev->name, bond_mode_tbl[new_value].modename,
- new_value);
- }
+ bond->params.mode = new_value;
+ bond_set_mode_ops(bond, bond->params.mode);
+ pr_info("%s: setting mode to %s (%d).\n",
+ bond->dev->name, bond_mode_tbl[new_value].modename,
+ new_value);
out:
return ret;
}
@@ -510,7 +517,13 @@ static ssize_t bonding_store_arp_interval(struct device *d,
ret = -EINVAL;
goto out;
}
-
+ if (bond->params.mode == BOND_MODE_ALB ||
+ bond->params.mode == BOND_MODE_TLB) {
+ pr_info("%s: ARP monitoring cannot be used with ALB/TLB. Only MII monitoring is supported on %s.\n",
+ bond->dev->name, bond->dev->name);
+ ret = -EINVAL;
+ goto out;
+ }
pr_info("%s: Setting ARP monitoring interval to %d.\n",
bond->dev->name, new_value);
bond->params.arp_interval = new_value;
--
1.7.0.1
^ permalink raw reply related
* Re: [REGRESSION] e1000e stopped working [MANUALLY BISECTED]
From: Jeff Kirsher @ 2010-07-29 1:10 UTC (permalink / raw)
To: Maxim Levitsky
Cc: Tantilov, Emil S, netdev@vger.kernel.org, Allan, Bruce W,
Pieper, Jeffrey E
In-Reply-To: <1280300683.8250.2.camel@maxim-laptop>
On Wed, Jul 28, 2010 at 00:04, Maxim Levitsky <maximlevitsky@gmail.com> wrote:
> On Mon, 2010-07-26 at 03:25 +0300, Maxim Levitsky wrote:
>>
>> This commit, present in net-next, solves the problem:
>>
>> commit 1286950690f0f82ffa504e1e149ee3fdb4c51478
>> Author: Bruce Allan <bruce.w.allan@intel.com>
>> Date: Mon Jul 26 03:19:38 2010 +0300
>>
>> e1000e: cleanup e1000_sw_lcd_config_ich8lan()
>>
>> Do not acquire and release the PHY unnecessarily for parts that return
>> from this workaround without actually accessing the PHY registers.
>>
>> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
>> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>>
>>
>> Also, the above patch is part of whole series of patches with scary descriptions (that is these fix bugs).
>> If I were you I would send them to Linus for 2.6.35 inclusion too.
>>
>> Best regards,
>> Maxim Levitsky
>>
>>
>>
> ping
>
Sorry for the delayed response. I am working on the issue. Here is
the problem I am having, the patch that fixes the issue you are seeing
is fairly large and is a cleanup to the ich8 function, which as it
stands now, would not be accepted into net-2.6 tree this late into the
-rc cycle. So, what I looking at is, what specifically fixed the
issue you are seeing that resides in that patch, and come up with a
smaller (acceptable) patch that I can submit to net-2.6 now to resolve
your issue.
I have dedicated most of this evening to finding a resolution to your
issue that will be acceptable for the net-2.6 tree. As you noted,
there were several patches before this particular commit that may play
some part in the resolution as well, and that is what I will be
looking into. I greatly appreciate the hard work you have done to
help us resolve this issue, and will make sure you get credit for any
solution I put together to resolve this issue.
--
Cheers,
Jeff
^ permalink raw reply
* linux-next: manual merge of the net tree with the net-current tree
From: Stephen Rothwell @ 2010-07-29 1:05 UTC (permalink / raw)
To: David Miller, netdev
Cc: linux-next, linux-kernel, stephen hemminger, Jiri Pirko
Hi all,
Today's linux-next merge of the net tree got a conflict in
net/bridge/br_input.c between commit
eeaf61d8891f9c9ed12c1a667e72bf83f0857954 ("bridge: add rcu_read_lock on
transmit") from the net-current tree and commit
ab95bfe01f9872459c8678572ccadbf646badad0 ("net: replace hooks in
__netif_receive_skb V5") from the net tree.
Just overlapping changes in a comment. I fixed it up (see below) and can
carry the fix for a while.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
diff --cc net/bridge/br_input.c
index 114365c,5fc1c5b..0000000
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@@ -108,13 -110,12 +110,12 @@@ drop
goto out;
}
-/* note: already called with rcu_read_lock (preempt_disabled) */
+/* note: already called with rcu_read_lock */
static int br_handle_local_finish(struct sk_buff *skb)
{
- struct net_bridge_port *p = rcu_dereference(skb->dev->br_port);
+ struct net_bridge_port *p = br_port_get_rcu(skb->dev);
- if (p)
- br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
+ br_fdb_update(p->br, p, eth_hdr(skb)->h_source);
return 0; /* process further */
}
@@@ -131,12 -132,13 +132,12 @@@ static inline int is_link_local(const u
}
/*
- * Called via br_handle_frame_hook.
* Return NULL if skb is handled
- * note: already called with rcu_read_lock
- * note: already called with rcu_read_lock (preempt_disabled) from
- * netif_receive_skb
++ * note: already called with rcu_read_lock from netif_receive_skb
*/
- struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
+ struct sk_buff *br_handle_frame(struct sk_buff *skb)
{
+ struct net_bridge_port *p;
const unsigned char *dest = eth_hdr(skb)->h_dest;
int (*rhook)(struct sk_buff *skb);
^ permalink raw reply
* Re: noqueue on bonding devices
From: Simon Horman @ 2010-07-28 23:42 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: netdev
In-Reply-To: <16360.1280338676@death>
On Wed, Jul 28, 2010 at 10:37:56AM -0700, Jay Vosburgh wrote:
> Simon Horman <horms@verge.net.au> wrote:
>
> >Hi Jay, Hi All,
> >
> >I would just to wonder out loud if it is intentional that bonding
> >devices default to noqueue, whereas for instance ethernet devices
> >default to a pfifo_fast with qlen 1000.
>
> Yes, it is.
>
> >The reason that I ask, is that when setting up some bandwidth
> >control using tc I encountered some strange behaviour which
> >I eventually tracked down to the queue-length of the qdiscs being 1p -
> >inherited from noqueue, as opposed to 1000p which would occur
> >on an ethernet device.
> >
> >Its trivial to work around, by either altering the txqueuelen on
> >the bonding device before adding the qdisc or by manually setting
> >the qlen of the qdisc. But it did take us a while to determine the
> >cause of the problem we were seeing. And as it seems inconsistent
> >I'm interested to know why this is the case.
>
> Software-only virtual devices (loopback, bonding, bridge, vlan,
> etc) typically have no transmit queue because, well, the device does no
> queueing. Meaning that there is no flow control infrastructure in the
> software device; bonding, et al, won't ever flow control (call
> netif_stop_queue to temporarily suspend transmit) or accumulate packets
> on a transmit queue.
>
> Hardware ethernet devices set a queue length because it is
> meaningful for them to do so. When their hardware transmit ring fills
> up, they will assert flow control, and stop accepting new packets for
> transmit. Packets then accumulate in the software transmit queue, and
> when the device unblocks, those packets are ready to go. When under
> continuous load, hardware network devices typically free up ring entries
> in blocks (not one at a time), so the software transmit queue helps to
> smooth out the chunkiness of the hardware driver's processing, minimize
> dropped packets, etc.
>
> It's certainly possible to add a queue and qdisc to a bonding
> device, and is reasonable to do if you want to do packet scheduling with
> tc and friends. In this case, the queue is really just for the tc
> actions to connect to; the queue won't accumulate packets on account of
> the driver (but could if the scheduler, e.g., rate limits).
Thanks for the detailed explanation, much appreciated.
^ permalink raw reply
* Re: [PATCH net-next] bonding: take rtnl in bond_loadbalance_arp_mon
From: Jay Vosburgh @ 2010-07-28 21:39 UTC (permalink / raw)
To: Andy Gospodarek; +Cc: netdev
In-Reply-To: <1280351076-19973-1-git-send-email-andy@greyhouse.net>
Andy Gospodarek <andy@greyhouse.net> wrote:
>With the latest code in net-next-2.6 the following (and similar) are
>spewed when using arp monitoring and balance-alb.
Does the ARP monitor function correctly for balance-alb? My
recollection is that the ARP monitor probes interfere with the tailored
ARP messages that balance-alb sends. The bond_check_params function
disallows setting arp_interval (it forces miimon on). I suspect this
nuance was missed when setting up the sysfs code, but if it does work,
then perhaps it is too strict.
As I recall, I had deliberately left acquiring rtnl out of the
loadbalance_arp_mon function, since none of the modes that used it
required rtnl for failover.
-J
>RTNL: assertion failed at drivers/net/bonding/bond_alb.c (1663)
>Pid: 1653, comm: bond0 Tainted: G W 2.6.35-rc1-net-next #9
>Call Trace:
> [<ffffffffa0385bb3>] bond_alb_handle_active_change+0x10e/0x17f [bonding]
> [<ffffffffa037f2e4>] bond_change_active_slave+0x20c/0x42f [bonding]
> [<ffffffffa0380062>] ? bond_loadbalance_arp_mon+0x1d8/0x222 [bonding]
> [<ffffffffa037f95a>] bond_select_active_slave+0xe0/0x10e [bonding]
> [<ffffffffa038006a>] bond_loadbalance_arp_mon+0x1e0/0x222 [bonding]
> [<ffffffff81065f7d>] worker_thread+0x26a/0x363
> [<ffffffff81065f25>] ? worker_thread+0x212/0x363
> [<ffffffff81048b19>] ? finish_task_switch+0x70/0xe4
> [<ffffffff81048aa9>] ? finish_task_switch+0x0/0xe4
> [<ffffffffa037fe8a>] ? bond_loadbalance_arp_mon+0x0/0x222 [bonding]
> [<ffffffff8106a45a>] ? autoremove_wake_function+0x0/0x39
> [<ffffffff81065d13>] ? worker_thread+0x0/0x363
> [<ffffffff81069f98>] kthread+0x9a/0xa2
> [<ffffffff8107b4f7>] ? trace_hardirqs_on_caller+0x111/0x135
> [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
> [<ffffffff81475e50>] ? restore_args+0x0/0x30
> [<ffffffff81069efe>] ? kthread+0x0/0xa2
> [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
>
>This is essentially the same thing done in bond_activebackup_arp_mon to
>address not holding rtnl when needed.
>
>Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
>
>---
> drivers/net/bonding/bond_main.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 2cc4cfc..d624cf9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2857,11 +2857,17 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
> }
>
> if (do_failover) {
>+ read_unlock(&bond->lock);
>+ rtnl_lock();
>+ read_lock(&bond->lock);
> write_lock_bh(&bond->curr_slave_lock);
>
> bond_select_active_slave(bond);
>
> write_unlock_bh(&bond->curr_slave_lock);
>+ read_unlock(&bond->lock);
>+ rtnl_unlock();
>+ read_lock(&bond->lock);
> }
>
> re_arm:
>--
>1.7.0.1
>
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox