Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v1 4/9] forcedeth: stats for rx_packets based on hardware registers
From: David Decotigny @ 2011-11-09 22:09 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, David Decotigny
In-Reply-To: <cover.1320875577.git.david.decotigny@google.com>

Use the hardware registers instead of a software implementation to
account for the number of RX packets.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index cd7f83d..0071d5c 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1712,6 +1712,7 @@ static struct net_device_stats *nv_get_stats(struct net_device *dev)
 		nv_get_hw_stats(dev);
 
 		/* copy to net_device stats */
+		dev->stats.rx_packets = np->estats.rx_packets;
 		dev->stats.tx_packets = np->estats.tx_packets;
 		dev->stats.rx_bytes = np->estats.rx_bytes;
 		dev->stats.tx_bytes = np->estats.tx_bytes;
@@ -2690,7 +2691,6 @@ static int nv_rx_process(struct net_device *dev, int limit)
 		skb_put(skb, len);
 		skb->protocol = eth_type_trans(skb, dev);
 		napi_gro_receive(&np->napi, skb);
-		dev->stats.rx_packets++;
 next_pkt:
 		if (unlikely(np->get_rx.orig++ == np->last_rx.orig))
 			np->get_rx.orig = np->first_rx.orig;
@@ -2773,7 +2773,6 @@ static int nv_rx_process_optimized(struct net_device *dev, int limit)
 				__vlan_hwaccel_put_tag(skb, vid);
 			}
 			napi_gro_receive(&np->napi, skb);
-			dev->stats.rx_packets++;
 		} else {
 			dev_kfree_skb(skb);
 		}
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v1 3/9] forcedeth: expose module parameters in /sys/module
From: David Decotigny @ 2011-11-09 22:09 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, David Decotigny
In-Reply-To: <cover.1320875577.git.david.decotigny@google.com>

In particular, debug_tx_timeout can be updated at runtime.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 763863d..cd7f83d 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5981,23 +5981,23 @@ static void __exit exit_nic(void)
 	pci_unregister_driver(&driver);
 }
 
-module_param(max_interrupt_work, int, 0);
+module_param(max_interrupt_work, int, S_IRUGO);
 MODULE_PARM_DESC(max_interrupt_work, "forcedeth maximum events handled per interrupt");
-module_param(optimization_mode, int, 0);
+module_param(optimization_mode, int, S_IRUGO);
 MODULE_PARM_DESC(optimization_mode, "In throughput mode (0), every tx & rx packet will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. In dynamic mode (2), the mode toggles between throughput and CPU mode based on network load.");
-module_param(poll_interval, int, 0);
+module_param(poll_interval, int, S_IRUGO);
 MODULE_PARM_DESC(poll_interval, "Interval determines how frequent timer interrupt is generated by [(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535.");
-module_param(msi, int, 0);
+module_param(msi, int, S_IRUGO);
 MODULE_PARM_DESC(msi, "MSI interrupts are enabled by setting to 1 and disabled by setting to 0.");
-module_param(msix, int, 0);
+module_param(msix, int, S_IRUGO);
 MODULE_PARM_DESC(msix, "MSIX interrupts are enabled by setting to 1 and disabled by setting to 0.");
-module_param(dma_64bit, int, 0);
+module_param(dma_64bit, int, S_IRUGO);
 MODULE_PARM_DESC(dma_64bit, "High DMA is enabled by setting to 1 and disabled by setting to 0.");
-module_param(phy_cross, int, 0);
+module_param(phy_cross, int, S_IRUGO);
 MODULE_PARM_DESC(phy_cross, "Phy crossover detection for Realtek 8201 phy is enabled by setting to 1 and disabled by setting to 0.");
-module_param(phy_power_down, int, 0);
+module_param(phy_power_down, int, S_IRUGO);
 MODULE_PARM_DESC(phy_power_down, "Power down phy and disable link when interface is down (1), or leave phy powered up (0).");
-module_param(debug_tx_timeout, bool, 0);
+module_param(debug_tx_timeout, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(debug_tx_timeout,
 		 "Dump tx related registers and ring when tx_timeout happens");
 
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v1 2/9] forcedeth: allow to silence "TX timeout" debug messages
From: David Decotigny @ 2011-11-09 22:09 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Sameer Nanda, David Decotigny
In-Reply-To: <cover.1320875577.git.david.decotigny@google.com>

From: Sameer Nanda <snanda@google.com>

This adds a new module parameter "debug_tx_timeout" to silence most
debug messages in case of TX timeout. These messages don't provide a
signal/noise ratio high enough for production systems and, with ~30kB
logged each time, they tend to add to a cascade effect if the system
is already under stress (memory pressure, disk, etc.).

By default, the parameter is clear, meaning that only a single warning
will be reported (the other more detailed debug messages are not
displayed).



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |   98 ++++++++++++++++++-------------
 1 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 39781c1b..763863d 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -892,6 +892,11 @@ enum {
 static int dma_64bit = NV_DMA_64BIT_ENABLED;
 
 /*
+ * Debug output control for tx_timeout
+ */
+static bool debug_tx_timeout = false;
+
+/*
  * Crossover Detection
  * Realtek 8201 phy + some OEM boards do not work properly.
  */
@@ -2461,56 +2466,64 @@ static void nv_tx_timeout(struct net_device *dev)
 	u32 status;
 	union ring_type put_tx;
 	int saved_tx_limit;
-	int i;
 
 	if (np->msi_flags & NV_MSI_X_ENABLED)
 		status = readl(base + NvRegMSIXIrqStatus) & NVREG_IRQSTAT_MASK;
 	else
 		status = readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK;
 
-	netdev_info(dev, "Got tx_timeout. irq: %08x\n", status);
+	netdev_warn(dev, "Got tx_timeout. irq status: %08x\n", status);
 
-	netdev_info(dev, "Ring at %lx\n", (unsigned long)np->ring_addr);
-	netdev_info(dev, "Dumping tx registers\n");
-	for (i = 0; i <= np->register_size; i += 32) {
-		netdev_info(dev,
-			    "%3x: %08x %08x %08x %08x %08x %08x %08x %08x\n",
-			    i,
-			    readl(base + i + 0), readl(base + i + 4),
-			    readl(base + i + 8), readl(base + i + 12),
-			    readl(base + i + 16), readl(base + i + 20),
-			    readl(base + i + 24), readl(base + i + 28));
-	}
-	netdev_info(dev, "Dumping tx ring\n");
-	for (i = 0; i < np->tx_ring_size; i += 4) {
-		if (!nv_optimized(np)) {
-			netdev_info(dev,
-				    "%03x: %08x %08x // %08x %08x // %08x %08x // %08x %08x\n",
-				    i,
-				    le32_to_cpu(np->tx_ring.orig[i].buf),
-				    le32_to_cpu(np->tx_ring.orig[i].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+1].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+1].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+2].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+2].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+3].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+3].flaglen));
-		} else {
+	if (unlikely(debug_tx_timeout)) {
+		int i;
+
+		netdev_info(dev, "Ring at %lx\n", (unsigned long)np->ring_addr);
+		netdev_info(dev, "Dumping tx registers\n");
+		for (i = 0; i <= np->register_size; i += 32) {
 			netdev_info(dev,
-				    "%03x: %08x %08x %08x // %08x %08x %08x // %08x %08x %08x // %08x %08x %08x\n",
+				    "%3x: %08x %08x %08x %08x "
+				    "%08x %08x %08x %08x\n",
 				    i,
-				    le32_to_cpu(np->tx_ring.ex[i].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+1].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+1].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+1].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+2].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+2].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+2].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+3].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+3].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+3].flaglen));
+				    readl(base + i + 0), readl(base + i + 4),
+				    readl(base + i + 8), readl(base + i + 12),
+				    readl(base + i + 16), readl(base + i + 20),
+				    readl(base + i + 24), readl(base + i + 28));
+		}
+		netdev_info(dev, "Dumping tx ring\n");
+		for (i = 0; i < np->tx_ring_size; i += 4) {
+			if (!nv_optimized(np)) {
+				netdev_info(dev,
+					    "%03x: %08x %08x // %08x %08x "
+					    "// %08x %08x // %08x %08x\n",
+					    i,
+					    le32_to_cpu(np->tx_ring.orig[i].buf),
+					    le32_to_cpu(np->tx_ring.orig[i].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+1].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+1].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+2].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+2].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+3].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+3].flaglen));
+			} else {
+				netdev_info(dev,
+					    "%03x: %08x %08x %08x "
+					    "// %08x %08x %08x "
+					    "// %08x %08x %08x "
+					    "// %08x %08x %08x\n",
+					    i,
+					    le32_to_cpu(np->tx_ring.ex[i].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+1].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+1].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+1].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+2].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+2].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+2].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+3].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+3].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+3].flaglen));
+			}
 		}
 	}
 
@@ -5984,6 +5997,9 @@ module_param(phy_cross, int, 0);
 MODULE_PARM_DESC(phy_cross, "Phy crossover detection for Realtek 8201 phy is enabled by setting to 1 and disabled by setting to 0.");
 module_param(phy_power_down, int, 0);
 MODULE_PARM_DESC(phy_power_down, "Power down phy and disable link when interface is down (1), or leave phy powered up (0).");
+module_param(debug_tx_timeout, bool, 0);
+MODULE_PARM_DESC(debug_tx_timeout,
+		 "Dump tx related registers and ring when tx_timeout happens");
 
 MODULE_AUTHOR("Manfred Spraul <manfred@colorfullife.com>");
 MODULE_DESCRIPTION("Reverse Engineered nForce ethernet driver");
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v1 1/9] forcedeth: Add messages to indicate using MSI or MSI-X
From: David Decotigny @ 2011-11-09 22:09 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Mike Ditto, David Decotigny
In-Reply-To: <cover.1320875577.git.david.decotigny@google.com>

From: Mike Ditto <mditto@google.com>

This adds a few debug messages to indicate whether PCIe interrupts are
signaled with MSI or MSI-X.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index d24c45b..39781c1b 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -3711,6 +3711,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 				writel(0, base + NvRegMSIXMap0);
 				writel(0, base + NvRegMSIXMap1);
 			}
+			netdev_info(dev, "MSI-X enabled\n");
 		}
 	}
 	if (ret != 0 && np->msi_flags & NV_MSI_CAPABLE) {
@@ -3732,6 +3733,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 			writel(0, base + NvRegMSIMap1);
 			/* enable msi vector 0 */
 			writel(NVREG_MSI_VECTOR_0_ENABLED, base + NvRegMSIIrqMask);
+			netdev_info(dev, "MSI enabled\n");
 		}
 	}
 	if (ret != 0) {
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v1 0/9] forcedeth: stats & debug enhancements
From: David Decotigny @ 2011-11-09 22:09 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, David Decotigny

These changes implement the ndo_get_stats64 API and add a few more
stats and debugging features for forcedeth. They also ensure that
stats updates are correct in SMP systems, 32 or 64-bits.

Regarding the "implement ndo_get_stats64() API" patch, I'm not sure
I'm using the right way to protect the 64b stats. Ideally, I would
like them to be non-blocking (u64_stats_sync.h), but as there are
several sources for updates, I don't think I can do without locking or
per-CPU stats. Would per-CPU stats be better here (note: I expect the
contention on netdev_priv(dev)->stats_lock to be _VERY_ low)?

Tested:
  ~150Mbps incoming TCP, ethtool -S in a loop, x86_64 16-way:
     tx_bytes: 1413863329
     rx_packets: 38918872
     tx_packets: 19828148
     rx_bytes: 57818685991

############################################
# Patch Set Summary:

David Decotigny (6):
  forcedeth: expose module parameters in /sys/module
  forcedeth: stats for rx_packets based on hardware registers
  forcedeth: implement ndo_get_stats64() API
  forcedeth: account for dropped RX frames
  forcedeth: stats updated with a deferrable timer
  forcedeth: whitespace/indentation fixes

Mike Ditto (1):
  forcedeth: Add messages to indicate using MSI or MSI-X

Sameer Nanda (2):
  forcedeth: allow to silence "TX timeout" debug messages
  forcedeth: new ethtool stat counter for TX timeouts

 drivers/net/ethernet/nvidia/forcedeth.c |  271 +++++++++++++++++++++----------
 1 files changed, 184 insertions(+), 87 deletions(-)

-- 
1.7.3.1

^ permalink raw reply

* [PATCH V5 net-next] neigh: new unresolved queue limits
From: Eric Dumazet @ 2011-11-09 22:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20111109.162137.808999062815992591.davem@davemloft.net>

Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> > 
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> >  ...
> > 
> > Ok, I've applied this, let's see what happens :-)
> 
> Early answer, build fails.
> 
> Please test build this patch with DECNET enabled and resubmit.  The
> decnet neigh layer still refers to the removed ->queue_len member.
> 
> Thanks.

Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.

[PATCH V5 net-next] neigh: new unresolved queue limits

unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.

$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms

--- 192.168.20.108 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.322/0.322/0.322/0.000 ms

Increasing unres_qlen can be dangerous, since an attacker might try to
fill many queues with many packets and consume all memory.

Switch to a bytes limit (limiting queued skbs truesize), and allow a
default limit of 64Kbytes per unresolved neighbour. This new limit seems
big, but as a packet can consume 64Kbytes, it reduces the memory window
offered to attackers.

unres_qlen is kept for compatibility, but internally converted to/from
bytes limit.

# cd /proc/sys/net/ipv4/neigh/default/
# grep . unres_qlen*
unres_qlen:31
unres_qlen_bytes:65536
# echo 10 >unres_qlen
# grep . unres_qlen*
unres_qlen:10
unres_qlen_bytes:21540
# echo 30000 >unres_qlen_bytes
# grep . unres_qlen*
unres_qlen:14
unres_qlen_bytes:30000

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
V5: decnet compile error fix

 Documentation/networking/ip-sysctl.txt |   10 +
 include/linux/neighbour.h              |    1 
 include/net/neighbour.h                |    3 
 net/atm/clip.c                         |    2 
 net/core/neighbour.c                   |  162 +++++++++++++++--------
 net/decnet/dn_neigh.c                  |    2 
 net/ipv4/arp.c                         |    2 
 net/ipv6/ndisc.c                       |    2 
 8 files changed, 128 insertions(+), 56 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f049a1c..b886706 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -31,6 +31,16 @@ neigh/default/gc_thresh3 - INTEGER
 	when using large numbers of interfaces and when communicating
 	with large numbers of directly-connected peers.
 
+neigh/default/unres_qlen_bytes - INTEGER
+	The maximum number of bytes which may be used by packets
+	queued for each	unresolved address by other network layers.
+	(added in linux 3.3)
+
+neigh/default/unres_qlen - INTEGER
+	The maximum number of packets which may be queued for each
+	unresolved address by other network layers.
+	(deprecated in linux 3.3) : use unres_qlen_bytes instead.
+
 mtu_expires - INTEGER
 	Time, in seconds, that cached PMTU information is kept.
 
diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h
index a7003b7..b188f68 100644
--- a/include/linux/neighbour.h
+++ b/include/linux/neighbour.h
@@ -116,6 +116,7 @@ enum {
 	NDTPA_PROXY_DELAY,		/* u64, msecs */
 	NDTPA_PROXY_QLEN,		/* u32 */
 	NDTPA_LOCKTIME,			/* u64, msecs */
+	NDTPA_QUEUE_LENBYTES,		/* u32 */
 	__NDTPA_MAX
 };
 #define NDTPA_MAX (__NDTPA_MAX - 1)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 2720884..7ae5acf 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -59,7 +59,7 @@ struct neigh_parms {
 	int	reachable_time;
 	int	delay_probe_time;
 
-	int	queue_len;
+	int	queue_len_bytes;
 	int	ucast_probes;
 	int	app_probes;
 	int	mcast_probes;
@@ -99,6 +99,7 @@ struct neighbour {
 	rwlock_t		lock;
 	atomic_t		refcnt;
 	struct sk_buff_head	arp_queue;
+	unsigned int		arp_queue_len_bytes;
 	struct timer_list	timer;
 	unsigned long		used;
 	atomic_t		probes;
diff --git a/net/atm/clip.c b/net/atm/clip.c
index 8523940..32c41b8 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -329,7 +329,7 @@ static struct neigh_table clip_tbl = {
 		.gc_staletime 		= 60 * HZ,
 		.reachable_time 	= 30 * HZ,
 		.delay_probe_time 	= 5 * HZ,
-		.queue_len 		= 3,
+		.queue_len_bytes 	= 64 * 1024,
 		.ucast_probes 		= 3,
 		.mcast_probes 		= 3,
 		.anycast_delay 		= 1 * HZ,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 039d51e..2684794 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -238,6 +238,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
 				   it to safe state.
 				 */
 				skb_queue_purge(&n->arp_queue);
+				n->arp_queue_len_bytes = 0;
 				n->output = neigh_blackhole;
 				if (n->nud_state & NUD_VALID)
 					n->nud_state = NUD_NOARP;
@@ -702,6 +703,7 @@ void neigh_destroy(struct neighbour *neigh)
 		printk(KERN_WARNING "Impossible event.\n");
 
 	skb_queue_purge(&neigh->arp_queue);
+	neigh->arp_queue_len_bytes = 0;
 
 	dev_put(neigh->dev);
 	neigh_parms_put(neigh->parms);
@@ -842,6 +844,7 @@ static void neigh_invalidate(struct neighbour *neigh)
 		write_lock(&neigh->lock);
 	}
 	skb_queue_purge(&neigh->arp_queue);
+	neigh->arp_queue_len_bytes = 0;
 }
 
 static void neigh_probe(struct neighbour *neigh)
@@ -980,15 +983,20 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
 
 	if (neigh->nud_state == NUD_INCOMPLETE) {
 		if (skb) {
-			if (skb_queue_len(&neigh->arp_queue) >=
-			    neigh->parms->queue_len) {
+			while (neigh->arp_queue_len_bytes + skb->truesize >
+			       neigh->parms->queue_len_bytes) {
 				struct sk_buff *buff;
+
 				buff = __skb_dequeue(&neigh->arp_queue);
+				if (!buff)
+					break;
+				neigh->arp_queue_len_bytes -= buff->truesize;
 				kfree_skb(buff);
 				NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
 			}
 			skb_dst_force(skb);
 			__skb_queue_tail(&neigh->arp_queue, skb);
+			neigh->arp_queue_len_bytes += skb->truesize;
 		}
 		rc = 1;
 	}
@@ -1175,6 +1183,7 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
 			write_lock_bh(&neigh->lock);
 		}
 		skb_queue_purge(&neigh->arp_queue);
+		neigh->arp_queue_len_bytes = 0;
 	}
 out:
 	if (update_isrouter) {
@@ -1747,7 +1756,11 @@ static int neightbl_fill_parms(struct sk_buff *skb, struct neigh_parms *parms)
 		NLA_PUT_U32(skb, NDTPA_IFINDEX, parms->dev->ifindex);
 
 	NLA_PUT_U32(skb, NDTPA_REFCNT, atomic_read(&parms->refcnt));
-	NLA_PUT_U32(skb, NDTPA_QUEUE_LEN, parms->queue_len);
+	NLA_PUT_U32(skb, NDTPA_QUEUE_LENBYTES, parms->queue_len_bytes);
+	/* approximative value for deprecated QUEUE_LEN (in packets) */
+	NLA_PUT_U32(skb, NDTPA_QUEUE_LEN,
+		    DIV_ROUND_UP(parms->queue_len_bytes,
+				 SKB_TRUESIZE(ETH_FRAME_LEN)));
 	NLA_PUT_U32(skb, NDTPA_PROXY_QLEN, parms->proxy_qlen);
 	NLA_PUT_U32(skb, NDTPA_APP_PROBES, parms->app_probes);
 	NLA_PUT_U32(skb, NDTPA_UCAST_PROBES, parms->ucast_probes);
@@ -1974,7 +1987,11 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 
 			switch (i) {
 			case NDTPA_QUEUE_LEN:
-				p->queue_len = nla_get_u32(tbp[i]);
+				p->queue_len_bytes = nla_get_u32(tbp[i]) *
+						     SKB_TRUESIZE(ETH_FRAME_LEN);
+				break;
+			case NDTPA_QUEUE_LENBYTES:
+				p->queue_len_bytes = nla_get_u32(tbp[i]);
 				break;
 			case NDTPA_PROXY_QLEN:
 				p->proxy_qlen = nla_get_u32(tbp[i]);
@@ -2635,117 +2652,158 @@ EXPORT_SYMBOL(neigh_app_ns);
 
 #ifdef CONFIG_SYSCTL
 
-#define NEIGH_VARS_MAX 19
+static int proc_unres_qlen(ctl_table *ctl, int write, void __user *buffer,
+			   size_t *lenp, loff_t *ppos)
+{
+	int size, ret;
+	ctl_table tmp = *ctl;
+
+	tmp.data = &size;
+	size = DIV_ROUND_UP(*(int *)ctl->data, SKB_TRUESIZE(ETH_FRAME_LEN));
+	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+	if (write && !ret)
+		*(int *)ctl->data = size * SKB_TRUESIZE(ETH_FRAME_LEN);
+	return ret;
+}
+
+enum {
+	NEIGH_VAR_MCAST_PROBE,
+	NEIGH_VAR_UCAST_PROBE,
+	NEIGH_VAR_APP_PROBE,
+	NEIGH_VAR_RETRANS_TIME,
+	NEIGH_VAR_BASE_REACHABLE_TIME,
+	NEIGH_VAR_DELAY_PROBE_TIME,
+	NEIGH_VAR_GC_STALETIME,
+	NEIGH_VAR_QUEUE_LEN,
+	NEIGH_VAR_QUEUE_LEN_BYTES,
+	NEIGH_VAR_PROXY_QLEN,
+	NEIGH_VAR_ANYCAST_DELAY,
+	NEIGH_VAR_PROXY_DELAY,
+	NEIGH_VAR_LOCKTIME,
+	NEIGH_VAR_RETRANS_TIME_MS,
+	NEIGH_VAR_BASE_REACHABLE_TIME_MS,
+	NEIGH_VAR_GC_INTERVAL,
+	NEIGH_VAR_GC_THRESH1,
+	NEIGH_VAR_GC_THRESH2,
+	NEIGH_VAR_GC_THRESH3,
+	NEIGH_VAR_MAX
+};
 
 static struct neigh_sysctl_table {
 	struct ctl_table_header *sysctl_header;
-	struct ctl_table neigh_vars[NEIGH_VARS_MAX];
+	struct ctl_table neigh_vars[NEIGH_VAR_MAX + 1];
 	char *dev_name;
 } neigh_sysctl_template __read_mostly = {
 	.neigh_vars = {
-		{
+		[NEIGH_VAR_MCAST_PROBE] = {
 			.procname	= "mcast_solicit",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_UCAST_PROBE] = {
 			.procname	= "ucast_solicit",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_APP_PROBE] = {
 			.procname	= "app_solicit",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_RETRANS_TIME] = {
 			.procname	= "retrans_time",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_userhz_jiffies,
 		},
-		{
+		[NEIGH_VAR_BASE_REACHABLE_TIME] = {
 			.procname	= "base_reachable_time",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_jiffies,
 		},
-		{
+		[NEIGH_VAR_DELAY_PROBE_TIME] = {
 			.procname	= "delay_first_probe_time",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_jiffies,
 		},
-		{
+		[NEIGH_VAR_GC_STALETIME] = {
 			.procname	= "gc_stale_time",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_jiffies,
 		},
-		{
+		[NEIGH_VAR_QUEUE_LEN] = {
 			.procname	= "unres_qlen",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
+			.proc_handler	= proc_unres_qlen,
+		},
+		[NEIGH_VAR_QUEUE_LEN_BYTES] = {
+			.procname	= "unres_qlen_bytes",
+			.maxlen		= sizeof(int),
+			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_PROXY_QLEN] = {
 			.procname	= "proxy_qlen",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_ANYCAST_DELAY] = {
 			.procname	= "anycast_delay",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_userhz_jiffies,
 		},
-		{
+		[NEIGH_VAR_PROXY_DELAY] = {
 			.procname	= "proxy_delay",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_userhz_jiffies,
 		},
-		{
+		[NEIGH_VAR_LOCKTIME] = {
 			.procname	= "locktime",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_userhz_jiffies,
 		},
-		{
+		[NEIGH_VAR_RETRANS_TIME_MS] = {
 			.procname	= "retrans_time_ms",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_ms_jiffies,
 		},
-		{
+		[NEIGH_VAR_BASE_REACHABLE_TIME_MS] = {
 			.procname	= "base_reachable_time_ms",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_ms_jiffies,
 		},
-		{
+		[NEIGH_VAR_GC_INTERVAL] = {
 			.procname	= "gc_interval",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec_jiffies,
 		},
-		{
+		[NEIGH_VAR_GC_THRESH1] = {
 			.procname	= "gc_thresh1",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_GC_THRESH2] = {
 			.procname	= "gc_thresh2",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
 			.proc_handler	= proc_dointvec,
 		},
-		{
+		[NEIGH_VAR_GC_THRESH3] = {
 			.procname	= "gc_thresh3",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
@@ -2778,47 +2836,49 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
 	if (!t)
 		goto err;
 
-	t->neigh_vars[0].data  = &p->mcast_probes;
-	t->neigh_vars[1].data  = &p->ucast_probes;
-	t->neigh_vars[2].data  = &p->app_probes;
-	t->neigh_vars[3].data  = &p->retrans_time;
-	t->neigh_vars[4].data  = &p->base_reachable_time;
-	t->neigh_vars[5].data  = &p->delay_probe_time;
-	t->neigh_vars[6].data  = &p->gc_staletime;
-	t->neigh_vars[7].data  = &p->queue_len;
-	t->neigh_vars[8].data  = &p->proxy_qlen;
-	t->neigh_vars[9].data  = &p->anycast_delay;
-	t->neigh_vars[10].data = &p->proxy_delay;
-	t->neigh_vars[11].data = &p->locktime;
-	t->neigh_vars[12].data  = &p->retrans_time;
-	t->neigh_vars[13].data  = &p->base_reachable_time;
+	t->neigh_vars[NEIGH_VAR_MCAST_PROBE].data  = &p->mcast_probes;
+	t->neigh_vars[NEIGH_VAR_UCAST_PROBE].data  = &p->ucast_probes;
+	t->neigh_vars[NEIGH_VAR_APP_PROBE].data  = &p->app_probes;
+	t->neigh_vars[NEIGH_VAR_RETRANS_TIME].data  = &p->retrans_time;
+	t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].data  = &p->base_reachable_time;
+	t->neigh_vars[NEIGH_VAR_DELAY_PROBE_TIME].data  = &p->delay_probe_time;
+	t->neigh_vars[NEIGH_VAR_GC_STALETIME].data  = &p->gc_staletime;
+	t->neigh_vars[NEIGH_VAR_QUEUE_LEN].data  = &p->queue_len_bytes;
+	t->neigh_vars[NEIGH_VAR_QUEUE_LEN_BYTES].data  = &p->queue_len_bytes;
+	t->neigh_vars[NEIGH_VAR_PROXY_QLEN].data  = &p->proxy_qlen;
+	t->neigh_vars[NEIGH_VAR_ANYCAST_DELAY].data  = &p->anycast_delay;
+	t->neigh_vars[NEIGH_VAR_PROXY_DELAY].data = &p->proxy_delay;
+	t->neigh_vars[NEIGH_VAR_LOCKTIME].data = &p->locktime;
+	t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].data  = &p->retrans_time;
+	t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].data  = &p->base_reachable_time;
 
 	if (dev) {
 		dev_name_source = dev->name;
 		/* Terminate the table early */
-		memset(&t->neigh_vars[14], 0, sizeof(t->neigh_vars[14]));
+		memset(&t->neigh_vars[NEIGH_VAR_GC_INTERVAL], 0,
+		       sizeof(t->neigh_vars[NEIGH_VAR_GC_INTERVAL]));
 	} else {
 		dev_name_source = neigh_path[NEIGH_CTL_PATH_DEV].procname;
-		t->neigh_vars[14].data = (int *)(p + 1);
-		t->neigh_vars[15].data = (int *)(p + 1) + 1;
-		t->neigh_vars[16].data = (int *)(p + 1) + 2;
-		t->neigh_vars[17].data = (int *)(p + 1) + 3;
+		t->neigh_vars[NEIGH_VAR_GC_INTERVAL].data = (int *)(p + 1);
+		t->neigh_vars[NEIGH_VAR_GC_THRESH1].data = (int *)(p + 1) + 1;
+		t->neigh_vars[NEIGH_VAR_GC_THRESH2].data = (int *)(p + 1) + 2;
+		t->neigh_vars[NEIGH_VAR_GC_THRESH3].data = (int *)(p + 1) + 3;
 	}
 
 
 	if (handler) {
 		/* RetransTime */
-		t->neigh_vars[3].proc_handler = handler;
-		t->neigh_vars[3].extra1 = dev;
+		t->neigh_vars[NEIGH_VAR_RETRANS_TIME].proc_handler = handler;
+		t->neigh_vars[NEIGH_VAR_RETRANS_TIME].extra1 = dev;
 		/* ReachableTime */
-		t->neigh_vars[4].proc_handler = handler;
-		t->neigh_vars[4].extra1 = dev;
+		t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].proc_handler = handler;
+		t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].extra1 = dev;
 		/* RetransTime (in milliseconds)*/
-		t->neigh_vars[12].proc_handler = handler;
-		t->neigh_vars[12].extra1 = dev;
+		t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].proc_handler = handler;
+		t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].extra1 = dev;
 		/* ReachableTime (in milliseconds) */
-		t->neigh_vars[13].proc_handler = handler;
-		t->neigh_vars[13].extra1 = dev;
+		t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].proc_handler = handler;
+		t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].extra1 = dev;
 	}
 
 	t->dev_name = kstrdup(dev_name_source, GFP_KERNEL);
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 7f0eb08..9e73aa1 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -107,7 +107,7 @@ struct neigh_table dn_neigh_table = {
 		.gc_staletime =	60 * HZ,
 		.reachable_time =		30 * HZ,
 		.delay_probe_time =	5 * HZ,
-		.queue_len =		3,
+		.queue_len_bytes =	64*1024,
 		.ucast_probes =	0,
 		.app_probes =		0,
 		.mcast_probes =	0,
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 96a164a..d732827 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -177,7 +177,7 @@ struct neigh_table arp_tbl = {
 		.gc_staletime		= 60 * HZ,
 		.reachable_time		= 30 * HZ,
 		.delay_probe_time	= 5 * HZ,
-		.queue_len		= 3,
+		.queue_len_bytes	= 64*1024,
 		.ucast_probes		= 3,
 		.mcast_probes		= 3,
 		.anycast_delay		= 1 * HZ,
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 44e5b7f..4a20982 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -141,7 +141,7 @@ struct neigh_table nd_tbl = {
 		.gc_staletime		= 60 * HZ,
 		.reachable_time		= ND_REACHABLE_TIME,
 		.delay_probe_time	= 5 * HZ,
-		.queue_len		= 3,
+		.queue_len_bytes	= 64*1024,
 		.ucast_probes		= 3,
 		.mcast_probes		= 3,
 		.anycast_delay		= 1 * HZ,

^ permalink raw reply related

* Re: [PATCH] libteam: fix function names to include 'bond'
From: Jiri Pirko @ 2011-11-09 22:04 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, benjamin.poirier,
	jzupka
In-Reply-To: <1320862846-6000-1-git-send-email-fbl@redhat.com>


Hi Flavio.

Thomas included these 2 functions in latest libnl upstream. Bond
versions wouldn't work because of "bond" type check.

Jirka

Wed, Nov 09, 2011 at 07:20:46PM CET, fbl@redhat.com wrote:
>Signed-off-by: Flavio Leitner <fbl@redhat.com>
>---
>
> I found those while trying to test V6 patch using latest
> libteam (commit 5e9790816606a6dd4e7f6f32c0bb0c45e5d13b31)
> and libnl-3.2.2 (last stable).
> thanks,
> fbl
>
> lib/libteam.c |    4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/lib/libteam.c b/lib/libteam.c
>index feb13b6..e7ae6b0 100644
>--- a/lib/libteam.c
>+++ b/lib/libteam.c
>@@ -1331,7 +1331,7 @@ int team_port_add(struct team_handle *th, uint32_t port_ifindex)
> {
> 	int err;
> 
>-	err = rtnl_link_enslave_ifindex(th->nl_cli.sock, th->ifindex,
>+	err = rtnl_link_bond_enslave_ifindex(th->nl_cli.sock, th->ifindex,
> 					port_ifindex);
> 	return -nl2syserr(err);
> }
>@@ -1350,6 +1350,6 @@ int team_port_remove(struct team_handle *th, uint32_t port_ifindex)
> {
> 	int err;
> 
>-	err = rtnl_link_release_ifindex(th->nl_cli.sock, port_ifindex);
>+	err = rtnl_link_bond_release_ifindex(th->nl_cli.sock, port_ifindex);
> 	return -nl2syserr(err);
> }
>-- 
>1.7.6
>

^ permalink raw reply

* Re: [PATCH net-next] ipv4: PKTINFO doesnt need dst reference
From: Eric Dumazet @ 2011-11-09 22:03 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, pstaszewski, netdev
In-Reply-To: <20111109.163708.2156133928191684256.davem@davemloft.net>

Le mercredi 09 novembre 2011 à 16:37 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 09 Nov 2011 18:24:35 +0100
> 
> > [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
> > 
> > When a socket uses IP_PKTINFO notifications, we currently force a dst
> > reference for each received skb. Reader has to access dst to get needed
> > information (rt_iif & rt_spec_dst) and must release dst reference.
> > 
> > We also forced a dst reference if skb was put in socket backlog, even
> > without IP_PKTINFO handling. This happens under stress/load.
> > 
> > We can instead store the needed information in skb->cb[], so that only
> > softirq handler really access dst, improving cache hit ratios.
> > 
> > This removes two atomic operations per packet, and false sharing as
> > well.
> > 
> > On a benchmark using a mono threaded receiver (doing only recvmsg()
> > calls), I can reach 720.000 pps instead of 570.000 pps.
> > 
> > IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> > UDP application.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good, if it compiles I'll push it out to net-next :-)

Arg :(  I cross my fingers :)

BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
bytes :

skb->truesize=4352 len=26 (payload only)

Truesize being now more precise, we hit badly the shared
udp_memory_allocated, even with single frames.

I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
ping/pong...

-#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
+#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)

^ permalink raw reply

* [PATCH][RESEND] net/usb: Misc. fixes for the LG-VL600 LTE USB modem
From: Mark Kamichoff @ 2011-11-09 21:48 UTC (permalink / raw)
  To: oliver, gregkh; +Cc: linux-usb, netdev, linux-kernel, Mark Kamichoff

Add checking for valid magic values (needed for stability in the event
corrupted packets are received) and remove some other unneeded checks.
Also, fix flagging device as WWAN (Bugzilla bug #39952).

Signed-off-by: Mark Kamichoff <prox@prolixium.com>
---
 drivers/net/usb/cdc_ether.c |    2 +-
 drivers/net/usb/lg-vl600.c  |   25 +++++++++++--------------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index c924ea2..99ed6eb 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -567,7 +567,7 @@ static const struct usb_device_id	products [] = {
 {
 	USB_DEVICE_AND_INTERFACE_INFO(0x1004, 0x61aa, USB_CLASS_COMM,
 			USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
-	.driver_info = (unsigned long)&wwan_info,
+	.driver_info = 0,
 },
 
 /*
diff --git a/drivers/net/usb/lg-vl600.c b/drivers/net/usb/lg-vl600.c
index d43db32..9c26c63 100644
--- a/drivers/net/usb/lg-vl600.c
+++ b/drivers/net/usb/lg-vl600.c
@@ -144,10 +144,11 @@ static int vl600_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	}
 
 	frame = (struct vl600_frame_hdr *) buf->data;
-	/* NOTE: Should check that frame->magic == 0x53544448?
-	 * Otherwise if we receive garbage at the beginning of the frame
-	 * we may end up allocating a huge buffer and saving all the
-	 * future incoming data into it.  */
+	/* Yes, check that frame->magic == 0x53544448 (or 0x44544d48),
+	 * otherwise we may run out of memory w/a bad packet */
+	if (ntohl(frame->magic) != 0x53544448 &&
+			ntohl(frame->magic) != 0x44544d48)
+		goto error;
 
 	if (buf->len < sizeof(*frame) ||
 			buf->len != le32_to_cpup(&frame->len)) {
@@ -296,6 +297,11 @@ encapsulate:
 	 * overwrite the remaining fields.
 	 */
 	packet = (struct vl600_pkt_hdr *) skb->data;
+	/* The VL600 wants IPv6 packets to have an IPv4 ethertype
+	 * Since this modem only supports IPv4 and IPv6, just set all
+	 * frames to 0x0800 (ETH_P_IP)
+	 */
+	packet->h_proto = htons(ETH_P_IP);
 	memset(&packet->dummy, 0, sizeof(packet->dummy));
 	packet->len = cpu_to_le32(orig_len);
 
@@ -308,21 +314,12 @@ encapsulate:
 	if (skb->len < full_len) /* Pad */
 		skb_put(skb, full_len - skb->len);
 
-	/* The VL600 wants IPv6 packets to have an IPv4 ethertype
-	 * Check if this is an IPv6 packet, and set the ethertype
-	 * to 0x800
-	 */
-	if ((skb->data[sizeof(struct vl600_pkt_hdr *) + 0x22] & 0xf0) == 0x60) {
-		skb->data[sizeof(struct vl600_pkt_hdr *) + 0x20] = 0x08;
-		skb->data[sizeof(struct vl600_pkt_hdr *) + 0x21] = 0;
-	}
-
 	return skb;
 }
 
 static const struct driver_info	vl600_info = {
 	.description	= "LG VL600 modem",
-	.flags		= FLAG_ETHER | FLAG_RX_ASSEMBLE,
+	.flags		= FLAG_RX_ASSEMBLE | FLAG_WWAN,
 	.bind		= vl600_bind,
 	.unbind		= vl600_unbind,
 	.status		= usbnet_cdc_status,
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH net-next] ipv4: PKTINFO doesnt need dst reference
From: David Miller @ 2011-11-09 21:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bhutchings, pstaszewski, netdev
In-Reply-To: <1320859475.3916.21.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Nov 2011 18:24:35 +0100

> [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
> 
> When a socket uses IP_PKTINFO notifications, we currently force a dst
> reference for each received skb. Reader has to access dst to get needed
> information (rt_iif & rt_spec_dst) and must release dst reference.
> 
> We also forced a dst reference if skb was put in socket backlog, even
> without IP_PKTINFO handling. This happens under stress/load.
> 
> We can instead store the needed information in skb->cb[], so that only
> softirq handler really access dst, improving cache hit ratios.
> 
> This removes two atomic operations per packet, and false sharing as
> well.
> 
> On a benchmark using a mono threaded receiver (doing only recvmsg()
> calls), I can reach 720.000 pps instead of 570.000 pps.
> 
> IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> UDP application.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, if it compiles I'll push it out to net-next :-)

^ permalink raw reply

* Re: pull request: wireless 2011-11-09
From: David Miller @ 2011-11-09 21:35 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20111109212505.GC10712@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Wed, 9 Nov 2011 16:25:05 -0500

> On Wed, Nov 09, 2011 at 04:20:15PM -0500, David Miller wrote:
>> From: "John W. Linville" <linville@tuxdriver.com>
>> Date: Wed, 9 Nov 2011 14:35:04 -0500
>> 
>> > Regarding the Bluetooth fixes, Gustavo says this:
>> > 
>> > Please let me know if there are problems!
>> 
>> Gustavo says what? :-)
> 
> Hmmm...obviously not my best day...
> 
> Gustavo says:
> 
> "3 more fixes to linux 3.2. One is USB device id addition and the other two
> patches combined fixes a connection issue. The first one from Arek Lichwa
> revert the wrong fix and a second commit from Andrzej Kaczmarek fix the issue
> properly."
> 
> Hth! :-)

That's better :-)

Pulled, thanks a lot John!

^ permalink raw reply

* [065/264] MAINTANERS: update Qualcomm Atheros addresses
From: Greg KH @ 2011-11-09 21:31 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: torvalds, akpm, alan, stable, netdev, jouni, yangjie, vthiagar,
	senthilb, Luis R. Rodriguez, John W. Linville
In-Reply-To: <20111109213508.GA3476@kroah.com>

3.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>

commit fe8e084455f273b32cc57a5fbaf6c22ef984d657 upstream.

Qualcomm ate up Atheros, all of the old e-mail addresses
no longer work and e-mails sent to it will bounce. Update
the addresses to the new shiny Qualcomm Atheros (QCA) ones.

Cc: stable@kernel.org
Cc: netdev@vger.kernel.org
Cc: jouni@qca.qualcomm.com
Cc: yangjie@qca.qualcomm.com
Cc: vthiagar@qca.qualcomm.com
Cc: senthilb@qca.qualcomm.com
Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 MAINTAINERS |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1230,7 +1230,7 @@ F:	Documentation/aoe/
 F:	drivers/block/aoe/
 
 ATHEROS ATH GENERIC UTILITIES
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
 L:	linux-wireless@vger.kernel.org
 S:	Supported
 F:	drivers/net/wireless/ath/*
@@ -1238,7 +1238,7 @@ F:	drivers/net/wireless/ath/*
 ATHEROS ATH5K WIRELESS DRIVER
 M:	Jiri Slaby <jirislaby@gmail.com>
 M:	Nick Kossifidis <mickflemm@gmail.com>
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
 M:	Bob Copeland <me@bobcopeland.com>
 L:	linux-wireless@vger.kernel.org
 L:	ath5k-devel@lists.ath5k.org
@@ -1247,10 +1247,10 @@ S:	Maintained
 F:	drivers/net/wireless/ath/ath5k/
 
 ATHEROS ATH9K WIRELESS DRIVER
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
-M:	Jouni Malinen <jmalinen@atheros.com>
-M:	Vasanthakumar Thiagarajan <vasanth@atheros.com>
-M:	Senthil Balasubramanian <senthilkumar@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
+M:	Jouni Malinen <jouni@qca.qualcomm.com>
+M:	Vasanthakumar Thiagarajan <vthiagar@qca.qualcomm.com>
+M:	Senthil Balasubramanian <senthilb@qca.qualcomm.com>
 L:	linux-wireless@vger.kernel.org
 L:	ath9k-devel@lists.ath9k.org
 W:	http://wireless.kernel.org/en/users/Drivers/ath9k

^ permalink raw reply

* [057/262] MAINTANERS: update Qualcomm Atheros addresses
From: Greg KH @ 2011-11-09 21:26 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: torvalds, akpm, alan, stable, netdev, jouni, yangjie, vthiagar,
	senthilb, Luis R. Rodriguez, John W. Linville
In-Reply-To: <20111109212847.GA20838@kroah.com>

3.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>

commit fe8e084455f273b32cc57a5fbaf6c22ef984d657 upstream.

Qualcomm ate up Atheros, all of the old e-mail addresses
no longer work and e-mails sent to it will bounce. Update
the addresses to the new shiny Qualcomm Atheros (QCA) ones.

Cc: stable@kernel.org
Cc: netdev@vger.kernel.org
Cc: jouni@qca.qualcomm.com
Cc: yangjie@qca.qualcomm.com
Cc: vthiagar@qca.qualcomm.com
Cc: senthilb@qca.qualcomm.com
Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 MAINTAINERS |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1221,7 +1221,7 @@ F:	Documentation/aoe/
 F:	drivers/block/aoe/
 
 ATHEROS ATH GENERIC UTILITIES
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
 L:	linux-wireless@vger.kernel.org
 S:	Supported
 F:	drivers/net/wireless/ath/*
@@ -1229,7 +1229,7 @@ F:	drivers/net/wireless/ath/*
 ATHEROS ATH5K WIRELESS DRIVER
 M:	Jiri Slaby <jirislaby@gmail.com>
 M:	Nick Kossifidis <mickflemm@gmail.com>
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
 M:	Bob Copeland <me@bobcopeland.com>
 L:	linux-wireless@vger.kernel.org
 L:	ath5k-devel@lists.ath5k.org
@@ -1238,10 +1238,10 @@ S:	Maintained
 F:	drivers/net/wireless/ath/ath5k/
 
 ATHEROS ATH9K WIRELESS DRIVER
-M:	"Luis R. Rodriguez" <lrodriguez@atheros.com>
-M:	Jouni Malinen <jmalinen@atheros.com>
-M:	Vasanthakumar Thiagarajan <vasanth@atheros.com>
-M:	Senthil Balasubramanian <senthilkumar@atheros.com>
+M:	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>
+M:	Jouni Malinen <jouni@qca.qualcomm.com>
+M:	Vasanthakumar Thiagarajan <vthiagar@qca.qualcomm.com>
+M:	Senthil Balasubramanian <senthilb@qca.qualcomm.com>
 L:	linux-wireless@vger.kernel.org
 L:	ath9k-devel@lists.ath9k.org
 W:	http://wireless.kernel.org/en/users/Drivers/ath9k
@@ -1269,7 +1269,7 @@ F:	drivers/input/misc/ati_remote2.c
 ATLX ETHERNET DRIVERS
 M:	Jay Cliburn <jcliburn@gmail.com>
 M:	Chris Snook <chris.snook@gmail.com>
-M:	Jie Yang <jie.yang@atheros.com>
+M:	Jie Yang <yangjie@qca.qualcomm.com>
 L:	netdev@vger.kernel.org
 W:	http://sourceforge.net/projects/atl1
 W:	http://atl1.sourceforge.net

^ permalink raw reply

* Re: pull request: wireless 2011-11-09
From: John W. Linville @ 2011-11-09 21:25 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20111109.162015.2261725491621555303.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

On Wed, Nov 09, 2011 at 04:20:15PM -0500, David Miller wrote:
> From: "John W. Linville" <linville@tuxdriver.com>
> Date: Wed, 9 Nov 2011 14:35:04 -0500
> 
> > Regarding the Bluetooth fixes, Gustavo says this:
> > 
> > Please let me know if there are problems!
> 
> Gustavo says what? :-)

Hmmm...obviously not my best day...

Gustavo says:

"3 more fixes to linux 3.2. One is USB device id addition and the other two
patches combined fixes a connection issue. The first one from Arek Lichwa
revert the wrong fix and a second commit from Andrzej Kaczmarek fix the issue
properly."

Hth! :-)

John

P.S.  Pull request head is commit e29ec6247053ad60bd0b36f155b647364a615097.
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH V4 net-next] neigh: new unresolved queue limits
From: David Miller @ 2011-11-09 21:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <20111109.161644.505896539772671525.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 09 Nov 2011 12:14:09 +0100
> 
>> unres_qlen is the number of frames we are able to queue per unresolved
>> neighbour. Its default value (3) was never changed and is responsible
>> for strange drops, especially if IP fragments are used, or multiple
>> sessions start in parallel. Even a single tcp flow can hit this limit.
>  ...
> 
> Ok, I've applied this, let's see what happens :-)

Early answer, build fails.

Please test build this patch with DECNET enabled and resubmit.  The
decnet neigh layer still refers to the removed ->queue_len member.

Thanks.

^ permalink raw reply

* Re: pull request: wireless 2011-11-09
From: David Miller @ 2011-11-09 21:20 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20111109193504.GA32400@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Wed, 9 Nov 2011 14:35:04 -0500

> Regarding the Bluetooth fixes, Gustavo says this:
> 
> Please let me know if there are problems!

Gustavo says what? :-)

^ permalink raw reply

* Re: [PATCH V4 net-next] neigh: new unresolved queue limits
From: David Miller @ 2011-11-09 21:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1320837249.2315.26.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Nov 2011 12:14:09 +0100

> unres_qlen is the number of frames we are able to queue per unresolved
> neighbour. Its default value (3) was never changed and is responsible
> for strange drops, especially if IP fragments are used, or multiple
> sessions start in parallel. Even a single tcp flow can hit this limit.
 ...

Ok, I've applied this, let's see what happens :-)

Thanks!

^ permalink raw reply

* Re: [PATCH] net: drivers/net/hippi/Kconfig should be sourced
From: David Miller @ 2011-11-09 21:15 UTC (permalink / raw)
  To: pebolle; +Cc: netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <1320872916.27598.49.camel@x61.thuisdomein>

From: Paul Bolle <pebolle@tiscali.nl>
Date: Wed, 09 Nov 2011 22:08:36 +0100

> Would it be better if I hadn't submitted this as a patch

Yes, because eventually someone who actually cared about the
situation would submit a properly tested patch.

If nobody else notices the problem, that's fine too, because it means
nobody else cares about whether HIPPI is missing from the build or
not.

^ permalink raw reply

* Re: [PATCH] iMX28 Ethernet driver fix
From: David Miller @ 2011-11-09 21:11 UTC (permalink / raw)
  To: phorton; +Cc: netdev, linux-arm-kernel
In-Reply-To: <20111109124411.GA31046@axolotl.localnet>

From: Peter Horton <phorton@bitbox.co.uk>
Date: Wed, 9 Nov 2011 12:44:11 +0000

> -	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> +	if ((((unsigned long) bufaddr) & FEC_ALIGNMENT) ||
> +		((id_entry->driver_data & FEC_QUIRK_SWAP_FRAME) &&
> +		skb_cloned(skb)))
> +	{

Please format this condition properly:

	if (A ||
	    (B &&
             C)) {

^ permalink raw reply

* Re: net: Add network priority cgroup
From: John Fastabend @ 2011-11-09 21:10 UTC (permalink / raw)
  To: Dave Taht
  Cc: Neil Horman, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller
In-Reply-To: <CAA93jw7G90kGBu2JnaEdWv3J0OSPO8eg55TMrZZCWPD-pdRf_g@mail.gmail.com>

On 11/9/2011 12:27 PM, Dave Taht wrote:
> On Wed, Nov 9, 2011 at 8:57 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>>
>> Data Center Bridging environments are currently somewhat limited in their
>> ability to provide a general mechanism for controlling traffic priority.
> 
> 
> 
>>
>> Specifically they are unable to administratively control the priority at which
>> various types of network traffic are sent.
>>
>> Currently, the only ways to set the priority of a network buffer are:
>>
>> 1) Through the use of the SO_PRIORITY socket option
>> 2) By using low level hooks, like a tc action
>>
> 2), above is a little vague.
> 
> There are dozens of ways to control the relative priorities of network
> streams in addition to priority notably diffserv, various forms of
> fair queuing, and active queue management tecniques like RED, Blue,
> etc.
> 

Maybe dozens of ways to control traffic using various combinations of
qdiscs but I think for classification we have a small set of reasonably
defined mechanisms.

 - tc filter/action
 - netfilter infrastructure think CLASSIFY (iptables/ebtables)
 - socket options SO_PRIORITY and SO_TOS

By the way setting the tos bits also sets the sk->priority. What other
classifications did I miss?

> The priority field within the Linux skb is used for multiple purposes
> - in addition to SO_PRIORITY it is also used for queue selection
> within tc for a variety of queuing disciplines. Certain bands are
> reserved for vlan and wireless queueing, (these features are rarely
> used)
> 
> Twiddling with it on one level or creating a controller for it can and
> will still be messed up by attempts to sanely use it elsewhere in the
> stack.
> 

The skb->priority is used by some qdiscs and also with vlan egress_maps.

Without knowing the wireless situation it seems you can either not manage
priority over wireless links if this is a problem or perhaps we can clean
up the wireless queueing and integrate it with the appropriate qdisc.

Could the wireless skb->priority usage be tied into mqprio?

>>
>> (1) is difficult from an administrative perspective because it requires that the
>> application to be coded to not just assume the default priority is sufficient,
>> and must expose an administrative interface to allow priority adjustment.  Such
>> a solution is not scalable in a DCB environment
>>
> 
> Nor any other complex environment. Or even a simple one.
> 
>>
>> (2) is also difficult, as it requires constant administrative oversight of
>> applications so as to build appropriate rules to match traffic belonging to
> 
> Yes, your description of option 2, as simplified above, is difficult.
> 
> However certain algorithms are intended to improve fairness between
> flows that do not require as much oversight and classification.
> 
> However, even when RED or a newer queue management algorithm such as
> QFQ or DRR is applied, classes of traffic exist that benefit from more
> specialized diffserv or diffserv-like behavior.
> 
> However, the evidence for something more complex in server
> environments than simple priority management is compelling at this
> point.
> 
>> various classes, so that priority can be appropriately set. It is further
>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>> hw queues for various traffic classes and needs the priority to be set prior to
>> selecting the root qdisc)
>>
> 
> Multiple applications (somewhat) rightly set priorities according to
> their view of the world.
> 
> background traffic and immediate traffic often set the appropriate
> diffserv bits, other traffic can do the same, and at least a few apps
> set the priority field also in the hope that that will do some good,
> and perhaps more should.

These patches do not overwrite existing priorities. So applications
that manage the priority can continue to do this.

> 
> 
>>
>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>> being a good general solution to this problem.  The network priority cgroup
> 
> Not if you are wanting to apply queue management further down the stack!
> 

I don't follow? Here your saying that you have a queue management that the
QOS layer is unaware of? OK so any qdisc or priority mechanism is going to
interfere with 'further down the stack'.

>>
>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>> originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.
> 
>> This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
> 
> I would like it if the many uses of the priority field were reduced to
> one use per semantic grouping.
> 
> You are adding a controller to something that is already
> ill-controlled and ill-defined, overly overloaded and both under and
> over used, to be managed in userspace by code to designed later, and
> then re-mapped once it exits a vm into another host or hardware queue
> management system which may or may not share similar assumptions.
> 

I don't think its ill-defined or ill-controlled. The priority can be
set by well defined mechanisms. We provide another mechanism to set
the priority without having to modify existing applications and a
mechanism for administrators/tools to set dynamically.

Overloaded perhaps the egress_map is a bit of an overloading of this.
But its existed for a long time.

IMHO hardware queue management systems should be integrated into the
qdisc layer if possible. DCB enabled hardware had similar problems
trying to do hardware queue management without involving the OS and
had to add hacks into select_queue() or hard coded traffic types
into the base drivers to work around this. 'mqprio' and dev support
for traffic classes was my take at a generic mechanism to expose this
to the OS.


> Don't get me wrong, I LIKE the controller idea, but think the priority
> field needs to be un-overloaded first to avoid ill-effects elsewhere
> in the users of the down-stream subsystems.
> 

But doesn't this help the down-stream subsystems as well? The priority
will eventually be pushed down the stack.

>> Tested by John and myself, with good results
> 
> With what?
> 

I tested this with mqprio using the net_prio cgroups to set the priority
and using mqprio to bind hardware queue sets to each priority. Then
I used netperf, ping, and the cg* tools to test I/O.

As a side note I expect you could also use this in conjunction with
the vlan egress_map to push applications onto 802.1Q priorities.

>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> CC: John Fastabend <john.r.fastabend@intel.com>
>> CC: Robert Love <robert.w.love@intel.com>
>> CC: "David S. Miller" <davem@davemloft.net>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Dave Täht
> SKYPE: davetaht
> 
> http://www.bufferbloat.net

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-09 21:09 UTC (permalink / raw)
  To: Dave Taht; +Cc: netdev, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <CAA93jw7G90kGBu2JnaEdWv3J0OSPO8eg55TMrZZCWPD-pdRf_g@mail.gmail.com>

On Wed, Nov 09, 2011 at 09:27:08PM +0100, Dave Taht wrote:
> On Wed, Nov 9, 2011 at 8:57 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > Data Center Bridging environments are currently somewhat limited in their
> > ability to provide a general mechanism for controlling traffic priority.
> 
> 
> 
> >
> > Specifically they are unable to administratively control the priority at which
> > various types of network traffic are sent.
> >
> > Currently, the only ways to set the priority of a network buffer are:
> >
> > 1) Through the use of the SO_PRIORITY socket option
> > 2) By using low level hooks, like a tc action
> >
> 2), above is a little vague.
> 
> There are dozens of ways to control the relative priorities of network
> streams in addition to priority notably diffserv, various forms of
> fair queuing, and active queue management tecniques like RED, Blue,
> etc.
> 
I'm referring explicitly to skb->prioroity here.  Sorry If I wasn't clear.

> The priority field within the Linux skb is used for multiple purposes
> - in addition to SO_PRIORITY it is also used for queue selection
> within tc for a variety of queuing disciplines. Certain bands are
> reserved for vlan and wireless queueing, (these features are rarely
> used)
> 
Yes.

> Twiddling with it on one level or creating a controller for it can and
> will still be messed up by attempts to sanely use it elsewhere in the
> stack.
> 
Why?  Its not like it can't already be twiddled with via SO_PRIORITY.  This does
exactly the same thing, it just lets us do it via an administrative interface
rather than a programatic one.  I don't disagree that the use of skb->prioirty
is complex, but this doesn't add any complexity that isn't already there.  It
just gives us a general way to assign priorities for those that know how to use
it consistently, in a way that doesn't require application modification.  Thats
something that DCB needs.

> >
> > (1) is difficult from an administrative perspective because it requires that the
> > application to be coded to not just assume the default priority is sufficient,
> > and must expose an administrative interface to allow priority adjustment.  Such
> > a solution is not scalable in a DCB environment
> >
> 
> Nor any other complex environment. Or even a simple one.
Yes.

> 
> >
> > (2) is also difficult, as it requires constant administrative oversight of
> > applications so as to build appropriate rules to match traffic belonging to
> 
> Yes, your description of option 2, as simplified above, is difficult.
> 
> However certain algorithms are intended to improve fairness between
> flows that do not require as much oversight and classification.
> 
Yes, but DCB is orthogonal to software traffic control.  Its hardware queueing 
based on the priority value of an skb.  As such, when a DCB enabled multiqueue
adapter selects the output queues in dev_pick_tx, it needs to have the
skb->priority value set properly.  Since we don't run any of the tc filters or
classifiers until after thats complete, we can't use those to adjust the skb
priority, as the root qdisc is already selected.

> However, even when RED or a newer queue management algorithm such as
> QFQ or DRR is applied, classes of traffic exist that benefit from more
> specialized diffserv or diffserv-like behavior.
> 
I understand, but again, DCB is orthogonal to that.  DCB is a hardware based
solution that steers traffic to various output queues in the NIC based on the
skb->priority value.  Take a look at ixgbe_select_queue for an example.

> However, the evidence for something more complex in server
> environments than simple priority management is compelling at this
> point.
> 
> > various classes, so that priority can be appropriately set. It is further
> > limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> > only run after a root qdisc has been selected (DCB enabled hardware may reserve
> > hw queues for various traffic classes and needs the priority to be set prior to
> > selecting the root qdisc)
> >
> 
> Multiple applications (somewhat) rightly set priorities according to
> their view of the world.
> 
> background traffic and immediate traffic often set the appropriate
> diffserv bits, other traffic can do the same, and at least a few apps
> set the priority field also in the hope that that will do some good,
> and perhaps more should.
> 
Agreed, and this patch respects that.  It only sets the priority of an skb that
doesn't already have its priority set.  See skb_update_prio.

> 
> >
> > I've discussed various solutions with John Fastabend, and we saw a cgroup as
> > being a good general solution to this problem.  The network priority cgroup
> 
> Not if you are wanting to apply queue management further down the stack!
> 
I'm not saying you can use the two together! I understand that this solution
interferes with the use of skb->priority in various queuing disciplines (just
like a program using SO_PRIORITY would), but the way those disciplines work is
incompatible with DCB at the moment.  You wouldn't use them all at the same
time.  I'd be happy to add some documentation to my patch to reflect that if you
like.

> >
> > allows for a per-interface priority map to be built per cgroup.  Any traffic
> > originating from an application in a cgroup, that does not explicitly set its
> > priority with SO_PRIORITY will have its priority assigned to the value
> > designated for that group on that interface.
> 
> > This allows a user space daemon,
> > when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> > based on the APP_TLV value received and administratively assign applications to
> > that priority using the existing cgroup utility infrastructure.
> 
> I would like it if the many uses of the priority field were reduced to
> one use per semantic grouping.
> 
> You are adding a controller to something that is already
> ill-controlled and ill-defined, overly overloaded and both under and
> over used, to be managed in userspace by code to designed later, and
> then re-mapped once it exits a vm into another host or hardware queue
> management system which may or may not share similar assumptions.
> 
> Don't get me wrong, I LIKE the controller idea, but think the priority
> field needs to be un-overloaded first to avoid ill-effects elsewhere
> in the users of the down-stream subsystems.
> 
We can certainly discuss the idea of separating the various semantic uses of
skb->priority out, but I don't think this patch is the place to do it. The
DCB use case for priority already exists (it specifically uses the prio_tc_map
as indexed by skb->priority in __skb_tx_hash).  I'm just adding a means of
controlling it more easily and reliably. 

> > Tested by John and myself, with good results
> 
> With what?
> 
What else?  and ixgbe adapter and ping.  I created a test netprio cgroup, assigned a
priority value to it, and did a did a cgexec -g net_prio:test ping www.yahoo.com
with a printk in the ixgbe tx method to valiedate that the proper queue mapping
was selected.

Neil

> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: John Fastabend <john.r.fastabend@intel.com>
> > CC: Robert Love <robert.w.love@intel.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Dave Täht
> SKYPE: davetaht
> 
> http://www.bufferbloat.net
> 

^ permalink raw reply

* Re: [PATCH] net: drivers/net/hippi/Kconfig should be sourced
From: Paul Bolle @ 2011-11-09 21:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20111109.154643.2008804116058722848.davem@davemloft.net>

On Wed, 2011-11-09 at 15:46 -0500, David Miller wrote:
> Please at least type "make oldconfig" with CONFIG_HIPPI enabled or
> similar before submitting patches like this.
> 
> There is nothing architecture or platform specific about getting
> the option enabled enough for you to see this:
> 
> drivers/net/hippi/Kconfig:40: syntax error
> drivers/net/hippi/Kconfig:20: missing end statement for this entry
> drivers/net/Kconfig:28: missing end statement for this entry
> drivers/Kconfig:1: missing end statement for this entry
> drivers/net/hippi/Kconfig:39: invalid statement
> drivers/net/Kconfig:341: unexpected end statement
> drivers/Kconfig:139: unexpected end statement
> make[1]: *** [oldconfig] Error 1
> make: *** [oldconfig] Error 2
> 
> I've fixed this up but if you can't be bothered to type "make" I
> seriously can't be bothered to even look at your patch submissions.

Would it be better if I hadn't submitted this as a patch (with a
warning, which you perhaps missed, that I didn't build test it) but as a
simple message to notify the people who wrote the patch that started all
this, netdev and you, that that commit was incomplete? If so, I'd be
glad to only do that in the future.


Paul Bolle

^ permalink raw reply

* Re: [PATCH] net/usb: Misc. fixes for the LG-VL600 LTE USB modem
From: David Miller @ 2011-11-09 21:06 UTC (permalink / raw)
  To: prox; +Cc: dcbw, oliver, gregkh, netdev, linux-kernel
In-Reply-To: <20111109185714.GA15884@prolixium.com>

From: Mark Kamichoff <prox@prolixium.com>
Date: Wed, 9 Nov 2011 13:57:14 -0500

> For (a), it's my understanding that __constant_htons() should be used
> only for initializers and htons() used in other cases, since it handles
> checking for constants.  I suppose you're right and this is a little
> gratuitous, but I wanted to keep things clean.
> 
> As far as (b), sorry!  That's an error on my part.  I must have been
> practicing another coding style at the time.  The braces certainly
> shouldn't be there, let me know if I should resubmit.

Please get rid of the gratuitous htons() etc. changes and keep this
patch purely to the bug fixes and resubmit.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: reduce percpu needs for icmpmsg mibs
From: David Miller @ 2011-11-09 21:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1320793483.26025.29.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Nov 2011 00:04:43 +0100

> Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
> (can be ~88000 us).
> 
> This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
> for all possible cpus can read 16 Mbytes of memory.
> 
> ICMP messages are not considered as fast path on a typical server, and
> eventually few cpus handle them anyway. We can afford an atomic
> operation instead of using percpu data.
> 
> This saves 4096 bytes per cpu and per network namespace.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> If this patch is accepted, I'll submit the IPv6 part as well.

Looks good, applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] ipv4: fix for ip_options_rcv_srr() daddr update.
From: David Miller @ 2011-11-09 20:59 UTC (permalink / raw)
  To: lw; +Cc: netdev
In-Reply-To: <4EBA2E30.8050102@cn.fujitsu.com>

From: Li Wei <lw@cn.fujitsu.com>
Date: Wed, 09 Nov 2011 15:39:28 +0800

> When opt->srr_is_hit is set skb_rtable(skb) has been updated for
> 'nexthop' and iph->daddr should always equals to skb_rtable->rt_dst
> holds, We need update iph->daddr either.
> 
> Signed-off-by: Li Wei <lw@cn.fujitsu.com>

Applied, thank you.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox