* Re: [PATCH 1/4] ipv4: Fix pmtu propagating
From: David Miller @ 2011-11-08 19:19 UTC (permalink / raw)
To: gaofeng; +Cc: steffen.klassert, netdev
In-Reply-To: <20111019.153208.1743995025598236810.davem@davemloft.net>
Steffen I look at this specific patch again.
The peer should be found and loaded up, and the PMTU propagated, when
the route cache entry is created. Specifically rt_init_metrics() will
find any existing peer, and do the whole check_peer_pmtu() sequence
that ipv4_dst_check() does.
If we have some issue with UDP or RAW caching a route past the first
use after the route lookup, yes we have to introduce a dst_check()
call somewhere.
But unilaterally doing this on every CORK setup seems at least very
excessive. Because CORK setup happens even for single sends. One
such example is ip_make_skb().
ip_make_skb() is used by, f.e., udp_sendmsg() when corkreq is false.
And in this case 'rt' is used immediately after being looked up
in udp_sendmsg().
And, as far as I can tell, routines like udp_sendmsg() in fact already
handle the "cached route across multiple sendmsg() calls" case too.
Specifically, udp_sendmsg() does this:
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
otherwise it makes a completely fresh route lookup.
RAW sendmsg unconditionally makes a fresh route lookup on every
sendmsg call. So it should be OK too.
So I really fail to see the problematic case. Therefore, if it exists
you'll have to give me an exact sequence of events that leads to the
problem.
I suspect that your real problem has nothing to do with UDP or RAW,
but rather the issue is that entries already in the routing cache
with a NULL peer need to be refreshed with peer information created
in another context.
^ permalink raw reply
* [PATCH] net: min_pmtu default is 552
From: Eric Dumazet @ 2011-11-08 19:10 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Small fix in Documentation, since min_pmtu is 512 + 20 + 20 = 552
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
Documentation/networking/ip-sysctl.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index cb7f314..f049a1c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -20,7 +20,7 @@ ip_no_pmtu_disc - BOOLEAN
default FALSE
min_pmtu - INTEGER
- default 562 - minimum discovered Path MTU
+ default 552 - minimum discovered Path MTU
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
^ permalink raw reply related
* Re: [PATCH net-next] W5300: Add WIZnet W5300 Ethernet driver
From: Ben Hutchings @ 2011-11-08 19:09 UTC (permalink / raw)
To: Taehun Kim; +Cc: David S. Miller, netdev, linux-kernel, suhwan, bongbong
In-Reply-To: <1320676661-19138-1-git-send-email-kth3321@gmail.com>
On Mon, 2011-11-07 at 23:37 +0900, Taehun Kim wrote:
> I have modified W5300 driver by applying the David Miller's feedback
> (http://www.spinics.net/lists/netdev/msg177862.html).
>
> Please review this driver and apply it if do not have any problems.
[...]
> diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
> index 6dff5a0..6325d85 100644
> --- a/drivers/net/ethernet/Kconfig
> +++ b/drivers/net/ethernet/Kconfig
> @@ -173,5 +173,6 @@ source "drivers/net/ethernet/tundra/Kconfig"
> source "drivers/net/ethernet/via/Kconfig"
> source "drivers/net/ethernet/xilinx/Kconfig"
> source "drivers/net/ethernet/xircom/Kconfig"
> +source "drivers/net/ethernet/wiznet/Kconfig"
>
> endif # ETHERNET
> diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
> index c53ad3a..7bd5211 100644
> --- a/drivers/net/ethernet/Makefile
> +++ b/drivers/net/ethernet/Makefile
> @@ -72,3 +72,4 @@ obj-$(CONFIG_NET_VENDOR_TUNDRA) += tundra/
> obj-$(CONFIG_NET_VENDOR_VIA) += via/
> obj-$(CONFIG_NET_VENDOR_XILINX) += xilinx/
> obj-$(CONFIG_NET_VENDOR_XIRCOM) += xircom/
> +obj-$(CONFIG_NET_VENDOR_WIZNET) += wiznet/
These two files currently include their subdirectories in (mostly)
alphabetical order, so please maintain that.
> diff --git a/drivers/net/ethernet/wiznet/Kconfig b/drivers/net/ethernet/wiznet/Kconfig
> new file mode 100644
> index 0000000..b5925bd
> --- /dev/null
> +++ b/drivers/net/ethernet/wiznet/Kconfig
> @@ -0,0 +1,32 @@
> +#
> +# WIZnet device configuration
> +#
> +
> +config NET_VENDOR_WIZNET
> + bool "WIZnet devices"
> + default y
I don't think this should have 'default y'. The other vendor Kconfig
files have 'default y' for compatibility with old kernel config files.
It should possibly have 'depends on ARM'.
> + ---help---
> + If you have a network (Ethernet) card belonging to this class, say Y
> + and read the Ethernet-HOWTO, available from
> + <http://www.tldp.org/docs.html#howto>.
> +
> + Note that the answer to this question doesn't directly affect the
> + kernel: saying N will just cause the configurator to skip all
> + the questions about WIZnet devices. If you say Y, you will be asked for
> + your specific card in the following questions.
> +
> +if NET_VENDOR_WIZNET
> +
> +config W5300
> + tristate "WIZnet W5300 Ethernet support"
> + depends on ARM
> + ---help---
> + This driver supports the Ethernet in the WIZnet W5300 chips.
> + W5300 supports hardwired TCP/IP stack. But this driver is limited to
> + the Ethernet function. To use hardwired TCP/IP stack, need to modify
> + the TCP/IP stack in linux kerenl.
Don't mention features that don't exist in this driver. Just the first
line is enough.
> + To compile this driver as a module, choose M here: the module
> + will be called w5300.
> +
> +endif # NET_VENDOR_WIZNET
> diff --git a/drivers/net/ethernet/wiznet/Makefile b/drivers/net/ethernet/wiznet/Makefile
> new file mode 100644
> index 0000000..53120bc
> --- /dev/null
> +++ b/drivers/net/ethernet/wiznet/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for the WIZnet device drivers.
> +#
> +
> +obj-$(CONFIG_W5300) += w5300.o
> diff --git a/drivers/net/ethernet/wiznet/w5300.c b/drivers/net/ethernet/wiznet/w5300.c
> new file mode 100644
> index 0000000..6fe3b57
> --- /dev/null
> +++ b/drivers/net/ethernet/wiznet/w5300.c
> @@ -0,0 +1,697 @@
> +/* w5300.c: A Linux Ethernet driver for the WIZnet W5300 chip. */
> +/*
> + Copyright (C) 2011 Taehun Kim <kth3321@gmail.com>
> +
> + This software may be used and distributed according to the terms of
> + the GNU General Public License (GPL), incorporated herein by reference.
Which version.
> + Drivers based on or derived from this code fall under the GPL and must
> + retain the authorship, copyright and license notice. This file is not
> + a complete program and may only be used when the entire operating
> + system is licensed under the GPL.
I think that condition is problematic. 'The operating system' may be
taken to include more than just the kernel, but kernel developers have
not attempted to put any restriction on licencing of user-space.
> +*/
[...]
> +/* Transmit timeout, default 5 seconds. */
> +static int watchdog = 5000;
> +module_param(watchdog, int, 0400);
> +MODULE_PARM_DESC(watchdog, "transmit timeout in milliseconds");
Is it really necessary for this to be configurable?
> +static int debug = -1;
> +module_param(debug, int, 0);
> +MODULE_PARM_DESC(debug, "W5300: bitmapped message enable number");
> +
> +/*
> + * This is W5300 information structure.
> + * Additional information is included in struct net_device.
> + */
> +struct wiz_private {
> + void __iomem *base;
> + struct net_device *dev;
> + u8 rxbuf_conf[MAX_SOCK_NUM];
> + u8 txbuf_conf[MAX_SOCK_NUM];
> + struct napi_struct napi;
> + spinlock_t lock;
> + u32 msg_enable;
> +};
> +
> +/* Default MAC address. */
> +static __initdata u8 w5300_defmac[6] = {0x00, 0x08, 0xDC, 0xA0, 0x00, 0x01};
This is not suitable as a default MAC address.
> +/* Default RX/TX buffer size(KByte). */
> +static u8 w5300_rxbuf_conf[MAX_SOCK_NUM] __initdata = {
> + 64, 0, 0, 0, 0, 0, 0, 0
> +};
> +
> +static u8 w5300_txbuf_conf[MAX_SOCK_NUM] __initdata = {
> + 64, 0, 0, 0, 0, 0, 0, 0
> +};
These are used in device initialisation so they should be __devinitdata,
not __initdata. But also, they aren't changed anywhere so they should
be declared const and __devinitconst.
Though really I can't see why you need these arrays at all when you are
only using one channel.
[...]
> +/* Opening channels of W5300 */
> +static int w5300_channel_open(struct wiz_private *wp, u32 type)
> +{
> + int timeout = 1000;
> +
> + /* Which type will be used for open? */
> + switch (type) {
> + case Sn_MR_MACRAW:
> + case Sn_MR_MACRAW_MF:
> + w5300_write(wp, Sn_MR(0), type);
> + break;
> + default:
> + netif_err(wp, ifup, wp->dev,
> + "Unknown socket type (%d)\n", type);
Wouldn't this case indicate a bug in the driver? In which case you
should use BUG().
> + return -EFAULT;
EFAULT means that user-space passed an invalid memory address.
> + }
> +
> + w5300_write(wp, Sn_CR(0), Sn_CR_OPEN);
> +
> + while (timeout--) {
> + if (!w5300_read(wp, Sn_CR(0)))
> + return 0;
> + udelay(1);
> + }
> +
> + return -EBUSY;
> +}
[...]
> +/* W5300 initialization function */
> +static int w5300_reset(struct net_device *dev)
> +{
[...]
> + /* Setting the size of Rx/Tx FIFO */
> + for (i = 0; i < MAX_SOCK_NUM; ++i) {
> + if (wp->rxbuf_conf[i] > 64) {
> + netif_err(wp, drv, wp->dev,
> + "Illegal Channel(%d) RX memory size.\n", i);
> +
> + return -EINVAL;
> + }
> + if (wp->txbuf_conf[i] > 64) {
> + netif_err(wp, drv, wp->dev,
> + "Illegal Channel(%d) TX memory size.\n", i);
> +
> + return -EINVAL;
> + }
> + txbuf_total += wp->txbuf_conf[i];
> + }
> +
> + if (txbuf_total % 8) {
> + netif_err(wp, drv, wp->dev,
> + "Illegal memory size register setting.\n");
> +
> + return -EINVAL;
> + }
Again, an invalid buffer configuration would be a bug in the driver and
not a run-time configuration error, so use BUG_ON() instead.
[...]
> +/* Interrupt Handler(ISR) */
> +static irqreturn_t wiz_interrupt(int irq, void *dev_instance)
> +{
> + struct net_device *dev = dev_instance;
> + struct wiz_private *wp = netdev_priv(dev);
> + int timeout = 100;
> + u16 isr, ssr;
> + int s;
> +
> + isr = w5300_read(wp, IR);
> +
> + /* Completing all interrupts at a time. */
> + while (isr && timeout--) {
Why would you need to repeat this? You disable the interrupt
> + w5300_write(wp, IR, isr);
> +
> + /* Finding the channel to create the interrupt */
> + s = find_first_bit((ulong *)&isr, sizeof(u16));
find_first_bit() is unnecessary in this case; __ffs(isr) should be more
efficient and doesn't require the ugly cast.
Can multiple bits be set in isr, or does the hardware ensure that you
only see one bit set?
> + ssr = w5300_read(wp, Sn_IR(s));
> + /* socket interrupt is cleared. */
> + w5300_write(wp, Sn_IR(s), ssr);
> +
> + netif_dbg(wp, intr, wp->dev,
> + "ISR = %X, SSR = %X, s = %X\n",
> + isr, ssr, s);
> +
> + if (likely(!s)) {
> + if (ssr & Sn_IR_RECV) {
> + /* Interrupt disable. */
> + w5300_interrupt_disable(wp);
> + /* Receiving by polling method */
> + napi_schedule(&wp->napi);
These comments are redundant.
> + }
> + }
> +
> + /* Is there any interrupt to be processed? */
> + isr = w5300_read(wp, IR);
> + }
> +
> + return IRQ_HANDLED;
> +}
> +
> +static int wiz_open(struct net_device *dev)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + int ret;
> +
> + napi_enable(&wp->napi);
> +
> + ret = request_irq(dev->irq, wiz_interrupt, IRQF_SHARED,
> + dev->name, dev);
> + if (ret < 0) {
> + netif_err(wp, ifup, wp->dev, "request_irq() error!\n");
> + return ret;
> + }
> +
> + w5300_interrupt_enable(wp);
> +
> + /* Sending OPEN command to use channel 0 as MACRAW mode. */
> + ret = w5300_channel_open(wp, Sn_MR_MACRAW_MF);
> + if (ret < 0) {
> + netif_err(wp, ifup, wp->dev, "w5300 channel open fail!\n");
> + return ret;
You need to tear down the interrupt in this case.
> + }
> +
> + netif_carrier_on(dev);
Either implement link detection or don't. But don't fake it by calling
netif_carrier_{on,off}() in control functions.
> + netif_start_queue(dev);
> +
> + return 0;
> +}
> +
> +static int wiz_close(struct net_device *dev)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + int timeout = 1000;
> +
> + napi_disable(&wp->napi);
> + netif_carrier_off(dev);
> +
> + /* Interrupt masking of all channels */
> + w5300_write(wp, IMR, 0x0000);
> + w5300_write(wp, Sn_CR(0), Sn_CR_CLOSE);
> +
> + while (timeout--) {
> + if (!w5300_read(wp, Sn_CR(0)))
> + break;
> + udelay(1);
> + }
You need to call synchronize_irq() between masking the interrupt sources
and calling free_irq(), otherwise this can race with the interrupt
handler on an SMP system.
> + free_irq(dev->irq, dev);
> +
> + return 0;
> +}
[...]
> +/* Function to transmit data at the MACRAW mode */
> +static int wiz_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + int ret;
> +
> + ret = w5300_send_data(wp, skb->data, skb->len);
> +
> + /* Statistical Process */
> + if (ret < 0) {
> + dev->stats.tx_dropped++;
Will the hardware generate an interrupt when there is space in the TX
FIFO? If so then you should use netif_{stop,wake}_queue() to avoid
dropping packets.
> + } else {
> + dev->stats.tx_bytes += skb->len;
> + dev->stats.tx_packets++;
> + dev->trans_start = jiffies;
Don't update trans_start; that's not the driver's responsibility any more.
> + netif_dbg(wp, tx_done, wp->dev,
> + "tx done, packet size = %d\n", skb->len);
> + }
> + dev_kfree_skb(skb);
> +
> + return NETDEV_TX_OK;
> +}
[...]
> +/* It is called when multi-cast list or flag is changed. */
> +static void wiz_set_multicast(struct net_device *dev)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + int ret;
> + u32 type = dev->flags & IFF_PROMISC ? Sn_MR_MACRAW : Sn_MR_MACRAW_MF;
> +
> + ret = w5300_channel_open(wp, type);
> + if (ret < 0) {
> + netif_err(wp, ifup, wp->dev,
> + "w5300 channel open fail!\n");
> + }
> +}
You don't seem to be adjusting multicast filtering.
> +static int wiz_set_mac_address(struct net_device *dev, void *addr)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + struct sockaddr *sock_addr = addr;
> +
> + netif_dbg(wp, drv, wp->dev, "set mac address");
> +
> + spin_lock(&wp->lock);
> + w5300_set_macaddr(wp, sock_addr->sa_data);
> + memcpy(dev->dev_addr, sock_addr->sa_data, dev->addr_len);
> + spin_unlock(&wp->lock);
> +
> + return 0;
> +}
You're supposed to validate the address with is_valid_ether_addr().
This is called in process context so it needs to use
spin_{,un}lock_bh(). Though I'm not sure what you think this lock is
protecting, as it's not used in any sort of consistent way.
> +static void wiz_tx_timeout(struct net_device *dev)
> +{
> + struct wiz_private *wp = netdev_priv(dev);
> + unsigned long flags;
> +
> + netif_dbg(wp, timer, wp->dev, "Transmit timeout");
> +
> + spin_lock_irqsave(&wp->lock, flags);
> +
> + /* Initializing W5300 chip. */
> + if (w5300_reset(dev) < 0) {
> + netif_err(wp, timer, wp->dev, "w5300 reset fail!\n");
> + return;
> + }
> +
> + /* Waking up network interface */
> + netif_wake_queue(dev);
> + spin_unlock_irqrestore(&wp->lock, flags);
> +}
This can never be called because you never stop the TX queue.
And since you do not implement link detection, it is impossible for the
TX watchdog to work properly.
> +/*
> + * Polling Function to process only receiving at the MACRAW mode.
> + * De-activating the interrupt when recv interrupt occurs,
> + * and processing the RECEIVE with this Function
> + * Activating the interrupt after completing RECEIVE process
> + * As recv interrupt often occurs at short intervals,
> + * there will system load in case that interrupt handler process the RECEIVE.
> + */
> +static int wiz_rx_poll(struct napi_struct *napi, int budget)
> +{
> + struct wiz_private *wp = container_of(napi, struct wiz_private, napi);
> + struct net_device *dev = wp->dev;
> + int npackets = 0;
> +
> + /* Processing the RECEIVE during Rx FIFO is containing any packet */
> + while (w5300_get_rxsize(wp, 0) > 0) {
> + struct sk_buff *skb;
> + u16 rxbuf_len, pktlen;
> + u32 crc;
> +
> + /* The first 2byte is the information about packet lenth. */
> + w5300_recv_data(wp, 0, (u8 *)&pktlen, 2);
> + pktlen = be16_to_cpu(pktlen);
> +
> + netif_dbg(wp, rx_err, wp->dev, "pktlen = %d\n", pktlen);
> +
> + /*
> + * Allotting the socket buffer in which packet will be contained
> + * Ethernet packet is of 14byte.
> + * In order to make it multiplied by 2, the buffer allocation
You mean, in order to align the network header on a multiple of *4*.
> + * should be 2bytes bigger than the packet.
> + */
> + skb = netdev_alloc_skb_ip_align(dev, pktlen);
> + if (!skb) {
> + u8 temp[pktlen + 4];
No, you mustn't allocate arbitary amounts of stack space like this!
Either pre-allocate a discard buffer when the device is opened, or
change w5300_recv_data() so that it can work without a buffer to write
to.
> + dev->stats.rx_dropped++;
> + w5300_recv_data(wp, 0, temp, pktlen + 4);
> + continue;
> + }
> +
> + /* Initializing the socket buffer */
> + skb->dev = dev;
> + skb_reserve(skb, 2);
netdev_alloc_skb_ip_align() already reserves padding for alignment.
> + skb_put(skb, pktlen);
> +
> + /* Reading packets from W5300 Rx FIFO into socket buffer. */
> + w5300_recv_data(wp, 0, (u8 *)skb->data, pktlen);
> +
> + /* Reading and discarding 4byte CRC. */
> + w5300_recv_data(wp, 0, (u8 *)&crc, 4);
> + crc = be32_to_cpu(crc);
Does the MAC discard packets with a bad CRC? If not, you have to do
that here.
> + /* The packet type is Ethernet. */
> + skb->protocol = eth_type_trans(skb, dev);
> +
> + /* Passing packets to uppder stack (kernel). */
> + netif_receive_skb(skb);
> +
> + /* Processing statistical information */
> + dev->stats.rx_packets++;
> + dev->stats.rx_bytes += pktlen;
> + dev->last_rx = jiffies;
> + rxbuf_len -= pktlen;
> + npackets++;
> +
> + if (npackets >= budget)
> + break;
> + }
> +
> + /* If packet number is smaller than budget when getting out of loopback,
> + * the RECEIVE process is completed. */
> + if (npackets < budget) {
> + unsigned long flags;
> +
> + spin_lock_irqsave(&wp->lock, flags);
> + w5300_interrupt_enable(wp);
> + __napi_complete(napi);
You must enable interrupts only after marking the NAPI context complete,
otherwise there is a race condition where the NAPI context will not be
rescheduled by an interrupt.
> + spin_unlock_irqrestore(&wp->lock, flags);
This is always called in tasklet ('bottom half') context, so just use
spin_{,un}lock().
Again, I'm not sure what you think you're protecting here. You probably
do need locking to serialise calls to
w5300_interrupt_{disable,enable}(), but you're not doing so
consistently.
[...]
> +/* Initialize W5300 driver. */
> +static int __devinit w5300_drv_probe(struct platform_device *pdev)
> +{
[...]
> + wp->msg_enable = (debug < 0 ? W5300_DEF_MSG_ENABLE : debug);
Why not just initialise debug to W5300_DEF_MSG_ENABLE and set
wp->msg_enable = debug here?
[...]
> + ret = w5300_reset(dev);
> + if (ret < 0)
> + goto release_both;
At this point you need to free the MMIO mapping on error.
> + ret = register_netdev(dev);
> + if (ret != 0) {
> + platform_set_drvdata(pdev, NULL);
> + iounmap(addr);
> +release_both:
> + free_netdev(dev);
> +release_region:
> + release_mem_region(res->start, resource_size(res));
> + }
> +out:
> + return ret;
> +}
[...]
> +#ifdef CONFIG_PM
> +
> +static int w5300_drv_suspend(struct platform_device *pdev, pm_message_t state)
> +{
> + struct net_device *dev = platform_get_drvdata(pdev);
> +
> + if (dev) {
> + struct wiz_private *wp = netdev_priv(dev);
> +
> + if (netif_running(dev)) {
> + int timeout = 1000;
> +
> + netif_carrier_off(dev);
> + netif_device_detach(dev);
Even if you do implement link detection, there is no need to call
netif_carrier_off() here.
[...]
> --- /dev/null
> +++ b/drivers/net/ethernet/wiznet/w5300.h
> @@ -0,0 +1,121 @@
> +#ifndef _W5300_H_
> +#define _W5300_H_
> +
> +/* Maximum socket number. W5300 supports max 8 channels. */
> +#define MAX_SOCK_NUM 8
> +
> +/* socket register */
> +#define CH_BASE (0x200)
Please refer consistently to channels, not sockets. The in-tree driver
will never be using the TCP offload functionality of the device.
[...]
> +/* W5300 Register READ/WRITE funtions(Just 16 bit interface). */
> +#define w5300_write(wp, addr, val) writew(val, (wp->base + addr))
> +#define w5300_read(wp, addr) readw((wp->base + addr))
This is not how you write safe macro definitions. You generally need to
put parentheses directly around references to the parameters, not the
expressions that use them:
#define w5300_write(wp, addr, val) writew(val, (wp)->base + (addr))
#define w5300_read(wp, addr) readw((wp)->base + (addr))
Ben.
> +#endif /* _W5300_H_ */
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] tcp: Fix comments for Nagle algorithm
From: David Miller @ 2011-11-08 19:03 UTC (permalink / raw)
To: kinwin2008; +Cc: netdev, linux-kernel
In-Reply-To: <1320503003-5169-1-git-send-email-kinwin2008@gmail.com>
From: Feng King <kinwin2008@gmail.com>
Date: Sat, 5 Nov 2011 22:23:23 +0800
> TCP_NODELAY is weaker than TCP_CORK, when TCP_CORK was set, small
> segments will always pass Nagle test regardless of TCP_NODELAY option.
>
> Signed-off-by: Feng King <kinwin2008@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH] sunhme: Allow usage on SBI based SBus systems
From: David Miller @ 2011-11-08 19:01 UTC (permalink / raw)
To: oftedal; +Cc: netdev, sparclinux
In-Reply-To: <Pine.LNX.4.64.1111072217540.4069@oizys.tordivel.org>
From: Kjetil Oftedal <oftedal@gmail.com>
Date: Mon, 7 Nov 2011 22:47:53 +0100 (CET)
> To prevent the SBus driver for Sun Happy Meal cards from being loaded for
> PCI cards utilizing the same chipset, a filter was added to the probe
> function in commit 0b492fce3d72d982a7981905f85484a1e1ba7fde.
>
> The filter was implemented by checking the name of the parent node in
> the OF tree. This patch extends this filter, so that the driver will
> load on SBus systems that are based upon SBI SBus Bridges.
>
> Signed-off-by: Kjetil Oftedal <oftedal@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: dst->obsolete has become pointless
From: David Miller @ 2011-11-08 18:59 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev, timo.teras
In-Reply-To: <20111108.122020.1080743546477280623.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Tue, 08 Nov 2011 12:20:20 -0500 (EST)
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Tue, 8 Nov 2011 10:34:24 +0100
>
>> I don't know what to do with DecNET, but we probaply need to decide
>> about the future of dst->obsolete before we can fix the ipv4 PMTU
>> problems. Possible fixes might depend on whether ->dst_check() is
>> always invoked or not.
>
> Simplest thing to do is to move dst->obsolete check into DecNET's
> ->dst_check() handler, then call ->dst_check() unconditionally.
>
> Then we can just set dst->obsolete to zero for all route types,
> and kill the "initial_obsolete" argument to dst_alloc() and
> associated logic.
What I meant is something like the following patch. It needs more
work because it turns out some cases in ipv6 and xfrm_policy really
want dst_check to be avoided for special routes.
--------------------
net: Kill pointless and misleading checks on dst->obsolete.
dst->obsolete is set to -1 for all ipv4 and ipv6 routes. Therefore
the check guarding the invocation dst_ops->dst_check() is completely
pointless and misleads the reader into thinking we elide the call
in the common case.
What this value really means these days is that dst_free() has been
called on the route.
Therefore rename it to dst->freed, and make it take on only the values
"0" and "1".
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/dst.h b/include/net/dst.h
index 4fb6c43..7bd2fdb 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -22,7 +22,7 @@
/* Each dst_entry has reference count and sits in some parent list(s).
* When it is removed from parent list, it is "freed" (dst_free).
- * After this it enters dead state (dst->obsolete > 0) and if its refcnt
+ * After this it enters dead state (dst->freed == 1) and if its refcnt
* is zero, it can be destroyed immediately, otherwise it is added
* to gc list and garbage collector periodically checks the refcnt.
*/
@@ -55,7 +55,7 @@ struct dst_entry {
#define DST_NOCOUNT 0x0020
short error;
- short obsolete;
+ unsigned short freed;
unsigned short header_len; /* more space at head required */
unsigned short trailer_len; /* space to reserve at tail */
#ifdef CONFIG_IP_ROUTE_CLASSID
@@ -369,13 +369,13 @@ static inline struct dst_entry *skb_dst_pop(struct sk_buff *skb)
extern int dst_discard(struct sk_buff *skb);
extern void *dst_alloc(struct dst_ops * ops, struct net_device *dev,
- int initial_ref, int initial_obsolete, int flags);
+ int initial_ref, int flags);
extern void __dst_free(struct dst_entry * dst);
extern struct dst_entry *dst_destroy(struct dst_entry * dst);
static inline void dst_free(struct dst_entry * dst)
{
- if (dst->obsolete > 1)
+ if (dst->freed)
return;
if (!atomic_read(&dst->__refcnt)) {
dst = dst_destroy(dst);
@@ -440,9 +440,7 @@ static inline int dst_input(struct sk_buff *skb)
static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
{
- if (dst->obsolete)
- dst = dst->ops->check(dst, cookie);
- return dst;
+ return dst->ops->check(dst, cookie);
}
extern void dst_init(void);
diff --git a/net/core/dst.c b/net/core/dst.c
index d5e2c4c..6cb504a 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -90,11 +90,11 @@ loop:
* on gc list, invalidate it and add to gc list.
*
* Note: this is temporary. Actually, NOHASH dst's
- * must be obsoleted when parent is obsoleted.
- * But we do not have state "obsoleted, but
+ * must be freed when parent is freed.
+ * But we do not have state "freed, but
* referenced by parent", so it is right.
*/
- if (dst->obsolete > 1)
+ if (dst->freed)
continue;
___dst_free(dst);
@@ -152,7 +152,7 @@ EXPORT_SYMBOL(dst_discard);
const u32 dst_default_metrics[RTAX_MAX];
void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
- int initial_ref, int initial_obsolete, int flags)
+ int initial_ref, int flags)
{
struct dst_entry *dst;
@@ -178,7 +178,7 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
dst->input = dst_discard;
dst->output = dst_discard;
dst->error = 0;
- dst->obsolete = initial_obsolete;
+ dst->freed = 0;
dst->header_len = 0;
dst->trailer_len = 0;
#ifdef CONFIG_IP_ROUTE_CLASSID
@@ -202,7 +202,7 @@ static void ___dst_free(struct dst_entry *dst)
*/
if (dst->dev == NULL || !(dst->dev->flags&IFF_UP))
dst->input = dst->output = dst_discard;
- dst->obsolete = 2;
+ dst->freed = 1;
}
void __dst_free(struct dst_entry *dst)
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ed7b1d..7b84f24 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -385,7 +385,7 @@ struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
{
struct dst_entry *dst = __sk_dst_get(sk);
- if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+ if (dst && dst->ops->check(dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
dst_release(dst);
@@ -400,7 +400,7 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
{
struct dst_entry *dst = sk_dst_get(sk);
- if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+ if (dst && dst->ops->check(dst, cookie) == NULL) {
sk_dst_reset(sk);
dst_release(dst);
return NULL;
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index a77d161..fc91774 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -269,11 +269,10 @@ static void dn_dst_update_pmtu(struct dst_entry *dst, u32 mtu)
}
}
-/*
- * When a route has been marked obsolete. (e.g. routing cache flush)
- */
static struct dst_entry *dn_dst_check(struct dst_entry *dst, __u32 cookie)
{
+ if (!dst->freed)
+ return dst;
return NULL;
}
@@ -1142,7 +1141,7 @@ make_route:
if (dev_out->flags & IFF_LOOPBACK)
flags |= RTCF_LOCAL;
- rt = dst_alloc(&dn_dst_ops, dev_out, 1, 0, DST_HOST);
+ rt = dst_alloc(&dn_dst_ops, dev_out, 1, DST_HOST);
if (rt == NULL)
goto e_nobufs;
@@ -1412,7 +1411,7 @@ static int dn_route_input_slow(struct sk_buff *skb)
}
make_route:
- rt = dst_alloc(&dn_dst_ops, out_dev, 0, 0, DST_HOST);
+ rt = dst_alloc(&dn_dst_ops, out_dev, 0, DST_HOST);
if (rt == NULL)
goto e_nobufs;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 155138d..6fe264f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1401,7 +1401,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
struct dst_entry *ret = dst;
if (rt) {
- if (dst->obsolete > 0) {
+ if (dst->freed) {
ip_rt_put(rt);
ret = NULL;
} else if (rt->rt_flags & RTCF_REDIRECTED) {
@@ -1890,7 +1890,7 @@ static void rt_set_nexthop(struct rtable *rt, const struct flowi4 *fl4,
static struct rtable *rt_dst_alloc(struct net_device *dev,
bool nopolicy, bool noxfrm)
{
- return dst_alloc(&ipv4_dst_ops, dev, 1, -1,
+ return dst_alloc(&ipv4_dst_ops, dev, 1,
DST_HOST |
(nopolicy ? DST_NOPOLICY : 0) |
(noxfrm ? DST_NOXFRM : 0));
@@ -2776,7 +2776,7 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig)
{
- struct rtable *rt = dst_alloc(&ipv4_dst_blackhole_ops, NULL, 1, 0, 0);
+ struct rtable *rt = dst_alloc(&ipv4_dst_blackhole_ops, NULL, 1, 0);
struct rtable *ort = (struct rtable *) dst_orig;
if (rt) {
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 93718f3..35bab4e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1166,7 +1166,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
struct rt6_info **rtp;
#if RT6_DEBUG >= 2
- if (rt->dst.obsolete>0) {
+ if (rt->dst.freed) {
WARN_ON(fn != NULL);
return -ENOENT;
}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index bdc15c9..b38ff32 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -123,8 +123,7 @@ static inline struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
{
struct dst_entry *dst = t->dst_cache;
- if (dst && dst->obsolete &&
- dst->ops->check(dst, t->dst_cookie) == NULL) {
+ if (dst && dst->ops->check(dst, t->dst_cookie) == NULL) {
t->dst_cache = NULL;
dst_release(dst);
return NULL;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8473016..dfad749 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -190,7 +190,6 @@ static struct rt6_info ip6_null_entry_template = {
.dst = {
.__refcnt = ATOMIC_INIT(1),
.__use = 1,
- .obsolete = -1,
.error = -ENETUNREACH,
.input = ip6_pkt_discard,
.output = ip6_pkt_discard_out,
@@ -210,7 +209,6 @@ static struct rt6_info ip6_prohibit_entry_template = {
.dst = {
.__refcnt = ATOMIC_INIT(1),
.__use = 1,
- .obsolete = -1,
.error = -EACCES,
.input = ip6_pkt_prohibit,
.output = ip6_pkt_prohibit_out,
@@ -225,7 +223,6 @@ static struct rt6_info ip6_blk_hole_entry_template = {
.dst = {
.__refcnt = ATOMIC_INIT(1),
.__use = 1,
- .obsolete = -1,
.error = -EINVAL,
.input = dst_discard,
.output = dst_discard,
@@ -243,7 +240,7 @@ static inline struct rt6_info *ip6_dst_alloc(struct dst_ops *ops,
struct net_device *dev,
int flags)
{
- struct rt6_info *rt = dst_alloc(ops, dev, 0, 0, flags);
+ struct rt6_info *rt = dst_alloc(ops, dev, 0, flags);
if (rt != NULL)
memset(&rt->rt6i_table, 0,
@@ -913,7 +910,7 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
struct rt6_info *rt, *ort = (struct rt6_info *) dst_orig;
struct dst_entry *new = NULL;
- rt = dst_alloc(&ip6_dst_blackhole_ops, ort->dst.dev, 1, 0, 0);
+ rt = dst_alloc(&ip6_dst_blackhole_ops, ort->dst.dev, 1, 0);
if (rt) {
memset(&rt->rt6i_table, 0, sizeof(*rt) - sizeof(struct dst_entry));
@@ -1243,7 +1240,6 @@ int ip6_route_add(struct fib6_config *cfg)
goto out;
}
- rt->dst.obsolete = -1;
rt->rt6i_expires = (cfg->fc_flags & RTF_EXPIRES) ?
jiffies + clock_t_to_jiffies(cfg->fc_expires) :
0;
@@ -2058,7 +2054,6 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
rt->dst.input = ip6_input;
rt->dst.output = ip6_output;
rt->rt6i_idev = idev;
- rt->dst.obsolete = -1;
rt->rt6i_flags = RTF_UP | RTF_NONEXTHOP;
if (anycast)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index aa2d720..9934629 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -74,8 +74,7 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos)
if (!dst)
return NULL;
- if ((dst->obsolete || rtos != dest->dst_rtos) &&
- dst->ops->check(dst, dest->dst_cookie) == NULL) {
+ if (dst->ops->check(dst, dest->dst_cookie) == NULL) {
dest->dst_cache = NULL;
dst_release(dst);
return NULL;
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 08b3cea..23bff5b 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -377,8 +377,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
*/
skb_set_owner_w(nskb, sk);
- /* The 'obsolete' field of dst is set to 2 when a dst is freed. */
- if (!dst || (dst->obsolete > 1)) {
+ if (!dst || dst->freed) {
dst_release(dst);
sctp_transport_route(tp, NULL, sctp_sk(sk));
if (asoc && (asoc->param_flags & SPP_PMTUD_ENABLE)) {
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index 394c57c..90a1a60 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -214,7 +214,7 @@ void sctp_transport_set_owner(struct sctp_transport *transport,
void sctp_transport_pmtu(struct sctp_transport *transport, struct sock *sk)
{
/* If we don't have a fresh route, look one up */
- if (!transport->dst || transport->dst->obsolete > 1) {
+ if (!transport->dst || transport->dst->freed) {
dst_release(transport->dst);
transport->af_specific->get_dst(transport, &transport->saddr,
&transport->fl, sk);
@@ -234,7 +234,7 @@ static struct dst_entry *sctp_transport_dst_check(struct sctp_transport *t)
{
struct dst_entry *dst = t->dst;
- if (dst && dst->obsolete && dst->ops->check(dst, 0) == NULL) {
+ if (dst && dst->ops->check(dst, 0) == NULL) {
dst_release(t->dst);
t->dst = NULL;
return NULL;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 552df27..3ae2903 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1348,7 +1348,7 @@ static inline struct xfrm_dst *xfrm_alloc_dst(struct net *net, int family)
default:
BUG();
}
- xdst = dst_alloc(dst_ops, NULL, 0, 0, 0);
+ xdst = dst_alloc(dst_ops, NULL, 0, 0);
if (likely(xdst)) {
memset(&xdst->u.rt6.rt6i_table, 0,
@@ -1474,7 +1474,6 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
dst1->xfrm = xfrm[i];
xdst->xfrm_genid = xfrm[i]->genid;
- dst1->obsolete = -1;
dst1->flags |= DST_HOST;
dst1->lastuse = now;
@@ -2215,30 +2214,12 @@ EXPORT_SYMBOL(__xfrm_route_forward);
static struct dst_entry *xfrm_dst_check(struct dst_entry *dst, u32 cookie)
{
- /* Code (such as __xfrm4_bundle_create()) sets dst->obsolete
- * to "-1" to force all XFRM destinations to get validated by
- * dst_ops->check on every use. We do this because when a
- * normal route referenced by an XFRM dst is obsoleted we do
- * not go looking around for all parent referencing XFRM dsts
- * so that we can invalidate them. It is just too much work.
- * Instead we make the checks here on every use. For example:
- *
- * XFRM dst A --> IPv4 dst X
- *
- * X is the "xdst->route" of A (X is also the "dst->path" of A
- * in this example). If X is marked obsolete, "A" will not
- * notice. That's what we are validating here via the
- * stale_bundle() check.
- *
- * When a policy's bundle is pruned, we dst_free() the XFRM
- * dst which causes it's ->obsolete field to be set to a
- * positive non-zero integer. If an XFRM dst has been pruned
- * like this, we want to force a new route lookup.
+ /* All destinations in the kernel are validated unconditionally
+ * by calling dst_ops->dst_check() on every use.
*/
- if (dst->obsolete < 0 && !stale_bundle(dst))
- return dst;
-
- return NULL;
+ if (dst->freed || stale_bundle(dst))
+ return NULL;
+ return dst;
}
static int stale_bundle(struct dst_entry *dst)
@@ -2263,11 +2244,9 @@ static void xfrm_link_failure(struct sk_buff *skb)
static struct dst_entry *xfrm_negative_advice(struct dst_entry *dst)
{
- if (dst) {
- if (dst->obsolete) {
- dst_release(dst);
- dst = NULL;
- }
+ if (dst && dst->freed) {
+ dst_release(dst);
+ dst = NULL;
}
return dst;
}
^ permalink raw reply related
* Re: PROBLEM: pppol2tp over pppoe NULL pointer dereference
From: David Miller @ 2011-11-08 19:00 UTC (permalink / raw)
To: eric.dumazet; +Cc: spiked.yar, netdev
In-Reply-To: <1320478829.16609.15.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 05 Nov 2011 08:40:29 +0100
> Le vendredi 04 novembre 2011 à 22:28 -0400, David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 02 Nov 2011 00:58:13 +0100
>>
>> > Please try following patch, thanks !
>> >
>> > [PATCH] l2tp: handle fragmented skbs in receive path
>> >
>> > Modern drivers provide skb with fragments, and L2TP doesnt properly
>> > handles them.
>> >
>> > Some bad frames can also trigger panics because of insufficent checks.
>> >
>> > Reported-by: Misha Labjuk <spiked.yar@gmail.com>
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> I'm still waiting for testing results of this patch.
>
> Of course.
>
> If you prefer, I can submit a smaller patch for the obvious bug first,
> and I can respin the thing when net-next reopens.
>
> [PATCH] l2tp: fix l2tp_udp_recv_core()
>
> pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
> right place.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Yes, this is easier to digest right now.
Applied.
^ permalink raw reply
* RE: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode
From: Rose, Gregory V @ 2011-11-08 18:31 UTC (permalink / raw)
To: Roopa Prabhu, netdev@vger.kernel.org
Cc: sri@us.ibm.com, dragos.tatulea@gmail.com, arnd@arndb.de,
kvm@vger.kernel.org, mst@redhat.com, davem@davemloft.net,
mchan@broadcom.com, dwang2@cisco.com, shemminger@vyatta.com,
eric.dumazet@gmail.com, kaber@trash.net, benve@cisco.com
In-Reply-To: <20111019062543.7242.3969.stgit@savbu-pc100.cisco.com>
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Roopa Prabhu
> Sent: Tuesday, October 18, 2011 11:26 PM
> To: netdev@vger.kernel.org
> Cc: sri@us.ibm.com; dragos.tatulea@gmail.com; arnd@arndb.de;
> kvm@vger.kernel.org; mst@redhat.com; davem@davemloft.net;
> mchan@broadcom.com; dwang2@cisco.com; shemminger@vyatta.com;
> eric.dumazet@gmail.com; kaber@trash.net; benve@cisco.com
> Subject: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering
> support for passthru mode
>
> v1 version of this RFC patch was posted at
> http://www.spinics.net/lists/netdev/msg174245.html
>
> Today macvtap used in virtualized environment does not have support to
> propagate MAC, VLAN and interface flags from guest to lowerdev.
> Which means to be able to register additional VLANs, unicast and multicast
> addresses or change pkt filter flags in the guest, the lowerdev has to be
> put in promisocous mode. Today the only macvlan mode that supports this is
> the PASSTHRU mode and it puts the lower dev in promiscous mode.
>
> PASSTHRU mode was added primarily for the SRIOV usecase. In PASSTHRU mode
> there is a 1-1 mapping between macvtap and physical NIC or VF.
>
> There are two problems with putting the lowerdev in promiscous mode (ie
> SRIOV
> VF's):
> - Some SRIOV cards dont support promiscous mode today (Thread on
> Intel
> driver indicates that http://lists.openwall.net/netdev/2011/09/27/6)
> - For the SRIOV NICs that support it, Putting the lowerdev in
> promiscous mode leads to additional traffic being sent up to the
> guest virtio-net to filter result in extra overheads.
>
> Both the above problems can be solved by offloading filtering to the
> lowerdev hw. ie lowerdev does not need to be in promiscous mode as
> long as the guest filters are passed down to the lowerdev.
>
> This patch basically adds the infrastructure to set and get MAC and VLAN
> filters on an interface via rtnetlink. And adds support in macvlan and
> macvtap
> to allow set and get filter operations.
>
> Earlier version of this patch provided the TUNSETTXFILTER macvtap
> interface
> for setting address filtering. In response to feedback, This version
> introduces a netlink interface for the same.
>
> Response to some of the questions raised during v1:
>
> - Netlink interface:
> This patch provides the following netlink interface to set mac and
> vlan
> filters :
> [IFLA_RX_FILTER] = {
> [IFLA_ADDR_FILTER] = {
> [IFLA_ADDR_FILTER_FLAGS]
> [IFLA_ADDR_FILTER_UC_LIST] = {
> [IFLA_ADDR_LIST_ENTRY]
> }
> [IFLA_ADDR_FILTER_MC_LIST] = {
> [IFLA_ADDR_LIST_ENTRY]
> }
> }
> [IFLA_VLAN_FILTER] = {
> [IFLA_VLAN_BITMAP]
> }
> }
>
> Note: The IFLA_VLAN_FILTER is a nested attribute and contains only
> IFLA_VLAN_BITMAP today. The idea is that the IFLA_VLAN_FILTER can
> be extended tomorrow to use a vlan list option if some
> implementations
> prefer a list instead.
>
> And it provides the following rtnl_link_ops to set/get MAC/VLAN
> filters:
>
> int (*set_rx_addr_filter)(struct net_device
> *dev,
> struct nlattr *tb[]);
> int (*set_rx_vlan_filter)(struct net_device
> *dev,
> struct nlattr *tb[]);
> size_t (*get_rx_addr_filter_size)(const struct
> net_device *dev);
> size_t (*get_rx_vlan_filter_size)(const struct
> net_device *dev);
> int (*fill_rx_addr_filter)(struct sk_buff *skb,
> const struct net_device
> *dev);
> int (*fill_rx_vlan_filter)(struct sk_buff *skb,
> const struct net_device
> *dev);
>
>
> Note: The choice of rtnl_link_ops was because I saw the use case for
> this in virtual devices that need to do filtering in sw like
> macvlan
> and tun. Hw devices usually have filtering in hw with netdev->uc and
> mc lists to indicate active filters. But I can move from
> rtnl_link_ops
> to netdev_ops if that is the preferred way to go and if there is a
> need to support this interface on all kinds of interfaces.
> Please suggest.
>
> - Protection against address spoofing:
> - This patch adds filtering support only for macvtap PASSTHRU
> Mode. PASSTHRU mode is used mainly with SRIOV VF's. And SRIOV VF's
> come with anti mac/vlan spoofing support. (Recently added
> IFLA_VF_SPOOFCHK). In 802.1Qbh case the port profile has a knob to
> enable/disable anti spoof check. Lowerdevice drivers also enforce
> limits
> on the number of address registrations allowed.
>
> - Support for multiqueue devices: Enable filtering on individual queues
> (?):
> AFAIK, there is no netdev interface to install per queue hw
> filters for a multi queue interface. And also I dont know of any hw
> that provides an interface to set hw filters on a per queue basis.
> A multi queue device appears as a single lowerdev (ie netdev) and
> uses the same uc and mc lists to setup unicast and multicast hw
> filters.
> So i dont see a huge problem with this patch coming in the way for
> multi queue devices.
>
> - Support for non-PASSTHRU mode:
> I started implementing this. But there are a couple of problems.
> - The lowerdev may not be a SRIOV VF and may not have
> anti spoof capability
> - Today, in non-PASSTHRU cases macvlan_handle_frame assumes that
> every macvlan device on top of the lowerdev has a single unique mac.
> And the macvlans are hashed on that single mac address.
> To support filtering for non-PASSTHRU mode in addition to this
> patch the following needs to be done:
> - non-passthru mode with a single macvlan over a lower dev
> can be treated as PASSTHRU case
> - For non-PASSTHRU mode with multiple macvlans over a single
> lower dev:
> - Multiple unicast mac's now need to be hashed to the
> same macvlan device. The macvlan hash needs to change
> for lookup based on any one of the multiple unicast
> addresses a macvlan is interested in
> - We need to consider vlans during the lookup too
> - So the macvlan device hash needs to hash on both mac
> and vlan
> - But the support for filtering in non-PASSTHRU mode can be
> built on this patch
>
> This patch series implements the following
> 01/8 rtnetlink: Netlink interface for setting MAC and VLAN filters
> 02/8 rtnetlink: Add rtnl link operations for MAC address and VLAN
> filtering
> 03/8 rtnetlink: Add support to set MAC/VLAN filters
> 04/8 rtnetlink: Add support to get MAC/VLAN filters
> 05/8 macvlan: Add support to set MAC/VLAN filter rtnl link operations
> 06/8 macvlan: Add support to get MAC/VLAN filter rtnl link operations
> 07/8 macvtap: Add support to set MAC/VLAN filter rtnl link operations
> 08/8 macvtap: Add support to get MAC/VLAN filter rtnl link operations
>
> Please comment. Thanks.
I have finished my preliminary evaluation and testing of these RFC patches and find that the features that we would require are present and functional. I haven't fully fleshed out the 'get' operations in my own driver yet but the 'set' routines are all doing what we need them to do.
Thanks for your work on this Roopa!
- Greg
>
> Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
> Signed-off-by: Christian Benvenuti <benve@cisco.com>
> Signed-off-by: David Wang <dwang2@cisco.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: DMA-API check_sync errors with 3.2
From: Neil Horman @ 2011-11-08 18:16 UTC (permalink / raw)
To: Josh Boyer; +Cc: Joerg Roedel, Dan Williams, netdev, linux-kernel, kernel-team
In-Reply-To: <20111108173153.GE14216@zod.bos.redhat.com>
On Tue, Nov 08, 2011 at 12:31:53PM -0500, Josh Boyer wrote:
> Hi All,
>
> We have a few reports coming in on 3.2 git snapshots and 3.2-rc1
> where the DMA-API reports an error about a driver trying to sync
> memory it has not allocated. I've seen a couple reports of this for
> sky2 and one for tg3, but I'm not sure if it's a driver problem or
> something a bit more generic. An example trace is below:
>
> backtrace:
> :WARNING: at lib/dma-debug.c:965 check_sync+0x2a8/0x530()
> :Hardware name: P5K-E
> :sky2 0000:02:00.0: DMA-API: device driver tries to sync DMA memory it has not
> allocated [device address=0x0000000105258040] [size=60 bytes]
> :Modules linked in: fuse lp parport ebtable_nat ebtables ipt_MASQUERADE
> iptable_nat nf_nat xt_CHECKSUM iptable_mangle tun bridge lockd stp llc
> ip6t_REJECT nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv6
> nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter xt_state
> nf_conntrack ip6_tables raid1 uvcvideo videodev snd_usb_audio
> snd_hda_codec_hdmi media v4l2_compat_ioctl32 snd_usbmidi_lib joydev snd_rawmidi
> snd_seq_device snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep
> snd_pcm snd_timer snd microcode i2c_i801 iTCO_wdt iTCO_vendor_support serio_raw
> sky2 asus_atk0110 soundcore snd_page_alloc configfs virtio_net kvm_intel kvm
> uinput sunrpc raid10 btrfs zlib_deflate libcrc32c ata_generic pata_acpi
> firewire_ohci firewire_core crc_itu_t pata_jmicron radeon ttm drm_kms_helper
> drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
> :Pid: 2520, comm: boinc Not tainted 3.2.0-0.rc0.git6.0.fc17.x86_64 #1
> :Call Trace:
> : <IRQ> [<ffffffff8107ce9f>] warn_slowpath_common+0x7f/0xc0
> : [<ffffffff8107cf96>] warn_slowpath_fmt+0x46/0x50
> : [<ffffffff81325658>] check_sync+0x2a8/0x530
> : [<ffffffff81311c8e>] ? random32+0x2e/0x40
> : [<ffffffff81325b62>] debug_dma_sync_single_for_cpu+0x42/0x50
> : [<ffffffff81192cac>] ? ksize+0x1c/0xc0
> : [<ffffffff813217cc>] ? is_swiotlb_buffer+0x3c/0x50
> : [<ffffffff81321fe8>] ? swiotlb_sync_single+0x38/0x80
> : [<ffffffff8132212c>] ? swiotlb_sync_single_for_cpu+0xc/0x10
> : [<ffffffffa0331873>] sky2_poll+0x573/0xd90 [sky2]
> : [<ffffffff815454e1>] ? net_rx_action+0xa1/0x460
> : [<ffffffff815455a9>] net_rx_action+0x169/0x460
> : [<ffffffff81020c89>] ? sched_clock+0x9/0x10
> : [<ffffffff810ab9b5>] ? sched_clock_local+0x25/0x90
> : [<ffffffff810858f8>] __do_softirq+0xc8/0x3a0
> : [<ffffffff810ab9b5>] ? sched_clock_local+0x25/0x90
> : [<ffffffff81685efc>] call_softirq+0x1c/0x30
> : [<ffffffff8101b385>] do_softirq+0xa5/0xe0
> : [<ffffffff81085f2e>] irq_exit+0xbe/0xf0
> : [<ffffffff816867d3>] do_IRQ+0x63/0xe0
> : [<ffffffff8167b673>] common_interrupt+0x73/0x73
> : <EOI> [<ffffffff8167b719>] ? retint_swapgs+0x13/0x1b
>
> From what I can tell, net_rx_action is calling dma_issue_pending_all at
> the end of the function and this is forcing the flush and check (though
> I really haven't figured out why), and it's being attributed to the driver.
>
> Originally there was a suggestion that c6a21d0b8d (dma-debug:
> hash_bucket_find needs to allow for offsets within an entry) would solve
> the issue, but that seems to have not proven true. We still see this on
> kernels that have that commit included. I've linked the bug reports below.
>
> Any ideas?
>
> josh
>
> https://bugzilla.redhat.com/show_bug.cgi?id=751005
> https://bugzilla.redhat.com/show_bug.cgi?id=751797
> https://bugzilla.redhat.com/show_bug.cgi?id=752113
>
>
The trace above looks to me like sky2_poll is calling receive_copy, which does
the pci_dma_sync_single_for_cpu, which resolves into the
swiotlb_sync_single_for_cpu call, and that triggers the warning.
Looking at the bugs, its hard to say exactly whats going on. The sky2 case
looks like it should have easily found the hash bucket it needed (Juding by the
nice page aligned address for the dma rangs 0x0000000105258040), but the tg3
case looks like the dma address is just bogus (from bz 7451005, the device
address is 0x00000000d1668988, which doesn't look like anything a dma engine or
iotlb should return as a dma address. It might mean that tg3 is breaking up
operations on its allocated dma ranges, which would cause this kind of warning,
but I cant' see where thats happening.
My first suggestion would be to, if you are able, instrument (via printk or
stap), get_hash_bucket and hash_bucket_find, so as to dump out the entire hash
table to the console when the failure occurs. I know that will cause lots of
performance issues, but if this can be reproduced in a non-production
environment, it would let us at least see if we're just not finding the right
entry, or if the right entry doesn't exist.
Neil
^ permalink raw reply
* [PATCH] tcp: fixes for DSACK-based undo of cwnd reduction during fast recovery
From: Neal Cardwell @ 2011-11-08 18:07 UTC (permalink / raw)
To: David Miller
Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
Tom Herbert, Neal Cardwell
Fixes for some issues that prevent DSACKs from allowing TCP senders to
undo cwnd reductions made during fast recovery.
There were a few related bugs/issues:
1) Senders ignored DSACKs after recovery when there were no
outstanding packets (a common scenario for HTTP servers).
2) When the ACK field is below snd_una (which can happen when ACKs are
reordered), senders ignored DSACKs (preventing undo) and passed up
chances to send out more packets based on any newly-SACKed packets.
3) Senders were overriding cwnd values picked during an undo by
calling tcp_moderate_cwnd() in tcp_try_to_open().
The fixes:
(1) When there are no outstanding packets (the "no_queue" goto label),
use DSACKs to undo congestion window reductions.
(2) When the ACK field is below snd_una (the "old_ack" goto label),
process any DSACKs and try to send out more packets based on
newly-SACKed packets.
(3) Don't moderate cwnd in tcp_try_to_open() if we're in TCP_CA_Open,
since doing so is generally unnecessary and specifically would
override a DSACK-based undo of a cwnd reduction made in fast recovery.
(4) Simplify the congestion avoidance state machine by removing the
behavior where SACK-enabled connections hung around in the
TCP_CA_Disorder state just waiting for DSACKs. Instead, when snd_una
advances to high_seq or beyond we typically move to TCP_CA_Open
immediately and allow an undo in either TCP_CA_Open or TCP_CA_Disorder
if we later receive enough DSACKs. Previously, SACK-enabled
connections hung around in TCP_CA_Disorder state while
snd_una==high_seq, just waiting to accumulate DSACKs and hopefully
undo a cwnd reduction. This could and did lead to the following
unfortunate scenario: if some incoming ACKs advance snd_una beyond
high_seq then we were setting undo_marker to 0 and moving to
TCP_CA_Open, so if (due to reordering in the ACK return path) we
shortly thereafter received a DSACK then we were no longer able to
undo the cwnd reduction.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
net/ipv4/tcp_input.c | 44 +++++++++++++++++++++++---------------------
1 files changed, 23 insertions(+), 21 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..78dd38c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
int state = TCP_CA_Open;
- if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
+ if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
state = TCP_CA_Disorder;
if (inet_csk(sk)->icsk_ca_state != state) {
@@ -2881,7 +2881,8 @@ static void tcp_try_to_open(struct sock *sk, int flag)
if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
tcp_try_keep_open(sk);
- tcp_moderate_cwnd(tp);
+ if (inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
+ tcp_moderate_cwnd(tp);
} else {
tcp_cwnd_down(sk, flag);
}
@@ -3009,11 +3010,11 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
* tcp_xmit_retransmit_queue().
*/
static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
- int newly_acked_sacked, int flag)
+ int newly_acked_sacked, bool is_dupack,
+ int flag)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
- int is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
(tcp_fackets_out(tp) > tp->reordering));
int fast_rexmit = 0, mib_idx;
@@ -3066,17 +3067,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
}
break;
- case TCP_CA_Disorder:
- tcp_try_undo_dsack(sk);
- if (!tp->undo_marker ||
- /* For SACK case do not Open to allow to undo
- * catching for all duplicate ACKs. */
- tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
- tp->undo_marker = 0;
- tcp_set_ca_state(sk, TCP_CA_Open);
- }
- break;
-
case TCP_CA_Recovery:
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
@@ -3117,7 +3107,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
tcp_add_reno_sack(sk);
}
- if (icsk->icsk_ca_state == TCP_CA_Disorder)
+ if (icsk->icsk_ca_state <= TCP_CA_Disorder)
tcp_try_undo_dsack(sk);
if (!tcp_time_to_recover(sk)) {
@@ -3681,10 +3671,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
u32 prior_snd_una = tp->snd_una;
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
+ bool is_dupack = false;
u32 prior_in_flight;
u32 prior_fackets;
int prior_packets;
int prior_sacked = tp->sacked_out;
+ int pkts_acked = 0;
int newly_acked_sacked = 0;
int frto_cwnd = 0;
@@ -3757,6 +3749,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
/* See if we can take anything off of the retransmit queue. */
flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
+ pkts_acked = prior_packets - tp->packets_out;
newly_acked_sacked = (prior_packets - prior_sacked) -
(tp->packets_out - tp->sacked_out);
@@ -3771,8 +3764,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
tcp_may_raise_cwnd(sk, flag))
tcp_cong_avoid(sk, ack, prior_in_flight);
- tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
- newly_acked_sacked, flag);
+ is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
+ tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+ is_dupack, flag);
} else {
if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
tcp_cong_avoid(sk, ack, prior_in_flight);
@@ -3784,6 +3778,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
return 1;
no_queue:
+ /* If data was DSACKed, see if we can undo a cwnd reduction. */
+ if (flag & FLAG_DSACKING_ACK)
+ tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+ is_dupack, flag);
/* If this ack opens up a zero window, clear backoff. It was
* being used to time the probes, and is probably far higher than
* it needs to be for normal retransmission.
@@ -3797,10 +3795,14 @@ invalid_ack:
return -1;
old_ack:
+ /* If data was SACKed, tag it and see if we should send more data.
+ * If data was DSACKed, see if we can undo a cwnd reduction.
+ */
if (TCP_SKB_CB(skb)->sacked) {
- tcp_sacktag_write_queue(sk, skb, prior_snd_una);
- if (icsk->icsk_ca_state == TCP_CA_Open)
- tcp_try_keep_open(sk);
+ flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+ newly_acked_sacked = tp->sacked_out - prior_sacked;
+ tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+ is_dupack, flag);
}
SOCK_DEBUG(sk, "Ack %u before %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
--
1.7.3.1
^ permalink raw reply related
* Re: [PATCH 1/1] [SMSC75xx] Fix incorrect usage of NET_IP_ALIGN
From: David Miller @ 2011-11-08 17:37 UTC (permalink / raw)
To: ne; +Cc: netdev, phil, stable, steve.glendinning
In-Reply-To: <1320773440-30122-1-git-send-email-ne@erfurth.eu>
From: Nico Erfurth <ne@erfurth.eu>
Date: Tue, 8 Nov 2011 18:30:40 +0100
> The driver used NET_IP_ALIGN to remove some additional padding inside of
> the rx_fixup function. On many architectures NET_IP_ALIGN defaults to 2
> which removed the correct amount of bytes.
>
> On MCORE2-machines commit ea812ca1b06113597adcd8e70c0f84a413d97544
> introduces a change which sets NET_IP_ALIGN to 0 by default. Which
> triggered the bug on these machines.
>
> This fix introduces a new RXW_PADDING define and uses this instead of
> NET_IP_ALIGN. The name was taken from the original SMSC7500 driver which
> is provided by SMSC.
>
> Signed-off-by: Nico Erfurth <ne@erfurth.eu>
> Tested-by: Phil Sutter <phil@nwl.cc>
Applied.
^ permalink raw reply
* Re: [PATCH] r8169: increase the delay parameter of pm_schedule_suspend
From: David Miller @ 2011-11-08 17:37 UTC (permalink / raw)
To: romieu; +Cc: hayeswang, netdev, linux-kernel, rjw
In-Reply-To: <20111108090716.GA4885@electric-eye.fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 8 Nov 2011 10:07:16 +0100
> Hayes Wang <hayeswang@realtek.com> :
>> The link down would occur when reseting PHY. And it would take about 2 ~ 5 seconds
>> from link down to link up. If the delay of pm_schedule_suspend is not long enough,
>> the device would enter runtime_suspend before link up. After link up, the device
>> would wake up and reset PHY again. Then, you would find the driver keep in a loop
>> of runtime_suspend and rumtime_resume.
>
> Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Applied.
^ permalink raw reply
* Re: [PATCH] ipv6: drop packets when source address is multicast
From: David Miller @ 2011-11-08 17:37 UTC (permalink / raw)
To: brian.haley; +Cc: netdev, divinekumar
In-Reply-To: <4EB93FA6.3090709@hp.com>
From: Brian Haley <brian.haley@hp.com>
Date: Tue, 08 Nov 2011 09:41:42 -0500
> RFC 4291 Section 2.7 says Multicast addresses must not be used as source
> addresses in IPv6 packets - drop them on input so we don't process the
> packet further.
>
> Signed-off-by: Brian Haley <brian.haley@hp.com>
> Reported-and-Tested-by: Kumar Sanghvi <divinekumar@gmail.com>
Applied.
^ permalink raw reply
* DMA-API check_sync errors with 3.2
From: Josh Boyer @ 2011-11-08 17:31 UTC (permalink / raw)
To: Neil Horman, Joerg Roedel, Dan Williams; +Cc: netdev, linux-kernel, kernel-team
Hi All,
We have a few reports coming in on 3.2 git snapshots and 3.2-rc1
where the DMA-API reports an error about a driver trying to sync
memory it has not allocated. I've seen a couple reports of this for
sky2 and one for tg3, but I'm not sure if it's a driver problem or
something a bit more generic. An example trace is below:
backtrace:
:WARNING: at lib/dma-debug.c:965 check_sync+0x2a8/0x530()
:Hardware name: P5K-E
:sky2 0000:02:00.0: DMA-API: device driver tries to sync DMA memory it has not
allocated [device address=0x0000000105258040] [size=60 bytes]
:Modules linked in: fuse lp parport ebtable_nat ebtables ipt_MASQUERADE
iptable_nat nf_nat xt_CHECKSUM iptable_mangle tun bridge lockd stp llc
ip6t_REJECT nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv6
nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter xt_state
nf_conntrack ip6_tables raid1 uvcvideo videodev snd_usb_audio
snd_hda_codec_hdmi media v4l2_compat_ioctl32 snd_usbmidi_lib joydev snd_rawmidi
snd_seq_device snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep
snd_pcm snd_timer snd microcode i2c_i801 iTCO_wdt iTCO_vendor_support serio_raw
sky2 asus_atk0110 soundcore snd_page_alloc configfs virtio_net kvm_intel kvm
uinput sunrpc raid10 btrfs zlib_deflate libcrc32c ata_generic pata_acpi
firewire_ohci firewire_core crc_itu_t pata_jmicron radeon ttm drm_kms_helper
drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
:Pid: 2520, comm: boinc Not tainted 3.2.0-0.rc0.git6.0.fc17.x86_64 #1
:Call Trace:
: <IRQ> [<ffffffff8107ce9f>] warn_slowpath_common+0x7f/0xc0
: [<ffffffff8107cf96>] warn_slowpath_fmt+0x46/0x50
: [<ffffffff81325658>] check_sync+0x2a8/0x530
: [<ffffffff81311c8e>] ? random32+0x2e/0x40
: [<ffffffff81325b62>] debug_dma_sync_single_for_cpu+0x42/0x50
: [<ffffffff81192cac>] ? ksize+0x1c/0xc0
: [<ffffffff813217cc>] ? is_swiotlb_buffer+0x3c/0x50
: [<ffffffff81321fe8>] ? swiotlb_sync_single+0x38/0x80
: [<ffffffff8132212c>] ? swiotlb_sync_single_for_cpu+0xc/0x10
: [<ffffffffa0331873>] sky2_poll+0x573/0xd90 [sky2]
: [<ffffffff815454e1>] ? net_rx_action+0xa1/0x460
: [<ffffffff815455a9>] net_rx_action+0x169/0x460
: [<ffffffff81020c89>] ? sched_clock+0x9/0x10
: [<ffffffff810ab9b5>] ? sched_clock_local+0x25/0x90
: [<ffffffff810858f8>] __do_softirq+0xc8/0x3a0
: [<ffffffff810ab9b5>] ? sched_clock_local+0x25/0x90
: [<ffffffff81685efc>] call_softirq+0x1c/0x30
: [<ffffffff8101b385>] do_softirq+0xa5/0xe0
: [<ffffffff81085f2e>] irq_exit+0xbe/0xf0
: [<ffffffff816867d3>] do_IRQ+0x63/0xe0
: [<ffffffff8167b673>] common_interrupt+0x73/0x73
: <EOI> [<ffffffff8167b719>] ? retint_swapgs+0x13/0x1b
>From what I can tell, net_rx_action is calling dma_issue_pending_all at
the end of the function and this is forcing the flush and check (though
I really haven't figured out why), and it's being attributed to the driver.
Originally there was a suggestion that c6a21d0b8d (dma-debug:
hash_bucket_find needs to allow for offsets within an entry) would solve
the issue, but that seems to have not proven true. We still see this on
kernels that have that commit included. I've linked the bug reports below.
Any ideas?
josh
https://bugzilla.redhat.com/show_bug.cgi?id=751005
https://bugzilla.redhat.com/show_bug.cgi?id=751797
https://bugzilla.redhat.com/show_bug.cgi?id=752113
^ permalink raw reply
* Re: dst->obsolete has become pointless
From: David Miller @ 2011-11-08 17:20 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev, timo.teras
In-Reply-To: <20111108093424.GG22141@secunet.com>
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 8 Nov 2011 10:34:24 +0100
> I don't know what to do with DecNET, but we probaply need to decide
> about the future of dst->obsolete before we can fix the ipv4 PMTU
> problems. Possible fixes might depend on whether ->dst_check() is
> always invoked or not.
Simplest thing to do is to move dst->obsolete check into DecNET's
->dst_check() handler, then call ->dst_check() unconditionally.
Then we can just set dst->obsolete to zero for all route types,
and kill the "initial_obsolete" argument to dst_alloc() and
associated logic.
As things are currently implemented, because of how we elide the full
table scan and flush, ipv4 and ipv6 really need to do the serial
number check every time so we have to keep things as they are.
Things get more interesting with the routing cache removed, of course,
but that is still a long ways off. :-)
^ permalink raw reply
* Re: Add IPSec IP Range in Linux kernel
From: David Miller @ 2011-11-08 17:16 UTC (permalink / raw)
To: adobriyan
Cc: peter.p.waskiewicz.jr, danila.st, linux-kernel, netdev,
linux-crypto, linux-security-module
In-Reply-To: <CACVxJT887ATwt97iw7mPEkH+KXgy1WbsP_aUB_X-8uLQ+6YsLQ@mail.gmail.com>
From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Tue, 8 Nov 2011 14:08:24 +0200
> changing addr_match() is trivial for ipv4 and easy for ipv6. :-)
No, this is not happening. This added complexity screws up all the hash table
and lookup optimizations we have in the XFRM layer.
^ permalink raw reply
* Re: Add IPSec IP Range in Linux kernel
From: David Miller @ 2011-11-08 17:15 UTC (permalink / raw)
To: danila.st
Cc: peter.p.waskiewicz.jr, linux-kernel, netdev, linux-crypto,
linux-security-module
In-Reply-To: <E1RNhE5-0005rf-00.danila-st-mail-ru@f105.mail.ru>
From: Daniil Stolnikov <danila.st@mail.ru>
Date: Tue, 08 Nov 2011 12:40:13 +0400
> I turned to you, the developers, but rather to urge you to implement
> this feature using IP range.
This won't be implemented, the keys used for IPSEC rule lookups supported by
the kernel are already way too complex.
Ranges can be synthesized by userspace, and that's the way it has to
be supported.
^ permalink raw reply
* Re: [PATCH] ipv4: fix a bug in strict route gateway comparation.
From: David Miller @ 2011-11-08 17:07 UTC (permalink / raw)
To: lw; +Cc: netdev
In-Reply-To: <4EB8EBF8.1080508@cn.fujitsu.com>
From: Li Wei <lw@cn.fujitsu.com>
Date: Tue, 08 Nov 2011 16:44:40 +0800
> Since commit def57687 (ipv4: Elide use of rt->rt_dst in ip_forward())
> we use iph->daddr for strict route gateway comparation, Unfortunately
> skb_rtable(skb) has been updated in ip_options_rcv_srr() for the
> nexthop in SRR option but iph->daddr *not* updated, So rt->rt_dst is
> not equals to iph->daddr, We should use the updated rt->rt_dst instead.
>
> Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Same comments as the other patch, add whatever logic is necessary to
compute the destination address such that we don't need to add back
an rt->rt_dst reference here.
^ permalink raw reply
* Re: [PATCH] ipv4: fix a bug in SRR option matching.
From: David Miller @ 2011-11-08 17:06 UTC (permalink / raw)
To: lw; +Cc: netdev
In-Reply-To: <4EB8E0B8.40500@cn.fujitsu.com>
From: Li Wei <lw@cn.fujitsu.com>
Date: Tue, 08 Nov 2011 15:56:40 +0800
> Since commit 7be799a7 (ipv4: Remove rt->rt_dst reference from
> ip_forward_options()) and commit 0374d9ce (ipv4: Kill spurious
> write to iph->daddr in ip_forward_options()) we use iph->daddr
> for SRR option matching and assume iph->daddr equals to rt->rt_dst,
> Unfortunately skb_rtable(skb) has been updated in ip_options_rcv_srr()
> for the nexthop in SRR option but iph->daddr *not* updated,
> We should use the updated rt->rt_dst for SRR option matching
> and update iph->daddr here.
>
> Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Please replace this by whatever logic ip_options_rcv_srr() uses to
determine the destination address.
I would strongly encourage you, when fixing bugs like this, to use
as a hint the intentions of the commit which introduced the bug. And
try as hard as possible to retain the goals of the guilty commit.
In this case, that means not introducing references to rt->rt_dst
back into the code.
Thank you.
^ permalink raw reply
* Re: [stable] net: Handle different key sizes between address families in flow cache
From: Greg KH @ 2011-11-08 16:53 UTC (permalink / raw)
To: Kim Phillips
Cc: David Miller, stable, eric.dumazet, zheng.z.yan, netdev,
David Ward
In-Reply-To: <20111104212958.5d54152b38050e2a0917d575@freescale.com>
On Fri, Nov 04, 2011 at 09:29:58PM -0500, Kim Phillips wrote:
> On Fri, 4 Nov 2011 14:24:39 -0700
> Greg KH <greg@kroah.com> wrote:
>
> > On Fri, Nov 04, 2011 at 02:46:59PM -0500, Kim Phillips wrote:
> > > commit aa1c366e4febc7f5c2b84958a2dd7cd70e28f9d0 upstream.
> > >
> > > With the conversion of struct flowi to a union of AF-specific structs, some
> > > operations on the flow cache need to account for the exact size of the key.
> > >
> > > Signed-off-by: David Ward <david.ward@ll.mit.edu>
> > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > > Cc: stable@vger.kernel.org # v3.0.x
> > > ---
> > > This patch is the result of a clean cherry-pick onto v3.0.8.
> > > It restores IPSec fwding performance from ~4kpps back to ~44kpps on
> > > a P2020DS board.
> >
> > Too bad you forgot to build this patch after you applied it:
> > CC init/main.o
> > In file included from include/linux/security.h:39:0,
> > from init/main.c:32:
> > include/net/flow.h: In function 'flow_key_size':
> > include/net/flow.h:174:3: error: size of unnamed array is negative
> > include/net/flow.h:177:3: error: size of unnamed array is negative
> >
> > Please be kind to your poor over-worked stable kernel maintainer and do
> > the decent thing and TEST YOUR PATCH before you ask him to accept it.
> >
> > bah, I think it's time for the weekend to start a bit earlier than normal...
>
> so sorry I hadn't tested 64-bit.
>
> I found the problem - upstream commit aa1c366 depends on
> upstream commit 728871b:
>
> commit 728871bc05afc8ff310b17dba3e57a2472792b13
> Author: David Ward <david.ward@ll.mit.edu>
> Date: Mon Sep 5 16:47:23 2011 +0000
>
> net: Align AF-specific flowi structs to long
>
> AF-specific flowi structs are now passed to flow_key_compare, which must
> also be aligned to a long.
>
> Signed-off-by: David Ward <david.ward@ll.mit.edu>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> so, one more time, in stable sign-off area format:
>
> Cc: <stable@vger.kernel.org> # v3.0.x: 728871b: net: Align AF-specific flowi structs to long
> Cc: <stable@vger.kernel.org> # v3.0.x
>
> build tested on 32- and 64-bit powerpc, and ARCH=x86
> {i386,x86_64}_defconfig.
Can you please resend the original patch I needed to apply here as well?
It's long-gone in my deleted inbox.
thanks,
greg k-h
^ permalink raw reply
* [PATCH 1/1] [SMSC75xx] Fix incorrect usage of NET_IP_ALIGN
From: Nico Erfurth @ 2011-11-08 17:30 UTC (permalink / raw)
To: netdev; +Cc: Phil Sutter, stable, Steve Glendinning, Nico Erfurth
The driver used NET_IP_ALIGN to remove some additional padding inside of
the rx_fixup function. On many architectures NET_IP_ALIGN defaults to 2
which removed the correct amount of bytes.
On MCORE2-machines commit ea812ca1b06113597adcd8e70c0f84a413d97544
introduces a change which sets NET_IP_ALIGN to 0 by default. Which
triggered the bug on these machines.
This fix introduces a new RXW_PADDING define and uses this instead of
NET_IP_ALIGN. The name was taken from the original SMSC7500 driver which
is provided by SMSC.
Signed-off-by: Nico Erfurth <ne@erfurth.eu>
Tested-by: Phil Sutter <phil@nwl.cc>
---
drivers/net/usb/smsc75xx.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index 22a7cf9..a5b9b12 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -51,6 +51,7 @@
#define USB_VENDOR_ID_SMSC (0x0424)
#define USB_PRODUCT_ID_LAN7500 (0x7500)
#define USB_PRODUCT_ID_LAN7505 (0x7505)
+#define RXW_PADDING 2
#define check_warn(ret, fmt, args...) \
({ if (ret < 0) netdev_warn(dev->net, fmt, ##args); })
@@ -1088,13 +1089,13 @@ static int smsc75xx_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
memcpy(&rx_cmd_b, skb->data, sizeof(rx_cmd_b));
le32_to_cpus(&rx_cmd_b);
- skb_pull(skb, 4 + NET_IP_ALIGN);
+ skb_pull(skb, 4 + RXW_PADDING);
packet = skb->data;
/* get the packet length */
- size = (rx_cmd_a & RX_CMD_A_LEN) - NET_IP_ALIGN;
- align_count = (4 - ((size + NET_IP_ALIGN) % 4)) % 4;
+ size = (rx_cmd_a & RX_CMD_A_LEN) - RXW_PADDING;
+ align_count = (4 - ((size + RXW_PADDING) % 4)) % 4;
if (unlikely(rx_cmd_a & RX_CMD_A_RED)) {
netif_dbg(dev, rx_err, dev->net,
--
1.7.3.4
^ permalink raw reply related
* [PATCH 0/2] sctp: Patches for fasthandoff
From: Michio Honda @ 2011-11-08 16:23 UTC (permalink / raw)
To: David Miller, Wei Yongjun; +Cc: netdev
>From 0c0019a3d219bd900bedbadfa6a3b3733d9ca49c Mon Sep 17 00:00:00 2001
From: Michio Honda <micchie@sfc.wide.ad.jp>
Date: Wed, 9 Nov 2011 00:54:52 +0900
Subject: [PATCH 0/2] sctp: Patches for fasthandoff
Series of 2 patches for retransmission immediately after the address reconfiguration including disconnected period.
Michio Honda (2):
sctp: fasthandoff with ASCONF at mobile-node
sctp: fasthandoff with ASCONF at server-node
include/net/sctp/structs.h | 1 +
net/sctp/sm_make_chunk.c | 4 +++-
net/sctp/sm_sideeffect.c | 8 +++++++-
net/sctp/transport.c | 16 ++++++++++++++++
4 files changed, 27 insertions(+), 2 deletions(-)
--
1.7.3.2
^ permalink raw reply
* [PATCH 1/2] sctp: fasthandoff with ASCONF at mobile-node
From: Michio Honda @ 2011-11-08 16:23 UTC (permalink / raw)
To: David Miller, Wei Yongjun; +Cc: netdev
>From d42e880c330693ceb0e03a48d1e4347b190d0cce Mon Sep 17 00:00:00 2001
From: Michio Honda <micchie@sfc.wide.ad.jp>
Date: Fri, 17 Jun 2011 11:03:23 +0900
Subject: [PATCH 1/2] sctp: fasthandoff with ASCONF at mobile-node
Fast retransmission after changing the last address
with ASCONF negotiation
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
include/net/sctp/structs.h | 1 +
net/sctp/sm_make_chunk.c | 4 +++-
net/sctp/transport.c | 16 ++++++++++++++++
3 files changed, 20 insertions(+), 1 deletions(-)
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index e90e7a9..3382615 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1085,6 +1085,7 @@ void sctp_transport_burst_reset(struct sctp_transport *);
unsigned long sctp_transport_timeout(struct sctp_transport *);
void sctp_transport_reset(struct sctp_transport *);
void sctp_transport_update_pmtu(struct sctp_transport *, u32);
+void sctp_transport_immediate_rtx(struct sctp_transport *);
/* This is the structure we use to queue packets as they come into
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 0121e0a..a85eeeb 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -3400,8 +3400,10 @@ int sctp_process_asconf_ack(struct sctp_association *asoc,
asconf_len -= length;
}
- if (no_err && asoc->src_out_of_asoc_ok)
+ if (no_err && asoc->src_out_of_asoc_ok) {
asoc->src_out_of_asoc_ok = 0;
+ sctp_transport_immediate_rtx(asoc->peer.primary_path);
+ }
/* Free the cached last sent asconf chunk. */
list_del_init(&asconf->transmitted_list);
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index 394c57c..3889330 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -641,3 +641,19 @@ void sctp_transport_reset(struct sctp_transport *t)
t->cacc.next_tsn_at_change = 0;
t->cacc.cacc_saw_newack = 0;
}
+
+/* Schedule retransmission on the given transport */
+void sctp_transport_immediate_rtx(struct sctp_transport *t)
+{
+ /* Stop pending T3_rtx_timer */
+ if (timer_pending(&t->T3_rtx_timer)) {
+ (void)del_timer(&t->T3_rtx_timer);
+ sctp_transport_put(t);
+ }
+ sctp_retransmit(&t->asoc->outqueue, t, SCTP_RTXR_T3_RTX);
+ if (!timer_pending(&t->T3_rtx_timer)) {
+ if (!mod_timer(&t->T3_rtx_timer, jiffies + t->rto))
+ sctp_transport_hold(t);
+ }
+ return;
+}
--
1.7.3.2
^ permalink raw reply related
* [PATCH 2/2] sctp: fasthandoff with ASCONF at server-node
From: Michio Honda @ 2011-11-08 16:23 UTC (permalink / raw)
To: David Miller, Wei Yongjun; +Cc: netdev
>From 0c0019a3d219bd900bedbadfa6a3b3733d9ca49c Mon Sep 17 00:00:00 2001
From: Michio Honda <micchie@sfc.wide.ad.jp>
Date: Fri, 17 Jun 2011 11:22:35 +0900
Subject: [PATCH 2/2] sctp: fasthandoff with ASCONF at server-node
Retransmit chunks to newly confirmed destination when ASCONF and
HEARTBEAT negotiation has success with a single-homed peer.
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
---
net/sctp/sm_sideeffect.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 76388b0..1ff51c9 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -666,6 +666,7 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
struct sctp_chunk *chunk)
{
sctp_sender_hb_info_t *hbinfo;
+ int was_unconfirmed = 0;
/* 8.3 Upon the receipt of the HEARTBEAT ACK, the sender of the
* HEARTBEAT should clear the error counter of the destination
@@ -692,9 +693,11 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
/* Mark the destination transport address as active if it is not so
* marked.
*/
- if ((t->state == SCTP_INACTIVE) || (t->state == SCTP_UNCONFIRMED))
+ if ((t->state == SCTP_INACTIVE) || (t->state == SCTP_UNCONFIRMED)) {
+ was_unconfirmed = 1;
sctp_assoc_control_transport(asoc, t, SCTP_TRANSPORT_UP,
SCTP_HEARTBEAT_SUCCESS);
+ }
/* The receiver of the HEARTBEAT ACK should also perform an
* RTT measurement for that destination transport address
@@ -712,6 +715,9 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
/* Update the heartbeat timer. */
if (!mod_timer(&t->hb_timer, sctp_transport_timeout(t)))
sctp_transport_hold(t);
+
+ if (was_unconfirmed && asoc->peer.transport_count == 1)
+ sctp_transport_immediate_rtx(t);
}
--
1.7.3.2
^ permalink raw reply related
* Re[2]: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
From: Hans Schillstrom @ 2011-11-08 15:12 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Hans Schillstrom, kaber, jengelh, netfilter-devel, netdev
>
>On Tue, Nov 08, 2011 at 12:29:53AM +0100, Hans Schillstrom wrote:
>> >We prefer skb_header_pointer instead. If conntrack is enabled, we can
>> >benefit from defragmention.
>>
>> In our case conntrack will not be there
>
>Yes, but if conntrack is there, we benefit from fragment reassembly if
>you use skb_header_pointer.
>
>> >Please, replace all pskb_may_pull by skb_header_pointer in this code.
>> >
>> >We can assume that the IP header is linear (not fragmented).
>>
>> I ran in to this issue in IPv6 testing so I got a little bit "paranoid".
>> Are you sure that the embedded IP and L4 header in the ICMP msg also is unfragmented.
>> Is this true for both IPv6 & IPv4 ?
>
>No sorry. I was refering to normal IP header in one packet.
>
>> From what I remember when I was testing IPv6 icmp and digged into the original header (on a 2.6.32 kernel)
>> pskb_may_pull was needed.
>
>Yes, it is indeed needed.
>
>> [snip]
[snip]
>
>Welcome, let's see if we can get this into 3.3 since we cannot make it
>for 3.2.
>
>BTW, do you have some number of this running with and without
>conntrack? It would be interesting to have.
I didn't save them, but I can make a new benchmark later on.
Regards
Hans
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox