Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] e100: Fix the TX workqueue race
From: Jeff Garzik @ 2010-04-23 16:20 UTC (permalink / raw)
  To: Alan Cox; +Cc: e1000-devel, netdev
In-Reply-To: <20100423143356.7092.45260.stgit@localhost.localdomain>

On 04/23/2010 10:34 AM, Alan Cox wrote:
> I'd assumed someone would have picked up on this and fixed it using rtnl_lock
> as was suggested but it seems to have fallen through the cracks ?
>
> Anyway this is I assume what was meant ?
>
> ---
>
> Nothing stops the workqueue being left to run in parallel with close or a
> few other operations. This causes double unmaps and the like.
>
> See kerneloops.org #1041230 for an example
>
> Signed-off-by: Alan Cox<alan@linux.intel.com>

Acked-by: Jeff Garzik <jgarzik@redhat.com>

Glad someone finally fixed this, it has bugged me for years...



------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH] NIU support for skb->rxhash
From: Tom Herbert @ 2010-04-23 15:32 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100423.011456.48472321.davem@davemloft.net>

> I looked into implementing this and it doesn't work.  The
> problem is GRO want's to look into the packet very early
> and we want to batch GRO a set of packets into a big packet
> before shooting them over to a remote cpu.
>

Can you reconsider? :-)  The majority of our servers see packet loads
which don't allow for much batching (a lot of small RPC messages), so
for those GRO is mostly unnecessary overhead and mechanisms that
improve unbatched packet performance are compelling.  Also, if a
device already does LRO, I don't see that GRO could add a lot of value
anyway.

Tom

> This reminds me that we can start using ->rxhash as a quick
> mismatch check in the GRO flow matcher.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH 1/7] Topcliff GbE: Add The Main code
From: Arnd Bergmann @ 2010-04-23 15:27 UTC (permalink / raw)
  To: Masayuki Ohtake; +Cc: NETDEV, Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

On Friday 23 April 2010, Masayuki Ohtake wrote:
> From: Masayuki Ohtake <masa-korg@dsn.okisemi.com>
> 
> This patch adds the Main code of GbE driver for Topcliff.
> The GbE driver needs all patch[1/7 to 7/7].
> 
> Signed-off-by: Masayuki Ohtake <masa-korg@dsn.okisemi.com>

I already commented on the "Topcliff PHUB: Add The Packet Hub driver"
submission. Many of my comments there apply here as well, but
there are a few more things that you may want to address in
future submissions:

> +static int
> +pch_gbe_probe(struct pci_dev *pdev, const struct pci_device_id *pci_id);
> +static void pch_gbe_remove(struct pci_dev *pdev);
> +static int pch_gbe_suspend(struct pci_dev *pdev, pm_message_t state);
> +static int pch_gbe_resume(struct pci_dev *pdev);

Ideally, static functions are ordered such that the caller is
last, so you can drop all of the forward declarations like these.

> +/*!
> + * @ingroup PCI driver Layer
> + * @struct  pch_gbe_pcidev_id
> + * @brief   PCI Device ID Table
> + * @remarks
> + *  This is an instance of pci_device_id structure defined in linux/pci.h,
> + *  and holds information of the PCI devices that are supported by this
> driver.
> + */
> +static const struct pci_device_id pch_gbe_pcidev_id[3] = {
> + {.vendor = PCI_VENDOR_ID_INTEL,
> +  .device = PCI_DEVICE_ID_INTEL_IOH1_GBE,
> +  .subvendor = PCI_ANY_ID,
> +  .subdevice = PCI_ANY_ID,
> +  .class = (PCI_CLASS_NETWORK_ETHERNET << 8),
> +  .class_mask = (0xFFFF00)
> +  },
> + /* required last entry */
> + {0}
> +};

Your array size above is three, but you only define two members.
Better may the array automatically sized. Also, it's clearer to
use the PCI_DEVICE_CLASS() helper macro, e.g.

static const struct pci_device_id pch_gbe_pcidev_id[] = {
	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IOH1_GBE) },
	{ 0 },
};

	Arnd

^ permalink raw reply

* Re: [PATCH 1/7] Topcliff GbE: Add The Main code [1/3]
From: Stephen Hemminger @ 2010-04-23 15:26 UTC (permalink / raw)
  To: Masayuki Ohtake; +Cc: NETDEV, Wang, Yong Y, Wang, Qi, Intel OTC, Andrew
In-Reply-To: <002d01cae2dd$1f6e42d0$66f8800a@maildom.okisemi.com>

On Fri, 23 Apr 2010 20:56:25 +0900
"Masayuki Ohtake" <masa-korg@dsn.okisemi.com> wrote:

Even though the patch was sent as an attachment, long lines were
wrapped.

Do you want this to go directly to kernel, or do you want help
fixing coding issues by submitting to staging tree?

The code uses a comment style that is kind of like the existing
docbook comment style; why not convert it to use the official
docbook style for examples look at other kernel code:
/**
 *	dev_alloc_skb - allocate an skbuff for receiving
 *	@length: length to allocate
 *
 *	Allocate a new &sk_buff and assign it a usage count of one. The
 *	buffer has unspecified headroom built in. Users should allocate
 *	the headroom they think they need without accounting for the
 *	built in space. The built in space is used for optimisations.
 *
 *	%NULL is returned if there is no free memory. Although this function
 *	allocates memory it can be called from an interrupt.
 */

The code is also indented with a non-standard indentation format.
Please read Documentation/CodingStyle.  Indentation is supposed
to be 4 characters (and using tabs of 8 characters).


The PCI device table could be changed to:

static DEFINE_PCI_DEVICE_TABLE(pch_gbe_pcidev_id) = {
    { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IOH1_GBE) },
    { 0 }
};
Also PCI_DEVICE_ID_INTEL_IOH1_GBE is not defined anywhere I can see.

^ permalink raw reply

* Re: [PATCH 1/4] Fix acquiring socket lock before reading RTNETLINK response
From: Dan Smith @ 2010-04-23 15:24 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272034539-19899-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

DS> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

I went too deep with my send-email command line.  This patch isn't
related to the others, so please disregard.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply

* [PATCH] e100: Fix the TX workqueue race
From: Alan Cox @ 2010-04-23 14:34 UTC (permalink / raw)
  To: netdev, e1000-devel

I'd assumed someone would have picked up on this and fixed it using rtnl_lock
as was suggested but it seems to have fallen through the cracks ?

Anyway this is I assume what was meant ?

---

Nothing stops the workqueue being left to run in parallel with close or a
few other operations. This causes double unmaps and the like.

See kerneloops.org #1041230 for an example

Signed-off-by: Alan Cox <alan@linux.intel.com>
---

 drivers/net/e100.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)


diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 3e8d000..859e833 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -2280,8 +2280,13 @@ static void e100_tx_timeout_task(struct work_struct *work)
 
 	netif_printk(nic, tx_err, KERN_DEBUG, nic->netdev,
 		     "scb.status=0x%02X\n", ioread8(&nic->csr->scb.status));
-	e100_down(netdev_priv(netdev));
-	e100_up(netdev_priv(netdev));
+
+	rtnl_lock();
+	if (netif_running(dev)) {
+		e100_down(netdev_priv(netdev));
+		e100_up(netdev_priv(netdev));
+	}
+	rtnl_unlock();
 }
 
 static int e100_loopback_test(struct nic *nic, enum loopback loopback_mode)


^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-23 15:05 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100423021000.GA21777@gondor.apana.org.au>

On Fri, Apr 23, 2010 at 10:10:00AM +0800, Herbert Xu wrote:
> 
> I will post an updated patch later today to deal with that.

Just got back from a business trip.

This stuff is more broken than I thought.  For example, we perform
a number of actions when DAD succeeds, e.g., joining an anycast
group.  However, this is not synchronised with respect to address
deletion at all, so if DAD succeeds just as someone deletes the
address, you can easily get stuck on that anycast group.

I will try to untangle this mess tomorrow.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] Topcliff PHUB: Add The Packet Hub driver [1/2]
From: Arnd Bergmann @ 2010-04-23 14:57 UTC (permalink / raw)
  To: Masayuki Ohtake; +Cc: NETDEV, Wang, Yong Y, Wang, Qi, andrew.chih.howe.khor
In-Reply-To: <000501cae2eb$de97e950$66f8800a@maildom.okisemi.com>

On Friday 23 April 2010, Masayuki Ohtake wrote:
>From: Masayuki Ohtake <masa-korg@dsn.okisemi.com>
>
>This patch adds the Packet Hub driver for Topcliff.
>Patch created against 2.6.33.1

Thank you for your submission. Since this is the first time
you posted this patch, there is naturally a lot that needs
to be improved on before the driver can be merged. I'll
go through this in detail so you can improve it for the
next submission.

First of all, you need to decide where you want the code
to go. The quickest path is to get it into drivers/staging,
which can accept code that does not yet follow the kernel
coding style or may still need changes to its user interface,
see http://www.kroah.com/log/linux/linux-staging-update.html

If you want to go straight into the regular supported driver
directory, that is also ok but requires more upfront work on
coding style and addressing review comments like the ones
I make below. Addressing the comments typically means that
you either change your code as suggested or that you argue
that your version is actually correct.

The patch description above needs to tell the reader what
the driver actually does and what the hardware is for.

>Signed-off-by: Masayuki Ohtake <masa-korg@dsn.okisemi.com>
>---
> drivers/char/Kconfig                                 |    7++
> drivers/char/Makefile                                |    2
> drivers/char/pch_phub/Makefile                       |    9
> drivers/char/pch_phub/pch_common.h                   |    147
> drivers/char/pch_phub/pch_debug.h                    |    58
> drivers/char/pch_phub/pch_phub.c                     |    375
> drivers/char/pch_phub/pch_phub.h                     |    195
> drivers/char/pch_phub/pch_phub_hal.c                 |    544
> drivers/char/pch_phub/pch_phub_hal.h                 |    124
> drivers/char/pch_phub/pch_phub_pci.c                 |    499
>+++++++++++++++++++++++++++++++ 10 files changed, 1960 insertions(+)

You have submitted this driver to the netdev list, but put
the code into drivers/char. If this is really a network driver,
it should probably go into drivers/net, otherwise it needs to be
reviewed on the main linux-kernel mailing list.

If you want it to be applied as a staging driver first, change
the code location to drivers/staging first a

>diff -urN linux-2.6.33.1/drivers/char/Kconfig
>topcliff-2.6.33.1/drivers/char/Kconfig
>--- linux-2.6.33.1/drivers/char/Kconfig 2010-03-16 01:09:39.000000000 +0900
>+++ topcliff-2.6.33.1/drivers/char/Kconfig 2010-04-14 18:19:10.000000000
>+0900
>@@ -4,6 +4,13 @@

Evidently, your email client breaks line-wrapping. This means that it's
not possibly to apply the patch. Please see Documentation/email-clients.txt
on how to fix this.

To make sure this does not happen again, you can send the patch to yourself
and try to apply it from the email as a test before you send it to a
mailing list.

> menu "Character devices"
>
>+config PCH_PHUB
>+        tristate "PCH PHUB"
>+        depends on PCI
>+        help
>+          If you say yes to this option, support will be included for the
>+          PCH Packet Hub Host controller.
>+
> config VT
>  bool "Virtual terminal" if EMBEDDED
>  depends on !S390

This description also should be more helpful. This could be the same
text that you will put in the patch description above.

>diff -urN linux-2.6.33.1/drivers/char/pch_phub/pch_common.h
>topcliff-2.6.33.1/drivers/char/pch_phub/pch_common.h
>--- linux-2.6.33.1/drivers/char/pch_phub/pch_common.h 1970-01-01
>09:00:00.000000000 +0900
>+++ topcliff-2.6.33.1/drivers/char/pch_phub/pch_common.h 2010-04-14
>15:29:48.000000000 +0900
>@@ -0,0 +1,147 @@
>+/*!
>+ * @file pch_common.h
>+ * @brief Provides the macro definitions used by all files.
>+ * @version 1.0.0.0
>+ * @section

You seem to use a tool for processing the comments into documentation,
which is good. However, the syntax you use is incompatible with the
one used in the kernel and should be changed to the 'kerneldoc'
format, see Documentation/kernel-doc-nano-HOWTO.txt.

>+/*! @ingroup Global
>+@def      PCH_WRITE8
>+@brief   Macro for writing 8 bit data to an io/mem address
>+*/
>+#define PCH_WRITE8(val, addr)   iowrite8((val), (void __iomem *)(addr))
>+/*! @ingroup Global
>+@def      PCH_LOG
>+@brief   Macro for writing 16 bit data to an io/mem address
>+*/
>+#define PCH_WRITE16(val, addr)  iowrite16((val), (void __iomem *)(addr))
>+/*! @ingroup Global
>+@def      PCH_LOG
>+@brief   Macro for writing 32 bit data to an io/mem address

In general, wrapping kernel functions in a driver specific macro that
does not do anything else is discouraged. It's best to delete these
macros and change the code to use the underlying interfaces directly.

>+#ifndef __PCH_DEBUG_H__
>+#define __PCH_DEBUG_H__
>+
>+#ifdef MODULE
>+#define PCH_LOG(level, fmt, args...) printk(level "%s:" fmt "\n",\
>+       THIS_MODULE->name, ##args)
>+#else
>+#define PCH_LOG(level, fmt, args...) printk(level "%s:" fmt "\n" ,\
>+        __FILE__, ##args)
>+#endif
>+
>+
>+#ifdef DEBUG
>+ #define PCH_DEBUG(fmt, args...) PCH_LOG(KERN_DEBUG, fmt, ##args)
>+#else
>+ #define PCH_DEBUG(fmt, args...)
>+#endif
>+
>+#ifdef PCH_TRACE_ENABLED
>+ #define PCH_TRACE PCH_DEBUG
>+#else
>+ #define PCH_TRACE(fmt, args...)
>+#endif
>+
>+#define PCH_TRACE_ENTER PCH_TRACE("Enter %s", __func__)
>+#define PCH_TRACE_EXIT  PCH_TRACE("Exit %s", __func__)

For these macros, we already have existing interfaces in the kernel,
you should remove yours and use dev_dbg, dev_info, pr_debug etc.

The tracing functions can probably just be removed here. If you
feel that you need tracing, please take a look at the kernel
tracing subsystem. There is an excellent series of articles
about tracing at http://lwn.net/Articles/383362/.

>+/**
>+ * file_operations structure initialization
>+ */
>+const struct file_operations pch_phub_fops = {
>+ .owner = THIS_MODULE,
>+ .open = pch_phub_open,
>+ .release = pch_phub_release,
>+ .ioctl = pch_phub_ioctl,
>+};

New code should use the 'unlocked_ioctl' callback instead of 'ioctl'.

The whitespace in this patch is damaged: indentation of code should
be done with tabs instead of spaces. It's not clear if the code is
written like this, or it was damaged by your email client.

If the problem is part of your actual code, have a look at
Documentation/CodingStyle for how it should look like.

>+/*function implementations*/
>+
>+/*! @ingroup PHUB_InterfaceLayerAPI
>+  @fn  int pch_phub_open( struct inode *inode,struct file *file)
>+  @remarks  Implements the Initializing and opening of the Packet Hub
>module.
>+  @param  inode  [@ref INOUT] Contains the reference of the inode
>+         structure
>+  @param  file  [@ref INOUT] Contains the reference of the file structure
>+  @retval returnvalue [@ref OUT] contains the result for the concerned
>+         attempt.
>+  The result would generally comprise of success code
>+  or failure code. The failure code will indicate reason for
>+  failure.
>+  @see
>+  EBUSY
>+  */
>+int pch_phub_open(struct inode *inode, struct file *file)
>+{

Since this is not an exported interface, it the function should
probably be marked 'static'.

>+ int ret_value = PCH_PHUB_SUCCESS;
>+ struct pch_phub_reqt *p_pch_phub_reqt;
>+ unsigned long addr_offset;
>+ unsigned long data;
>+ unsigned long mask;
>+
>+ do {
>+  if (pch_phub_suspended == true) {
>+   PCH_LOG(KERN_ERR, "pch_phub_ioctl : "
>+    "suspend initiated returning =%d\n",
>+    PCH_PHUB_FAIL);
>+   ret_value = PCH_PHUB_FAIL;
>+   break;
>+  }

Using the do { ... } while (0) construct in this way is not wrong,
but unconventional. The way this is normally done in Linux is to
have a goto target at the end.

The macros PCH_PHUB_SUCCESS and PCH_PHUB_FAIL should probably
be removed, because they are not standard error codes. By convention,
you should return 0 for success in ioctl or one of the errno.h
values for failure, as you do below.

>+  p_pch_phub_reqt = (struct pch_phub_reqt *)arg;
>+  ret_value =
>+   copy_from_user((void *)&addr_offset,
>+    (void *)&p_pch_phub_reqt->addr_offset,
>+    sizeof(addr_offset));
>+  if (ret_value) {
>+   PCH_LOG(KERN_ERR, "pch_phub_ioctl : "
>+    "copy_from_user fail returning =%d\n",
>+    -EFAULT);
>+   ret_value = -EFAULT;
>+   break;
>+  }

It is much easier to use get_user() here than copy_from_user.

The definition of struct pch_phub_reqt is problematic because
it contains members of type 'unsigned long'. This means that
a 32 bit user process uses a different data structure than a 
64 bit kernel.

Ideally, you only pass a single integer as an ioctl argument.
There are cases where it needs to be a data structure, but
if that happens, you should only use members types like __u32
or __u64, not long or pointer.

>+  switch (cmd) {
>+  case IOCTL_PHUB_READ_REG:
>+   {
>+
>+    pch_phub_read_reg(addr_offset, &data);
>+    PCH_DEBUG("pch_phub_ioctl  : Invoked "
>+     "pch_phub_read_reg successfully\n");
>+
>+    ret_value =
>+        copy_to_user((void *)&p_pch_phub_reqt->
>+       data, (void *)&data,
>+       sizeof(data));
>+    if (ret_value) {
>+     PCH_LOG(KERN_ERR, "pch_phub_ioctl : "
>+     "copy_to_user fail returning =%d\n",
>+      -EFAULT);
>+     ret_value = -EFAULT;
>+     break;
>+    }
>+    break;
>+   }
>+
>+  case IOCTL_PHUB_WRITE_REG:
>+   {
>+
>+    ret_value =
>+        copy_from_user((void *)&data,
>+         (void *)&p_pch_phub_reqt->
>+         data, sizeof(data));
>+    if (ret_value) {
>+     PCH_LOG(KERN_ERR, "pch_phub_ioctl : "
>+     "copy_from_user fail returning =%d\n",
>+      -EFAULT);
>+     ret_value = -EFAULT;
>+     break;
>+    }
>+    pch_phub_write_reg(addr_offset, data);
>+    PCH_DEBUG("pch_phub_ioctl  : Invoked "
>+     "pch_phub_write_reg successfully\n");
>+    break;
>+   }

My feeling is that this ioctl interface is too
low-level in general. You only export access to specific
registers, not to functionality exposed by them.
The best kernel interfaces are defined in a way that
is independent of the underlying hardware and
convert generic commands into device specific commands.

If you really want to allow direct register access,
a simpler way would be to map the memory into the user
address space using the mmap() operation and not
provide any ioctl commands.

I don't see any range cheching on the addr_offset
argument, which means that a malicious user can use this
function to access not only your device, but any data
in the kernel address space.

Note that your open count does not protect the
hardware from concurrent access, because a file
descriptor can be shared by multiple user threads.
You can probably safely drop that count.

>+/*! @ingroup PHUB_InterfaceLayer
>+  @def IOCTL_PHUB_READ_REG
>+  @brief Outlines the read register function signature.
>+  */
>+#define IOCTL_PHUB_READ_REG (_IOW(PHUB_IOCTL_MAGIC, 1, unsigned long))

If I read your code correctly, you actually pass a struct pch_phub_reqt
argument, not an unsigned long argument, so this definition should be
changed accordingly. The same applies to the other ioctl commands.


Your patch continues in a second email, which is not how you normally
split submissions. When there is a logical separation between parts
of the driver, make multiple patches, each with a separate description
of what the patch does.

In case of this driver, that does not seem necessary. In fact, it seems
to me like it is simple enough to become a single source file, which
would simplify it even more because you no longer need header files
defining the interface between parts of the driver.

	Arnd

^ permalink raw reply

* [PATCH 4/4] C/R: inet4 and inet6 unicast routes
From: Dan Smith @ 2010-04-23 14:55 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272034539-19899-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This patch adds support for checkpointing and restoring route information.
It keeps enough information to restore basic routes at the level of detail
of /proc/net/route.  It uses RTNETLINK to extract the information during
checkpoint and also to insert it back during restore.  This gives us a
nice layer of isolation between us and the various "fib" implementations.

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |   31 +++
 net/checkpoint_dev.c           |  412 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 442 insertions(+), 1 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 633c9b0..187d706 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -23,6 +23,7 @@
 #include <sys/un.h>
 #include <netinet/in.h>
 #endif
+#include <linux/if.h>
 
 /*
  * /usr/include/linux/security.h is not exported to userspace, so
@@ -783,6 +784,7 @@ struct ckpt_hdr_file_socket {
 struct ckpt_hdr_netns {
 	struct ckpt_hdr h;
 	__s32 this_ref;
+	__u32 routes;
 } __attribute__((aligned(8)));
 
 enum ckpt_netdev_types {
@@ -837,6 +839,35 @@ struct ckpt_netdev_addr {
 	} __attribute__((aligned(8)));
 } __attribute__((aligned(8)));
 
+enum ckpt_route_types {
+	CKPT_ROUTE_IPV4,
+	CKPT_ROUTE_IPV6,
+	CKPT_ROUTE_MAX
+};
+
+#define CKPT_ROUTE_FLAG_GW 1
+
+struct ckpt_route {
+	__u16 type;
+	__u16 flags;
+
+	union {
+		struct {
+			__be32 inet4_len;          /* mask length (bits) */
+			__u32  inet4_met;          /* metric             */
+			__be32 inet4_dst;          /* route address      */
+			__be32 inet4_gwy;          /* gateway address    */
+		};
+		struct {
+			__u32 inet6_len;           /* mask length (bits) */
+			__u32 inet6_met;           /* metric             */
+			struct in6_addr inet6_dst; /* route address      */
+			struct in6_addr inet6_gwy; /* gateway address    */
+		};
+	} __attribute__((aligned(8)));
+	char dev[IFNAMSIZ+1];
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index df8b16a..b34d1f2 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -17,9 +17,11 @@
 #include <linux/checkpoint_hdr.h>
 #include <linux/deferqueue.h>
 #include <linux/module.h>
+#include <linux/fib_rules.h>
 
 #include <net/net_namespace.h>
 #include <net/sch_generic.h>
+#include <net/ipv6.h>
 
 struct veth_newlink {
 	char *peer;
@@ -107,6 +109,22 @@ static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
 	return ret;
 }
 
+static void debug_route(struct ckpt_route *route)
+{
+	if (route->type == CKPT_ROUTE_IPV4)
+		ckpt_debug("inet4 route %pI4/%i gw %pI4 metric %i dev %s\n",
+			   &route->inet4_dst, route->inet4_len,
+			   &route->inet4_gwy, route->inet4_met,
+			   route->dev);
+	else if (route->type == CKPT_ROUTE_IPV6)
+		ckpt_debug("inet6 route %pI6/%i gw %pI6 metric %i dev %s\n",
+			   &route->inet6_dst, route->inet6_len,
+			   &route->inet6_gwy, route->inet6_met,
+			   route->dev);
+	else
+		ckpt_debug("unknown route type %i\n", route->type);
+}
+
 static struct socket *rtnl_open(struct net *net)
 {
 	struct socket *sock;
@@ -313,11 +331,236 @@ int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
 	return ret;
 }
 
+static int rtnl_dump_routes(struct socket *rtnl, int family)
+{
+	struct sk_buff *skb;
+	struct rtmsg *rtm;
+	int flags = NLM_F_ROOT | NLM_F_REQUEST;
+	struct msghdr msg;
+	struct kvec kvec;
+	struct nlmsghdr *nlh;
+	int ret = -ENOMEM;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_GETROUTE, sizeof(*rtm), flags);
+	if (!nlh)
+		goto out;
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+	rtm->rtm_family = family;
+
+	nlmsg_end(skb, nlh);
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if ((ret >= 0) && (ret != skb->len))
+		ret = -EIO;
+ out:
+	kfree_skb(skb);
+	return ret;
+}
+
+static int rtnl_process_inet4_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV4;
+	route->inet4_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		route->inet4_dst = htonl(nla_get_u32(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		route->inet4_gwy = htonl(nla_get_u32(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet4_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1; /* save this route */
+}
+
+static int rtnl_process_inet6_route(struct net *net,
+				    struct rtmsg *rtm,
+				    struct nlattr **tb,
+				    struct ckpt_route *route)
+{
+	if (rtm->rtm_type != RTN_UNICAST)
+		return 0; /* skip non-unicast routes */
+
+	route->type = CKPT_ROUTE_IPV6;
+	route->inet6_len = rtm->rtm_dst_len;
+
+	if (tb[RTA_DST])
+		ipv6_addr_copy(&route->inet6_dst, nla_data(tb[RTA_DST]));
+	if (tb[RTA_GATEWAY]) {
+		route->flags |= CKPT_ROUTE_FLAG_GW;
+		ipv6_addr_copy(&route->inet6_gwy, nla_data(tb[RTA_GATEWAY]));
+	}
+	if (tb[RTA_PRIORITY])
+		route->inet6_met = nla_get_u32(tb[RTA_PRIORITY]);
+
+	if (tb[RTA_OIF]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, nla_get_u32(tb[RTA_OIF]));
+		if (dev) {
+			strncpy(route->dev, dev->name, IFNAMSIZ);
+			dev_put(dev);
+		}
+	}
+
+	debug_route(route);
+
+	return 1;
+}
+
+static int rtnl_process_routes(struct net *net,
+			       struct nlmsghdr *nlh, int len,
+			       struct ckpt_route *routes,
+			       int idx, int max)
+{
+	struct nlmsghdr *i;
+
+	for (i = nlh; NLMSG_OK(i, len); i = NLMSG_NEXT(i, len)) {
+		struct ckpt_route *route = &routes[idx];
+		struct rtmsg *rtm = NLMSG_DATA(i);
+		struct nlattr *tb[FRA_MAX+1];
+		int ret;
+
+		if (idx >= max)
+			return -E2BIG;
+
+		if (i->nlmsg_type == NLMSG_DONE)
+			break;
+		else if (nlh->nlmsg_type != RTM_NEWROUTE) {
+			struct nlmsgerr *errmsg = nlmsg_data(nlh);
+			return errmsg->error;
+		}
+
+		ret = nlmsg_parse(i, sizeof(*rtm), tb, FRA_MAX, NULL);
+		if (ret < 0)
+			return ret;
+
+		memset(route, 0, sizeof(*route));
+
+		if (rtm->rtm_family == AF_INET)
+			ret = rtnl_process_inet4_route(net, rtm, tb, route);
+		else if (rtm->rtm_family == AF_INET6)
+			ret = rtnl_process_inet6_route(net, rtm, tb, route);
+		else
+			ret = 0; /* skip */
+		if (ret < 0)
+			return ret;
+		else if (ret)
+			idx += 1;
+	}
+
+	return idx;
+}
+
+static int rtnl_get_routes(struct net *net, int family,
+			   struct ckpt_route *routes, int idx, int max)
+{
+	int ret;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb = NULL;
+	struct socket *rtnl = NULL;
+
+	rtnl = rtnl_open(net);
+	if (IS_ERR(rtnl))
+		return PTR_ERR(rtnl);
+
+	ret = rtnl_dump_routes(rtnl, family);
+	if (ret < 0)
+		goto out;
+
+	lock_sock(rtnl->sk);
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (ret)
+		skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
+	if (!skb) {
+		ret = -EIO;
+		goto out;
+	}
+
+	nlh = nlmsg_hdr(skb);
+	if (!nlh) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = rtnl_process_routes(net, nlh, skb->len, routes, idx, max);
+ out:
+	rtnl_close(rtnl);
+	kfree_skb(skb);
+	return ret;
+}
+
+int checkpoint_netns_routes(struct ckpt_ctx *ctx, struct net *net,
+			    struct ckpt_route **_routes)
+{
+	struct ckpt_route *routes = NULL;
+	int max = 32;
+	int idx;
+	int families[] = {AF_INET, AF_INET6, 0};
+	int family;
+ retry:
+	idx = 0;
+	kfree(routes);
+	routes = kmalloc(max * sizeof(*routes), GFP_KERNEL);
+	if (!routes)
+		return -ENOMEM;
+
+	for (family = 0; families[family]; family++) {
+		idx = rtnl_get_routes(net, families[family], routes, idx, max);
+		if (idx == -E2BIG) {
+			max *= 2;
+			goto retry;
+		} else if (idx < 0)
+			break;
+	}
+
+	if (idx < 0) {
+		kfree(routes);
+		routes = NULL;
+		ckpt_err(ctx, idx, "error saving routes\n");
+	}
+	*_routes = routes;
+
+	return idx;
+}
+
 int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 {
 	struct net *net = ptr;
 	struct net_device *dev;
 	struct ckpt_hdr_netns *h;
+	struct ckpt_route *routes = NULL;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
@@ -327,10 +570,19 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
 	BUG_ON(h->this_ref <= 0);
 
+	ret = checkpoint_netns_routes(ctx, net, &routes);
+	if (ret < 0)
+		goto out;
+	h->routes = ret;
+
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	if (ret < 0)
 		goto out;
 
+	ret = ckpt_write_buffer(ctx, routes, h->routes * sizeof(*routes));
+	if (ret < 0)
+		goto out;
+
 	for_each_netdev(net, dev) {
 		if (dev->netdev_ops->ndo_checkpoint)
 			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
@@ -347,6 +599,7 @@ int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
 	}
  out:
 	ckpt_hdr_put(ctx, h);
+	kfree(routes);
 
 	return ret;
 }
@@ -862,10 +1115,145 @@ void *restore_netdev(struct ckpt_ctx *ctx)
 	return dev;
 }
 
+static int rtnl_restore_route(struct net *net, struct ckpt_route *route)
+{
+	struct sk_buff *skb;
+	struct rtmsg *rtm;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	int ret = 0;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWROUTE, sizeof(*rtm), flags);
+	if (!nlh) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rtm = nlmsg_data(nlh);
+	memset(rtm, 0, sizeof(*rtm));
+
+	rtm->rtm_table = RT_TABLE_MAIN;
+	rtm->rtm_protocol = RTPROT_BOOT;
+	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
+	rtm->rtm_type = RTN_UNICAST;
+
+	if (route->dev[0]) {
+		struct net_device *dev;
+
+		dev = dev_get_by_name(net, route->dev);
+		if (!dev) {
+			ckpt_debug("unable to find dev %s for route\n",
+				   route->dev);
+			ret = -EINVAL;
+			goto out;
+		}
+		nla_put_u32(skb, RTA_OIF, dev->ifindex);
+		dev_put(dev);
+	}
+
+	if (route->type == CKPT_ROUTE_IPV4) {
+		rtm->rtm_family = AF_INET;
+		rtm->rtm_dst_len = route->inet4_len;
+
+		nla_put_u32(skb, RTA_DST, route->inet4_dst);
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put_u32(skb, RTA_GATEWAY, route->inet4_gwy);
+		nla_put_u32(skb, RTA_PRIORITY, route->inet4_met);
+	} else if (route->type == CKPT_ROUTE_IPV6) {
+		int len = sizeof(route->inet6_dst);
+
+		if (ipv6_addr_scope(&route->inet6_dst))
+			goto out; /* Skip non-global scope routes */
+
+		rtm->rtm_family = AF_INET6;
+		rtm->rtm_dst_len = route->inet6_len;
+
+		nla_put(skb, RTA_DST, len, &route->inet6_dst);
+		if (route->flags & CKPT_ROUTE_FLAG_GW)
+			nla_put(skb, RTA_GATEWAY, len, &route->inet6_gwy);
+		nla_put_u32(skb, RTA_PRIORITY, route->inet6_met);
+	} else {
+		ckpt_debug("unsupported route type %i\n", route->type);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nlmsg_end(skb, nlh);
+
+	debug_route(route);
+
+	ret = rtnl_do(net, skb);
+ out:
+	kfree_skb(skb);
+	return ret;
+}
+
+static int restore_routes(struct net *net, struct ckpt_route *routes, int count)
+{
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < count; i++) {
+		struct ckpt_route *route = &routes[i];
+
+		ret = rtnl_restore_route(net, route);
+		if (ret == -EEXIST)
+			/* Some routes have been implied by device addresses */
+			continue;
+		else if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+struct dq_routes {
+	struct ckpt_ctx *ctx;
+	struct net *net;
+	struct ckpt_route *routes;
+	int count;
+};
+
+static int deferred_restore_routes(void *data)
+{
+	struct dq_routes *dq = data;
+	int ret;
+
+	ret = restore_routes(dq->net, dq->routes, dq->count);
+	if (ret < 0)
+		ckpt_err(dq->ctx, ret, "failed to restore routes\n");
+
+	kfree(dq->routes);
+
+	return ret;
+}
+
+static int defer_restore_routes(struct ckpt_ctx *ctx,
+				struct net *net,
+				struct ckpt_route *routes,
+				int count)
+{
+	struct dq_routes dq;
+
+	dq.ctx = ctx;
+	dq.net = net;
+	dq.routes = routes;
+	dq.count = count;
+
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      deferred_restore_routes, NULL);
+}
+
 void *restore_netns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_netns *h;
 	struct net *net;
+	struct ckpt_route *routes = NULL;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
 	if (IS_ERR(h)) {
@@ -873,12 +1261,34 @@ void *restore_netns(struct ckpt_ctx *ctx)
 		return h;
 	}
 
+	ret = ckpt_read_payload(ctx, (void **)&routes,
+				h->routes * sizeof(*routes), CKPT_HDR_BUFFER);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "Unable to read routes buffer\n");
+		net = ERR_PTR(ret);
+		goto out;
+	}
+
 	if (h->this_ref != 0) {
 		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
 		if (IS_ERR(net))
 			goto out;
-	} else
+
+		ret = defer_restore_routes(ctx, net, routes, h->routes);
+		if (ret < 0) {
+			kfree(routes);
+			put_net(net);
+			net = ERR_PTR(ret);
+		}
+	} else {
+		if (h->routes) {
+			net = ERR_PTR(-EINVAL);
+			ckpt_err(ctx, -EINVAL,
+				 "Parent netns claims to have routes\n");
+			goto out;
+		}
 		net = current->nsproxy->net_ns;
+	}
  out:
 	ckpt_hdr_put(ctx, h);
 
-- 
1.6.2.5

^ permalink raw reply related

* [PATCH 3/4] C/R: Make rtnl_open() and rtnl_do() take and pass a netns pointer
From: Dan Smith @ 2010-04-23 14:55 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272034539-19899-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

(also make rtnl_do() return negative or 0, not the message length)

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 net/checkpoint_dev.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index 2787892..df8b16a 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -107,12 +107,13 @@ static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
 	return ret;
 }
 
-static struct socket *rtnl_open(void)
+static struct socket *rtnl_open(struct net *net)
 {
 	struct socket *sock;
 	int ret;
 
-	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
+	ret = sock_create_kern_net(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE,
+				   net, &sock);
 	if (ret < 0)
 		return ERR_PTR(ret);
 
@@ -538,7 +539,7 @@ static struct sk_buff *del_link_msg(char *name)
 	return skb;
 }
 
-static int rtnl_do(struct sk_buff *skb)
+static int rtnl_do(struct net *net, struct sk_buff *skb)
 {
 	int ret = -ENOMEM;
 	struct socket *rtnl = NULL;
@@ -551,7 +552,7 @@ static int rtnl_do(struct sk_buff *skb)
 	kvec.iov_len = skb->len;
 	kvec.iov_base = skb->head;
 
-	rtnl = rtnl_open();
+	rtnl = rtnl_open(net);
 	if (IS_ERR(rtnl)) {
 		ret = PTR_ERR(rtnl);
 		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
@@ -570,7 +571,8 @@ static int rtnl_do(struct sk_buff *skb)
 	if (IS_ERR(nlh)) {
 		ret = PTR_ERR(nlh);
 		ckpt_debug("RTNETLINK said: %i\n", ret);
-	}
+	} else
+		ret = 0;
  out:
 	rtnl_close(rtnl);
 	kfree_skb(rskb);
@@ -590,7 +592,7 @@ static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
 		return ERR_PTR(PTR_ERR(skb));
 	}
 
-	ret = rtnl_do(skb);
+	ret = rtnl_do(current->nsproxy->net_ns, skb);
 	kfree_skb(skb);
 	if (ret < 0)
 		return ERR_PTR(ret);
@@ -610,7 +612,7 @@ static int rtnl_dellink(char *name)
 		return PTR_ERR(skb);
 	}
 
-	ret = rtnl_do(skb);
+	ret = rtnl_do(current->nsproxy->net_ns, skb);
 	kfree_skb(skb);
 
 	return ret;
-- 
1.6.2.5

^ permalink raw reply related

* [PATCH 2/4] [RFC] Add sock_create_kern_net()
From: Dan Smith @ 2010-04-23 14:55 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272034539-19899-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This helper allows kernel routines to create a socket in a given netns,
instead of forcing it to the initial or current one.

I know this seems like it's violating the netns boundary.  The intended
use (as in the following patches) is specifically when talking to RTNETLINK
in another netns for the purposes of creating or examining resources there.
It is expected that this will be used for that sort of transient socket
creation only.  In other words:

  s = sock_create_kern_net(AF_NETLINK, ..., other_netns, ...);
  rtnl_talk(s);
  close(s);

If this is acceptable, I will actually be able to clean up and simplify
other bits of the net checkpoint code to make better use of RTNL for
examining and restoring resources.

Perhaps we should assert that family == AF_NETLINK (or maybe just
printk(KERN_WARN) if it is not) to prevent abuse of this call?

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/net.h |    2 ++
 net/socket.c        |    6 ++++++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 9548e45..9cfc899 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -235,6 +235,8 @@ extern int	     sock_create(int family, int type, int proto,
 				 struct socket **res);
 extern int	     sock_create_kern(int family, int type, int proto,
 				      struct socket **res);
+extern int	     sock_create_kern_net(int family, int type, int protocol,
+				      struct net *net, struct socket **res);
 extern int	     sock_create_lite(int family, int type, int proto,
 				      struct socket **res); 
 extern void	     sock_release(struct socket *sock);
diff --git a/net/socket.c b/net/socket.c
index 3253c04..95c94a7 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1294,6 +1294,12 @@ int sock_create_kern(int family, int type, int protocol, struct socket **res)
 	return __sock_create(&init_net, family, type, protocol, res, 1);
 }
 
+int sock_create_kern_net(int family, int type, int protocol,
+			 struct net *net, struct socket **res)
+{
+	return __sock_create(net, family, type, protocol, res, 1);
+}
+
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 {
 	int retval;
-- 
1.6.2.5

^ permalink raw reply related

* [PATCH 1/4] Fix acquiring socket lock before reading RTNETLINK response
From: Dan Smith @ 2010-04-23 14:55 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272034539-19899-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 net/checkpoint_dev.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index 7ccb899..2787892 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -136,11 +136,12 @@ static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
 
 	*skb = NULL;
 
+	lock_sock(rtnl->sk);
 	ret = sk_wait_data(rtnl->sk, &timeo);
-	if (!ret)
-		return ERR_PTR(-EPIPE);
+	if (ret)
+		*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
 
-	*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
 	if (!*skb)
 		return ERR_PTR(-EPIPE);
 
-- 
1.6.2.5

^ permalink raw reply related

* Checkpoint and Restart of INET routing information
From: Dan Smith @ 2010-04-23 14:55 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA

This set extends the existing network socket, device, and namespace support
in the checkpoint tree to cover routing information.  It does so by making
heavy use of RTNETLINK to dump and insert routes much like userspace would.
Because the task doing the checkpointing or restarting needs to examine
or setup resources for tasks in network namespaces other than its own, an
additional kernel socket setup call is added.  It provides us the ability
to talk to RTNETLINK in a foreign netns.

The support added in this set allows me to configure various inet4 and inet6
routes in a container and have them saved and restored successfully during
a checkpoint/restart process.

^ permalink raw reply

* Re: [patch] sctp: cleanup: remove unneeded null check
From: Vlad Yasevich @ 2010-04-23 14:28 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Sridhar Samudrala, David S. Miller, Wei Yongjun, Chris Dischino,
	linux-sctp, netdev, kernel-janitors
In-Reply-To: <20100423115906.GE29093@bicker>



Dan Carpenter wrote:
> "chunk" can never be null here.  We dereferenced it earlier in the
> function and also at the start of the function we passed it to 
> sctp_pack_cookie() which dereferences it.
> 
> This code has been around since the dawn of git history so if "chunk"
> were ever null someone would have complained about it.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 17cb400..52352fc 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -470,8 +470,7 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
>  	 *
>  	 * [INIT ACK back to where the INIT came from.]
>  	 */
> -	if (chunk)
> -		retval->transport = chunk->transport;
> +	retval->transport = chunk->transport;
>  

Actually, this code can be completely removed as we already make this assignment
earlier:

        /* Per the advice in RFC 2960 6.4, send this reply to
         * the source of the INIT packet.
         */
        retval->transport = chunk->transport;
        retval->subh.init_hdr =
                sctp_addto_chunk(retval, sizeof(initack), &initack);

-vlad

>  nomem_chunk:
>  	kfree(cookie);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-23 13:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney
In-Reply-To: <Pine.LNX.4.64.1004231440190.840@ask.diku.dk>

Jesper Dangaard Brouer wrote:
> On Fri, 23 Apr 2010, Patrick McHardy wrote:
> 
>> That sounds like a good idea. But lets what for Jesper's test results
>> before we start fixing this problem :)
> 
> I will first have time to perform the tests Monday or Tuesday.
> 
> BUT I have just noticed there seems to be a corrolation between
> conntrack early_drop and searches.  I have upload a new graph:
> 
>  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png

I guess that's somewhat expected when your conntrack table is full
and all you're seeing is new connection setup attempts.

First you have a search for an existing conntrack, then it attempts
to create a new one and tries to early_drop and old one.

^ permalink raw reply

* [PATCH] Topcliff PHUB: Add The Packet Hub driver [2/2]
From: Masayuki Ohtake @ 2010-04-23 13:49 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, NETDEV, andrew.chih.howe.khor

[-- Attachment #1: Type: message/partial, Size: 27156 bytes --]

^ permalink raw reply

* [PATCH] Topcliff PHUB: Add The Packet Hub driver [1/2]
From: Masayuki Ohtake @ 2010-04-23 13:49 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, NETDEV, andrew.chih.howe.khor

[-- Attachment #1: Type: message/partial, Size: 39163 bytes --]

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Miles Lane @ 2010-04-23 12:50 UTC (permalink / raw)
  To: paulmck
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100422160144.GC2524@linux.vnet.ibm.com>

Hi Paul,
There has been a bit of back and forth, and I am not sure what patches
I should test now.
Could you send me a bundle of whatever needs testing now?

I currently have a build of 2.6.34-rc5-git3 with the same patch I
tested before applied.
I notice a few minor differences in the warnings given.  I suspect
these do not indicate
new issues, since the trace from <IRQ> through <EOI> is the same as before.

[   60.174809] [ INFO: suspicious rcu_dereference_check() usage. ]
[   60.174812] ---------------------------------------------------
[   60.174816] net/mac80211/sta_info.c:886 invoked
rcu_dereference_check() without protection!
[   60.174820]
[   60.174821] other info that might help us debug this:
[   60.174822]
[   60.174825]
[   60.174826] rcu_scheduler_active = 1, debug_locks = 1
[   60.174829] no locks held by wpa_supplicant/3973.
[   60.174832]
[   60.174833] stack backtrace:
[   60.174838] Pid: 3973, comm: wpa_supplicant Not tainted 2.6.34-rc5-git3 #19
[   60.174841] Call Trace:
[   60.174844]  <IRQ>  [<ffffffff81067faa>] lockdep_rcu_dereference+0x9d/0xa5
[   60.174873]  [<ffffffffa014e9ae>]
ieee80211_find_sta_by_hw+0x46/0x10f [mac80211]
[   60.174886]  [<ffffffffa014ea8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
[   60.174902]  [<ffffffffa01a60f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
[   60.174909]  [<ffffffff81068417>] ? mark_lock+0x2d/0x235
[   60.174920]  [<ffffffffa01d5f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
[   60.174927]  [<ffffffff8120a2d3>] ? is_swiotlb_buffer+0x2e/0x3b
[   60.174936]  [<ffffffffa01cebf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
[   60.174943]  [<ffffffff810688f0>] ? trace_hardirqs_on_caller+0xfa/0x13f
[   60.174952]  [<ffffffffa01cf3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
[   60.174959]  [<ffffffff810411df>] tasklet_action+0xa7/0x10f
[   60.174965]  [<ffffffff810421f1>] __do_softirq+0x144/0x252
[   60.174972]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
[   60.174977]  [<ffffffff810050e4>] do_softirq+0x38/0x80
[   60.174982]  [<ffffffff81041cbe>] irq_exit+0x45/0x94
[   60.174987]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
[   60.174994]  [<ffffffff813cfb13>] ret_from_intr+0x0/0xf
[   60.174997]  <EOI>  [<ffffffff810e5114>] ? kmem_cache_alloc+0xa9/0x15f
[   60.175010]  [<ffffffff81342182>] ? __alloc_skb+0x3d/0x155
[   60.175016]  [<ffffffff81342182>] __alloc_skb+0x3d/0x155
[   60.175023]  [<ffffffff8133d237>] sock_alloc_send_pskb+0xc0/0x2e5
[   60.175030]  [<ffffffff8133d46c>] sock_alloc_send_skb+0x10/0x12
[   60.175036]  [<ffffffff813b1ab5>] unix_stream_sendmsg+0x117/0x2e2
[   60.175044]  [<ffffffff811bdca8>] ? avc_has_perm+0x57/0x69
[   60.175050]  [<ffffffff8133b892>] ? sock_aio_write+0x0/0xcf
[   60.175056]  [<ffffffff813392c2>] __sock_sendmsg+0x59/0x64
[   60.175062]  [<ffffffff8133b94d>] sock_aio_write+0xbb/0xcf
[   60.175069]  [<ffffffff810e98b1>] do_sync_readv_writev+0xbc/0xfb
[   60.175077]  [<ffffffff811c1726>] ? selinux_file_permission+0xa2/0xaf
[   60.175082]  [<ffffffff810e9638>] ? copy_from_user+0x2a/0x2c
[   60.175089]  [<ffffffff811baf85>] ? security_file_permission+0x11/0x13
[   60.175095]  [<ffffffff810ea64e>] do_readv_writev+0xa2/0x122
[   60.175101]  [<ffffffff810ead3b>] ? fcheck_files+0x8f/0xc9
[   60.175107]  [<ffffffff810ea70c>] vfs_writev+0x3e/0x49
[   60.175113]  [<ffffffff810ea7f2>] sys_writev+0x45/0x8e
[   60.175119]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

[   60.223213] [ INFO: suspicious rcu_dereference_check() usage. ]
[   60.223216] ---------------------------------------------------
[   60.223221] net/mac80211/sta_info.c:886 invoked
rcu_dereference_check() without protection!
[   60.223224]
[   60.223225] other info that might help us debug this:
[   60.223227]
[   60.223230]
[   60.223230] rcu_scheduler_active = 1, debug_locks = 1
[   60.223234] no locks held by udisks-daemon/4398.
[   60.223236]
[   60.223237] stack backtrace:
[   60.223242] Pid: 4398, comm: udisks-daemon Not tainted 2.6.34-rc5-git3 #19
[   60.223245] Call Trace:
[   60.223249]  <IRQ>  [<ffffffff81067faa>] lockdep_rcu_dereference+0x9d/0xa5
[   60.223275]  [<ffffffffa014e9fe>]
ieee80211_find_sta_by_hw+0x96/0x10f [mac80211]
[   60.223288]  [<ffffffffa014ea8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
[   60.223304]  [<ffffffffa01a60f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
[   60.223310]  [<ffffffff81068417>] ? mark_lock+0x2d/0x235
[   60.223321]  [<ffffffffa01d5f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
[   60.223329]  [<ffffffff8120a2d3>] ? is_swiotlb_buffer+0x2e/0x3b
[   60.223338]  [<ffffffffa01cebf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
[   60.223344]  [<ffffffff810688f0>] ? trace_hardirqs_on_caller+0xfa/0x13f
[   60.223353]  [<ffffffffa01cf3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
[   60.223360]  [<ffffffff810411df>] tasklet_action+0xa7/0x10f
[   60.223367]  [<ffffffff810421f1>] __do_softirq+0x144/0x252
[   60.223374]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
[   60.223379]  [<ffffffff810050e4>] do_softirq+0x38/0x80
[   60.223384]  [<ffffffff81041cbe>] irq_exit+0x45/0x94
[   60.223389]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
[   60.223396]  [<ffffffff813cfb13>] ret_from_intr+0x0/0xf
[   60.223399]  <EOI>  [<ffffffff810e34f1>] ? kmem_cache_free+0xb0/0x134
[   60.223412]  [<ffffffff810f391a>] ? putname+0x2d/0x36
[   60.223417]  [<ffffffff810f391a>] putname+0x2d/0x36
[   60.223423]  [<ffffffff810f5536>] user_path_at+0x5f/0x8e
[   60.223429]  [<ffffffff81068671>] ? mark_held_locks+0x52/0x70
[   60.223435]  [<ffffffff810e34ee>] ? kmem_cache_free+0xad/0x134
[   60.223441]  [<ffffffff8106890a>] ? trace_hardirqs_on_caller+0x114/0x13f
[   60.223447]  [<ffffffff81068942>] ? trace_hardirqs_on+0xd/0xf
[   60.223454]  [<ffffffff810ed93f>] vfs_fstatat+0x32/0x5d
[   60.223460]  [<ffffffff810ed9bb>] vfs_lstat+0x19/0x1b
[   60.223465]  [<ffffffff810ed9d7>] sys_newlstat+0x1a/0x38
[   60.223471]  [<ffffffff8106890a>] ? trace_hardirqs_on_caller+0x114/0x13f
[   60.223477]  [<ffffffff813cec00>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   60.223485]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-23 12:45 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Eric Dumazet, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney
In-Reply-To: <4BD17CF9.4020502@trash.net>

On Fri, 23 Apr 2010, Patrick McHardy wrote:

> That sounds like a good idea. But lets what for Jesper's test results
> before we start fixing this problem :)

I will first have time to perform the tests Monday or Tuesday.

BUT I have just noticed there seems to be a corrolation between conntrack 
early_drop and searches.  I have upload a new graph:

  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png

I have not had time to checkout the code path yet...

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: eSwitch management
From: Arnd Bergmann @ 2010-04-23 12:42 UTC (permalink / raw)
  To: Anirban Chakraborty
  Cc: Scott Feldman, David Miller, netdev@vger.kernel.org,
	chrisw@redhat.com, Ameen Rahman, Amit Salecha, Rajesh Borundia
In-Reply-To: <DD92D5A8-1ECC-4440-BE81-ABDCC6847021@qlogic.com>

On Friday 23 April 2010, Anirban Chakraborty wrote:
> On Apr 22, 2010, at 6:29 PM, Scott Feldman wrote:
> > On 4/22/10 5:47 PM, "Scott Feldman" <scofeldm@cisco.com> wrote:
> >> 
> >> Are any of these settings covered in DCB?  (net/dcb/dcbnl.c).  Maybe you can
> >> get a start there?  Not sure not knowing your device requirements.
> > 
> > Or maybe the RTM_SETLINK IFLA_VF_* ops in include/linux/if_link.h?  Those
> > seem like what you're looking for.  I'm looking at moving iovnl here as well
> > for port-profile.
> 
> It looks like ifla_vf_info does contain most of the data set. But if I use it, what
> NETLINK protocol family should I use in my driver to receive netlink messages? Do I
> need to create a private protocol family?

Your driver should implement the ndo_set_vf_*/ndo_get_vf_* callbacks, not
implement the netlink protocol itself. If there is anything missing in the
existing callbacks that you require for the operation of your driver, you
should send patches to extend the implementation in net/core/rtnetlink.c.

	Arnd

^ permalink raw reply

* [PATCH 6/7] Topcliff GbE: Add The common header files [1/2]
From: Masayuki Ohtake @ 2010-04-23 12:01 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

[-- Attachment #1: Type: message/partial, Size: 39242 bytes --]

^ permalink raw reply

* [PATCH 7/7] Topcliff GbE: Change The Kconfig and Makefile
From: Masayuki Ohtake @ 2010-04-23 12:01 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

From: Masayuki Ohtake <masa-korg@dsn.okisemi.com>

This patch change the Kconfig and Makefile of GbE driver for Topcliff.
The GbE driver needs all patch[1/7 to 7/7].

Signed-off-by: Masayuki Ohtake <masa-korg@dsn.okisemi.com>
---
 drivers/net/Kconfig                        | 5 ++
 drivers/net/Makefile                       | 1
 drivers/net/pch_gbe/Makefile               | 6
+++++++++++++++++++++++++++++++ 3 files changed, 12 insertions(+)
diff -urN linux-2.6.33.1/drivers/net/Kconfig
topcliff-2.6.33.1/drivers/net/Kconfig
--- linux-2.6.33.1/drivers/net/Kconfig 2010-03-16 01:09:39.000000000 +0900
+++ topcliff-2.6.33.1/drivers/net/Kconfig 2010-04-13 00:55:22.000000000
+0900
@@ -1977,6 +1977,11 @@
    If you say N, all options in this submenu will be skipped and disabled.

 if NETDEV_1000
+config PCH_GBE
+        tristate "PCH Gigabit Ethernet"
+        ---help---
+          This is an gigabit ethernet driver for PCH.
+          resources.

 config ACENIC
  tristate "Alteon AceNIC/3Com 3C985/NetGear GA620 Gigabit support"
diff -urN linux-2.6.33.1/drivers/net/Makefile
topcliff-2.6.33.1/drivers/net/Makefile
--- linux-2.6.33.1/drivers/net/Makefile 2010-03-16 01:09:39.000000000 +0900
+++ topcliff-2.6.33.1/drivers/net/Makefile 2010-04-13 00:55:22.000000000
+0900
@@ -287,3 +287,4 @@
 obj-$(CONFIG_WIMAX) += wimax/

 obj-$(CONFIG_OCTEON_MGMT_ETHERNET) += octeon/
+obj-$(CONFIG_PCH_GBE) += pch_gbe/
diff -urN linux-2.6.33.1/drivers/net/pch_gbe/Makefile
topcliff-2.6.33.1/drivers/net/pch_gbe/Makefile
--- linux-2.6.33.1/drivers/net/pch_gbe/Makefile 1970-01-01
09:00:00.000000000 +0900
+++ topcliff-2.6.33.1/drivers/net/pch_gbe/Makefile 2010-04-13
00:55:22.000000000 +0900
@@ -0,0 +1,6 @@
+ifeq ($(CONFIG_PCH_GBE_DEBUG_CORE),y)
+EXTRA_CFLAGS += -DDEBUG
+endif
+
+obj-$(CONFIG_PCH_GBE) += pch_gbe.o
+pch_gbe-objs := pch_gbe_mac.o pch_gbe_phy.o pch_gbe_nvm.o pch_gbe_ethtool.o
pch_gbe_plat.o pch_gbe_param.o pch_gbe_api.o pch_gbe_main.o


^ permalink raw reply

* [PATCH 6/7] Topcliff GbE: Add The common header files [2/2]
From: Masayuki Ohtake @ 2010-04-23 12:01 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

[-- Attachment #1: Type: message/partial, Size: 15960 bytes --]

^ permalink raw reply

* [PATCH 5/7] Topcliff GbE: Add The Hardware layer codes [1/2]
From: Masayuki Ohtake @ 2010-04-23 12:00 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

[-- Attachment #1: Type: message/partial, Size: 39112 bytes --]

^ permalink raw reply

* [PATCH 5/7] Topcliff GbE: Add The Hardware layer codes [2/2]
From: Masayuki Ohtake @ 2010-04-23 12:00 UTC (permalink / raw)
  To: NETDEV; +Cc: Wang, Yong Y, Wang, Qi, Intel OTC, Andrew

[-- Attachment #1: Type: message/partial, Size: 9627 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox