Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ep93xx_eth stopps receiving packages
From: Lennert Buytenhek @ 2010-05-02 10:43 UTC (permalink / raw)
  To: Stefan Agner; +Cc: netdev
In-Reply-To: <20100419173813.7750395f4fkkmrk0@limpopo.deheime.ch>

On Mon, Apr 19, 2010 at 05:38:13PM +0200, Stefan Agner wrote:

> I'm using Linux 2.6.32.9 on a technologic systems TS-7250 SBC board, with
> the ep93xx_eth driver for networking. On three identical, but independent
> systems I noted that the system is unreachable after a while. On a serial
> terminal I noted that only the TX counter counts onward, RX stays where it 
> is,
> no matter if i try to ping from or to the system. Wireshark tells me exactly
> that too: I see helpless ARP requests which gets answered, but no ICMP. The
> system doesnt receive the ARP requests, and just sends another one.

(So does the board or does it not respond to ARP requests for its IP?)


> With a simple program which sends small packages in a fast pace I can
> reproduce the problem after several seconds (additional CPU load seem to
> provoke the problem even more). Remove and replug the network cable doesn't
> solve the problem, but ifup/down does. I don't see any messages in dmesg,
> memory is still available.

Do you see interrupts increasing in /proc/interrupts when this happens?

^ permalink raw reply

* Re: Re: Kernel panic in fib_rules_lookup (kernel 2.6.32)
From: "Oleg A. Arkhangelsky" @ 2010-05-02 10:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1268154568.3113.19.camel@edumazet-laptop>

Hello,

09.03.10, 17:09, "Eric Dumazet" <eric.dumazet@gmail.com>:

> Le mardi 09 mars 2010 à 10:44 +0300, "Oleg A. Arkhangelsky" a écrit :
>  > Hello,
>  > 
>  > Got this kernel panic tomorrow. This PC is rather heavy loaded router with BGP full view (> 300K routes).
>  > We're using FIB_TRIE. Last time we got similar panic about 1 month ago. Please, let me know if you
>  > need additional information to debug (e.g. objdump). Thanks!
>  > 
>  > Mar  9 10:08:55 bras-1 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
>  > Mar  9 10:08:55 bras-1 kernel: IP: [] fib_rules_lookup+0xa2/0xd0
>  > Mar  9 10:08:55 bras-1 kernel: *pde = 00000000
>  > Mar  9 10:08:55 bras-1 kernel: Thread overran stack, or stack corrupted
>  
>  Hmm...

Got the same panic, at the same place (fib_rules_lookup+0xa2/0xd0). Looks like the problem with NULL dereference is somewhere in list_for_each_entry_rcu macro. But I don't understand how this can be.

Any thoughts? :(

-- 
wbr, Oleg.

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 10:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100502092020.GA9655@gargoyle.fritz.box>

Le dimanche 02 mai 2010 à 11:20 +0200, Andi Kleen a écrit :
> > I tried it on the right spot (since my bench was only doing recvmsg()
> > calls, I had to patch wait_for_packet() in net/core/datagram.c
> > 
> > udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
> > schedule_timeout
> > 
> > Unfortunatly, using io_schedule_timeout() did not solve the problem.
> 
> Hmm, too bad. Weird.
> 
> > 
> > Tell me if you need some traces or something.
> 
> I'll try to reproduce it and see what I can do.
> 

Here the perf report on the latest test done, I confirm I am using
io_schedule_timeout() in this kernel.

In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
 packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
packets are then redistributed by RPS to same 16 cpus, generating about
650.000 IPI per second.

top says :
Cpu(s):  3.0%us, 17.3%sy,  0.0%ni, 22.4%id, 28.2%wa,  0.0%hi, 29.1%si,
0.0%st


# Samples: 321362570767
#
# Overhead         Command                 Shared Object  Symbol
# ........  ..............  ............................  ......
#
    25.08%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--93.47%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--4.70%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--0.64%-- generic_exec_single
                         |          __smp_call_function_single
                         |          net_rps_action_and_irq_enable
...
     9.72%            init  [kernel.kallsyms]             [k] acpi_os_read_port
                      |
                      --- acpi_os_read_port
                         |          
                         |--99.45%-- acpi_hw_read_port
                         |          acpi_hw_read
                         |          acpi_hw_read_multiple
                         |          acpi_hw_register_read
                         |          acpi_read_bit_register
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                          --0.55%-- acpi_hw_read
                                    acpi_hw_read_multiple

powertop says :
     PowerTOP version 1.11      (C) 2007 Intel Corporation

Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
                                        1.60 Ghz    38.2%

Wakeups-from-idle per second : 45177.8  interval: 5.0s
no ACPI power usage estimate available

Top causes for wakeups:
   9.9% (40863.0)       <interrupt> : eth1-fp-7 
   9.9% (40861.0)       <interrupt> : eth1-fp-8 
   9.9% (40858.0)       <interrupt> : eth1-fp-5 
   9.9% (40855.2)       <interrupt> : eth1-fp-10 
   9.9% (40847.6)       <interrupt> : eth1-fp-14 
   9.9% (40847.2)       <interrupt> : eth1-fp-12 
   9.9% (40835.0)       <interrupt> : eth1-fp-1 
   9.9% (40834.2)       <interrupt> : eth1-fp-3 
   9.9% (40834.0)       <interrupt> : eth1-fp-6 
   9.9% (40829.6)       <interrupt> : eth1-fp-4 
   1.0% (4002.0)     <kernel core> : hrtimer_start_range_ns (tick_sched_timer) 
   0.4% (1725.6)       <interrupt> : extra timer interrupt 
   0.0% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func)
   0.0% (  2.0)     <kernel core> : clocksource_watchdog (clocksource_watchdog)
   0.0% (  2.0)             snmpd : hrtimer_start_range_ns (hrtimer_wakeup)



^ permalink raw reply

* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Wolfgang Grandegger @ 2010-05-02 10:50 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429091936.GA6703@riccoc20.at.omicron.at>

Hi Richard,

Richard Cochran wrote:
> This patch adds an infrastructure for hardware clocks that implement
> IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
> registration method to particular hardware clock drivers. Each clock is
> exposed to user space as a character device with ioctls that allow tuning
> of the PTP clock.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
...
> diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
> new file mode 100644
> index 0000000..a5acac4
> --- /dev/null
> +++ b/drivers/ptp/ptp_clock.c
...
> +static int ptp_open(struct inode *inode, struct file *fp)
> +{
> +	struct ptp_clock *ptp;
> +	ptp = container_of(inode->i_cdev, struct ptp_clock, cdev);
> +
> +	if (mutex_lock_interruptible(&ptp->mux))
> +		return -ERESTARTSYS;
> +
> +	fp->private_data = ptp;
> +
> +	return 0;
> +}
...
> +static int ptp_release(struct inode *inode, struct file *fp)
> +{
> +	struct ptp_clock *ptp = fp->private_data;
> +	mutex_unlock(&ptp->mux);
> +	return 0;
> +}

As long as the device is in use by an application, no other can access
it, because the mutex is locked. Other application may want to read the
PTP clock time while ptpd is running, though.

Wolfgang.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: David Miller @ 2010-05-02 10:03 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, hadi
In-Reply-To: <1272783032.2173.8.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 02 May 2010 08:50:32 +0200

> Excellent !

Great, here is the commit message I will use:

--------------------
net: Inline skb_pull() in eth_type_trans().

In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot")
we uninlined skb_pull.

But in some critical paths it makes sense to inline this thing
and it helps performance significantly.

Create an skb_pull_inline() so that we can do this in a way that
serves also as annotation.

Based upon a patch by Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
--------------------

> Could we assume all eth_type_trans() must call it with initial
> skb->len >= (46 + 12) or not ?  (According to ethernet specs, all
> frames should have a minimum payload of 46 bytes)
>
> If not sure, maybe we should issue a WARN_ON_ONCE()
> 
> If yes, tests could be removed and we could gain two cycles ;)

Isn't the minimum ETH_ZLEN?

But yes, regardless of whether the minimum ethernet frame is 58 or 60
bytes, we should require it's at least that big, and use that test
consistently throughout.

Anything smaller is a runt packet and should be tossed or marked as an
error in some other way by the hardware.  They should never make it to
eth_type_trans().

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02  9:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <1272783366.2173.13.camel@edumazet-laptop>

> I tried it on the right spot (since my bench was only doing recvmsg()
> calls, I had to patch wait_for_packet() in net/core/datagram.c
> 
> udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
> schedule_timeout
> 
> Unfortunatly, using io_schedule_timeout() did not solve the problem.

Hmm, too bad. Weird.

> 
> Tell me if you need some traces or something.

I'll try to reproduce it and see what I can do.

-Andi


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02  6:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100501110000.GB9434@gargoyle.fritz.box>

Le samedi 01 mai 2010 à 13:00 +0200, Andi Kleen a écrit :
> On Fri, Apr 30, 2010 at 04:38:57PM -0700, David Miller wrote:
> > From: Andi Kleen <ak@gargoyle.fritz.box>
> > Date: Thu, 29 Apr 2010 23:41:44 +0200
> > 
> > >     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> > > 
> > >     XXX: probably too aggressive, some of these sleeps are not under high load.
> > > 
> > >     Based on a bug report from Eric Dumazet.
> > >     
> > >     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > I like this, except that we probably don't want the delayacct_blkio_*() calls
> > these things do.
> 
> Yes.
> 
> It needs more work, please don't apply it yet, to handle the "long sleep" case.
> 
> Still curious if it fixes Eric's test case.
> 

I tried it on the right spot (since my bench was only doing recvmsg()
calls, I had to patch wait_for_packet() in net/core/datagram.c

udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
schedule_timeout

Unfortunatly, using io_schedule_timeout() did not solve the problem.

Tell me if you need some traces or something.

Thanks !

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..051fd5b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -113,7 +113,7 @@ static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 		goto interrupted;
 
 	error = 0;
-	*timeo_p = schedule_timeout(*timeo_p);
+	*timeo_p = io_schedule_timeout(*timeo_p);
 out:
 	finish_wait(sk_sleep(sk), &wait);
 	return error;



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: Eric Dumazet @ 2010-05-02  6:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, therbert, hadi
In-Reply-To: <20100501.181558.141243424.davem@davemloft.net>

Le samedi 01 mai 2010 à 18:15 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 01 May 2010 08:42:25 +0200
> 
> > [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
> > 
> > With RPS, this patch can give a 5 % boost in performance.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Awesome, but let's do this in a way that allows us to easily annotate
> where inlining makes sense in other places, not just here.
> 
> Something like this, ok?
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 82f5116..746a652 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1128,6 +1128,11 @@ static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
>  	return skb->data += len;
>  }
>  
> +static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
> +{
> +	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
> +}
> +
>  extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
>  
>  static inline unsigned char *__pskb_pull(struct sk_buff *skb, unsigned int len)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4218ff4..8b9c109 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1051,7 +1051,7 @@ EXPORT_SYMBOL(skb_push);
>   */
>  unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
>  {
> -	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
> +	return skb_pull_inline(skb, len);
>  }
>  EXPORT_SYMBOL(skb_pull);
>  
> diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> index 0c0d272..61ec032 100644
> --- a/net/ethernet/eth.c
> +++ b/net/ethernet/eth.c
> @@ -162,7 +162,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  
>  	skb->dev = dev;
>  	skb_reset_mac_header(skb);
> -	skb_pull(skb, ETH_HLEN);
> +	skb_pull_inline(skb, ETH_HLEN);
>  	eth = eth_hdr(skb);
>  
>  	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {

Excellent !

Changli privately asked me why we were ignoring cases where skb->len <
ETH_HLEN.
I replied that minimum frame size was 46+12, then he asked me why we
were testing another time :

if (skb->len >= 2 && *(unsigned short *)rawp == 0xFFFF)
	return htons(ETH_P_802_3);


Could we assume all eth_type_trans() must call it with initial skb->len
>= (46 + 12) or not ?
(According to ethernet specs, all frames should have a minimum payload
of 46 bytes)

If not sure, maybe we should issue a WARN_ON_ONCE()

If yes, tests could be removed and we could gain two cycles ;)




^ permalink raw reply

* RE: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
From: Elina Pasheva @ 2010-05-02  5:53 UTC (permalink / raw)
  To: David Miller
  Cc: dbrownell@users.sourceforge.net, Rory Filer,
	linux-usb@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20100501.180829.139101312.davem@davemloft.net>


> On Saturday, May 01, 2010 6:08 PM David Miller wrote:

>>From: Elina Pasheva <epasheva@sierrawireless.com>
>>Date: Wed, 28 Apr 2010 16:28:24 -0700

>> Subject: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
>> From: Elina Pasheva <epasheva@sierrawireless.com>
>>
>> The following patch adds the initiation of the sync sequence to
>> "sierra_net_bind()". If this step is omitted, the modem will never sync up
>> with the host and it will not be possible to establish a data connection.
>> This is a high priority patch.
>>
>> This patch has been checked against net-2.6 tree.
>> Signed-off-by: Elina Pasheva <epasheva@sierrawireless.com>
>> Signed-off-by: Rory Filer <rfiler@sierrawireless.com>
>> Tested-by: Elina Pasheva <epasheva@sierrawireless.com>

>Applied.

Thank you very much, David!
Elina



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: David Miller @ 2010-05-02  1:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, hadi
In-Reply-To: <1272696145.2230.101.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 May 2010 08:42:25 +0200

> [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
> 
> With RPS, this patch can give a 5 % boost in performance.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Awesome, but let's do this in a way that allows us to easily annotate
where inlining makes sense in other places, not just here.

Something like this, ok?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 82f5116..746a652 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1128,6 +1128,11 @@ static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 	return skb->data += len;
 }
 
+static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
+{
+	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
+}
+
 extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
 
 static inline unsigned char *__pskb_pull(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4218ff4..8b9c109 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1051,7 +1051,7 @@ EXPORT_SYMBOL(skb_push);
  */
 unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
 {
-	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
+	return skb_pull_inline(skb, len);
 }
 EXPORT_SYMBOL(skb_pull);
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0c0d272..61ec032 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 
 	skb->dev = dev;
 	skb_reset_mac_header(skb);
-	skb_pull(skb, ETH_HLEN);
+	skb_pull_inline(skb, ETH_HLEN);
 	eth = eth_hdr(skb);
 
 	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {

^ permalink raw reply related

* Re: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
From: David Miller @ 2010-05-02  1:08 UTC (permalink / raw)
  To: epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8
  Cc: dbrownell-Rn4VEauK+AKRv+LV9MX5uipxlwaOVQ5f,
	rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272497304.8835.2.camel@Linuxdev4-laptop>

From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
Date: Wed, 28 Apr 2010 16:28:24 -0700

> Subject: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
> From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> 
> The following patch adds the initiation of the sync sequence to
> "sierra_net_bind()". If this step is omitted, the modem will never sync up
> with the host and it will not be possible to establish a data connection.
> This is a high priority patch.
> 
> This patch has been checked against net-2.6 tree.
> Signed-off-by: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> Signed-off-by: Rory Filer <rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> Tested-by: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Ben Hutchings @ 2010-05-01 23:44 UTC (permalink / raw)
  To: David Miller; +Cc: andi, tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501.150338.93457735.davem@davemloft.net>

On Sat, 2010-05-01 at 15:03 -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Sat, 1 May 2010 12:53:04 +0200
> 
> >> And we don't want it to, because the decision mechanisms for steering
> >> that we using now are starting to get into the stateful territory and
> >> that's verbotton for NIC offload as far as we're concerned.
> > 
> > Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC 
> > (or someone else like netfilter) tracking flows is quite common and very far 
> > from full offload. AFAIK it doesn't have near all the problems full
> > offload has.
> 
> We're tracking flow cpu location state at the socket operations, like
> recvmsg() and sendmsg(), where it belongs.
> 
> Would you like us to call into the card drivers and firmware at these
> spots instead?

I'm interested in experimenting with this at some point, since our
hardware supports a fairly large number of filters that could be used
for it.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 23:29 UTC (permalink / raw)
  To: andi; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501225815.GA8074@gargoyle.fritz.box>

From: Andi Kleen <andi@firstfloor.org>
Date: Sun, 2 May 2010 00:58:15 +0200

>> We're tracking flow cpu location state at the socket operations, like
>> recvmsg() and sendmsg(), where it belongs.
>> 
>> Would you like us to call into the card drivers and firmware at these
>> spots instead?
> 
> No, that's not needed for lazy flow tracking like in netfilter or 
> some NICs, it doesn't need exact updates. It just works with seen network 
> packets. 

Well what we need is exact flow updates so that we steer packets
to where the applications actually are.

Andi, this discussion is going in circles, can I just say "yeah you're
right Andi" and this will satisfy your desire to be correct and we can
be done with this?

Thanks.

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Andi Kleen @ 2010-05-01 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501.150338.93457735.davem@davemloft.net>

> We're tracking flow cpu location state at the socket operations, like
> recvmsg() and sendmsg(), where it belongs.
> 
> Would you like us to call into the card drivers and firmware at these
> spots instead?

No, that's not needed for lazy flow tracking like in netfilter or 
some NICs, it doesn't need exact updates. It just works with seen network 
packets. 

-Andi

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 22:13 UTC (permalink / raw)
  To: gandalf; +Cc: tglx, shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <Pine.LNX.4.62.1005012222320.24624@wlug.westbo.se>

From: Martin Josefsson <gandalf@mjufs.se>
Date: Sat, 1 May 2010 22:31:05 +0200 (CEST)

> On Fri, 30 Apr 2010, David Miller wrote:
> 
>> Then we can do cool tricks like having the cpu spin on a mwait() on
>> the
>> network device's status descriptor in memory.
> 
> Can you have mwait monitor multiple cachelines for stores?

The idea is that if you have hundreds of cpus threads (several of my
machines do, and it's not too long before these kinds of boxes will be
common) in your machine you can spare one for each NIC.

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 22:03 UTC (permalink / raw)
  To: andi; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501105304.GA9434@gargoyle.fritz.box>

From: Andi Kleen <andi@firstfloor.org>
Date: Sat, 1 May 2010 12:53:04 +0200

>> And we don't want it to, because the decision mechanisms for steering
>> that we using now are starting to get into the stateful territory and
>> that's verbotton for NIC offload as far as we're concerned.
> 
> Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC 
> (or someone else like netfilter) tracking flows is quite common and very far 
> from full offload. AFAIK it doesn't have near all the problems full
> offload has.

We're tracking flow cpu location state at the socket operations, like
recvmsg() and sendmsg(), where it belongs.

Would you like us to call into the card drivers and firmware at these
spots instead?

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: David Miller @ 2010-05-01 22:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <1272701011.2230.134.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 May 2010 10:03:31 +0200

> David, I also need this RCU thing in order to be able to group all
> wakeups at the end of net_rx_action().
> 
> Plan was to use RCU, so that I dont need to increase sk_refcnt when
> queueing a "wakeup" (and decrease sk_refcnt a long time after)
> 
> Previous attempt was a bit hacky,
> http://patchwork.ozlabs.org/patch/24179/
> 
> I expect 2010 one will be cleaner :)

Fair enough, I'm convinced now, applied thanks!

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Martin Josefsson @ 2010-05-01 20:31 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <20100430.115715.216750975.davem@davemloft.net>

On Fri, 30 Apr 2010, David Miller wrote:

> Then we can do cool tricks like having the cpu spin on a mwait() on the
> network device's status descriptor in memory.

Can you have mwait monitor multiple cachelines for stores? If not then it 
might be hard to do that when you have multiple nics and you actually 
need to use the status descriptors, otherwise you could possibly have them 
all written to the same cacheline. 
Or if the nic doesn't support updating a status descriptor in memory.

If you just want to wake up quickly without using interrupts it might be 
possible to abuse MSI to wake up without actually using interrupts, set 
the address to the cacheline that is being monitored.

/Martin

^ permalink raw reply

* Re: [PATCH 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Kumar Gala @ 2010-05-01 16:36 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Netdev, linuxppc-dev, devicetree-discuss
In-Reply-To: <20100429092005.GA6727@riccoc20.at.omicron.at>


On Apr 29, 2010, at 4:20 AM, Richard Cochran wrote:

> The eTSEC includes a PTP clock with quite a few features. This patch adds
> support for the basic clock adjustment functions.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
> arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
> arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
> arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
> drivers/net/Makefile                  |    1 +
> drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
> drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
> drivers/ptp/Kconfig                   |   13 ++
> 7 files changed, 470 insertions(+), 0 deletions(-)
> create mode 100644 drivers/net/gianfar_ptp.c
> create mode 100644 drivers/net/gianfar_ptp_reg.h
> 
> diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
> index 183f2aa..b760aee 100644
> --- a/arch/powerpc/boot/dts/mpc8313erdb.dts
> +++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
> @@ -208,6 +208,20 @@
> 			sleep = <&pmc 0x00300000>;
> 		};
> 
> +		ptp_clock@24E00 {
> +			device_type = "ptp_clock";
> +			model = "eTSEC";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <0x0C 2 0x0D 2>;
> +			interrupt-parent = < &ipic >;
> +			tclk_period = <10>;
> +			tmr_prsc    = <100>;
> +			tmr_add     = <0x999999A4>;
> +			cksel       = <0x1>;
> +			tmr_fiper1  = <0x3B9AC9F6>;
> +			tmr_fiper2  = <0x00018696>;
> +		};
> +
> 		enet0: ethernet@24000 {
> 			#address-cells = <1>;
> 			#size-cells = <1>;

Is there a binding document that describes this node you are adding?

- k

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] [v2] ptp: IEEE 1588 clock support
From: Kumar Gala @ 2010-05-01 16:32 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Netdev
In-Reply-To: <20100429091903.GA6691@riccoc20.at.omicron.at>


On Apr 29, 2010, at 4:19 AM, Richard Cochran wrote:

> * Patch ChangeLog
> ** v2
>   - Changed clock list from a static array into a dynamic list. Also,
>     use a bitmap to manage the clock's minor numbers.
>   - Replaced character device semaphore with a mutex.
>   - Drop .ko from module names in Kbuild help.
>   - Replace deprecated unifdef-y with header-y for user space header file.
>   - Gianfar driver now gets parameters from device tree.
>   - Added API documentation to Documentation/ptp/ptp.txt, with links
>     to both of the ptpd patches on sourceforge.
> 
> * Preface
> 
> Now and again there has been some talk on this list of adding PTP
> support into Linux. One part of the picture is already in place, the
> SO_TIMESTAMPING API for hardware time stamping. It has been pointed
> out that this API is not perfect, however, it is good enough for many
> real world uses of IEEE 1588. The second needed part has not, AFAICT,
> ever been addressed.
> 
> Here I offer an early draft of an idea how to bring the missing
> functionality into Linux. I don't yet have all of the features
> implemented, as described below. Still I would like to get your
> feedback concerning this idea before getting too far into it. I do
> have all of the hardware mentioned at hand, so I have a good idea that
> the proposed API covers the features of those clocks.
> 
> Thanks in advance for your comments,
> 
> Richard
> 
> 
> Richard Cochran (3):
>  ptp: Added a brand new class driver for ptp clocks.
>  ptp: Added a clock that uses the Linux system time.
>  ptp: Added a clock that uses the eTSEC found on the MPC85xx.
> 
> Documentation/ptp/ptp.txt             |   78 +++++++++
> Documentation/ptp/testptp.c           |  130 ++++++++++++++
> Documentation/ptp/testptp.mk          |   33 ++++
> arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
> arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
> arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
> drivers/Kconfig                       |    2 +
> drivers/Makefile                      |    1 +
> drivers/net/Makefile                  |    1 +
> drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
> drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
> drivers/ptp/Kconfig                   |   51 ++++++
> drivers/ptp/Makefile                  |    6 +
> drivers/ptp/ptp_clock.c               |  302 ++++++++++++++++++++++++++++++++
> drivers/ptp/ptp_linux.c               |  122 +++++++++++++
> include/linux/Kbuild                  |    1 +
> include/linux/ptp_clock.h             |   37 ++++
> include/linux/ptp_clock_kernel.h      |  134 ++++++++++++++
> kernel/time/ntp.c                     |    2 +
> 19 files changed, 1356 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/ptp/ptp.txt
> create mode 100644 Documentation/ptp/testptp.c
> create mode 100644 Documentation/ptp/testptp.mk
> create mode 100644 drivers/net/gianfar_ptp.c
> create mode 100644 drivers/net/gianfar_ptp_reg.h
> create mode 100644 drivers/ptp/Kconfig
> create mode 100644 drivers/ptp/Makefile
> create mode 100644 drivers/ptp/ptp_clock.c
> create mode 100644 drivers/ptp/ptp_linux.c
> create mode 100644 include/linux/ptp_clock.h
> create mode 100644 include/linux/ptp_clock_kernel.h

In the future please CC linuxppc-dev@lists.ozlabs.org and devicetree-discuss@lists.ozlabs.org since you are adding device tree bindings.

- k

^ permalink raw reply

* Re: [patch v2.2 3/4] [PATCH v2.1 3/4] IPVS: make FTP work with full NAT support
From: Patrick McHardy @ 2010-05-01 16:26 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.998807955@vergenet.net>

Simon Horman wrote:

> +#define FMT_TUPLE	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u"
> +#define ARG_TUPLE(T)	NIPQUAD((T)->src.u3.ip), ntohs((T)->src.u.all), \
> +			NIPQUAD((T)->dst.u3.ip), ntohs((T)->dst.u.all), \
> +			(T)->dst.protonum
> +
> +#define FMT_CONN	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u:%u"
> +#define ARG_CONN(C)	NIPQUAD((C)->caddr), ntohs((C)->cport), \
> +			NIPQUAD((C)->vaddr), ntohs((C)->vport), \
> +			NIPQUAD((C)->daddr), ntohs((C)->dport), \
> +			(C)->protocol, (C)->state
>  

Please use the appropriate format string (%pI4) instead of NIPQUAD.

> +		buf_len = sprintf(buf, "%u,%u,%u,%u,%u,%u", NIPQUAD(from.ip),
> +				  (ntohs(port)>>8)&255, ntohs(port)&255);
> +
> +		ct = nf_ct_get(skb, &ctinfo);
> +		ret = nf_nat_mangle_tcp_packet(skb,
> +					       ct,
> +					       ctinfo,
> +					       start-data,
> +					       end-start,
> +					       buf,
> +					       buf_len);
> +
> +		if (ct && ct != &nf_conntrack_untracked)

ct is non-NULL, otherwise we'll crash in nf_nat_mangle_tcp_packet().
Are you sure you want to mangle untracked packets above? That doesn't
work when their are size changes.

^ permalink raw reply

* Re: [patch v2.2 2/4] [PATCH v2.1 2/4] IPVS: make friends with nf_conntrack
From: Patrick McHardy @ 2010-05-01 16:19 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.644762316@vergenet.net>

Looks good to me.

^ permalink raw reply

* Re: [patch v2.2 1/4] [PATCH v2.1 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Patrick McHardy @ 2010-05-01 16:18 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.298829234@vergenet.net>

Simon Horman wrote:

> @@ -0,0 +1,25 @@
> +#ifndef _XT_IPVS_H
> +#define _XT_IPVS_H 1

You don't need to define a value.

> +config NETFILTER_XT_MATCH_IPVS
> +	tristate '"ipvs" match support'
> +	depends on IP_VS
> +	depends on NETFILTER_ADVANCED
> +	help
> +	  This option allows you to match against IPVS properties of a packet.
> +
> +	  If unsure, say N.

You're using conntrack symbols, so this seems to need a dependency
on NF_CONNTRACK.

> +static bool ipvs_mt_check(const struct xt_mtchk_param *par)

We've changed the signature to "int" in nf-next to be able to
return errno codes. Please rebase your patches onto nf-next-2.6.git.

Please also CC netfilter-devel at least for those parts that affect
non-IPVS netfilter.

> +{
> +	if (par->family != NFPROTO_IPV4
> +#ifdef CONFIG_IP_VS_IPV6
> +	    && par->family != NFPROTO_IPV6
> +#endif
> +		) {
> +		pr_info("protocol family %u not supported\n", par->family);
> +		return false;
> +	}
> +
> +	return true;
> +}


^ permalink raw reply

* [PATCH v21 094/100] c/r: Basic support for network namespaces and devices (v6)
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

When checkpointing a task tree with network namespaces, we hook into
do_checkpoint_ns() along with the others.  Any devices in a given namespace
are checkpointed (including their peer, in the case of veth) sequentially.
Each network device stores a list of protocol addresses, as well as other
information, such as hardware address.

This patch supports veth pairs, as well as the loopback adapter.  The
loopback support is there to make sure that any additional addresses and
state (such as up/down) is copied to the loopback adapter that we are
given in the new network namespace.

On restart, we instantiate new network namespaces and veth pairs as
necessary.  Any device we encounter that isn't in a network namespace
that was checkpointed as part of a task is left in the namespace of the
restarting process.  This will be the case for a veth half that exists
in the init netns to provide network access to a container.

Still to do are:

  1. Routes
  2. Netfilter rules
  3. IPv6 addresses
  4. Other virtual device types (e.g. bridges)
  5. Multicast
  6. Device config info (ipv4_devconf)
  7. Additional ipv4 address attributes

Changelog[v21]:
  - Do not include checkpoint_hdr.h explicitly
 - Fix acquiring socket lock before reading RTNETLINK response
 - Skip down interfaces (v2)
 - Export net checkpoint fns
 - Add CHECKPOINT_NETNS flag
 - Rename CONFIG_CHECKPOINT_NETNS -> CONFIG_NETNS_CHECKPOINT
 - Netdev restore function dispatching from a table
 - Added a comment about the controverial determination of "initial netns"
 - Simplify the E2BIG error handling
 - Remove a redundant check for checkpoint support per-device

Changes in v6:
 - Store addresses in network byte order, per Dave's recommendation

Changes in v5:
 - Rebase
 - Remove checkpoint_container() noise
 - Factor out some common bits of the RTNL newlink operations
 - Add macvlan support

Changes in v4:
 - Fix allocation under lock in ckpt_netdev_inet_addrs()
 - Add comment for case where there is no netns info in checkpoint image
 - Fix inner structure alignment in netdev_addr header
 - Fix instances of kfree(skb)
 - Remove init_netns_ref from container header and checkpoint context
 - Add 'extern' to checkpoint.h prototypes
 - Swizzle do_restore_netns() to handle netns more like the others
 - Return E2BIG for failure case when collecting inet addrs
 - Report case where device doesn't support checkpoint
 - Remove nested netns check from may_checkpoint_task()
 - Move veth-specific netdev attributes into unioned struct to set an
   example for specific attributes of additional device types
 - Add 'sit' device restore path that doesn't really do anything
 - Fail instead of skip when encountering a device with no checkpoint
   support

Changes in v3:
 - Use dev->checkpoint() for per-device checkpoint operation
 - Use RTNL for veth pair creation on restart
 - Export some of the functions that will be needed by dev->ndo_checkpoint()

Changes in v2:
 - Add CONFIG_CHECKPOINT_NETNS that is dependent on NET, NET_NS, and
   CHECKPOINT.  Conditionally compile the checkpoint_dev code based on it.
 - Updated comment on should_checkpoint_netdev()
 - Updated checkpoint_netdev() to explicitly check for "veth" in name
 - Changed checkpoint_netns() to use BUG() for impossible condition
 - Fixed a bug on restart with all devices in the init netns
 - Lock the dev_base_lock while traversing interface addresses
 - Collect all addresses for an interface before writing out in one
   single pass

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint/usage.txt |    1 +
 include/linux/checkpoint.h         |   29 ++-
 include/linux/checkpoint_hdr.h     |   58 +++
 kernel/checkpoint/checkpoint.c     |    5 -
 kernel/nsproxy.c                   |   24 +-
 net/Kconfig                        |    4 +
 net/Makefile                       |    1 +
 net/checkpoint.c                   |   63 +++-
 net/checkpoint_dev.c               |  818 ++++++++++++++++++++++++++++++++++++
 9 files changed, 995 insertions(+), 8 deletions(-)
 create mode 100644 net/checkpoint_dev.c

diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
index d697ed1..5700448 100644
--- a/Documentation/checkpoint/usage.txt
+++ b/Documentation/checkpoint/usage.txt
@@ -15,6 +15,7 @@ The API consists of three new system calls:
  an open file to which error and debug messages are written. @flags
  may be one or more of:
    - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+   - CHECKPOINT_NETNS : include network namespaces and devices
  (other value are not allowed).
 
  Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 43d67ce..84bb7a9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -14,6 +14,7 @@
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
+#define CHECKPOINT_NETNS	0x2
 
 /* restart user flags */
 #define RESTART_TASKSELF	0x1
@@ -35,6 +36,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <linux/inetdevice.h>
 #include <net/sock.h>
 
 /* sycall helpers */
@@ -55,7 +57,10 @@ extern long do_sys_restart(pid_t pid, int fd,
 #define CKPT_CTX_ERROR		(1 << CKPT_CTX_ERROR_BIT)
 
 /* ckpt_ctx: uflags */
-#define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+#define CHECKPOINT_USER_FLAGS \
+	(CHECKPOINT_SUBTREE | \
+	 CHECKPOINT_NETNS)
+
 #define RESTART_USER_FLAGS  \
 	(RESTART_TASKSELF | \
 	 RESTART_FROZEN | \
@@ -119,6 +124,28 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
 extern void sock_listening_list_free(struct list_head *head);
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+extern int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netns(struct ckpt_ctx *ctx);
+extern int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netdev(struct ckpt_ctx *ctx);
+
+extern int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx,
+				     struct net_device *dev);
+extern int ckpt_netdev_inet_addrs(struct in_device *indev,
+				  struct ckpt_netdev_addr *list[]);
+extern int ckpt_netdev_hwaddr(struct net_device *dev,
+			      struct ckpt_hdr_netdev *h);
+extern struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					struct net_device *dev,
+					struct ckpt_netdev_addr *addrs[]);
+#else
+# define checkpoint_netns NULL
+# define restore_netns NULL
+# define checkpoint_netdev NULL
+# define restore_netdev NULL
+#endif
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1564726..eb5e1b4 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -189,6 +189,12 @@ enum {
 #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
 	CKPT_HDR_SOCKET_INET,
 #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
+	CKPT_HDR_NET_NS,
+#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
+	CKPT_HDR_NETDEV,
+#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
+	CKPT_HDR_NETDEV_ADDR,
+#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
 
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
@@ -261,6 +267,10 @@ enum obj_type {
 #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
 	CKPT_OBJ_SECURITY,
 #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
+	CKPT_OBJ_NET_NS,
+#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
+	CKPT_OBJ_NETDEV,
+#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -444,6 +454,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
 	__s32 ipc_objref;
+	__s32 net_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -768,6 +779,53 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_netns {
+	struct ckpt_hdr h;
+	__s32 this_ref;
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_types {
+	CKPT_NETDEV_LO,
+	CKPT_NETDEV_VETH,
+	CKPT_NETDEV_SIT,
+	CKPT_NETDEV_MACVLAN,
+	CKPT_NETDEV_MAX,
+};
+
+struct ckpt_hdr_netdev {
+	struct ckpt_hdr h;
+	__s32 netns_ref;
+	union {
+		struct {
+			__s32 this_ref;
+			__s32 peer_ref;
+		} veth;
+		struct {
+			__u32 mode;
+		} macvlan;
+	};
+	__u32 inet_addrs;
+	__u16 type;
+	__u16 flags;
+	__u8 hwaddr[6];
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_addr_types {
+	CKPT_NETDEV_ADDR_IPV4,
+};
+
+struct ckpt_netdev_addr {
+	__u16 type;
+	union {
+		struct {
+			__be32 inet4_local;
+			__be32 inet4_address;
+			__be32 inet4_mask;
+			__be32 inet4_broadcast;
+		};
+	} __attribute__((aligned(8)));
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/kernel/checkpoint/checkpoint.c b/kernel/checkpoint/checkpoint.c
index 7a4f1ce..4059c28 100644
--- a/kernel/checkpoint/checkpoint.c
+++ b/kernel/checkpoint/checkpoint.c
@@ -291,11 +291,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		_ckpt_err(ctx, -EPERM, "%(T)Nested mnt_ns unsupported\n");
 		ret = -EPERM;
 	}
-	/* no support for >1 private netns */
-	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
-		_ckpt_err(ctx, -EPERM, "%(T)Nested net_ns unsupported\n");
-		ret = -EPERM;
-	}
 	/* pidns must be descendent of root_nsproxy */
 	pidns = nsproxy->pid_ns;
 	while (pidns != ctx->root_nsproxy->pid_ns) {
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 7fb3cea..d4af91d 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -260,6 +260,12 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+#ifdef CONFIG_NETNS_CHECKPOINT
+	if (ctx->uflags & CHECKPOINT_NETNS)
+		ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+#endif
 #ifdef CONFIG_IPC_NS
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 	if (ret < 0)
@@ -308,6 +314,15 @@ static int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
 #endif	/* CONFIG_IPC_NS */
 	h->ipc_objref = ret;
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+	if (ctx->uflags & CHECKPOINT_NETNS)
+		ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	else
+		ret = 0;
+	if (ret < 0)
+		goto out;
+	h->net_objref = ret;
+#endif
 	/* FIXME: for now, only marked visited to pacify leaks */
 	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
 	if (ret < 0)
@@ -341,6 +356,14 @@ static void *restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	if (h->net_objref == 0)
+		net_ns = current->nsproxy->net_ns;
+	else
+		net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net_ns)) {
+		ret = PTR_ERR(net_ns);
+		goto out;
+	}
 
 	if (h->ipc_objref == 0)
 		ipc_ns = ctx->root_nsproxy->ipc_ns;
@@ -356,7 +379,6 @@ static void *restore_ns(struct ckpt_ctx *ctx)
 	}
 
 	mnt_ns = ctx->root_nsproxy->mnt_ns;
-	net_ns = ctx->root_nsproxy->net_ns;
 
 	if (uts_ns == current->nsproxy->uts_ns &&
 	    ipc_ns == current->nsproxy->ipc_ns &&
diff --git a/net/Kconfig b/net/Kconfig
index 041c35e..c1cb774 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -276,4 +276,8 @@ source "net/wimax/Kconfig"
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETNS_CHECKPOINT
+       bool
+       default y if NET && NET_NS && CHECKPOINT
+
 endif   # if NET
diff --git a/net/Makefile b/net/Makefile
index 74b038f..b7d78f4 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -67,3 +67,4 @@ endif
 obj-$(CONFIG_WIMAX)		+= wimax/
 
 obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+obj-$(CONFIG_NETNS_CHECKPOINT)	+= checkpoint_dev.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
index 03c1224..b1f56bf 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -986,6 +986,56 @@ struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
  * sock-related checkpoint objects
  */
 
+static int netns_grab(void *ptr)
+{
+	struct net *net = ptr;
+
+	get_net(net);
+	return 0;
+}
+
+static void netns_drop(void *ptr, int lastref)
+{
+	struct net *net = ptr;
+
+	put_net(net);
+}
+
+/* netns object */
+static const struct ckpt_obj_ops ckpt_obj_netns_ops = {
+	.obj_name = "NET_NS",
+	.obj_type = CKPT_OBJ_NET_NS,
+	.ref_grab = netns_grab,
+	.ref_drop = netns_drop,
+	.checkpoint = checkpoint_netns,
+	.restore = restore_netns,
+};
+
+static int netdev_grab(void *ptr)
+{
+	struct net_device *dev = ptr;
+
+	dev_hold(dev);
+	return 0;
+}
+
+static void netdev_drop(void *ptr, int lastref)
+{
+	struct net_device *dev = ptr;
+
+	dev_put(dev);
+}
+
+/* netdev object */
+static const struct ckpt_obj_ops ckpt_obj_netdev_ops = {
+	.obj_name = "NET_DEV",
+	.obj_type = CKPT_OBJ_NETDEV,
+	.ref_grab = netdev_grab,
+	.ref_drop = netdev_drop,
+	.checkpoint = checkpoint_netdev,
+	.restore = restore_netdev,
+};
+
 static int obj_sock_grab(void *ptr)
 {
 	sock_hold((struct sock *) ptr);
@@ -1033,6 +1083,17 @@ static const struct ckpt_obj_ops ckpt_obj_sock_ops = {
 
 static int __init checkpoint_register_sock(void)
 {
-	return register_checkpoint_obj(&ckpt_obj_sock_ops);
+	int ret;
+
+	ret = register_checkpoint_obj(&ckpt_obj_sock_ops);
+	if (ret < 0)
+		return ret;
+	ret = register_checkpoint_obj(&ckpt_obj_netns_ops);
+	if (ret < 0)
+		return ret;
+	ret = register_checkpoint_obj(&ckpt_obj_netdev_ops);
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 module_init(checkpoint_register_sock);
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
new file mode 100644
index 0000000..34a6bdb
--- /dev/null
+++ b/net/checkpoint_dev.c
@@ -0,0 +1,818 @@
+/*
+ *  Copyright 2010 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/sched.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/inetdevice.h>
+#include <linux/veth.h>
+#include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
+
+#include <net/net_namespace.h>
+#include <net/sch_generic.h>
+
+struct dq_netdev {
+	struct net_device *dev;
+	struct ckpt_ctx *ctx;
+};
+
+struct veth_newlink {
+	char *peer;
+};
+
+struct mvl_newlink {
+	char this[IFNAMSIZ+1];
+	char base[IFNAMSIZ+1];
+	int mode;
+	__u8 *hwaddr;
+};
+
+typedef int (*new_link_fn)(struct sk_buff *, void *);
+
+static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = devinet_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = dev_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static struct socket *rtnl_open(void)
+{
+	struct socket *sock;
+	int ret;
+
+	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	return sock;
+}
+
+static int rtnl_close(struct socket *rtnl)
+{
+	if (rtnl)
+		return kernel_sock_shutdown(rtnl, SHUT_RDWR);
+	else
+		return 0;
+}
+
+static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
+					  struct sk_buff **skb)
+{
+	int ret;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+	struct nlmsghdr *nlh;
+
+	*skb = NULL;
+
+	lock_sock(rtnl->sk);
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (ret)
+		*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
+
+	if (!*skb)
+		return ERR_PTR(-EPIPE);
+
+	ret = -EINVAL;
+	nlh = nlmsg_hdr(*skb);
+	if (!nlh)
+		goto err;
+
+	if (nlh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *errmsg = nlmsg_data(nlh);
+		ret = errmsg->error;
+		goto err;
+	}
+
+	return nlh;
+ err:
+	kfree_skb(*skb);
+	*skb = NULL;
+
+	return ERR_PTR(ret);
+}
+
+int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	/*
+	 * Currently, we treat the "initial network namespace" as that
+	 * of the process doing the checkpoint.  This gives us a
+	 * consistent view of the container and its layout from the
+	 * perspective of the "agent" doing the checkpoint and
+	 * restore.
+	 */
+	return dev->nd_net == current->nsproxy->net_ns;
+}
+EXPORT_SYMBOL_GPL(ckpt_netdev_in_init_netns);
+
+int ckpt_netdev_hwaddr(struct net_device *dev, struct ckpt_hdr_netdev *h)
+{
+	struct net *net = dev->nd_net;
+	struct ifreq req;
+	int ret;
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
+	if (ret < 0)
+		return ret;
+	h->flags = req.ifr_flags;
+
+	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
+	if (ret < 0)
+		return ret;
+
+	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
+
+	return 0;
+}
+
+int ckpt_netdev_inet_addrs(struct in_device *indev,
+			   struct ckpt_netdev_addr *_abuf[])
+{
+	struct ckpt_netdev_addr *abuf = NULL;
+	struct in_ifaddr *addr = indev->ifa_list;
+	int addrs = 0;
+	int max = 32;
+
+ retry:
+	*_abuf = krealloc(abuf, max * sizeof(*abuf), GFP_KERNEL);
+	if (*_abuf == NULL) {
+		addrs = -ENOMEM;
+		goto out;
+	}
+	abuf = *_abuf;
+
+	read_lock(&dev_base_lock);
+
+	while (addr) {
+		abuf[addrs].type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 now */
+		abuf[addrs].inet4_local = htonl(addr->ifa_local);
+		abuf[addrs].inet4_address = htonl(addr->ifa_address);
+		abuf[addrs].inet4_mask = htonl(addr->ifa_mask);
+		abuf[addrs].inet4_broadcast = htonl(addr->ifa_broadcast);
+
+		addr = addr->ifa_next;
+		if (++addrs >= max) {
+			read_unlock(&dev_base_lock);
+			max *= 2;
+			goto retry;
+		}
+	}
+
+	read_unlock(&dev_base_lock);
+ out:
+	if (addrs < 0) {
+		kfree(abuf);
+		*_abuf = NULL;
+	}
+
+	return addrs;
+}
+
+struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					 struct net_device *dev,
+					 struct ckpt_netdev_addr *addrs[])
+{
+	struct ckpt_hdr_netdev *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_netdev_hwaddr(dev, h);
+	if (ret < 0)
+		goto out;
+
+	*addrs = NULL;
+	ret = h->inet_addrs = ckpt_netdev_inet_addrs(dev->ip_ptr, addrs);
+	if (ret < 0)
+		goto out;
+
+	if (ckpt_netdev_in_init_netns(ctx, dev))
+		ret = h->netns_ref = 0;
+	else
+		ret = h->netns_ref = checkpoint_obj(ctx, dev->nd_net,
+						    CKPT_OBJ_NET_NS);
+ out:
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+		kfree(*addrs);
+	}
+
+	return h;
+}
+EXPORT_SYMBOL_GPL(ckpt_netdev_base);
+
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net_device *dev = (struct net_device *)ptr;
+	int ret;
+
+	if (!dev->netdev_ops->ndo_checkpoint) {
+		ckpt_err(ctx, -ENOSYS,
+			 "Device %s does not support checkpoint\n", dev->name);
+		return -ENOSYS;
+	}
+
+	ckpt_debug("checkpointing netdev %s\n", dev->name);
+
+	ret = dev->netdev_ops->ndo_checkpoint(ctx, dev);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "Failed to checkpoint netdev %s: %i\n",
+			 dev->name, ret);
+
+	return ret;
+}
+
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net *net = ptr;
+	struct net_device *dev;
+	struct ckpt_hdr_netns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	BUG_ON(h->this_ref <= 0);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops->ndo_checkpoint)
+			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
+		else if (dev->flags & IFF_UP)
+			ret = -ENOSYS;
+		else
+			/* TODO: There should be a flag to attempt a
+			 * checkpoint of downed interfaces, regardless
+			 * of whether they support checkpoint or not.
+			 */
+			ret = 0;
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+static int restore_in_addrs(struct ckpt_ctx *ctx,
+			    __u32 naddrs,
+			    struct net *net,
+			    struct net_device *dev)
+{
+	__u32 i;
+	int ret = 0;
+	int len = naddrs * sizeof(struct ckpt_netdev_addr);
+	struct ckpt_netdev_addr *addrs = NULL;
+
+	ret = ckpt_read_payload(ctx, (void **)&addrs, len, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < naddrs; i++) {
+		struct ckpt_netdev_addr *addr = &addrs[i];
+		struct ifreq req;
+		struct sockaddr_in *inaddr;
+
+		if (addr->type != CKPT_NETDEV_ADDR_IPV4) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
+				 addr->type);
+			break;
+		}
+
+		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
+			   addr->inet4_address,
+			   addr->inet4_mask,
+			   addr->inet4_broadcast);
+
+		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_address);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set address\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_mask);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set netmask\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_broadcast);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set broadcast\n");
+			break;
+		}
+	}
+
+ out:
+	kfree(addrs);
+
+	return ret;
+}
+
+static int veth_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct ifinfomsg ifm;
+	int ret = -ENOMEM;
+	struct veth_newlink *d = data;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo)
+		goto out;
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "veth");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put(skb, VETH_INFO_PEER, sizeof(ifm), &ifm);
+	if (!ret)
+		ret = nla_put_string(skb, IFLA_IFNAME, d->peer);
+
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+
+	return ret;
+}
+
+static int mvl_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct mvl_newlink *d = data;
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct net_device *lowerdev;
+	int ret;
+
+	lowerdev = dev_get_by_name(current->nsproxy->net_ns, d->base);
+	if (!lowerdev)
+		return -ENOENT;
+
+	ret = nla_put(skb, IFLA_ADDRESS, ETH_ALEN, d->hwaddr);
+	if (ret)
+		goto out_put;
+
+	ret = nla_put_u32(skb, IFLA_LINK, lowerdev->ifindex);
+	if (ret)
+		goto out_put;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo) {
+		ret = -ENOMEM;
+		goto out_put;
+	}
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "macvlan");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put_u32(skb, IFLA_MACVLAN_MODE, d->mode);
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+ out_put:
+	dev_put(lowerdev);
+
+	return ret;
+}
+
+static struct sk_buff *new_link_msg(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb;
+	struct ifinfomsg *ifm;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		goto out;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWLINK, sizeof(*ifm), flags);
+	if (!nlh)
+		goto out;
+
+	ifm = nlmsg_data(nlh);
+	memset(ifm, 0, sizeof(*ifm));
+
+	ret = nla_put_string(skb, IFLA_IFNAME, name);
+	if (ret)
+		goto out;
+
+	ret = fn(skb, data);
+
+	nlmsg_end(skb, nlh);
+
+ out:
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	struct socket *rtnl = NULL;
+	struct sk_buff *skb = NULL;
+	struct nlmsghdr *nlh;
+	struct msghdr msg;
+	struct kvec kvec;
+
+	skb = new_link_msg(fn, data, name);
+	if (IS_ERR(skb)) {
+		ckpt_debug("failed to create new link message: %li\n",
+			   PTR_ERR(skb));
+		return ERR_PTR(PTR_ERR(skb));
+	}
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
+		goto out_noclose;
+	}
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if (ret < 0)
+		goto out;
+	else if (ret != skb->len) {
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Free the send skb to make room for the receive skb */
+	kfree_skb(skb);
+
+	nlh = rtnl_get_response(rtnl, &skb);
+	if (IS_ERR(nlh)) {
+		ret = PTR_ERR(nlh);
+		ckpt_debug("RTNETLINK said: %i\n", ret);
+	}
+ out:
+	rtnl_close(rtnl);
+ out_noclose:
+	kfree_skb(skb);
+
+	if (ret < 0)
+		return ERR_PTR(ret);
+	else
+		return dev_get_by_name(current->nsproxy->net_ns, name);
+}
+
+static int netdev_noop(void *data)
+{
+	return 0;
+}
+
+static int netdev_cleanup(void *data)
+{
+	struct dq_netdev *dq = data;
+
+	dev_put(dq->dev);
+
+	if (dq->ctx->errno) {
+		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
+		unregister_netdev(dq->dev);
+	}
+
+	return 0;
+}
+
+static struct net_device *restore_veth(struct ckpt_ctx *ctx,
+				       struct ckpt_hdr_netdev *h,
+				       struct net *net)
+{
+	int ret;
+	char this_name[IFNAMSIZ];
+	char peer_name[IFNAMSIZ];
+	struct net_device *dev;
+	struct net_device *peer;
+	struct ifreq req;
+	struct dq_netdev dq;
+
+	dq.ctx = ctx;
+
+	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
+
+	peer = ckpt_obj_try_fetch(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
+	if (IS_ERR(peer)) {
+		struct veth_newlink veth = {
+			.peer = peer_name,
+		};
+
+		dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
+		if (IS_ERR(dev))
+			return dev;
+
+		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
+		if (!peer) {
+			ret = -EINVAL;
+			goto err_dev;
+		}
+
+		dq.dev = peer;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			goto err_peer;
+
+		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
+				      CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+		dev_put(peer);
+
+		dq.dev = dev;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+
+	} else {
+		/* We're second: get our dev from the hash */
+		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
+		if (IS_ERR(dev))
+			return dev;
+	}
+
+	/* Move to our new netns */
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+	if (ret < 0)
+		goto out;
+
+	/* Restore MAC address */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
+	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
+ out:
+	if (ret)
+		dev = ERR_PTR(ret);
+
+	return dev;
+
+ err_peer:
+	dev_put(peer);
+	unregister_netdev(peer);
+ err_dev:
+	dev_put(dev);
+	unregister_netdev(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_lo(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_netdev *h,
+				     struct net *net)
+{
+	struct net_device *dev;
+	char name[IFNAMSIZ+1];
+	int ret;
+
+	dev = dev_get_by_name(net, "lo");
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	ret = _ckpt_read_buffer(ctx, name, IFNAMSIZ);
+	if (ret < 0)
+		goto err;
+
+	if (strncmp(dev->name, name, IFNAMSIZ) != 0) {
+		ret = dev_change_name(dev, name);
+		if (ret < 0)
+			goto err;
+	}
+
+	return dev;
+ err:
+	dev_put(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_sit(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_netdev *h,
+				      struct net *net)
+{
+	/* Don't actually do anything for SIT devices yet */
+	return dev_get_by_name(net, "sit0");
+}
+
+static struct net_device *restore_macvlan(struct ckpt_ctx *ctx,
+					  struct ckpt_hdr_netdev *h,
+					  struct net *net)
+{
+	struct net_device *dev;
+	struct mvl_newlink mvl = {
+		.mode = h->macvlan.mode,
+		.hwaddr = h->hwaddr,
+	};
+	int ret;
+
+	ret = _ckpt_read_buffer(ctx, mvl.this, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, mvl.base, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	dev = rtnl_newlink(mvl_new_link_msg, &mvl, mvl.this);
+	if (IS_ERR(dev)) {
+		ckpt_err(ctx, PTR_ERR(dev),
+			 "Failed to create macvlan device %s:%s",
+			 mvl.this, mvl.base);
+		goto out;
+	}
+
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to change netns of %s:%s\n",
+			 mvl.this, mvl.base);
+		dev_put(dev);
+		unregister_netdev(dev);
+		dev = ERR_PTR(ret);
+	}
+ out:
+	return dev;
+}
+
+typedef struct net_device *(*restore_netdev_fn)(struct ckpt_ctx *,
+						struct ckpt_hdr_netdev *,
+						struct net *);
+
+restore_netdev_fn restore_netdev_functions[] = {
+	restore_lo,		/* CKPT_NETDEV_LO */
+	restore_veth,		/* CKPT_NETDEV_VETH */
+	restore_sit,		/* CKPT_NETDEV_SIT */
+	restore_macvlan,	/* CKPT_NETDEV_MACVLAN */
+};
+
+void *restore_netdev(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = NULL;
+	struct ifreq req;
+	struct net *net;
+	int ret;
+	restore_netdev_fn restore_fn = NULL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->netns_ref != 0) {
+		net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
+		if (IS_ERR(net)) {
+			ckpt_debug("failed to get net for %i\n", h->netns_ref);
+			ret = PTR_ERR(net);
+			goto out;
+		}
+	} else
+		net = current->nsproxy->net_ns;
+
+	if (h->type >= CKPT_NETDEV_MAX) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "Invalid netdev type %i\n", h->type);
+		goto out;
+	}
+
+	restore_fn = restore_netdev_functions[h->type];
+
+	dev = restore_fn(ctx, h, net);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
+		goto out;
+	}
+
+	/* Restore flags (which will likely bring the interface up) */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	req.ifr_flags = h->flags;
+	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0)
+		ret = restore_in_addrs(ctx, h->inet_addrs, net, dev);
+ out:
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to restore netdevice\n");
+		if ((h->type == CKPT_NETDEV_VETH) && !IS_ERR(dev))
+			dev_put(dev);
+		dev = ERR_PTR(ret);
+	} else
+		ckpt_debug("restored netdev %s\n", dev->name);
+
+	ckpt_hdr_put(ctx, h);
+
+	return dev;
+}
+
+void *restore_netns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netns *h;
+	struct net *net;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");
+		return h;
+	}
+
+	if (h->this_ref != 0) {
+		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
+		if (IS_ERR(net))
+			goto out;
+	} else
+		net = current->nsproxy->net_ns;
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return net;
+}
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 098/100] c/r: Add a checkpoint handler to the 'sit' device
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

This handler doesn't really do much to checkpoint the device, other
than the minimum required to support the restart process.  When we
add IPv6 support to this, then we can fill this out.

This allows us to avoid skipping unsupported interfaces on a normal
system.

Changelog[v21]:
  - Do not include checkpoint_hdr.h explicitly
  - Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 net/ipv6/sit.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 5abae10..5ecbe56 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1084,11 +1084,45 @@ static int ipip6_tunnel_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+#include <linux/checkpoint.h>
+
+#ifdef CONFIG_NETNS_CHECKPOINT
+static int ipip6_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_SIT;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops ipip6_netdev_ops = {
 	.ndo_uninit	= ipip6_tunnel_uninit,
 	.ndo_start_xmit	= ipip6_tunnel_xmit,
 	.ndo_do_ioctl	= ipip6_tunnel_ioctl,
 	.ndo_change_mtu	= ipip6_tunnel_change_mtu,
+#ifdef CONFIG_NETNS_CHECKPOINT
+	.ndo_checkpoint	= ipip6_checkpoint,
+#endif
 };
 
 static void ipip6_tunnel_setup(struct net_device *dev)
-- 
1.6.3.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox