Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02  9:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <1272783366.2173.13.camel@edumazet-laptop>

> I tried it on the right spot (since my bench was only doing recvmsg()
> calls, I had to patch wait_for_packet() in net/core/datagram.c
> 
> udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
> schedule_timeout
> 
> Unfortunatly, using io_schedule_timeout() did not solve the problem.

Hmm, too bad. Weird.

> 
> Tell me if you need some traces or something.

I'll try to reproduce it and see what I can do.

-Andi


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02  6:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100501110000.GB9434@gargoyle.fritz.box>

Le samedi 01 mai 2010 à 13:00 +0200, Andi Kleen a écrit :
> On Fri, Apr 30, 2010 at 04:38:57PM -0700, David Miller wrote:
> > From: Andi Kleen <ak@gargoyle.fritz.box>
> > Date: Thu, 29 Apr 2010 23:41:44 +0200
> > 
> > >     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> > > 
> > >     XXX: probably too aggressive, some of these sleeps are not under high load.
> > > 
> > >     Based on a bug report from Eric Dumazet.
> > >     
> > >     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > 
> > I like this, except that we probably don't want the delayacct_blkio_*() calls
> > these things do.
> 
> Yes.
> 
> It needs more work, please don't apply it yet, to handle the "long sleep" case.
> 
> Still curious if it fixes Eric's test case.
> 

I tried it on the right spot (since my bench was only doing recvmsg()
calls, I had to patch wait_for_packet() in net/core/datagram.c

udp_recvmsg -> __skb_recv_datagram -> wait_for_packet ->
schedule_timeout

Unfortunatly, using io_schedule_timeout() did not solve the problem.

Tell me if you need some traces or something.

Thanks !

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 95b851f..051fd5b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -113,7 +113,7 @@ static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 		goto interrupted;
 
 	error = 0;
-	*timeo_p = schedule_timeout(*timeo_p);
+	*timeo_p = io_schedule_timeout(*timeo_p);
 out:
 	finish_wait(sk_sleep(sk), &wait);
 	return error;



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: Eric Dumazet @ 2010-05-02  6:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, therbert, hadi
In-Reply-To: <20100501.181558.141243424.davem@davemloft.net>

Le samedi 01 mai 2010 à 18:15 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 01 May 2010 08:42:25 +0200
> 
> > [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
> > 
> > With RPS, this patch can give a 5 % boost in performance.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Awesome, but let's do this in a way that allows us to easily annotate
> where inlining makes sense in other places, not just here.
> 
> Something like this, ok?
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 82f5116..746a652 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1128,6 +1128,11 @@ static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
>  	return skb->data += len;
>  }
>  
> +static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
> +{
> +	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
> +}
> +
>  extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
>  
>  static inline unsigned char *__pskb_pull(struct sk_buff *skb, unsigned int len)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4218ff4..8b9c109 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1051,7 +1051,7 @@ EXPORT_SYMBOL(skb_push);
>   */
>  unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
>  {
> -	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
> +	return skb_pull_inline(skb, len);
>  }
>  EXPORT_SYMBOL(skb_pull);
>  
> diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> index 0c0d272..61ec032 100644
> --- a/net/ethernet/eth.c
> +++ b/net/ethernet/eth.c
> @@ -162,7 +162,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  
>  	skb->dev = dev;
>  	skb_reset_mac_header(skb);
> -	skb_pull(skb, ETH_HLEN);
> +	skb_pull_inline(skb, ETH_HLEN);
>  	eth = eth_hdr(skb);
>  
>  	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {

Excellent !

Changli privately asked me why we were ignoring cases where skb->len <
ETH_HLEN.
I replied that minimum frame size was 46+12, then he asked me why we
were testing another time :

if (skb->len >= 2 && *(unsigned short *)rawp == 0xFFFF)
	return htons(ETH_P_802_3);


Could we assume all eth_type_trans() must call it with initial skb->len
>= (46 + 12) or not ?
(According to ethernet specs, all frames should have a minimum payload
of 46 bytes)

If not sure, maybe we should issue a WARN_ON_ONCE()

If yes, tests could be removed and we could gain two cycles ;)




^ permalink raw reply

* RE: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
From: Elina Pasheva @ 2010-05-02  5:53 UTC (permalink / raw)
  To: David Miller
  Cc: dbrownell@users.sourceforge.net, Rory Filer,
	linux-usb@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20100501.180829.139101312.davem@davemloft.net>


> On Saturday, May 01, 2010 6:08 PM David Miller wrote:

>>From: Elina Pasheva <epasheva@sierrawireless.com>
>>Date: Wed, 28 Apr 2010 16:28:24 -0700

>> Subject: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
>> From: Elina Pasheva <epasheva@sierrawireless.com>
>>
>> The following patch adds the initiation of the sync sequence to
>> "sierra_net_bind()". If this step is omitted, the modem will never sync up
>> with the host and it will not be possible to establish a data connection.
>> This is a high priority patch.
>>
>> This patch has been checked against net-2.6 tree.
>> Signed-off-by: Elina Pasheva <epasheva@sierrawireless.com>
>> Signed-off-by: Rory Filer <rfiler@sierrawireless.com>
>> Tested-by: Elina Pasheva <epasheva@sierrawireless.com>

>Applied.

Thank you very much, David!
Elina



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: David Miller @ 2010-05-02  1:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, hadi
In-Reply-To: <1272696145.2230.101.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 May 2010 08:42:25 +0200

> [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
> 
> With RPS, this patch can give a 5 % boost in performance.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Awesome, but let's do this in a way that allows us to easily annotate
where inlining makes sense in other places, not just here.

Something like this, ok?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 82f5116..746a652 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1128,6 +1128,11 @@ static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 	return skb->data += len;
 }
 
+static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len)
+{
+	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
+}
+
 extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
 
 static inline unsigned char *__pskb_pull(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4218ff4..8b9c109 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1051,7 +1051,7 @@ EXPORT_SYMBOL(skb_push);
  */
 unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
 {
-	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
+	return skb_pull_inline(skb, len);
 }
 EXPORT_SYMBOL(skb_pull);
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0c0d272..61ec032 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 
 	skb->dev = dev;
 	skb_reset_mac_header(skb);
-	skb_pull(skb, ETH_HLEN);
+	skb_pull_inline(skb, ETH_HLEN);
 	eth = eth_hdr(skb);
 
 	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {

^ permalink raw reply related

* Re: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
From: David Miller @ 2010-05-02  1:08 UTC (permalink / raw)
  To: epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8
  Cc: dbrownell-Rn4VEauK+AKRv+LV9MX5uipxlwaOVQ5f,
	rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272497304.8835.2.camel@Linuxdev4-laptop>

From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
Date: Wed, 28 Apr 2010 16:28:24 -0700

> Subject: [PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
> From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> 
> The following patch adds the initiation of the sync sequence to
> "sierra_net_bind()". If this step is omitted, the modem will never sync up
> with the host and it will not be possible to establish a data connection.
> This is a high priority patch.
> 
> This patch has been checked against net-2.6 tree.
> Signed-off-by: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> Signed-off-by: Rory Filer <rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> Tested-by: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Ben Hutchings @ 2010-05-01 23:44 UTC (permalink / raw)
  To: David Miller; +Cc: andi, tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501.150338.93457735.davem@davemloft.net>

On Sat, 2010-05-01 at 15:03 -0700, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Sat, 1 May 2010 12:53:04 +0200
> 
> >> And we don't want it to, because the decision mechanisms for steering
> >> that we using now are starting to get into the stateful territory and
> >> that's verbotton for NIC offload as far as we're concerned.
> > 
> > Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC 
> > (or someone else like netfilter) tracking flows is quite common and very far 
> > from full offload. AFAIK it doesn't have near all the problems full
> > offload has.
> 
> We're tracking flow cpu location state at the socket operations, like
> recvmsg() and sendmsg(), where it belongs.
> 
> Would you like us to call into the card drivers and firmware at these
> spots instead?

I'm interested in experimenting with this at some point, since our
hardware supports a fairly large number of filters that could be used
for it.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 23:29 UTC (permalink / raw)
  To: andi; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501225815.GA8074@gargoyle.fritz.box>

From: Andi Kleen <andi@firstfloor.org>
Date: Sun, 2 May 2010 00:58:15 +0200

>> We're tracking flow cpu location state at the socket operations, like
>> recvmsg() and sendmsg(), where it belongs.
>> 
>> Would you like us to call into the card drivers and firmware at these
>> spots instead?
> 
> No, that's not needed for lazy flow tracking like in netfilter or 
> some NICs, it doesn't need exact updates. It just works with seen network 
> packets. 

Well what we need is exact flow updates so that we steer packets
to where the applications actually are.

Andi, this discussion is going in circles, can I just say "yeah you're
right Andi" and this will satisfy your desire to be correct and we can
be done with this?

Thanks.

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Andi Kleen @ 2010-05-01 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501.150338.93457735.davem@davemloft.net>

> We're tracking flow cpu location state at the socket operations, like
> recvmsg() and sendmsg(), where it belongs.
> 
> Would you like us to call into the card drivers and firmware at these
> spots instead?

No, that's not needed for lazy flow tracking like in netfilter or 
some NICs, it doesn't need exact updates. It just works with seen network 
packets. 

-Andi

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 22:13 UTC (permalink / raw)
  To: gandalf; +Cc: tglx, shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <Pine.LNX.4.62.1005012222320.24624@wlug.westbo.se>

From: Martin Josefsson <gandalf@mjufs.se>
Date: Sat, 1 May 2010 22:31:05 +0200 (CEST)

> On Fri, 30 Apr 2010, David Miller wrote:
> 
>> Then we can do cool tricks like having the cpu spin on a mwait() on
>> the
>> network device's status descriptor in memory.
> 
> Can you have mwait monitor multiple cachelines for stores?

The idea is that if you have hundreds of cpus threads (several of my
machines do, and it's not too long before these kinds of boxes will be
common) in your machine you can spare one for each NIC.

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-05-01 22:03 UTC (permalink / raw)
  To: andi; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100501105304.GA9434@gargoyle.fritz.box>

From: Andi Kleen <andi@firstfloor.org>
Date: Sat, 1 May 2010 12:53:04 +0200

>> And we don't want it to, because the decision mechanisms for steering
>> that we using now are starting to get into the stateful territory and
>> that's verbotton for NIC offload as far as we're concerned.
> 
> Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC 
> (or someone else like netfilter) tracking flows is quite common and very far 
> from full offload. AFAIK it doesn't have near all the problems full
> offload has.

We're tracking flow cpu location state at the socket operations, like
recvmsg() and sendmsg(), where it belongs.

Would you like us to call into the card drivers and firmware at these
spots instead?

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: David Miller @ 2010-05-01 22:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <1272701011.2230.134.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 May 2010 10:03:31 +0200

> David, I also need this RCU thing in order to be able to group all
> wakeups at the end of net_rx_action().
> 
> Plan was to use RCU, so that I dont need to increase sk_refcnt when
> queueing a "wakeup" (and decrease sk_refcnt a long time after)
> 
> Previous attempt was a bit hacky,
> http://patchwork.ozlabs.org/patch/24179/
> 
> I expect 2010 one will be cleaner :)

Fair enough, I'm convinced now, applied thanks!

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Martin Josefsson @ 2010-05-01 20:31 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <20100430.115715.216750975.davem@davemloft.net>

On Fri, 30 Apr 2010, David Miller wrote:

> Then we can do cool tricks like having the cpu spin on a mwait() on the
> network device's status descriptor in memory.

Can you have mwait monitor multiple cachelines for stores? If not then it 
might be hard to do that when you have multiple nics and you actually 
need to use the status descriptors, otherwise you could possibly have them 
all written to the same cacheline. 
Or if the nic doesn't support updating a status descriptor in memory.

If you just want to wake up quickly without using interrupts it might be 
possible to abuse MSI to wake up without actually using interrupts, set 
the address to the cacheline that is being monitored.

/Martin

^ permalink raw reply

* Re: [PATCH 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Kumar Gala @ 2010-05-01 16:36 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Netdev, linuxppc-dev, devicetree-discuss
In-Reply-To: <20100429092005.GA6727@riccoc20.at.omicron.at>


On Apr 29, 2010, at 4:20 AM, Richard Cochran wrote:

> The eTSEC includes a PTP clock with quite a few features. This patch adds
> support for the basic clock adjustment functions.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
> arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
> arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
> arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
> drivers/net/Makefile                  |    1 +
> drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
> drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
> drivers/ptp/Kconfig                   |   13 ++
> 7 files changed, 470 insertions(+), 0 deletions(-)
> create mode 100644 drivers/net/gianfar_ptp.c
> create mode 100644 drivers/net/gianfar_ptp_reg.h
> 
> diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
> index 183f2aa..b760aee 100644
> --- a/arch/powerpc/boot/dts/mpc8313erdb.dts
> +++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
> @@ -208,6 +208,20 @@
> 			sleep = <&pmc 0x00300000>;
> 		};
> 
> +		ptp_clock@24E00 {
> +			device_type = "ptp_clock";
> +			model = "eTSEC";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <0x0C 2 0x0D 2>;
> +			interrupt-parent = < &ipic >;
> +			tclk_period = <10>;
> +			tmr_prsc    = <100>;
> +			tmr_add     = <0x999999A4>;
> +			cksel       = <0x1>;
> +			tmr_fiper1  = <0x3B9AC9F6>;
> +			tmr_fiper2  = <0x00018696>;
> +		};
> +
> 		enet0: ethernet@24000 {
> 			#address-cells = <1>;
> 			#size-cells = <1>;

Is there a binding document that describes this node you are adding?

- k

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] [v2] ptp: IEEE 1588 clock support
From: Kumar Gala @ 2010-05-01 16:32 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Netdev
In-Reply-To: <20100429091903.GA6691@riccoc20.at.omicron.at>


On Apr 29, 2010, at 4:19 AM, Richard Cochran wrote:

> * Patch ChangeLog
> ** v2
>   - Changed clock list from a static array into a dynamic list. Also,
>     use a bitmap to manage the clock's minor numbers.
>   - Replaced character device semaphore with a mutex.
>   - Drop .ko from module names in Kbuild help.
>   - Replace deprecated unifdef-y with header-y for user space header file.
>   - Gianfar driver now gets parameters from device tree.
>   - Added API documentation to Documentation/ptp/ptp.txt, with links
>     to both of the ptpd patches on sourceforge.
> 
> * Preface
> 
> Now and again there has been some talk on this list of adding PTP
> support into Linux. One part of the picture is already in place, the
> SO_TIMESTAMPING API for hardware time stamping. It has been pointed
> out that this API is not perfect, however, it is good enough for many
> real world uses of IEEE 1588. The second needed part has not, AFAICT,
> ever been addressed.
> 
> Here I offer an early draft of an idea how to bring the missing
> functionality into Linux. I don't yet have all of the features
> implemented, as described below. Still I would like to get your
> feedback concerning this idea before getting too far into it. I do
> have all of the hardware mentioned at hand, so I have a good idea that
> the proposed API covers the features of those clocks.
> 
> Thanks in advance for your comments,
> 
> Richard
> 
> 
> Richard Cochran (3):
>  ptp: Added a brand new class driver for ptp clocks.
>  ptp: Added a clock that uses the Linux system time.
>  ptp: Added a clock that uses the eTSEC found on the MPC85xx.
> 
> Documentation/ptp/ptp.txt             |   78 +++++++++
> Documentation/ptp/testptp.c           |  130 ++++++++++++++
> Documentation/ptp/testptp.mk          |   33 ++++
> arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
> arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
> arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
> drivers/Kconfig                       |    2 +
> drivers/Makefile                      |    1 +
> drivers/net/Makefile                  |    1 +
> drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
> drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
> drivers/ptp/Kconfig                   |   51 ++++++
> drivers/ptp/Makefile                  |    6 +
> drivers/ptp/ptp_clock.c               |  302 ++++++++++++++++++++++++++++++++
> drivers/ptp/ptp_linux.c               |  122 +++++++++++++
> include/linux/Kbuild                  |    1 +
> include/linux/ptp_clock.h             |   37 ++++
> include/linux/ptp_clock_kernel.h      |  134 ++++++++++++++
> kernel/time/ntp.c                     |    2 +
> 19 files changed, 1356 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/ptp/ptp.txt
> create mode 100644 Documentation/ptp/testptp.c
> create mode 100644 Documentation/ptp/testptp.mk
> create mode 100644 drivers/net/gianfar_ptp.c
> create mode 100644 drivers/net/gianfar_ptp_reg.h
> create mode 100644 drivers/ptp/Kconfig
> create mode 100644 drivers/ptp/Makefile
> create mode 100644 drivers/ptp/ptp_clock.c
> create mode 100644 drivers/ptp/ptp_linux.c
> create mode 100644 include/linux/ptp_clock.h
> create mode 100644 include/linux/ptp_clock_kernel.h

In the future please CC linuxppc-dev@lists.ozlabs.org and devicetree-discuss@lists.ozlabs.org since you are adding device tree bindings.

- k

^ permalink raw reply

* Re: [patch v2.2 3/4] [PATCH v2.1 3/4] IPVS: make FTP work with full NAT support
From: Patrick McHardy @ 2010-05-01 16:26 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.998807955@vergenet.net>

Simon Horman wrote:

> +#define FMT_TUPLE	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u"
> +#define ARG_TUPLE(T)	NIPQUAD((T)->src.u3.ip), ntohs((T)->src.u.all), \
> +			NIPQUAD((T)->dst.u3.ip), ntohs((T)->dst.u.all), \
> +			(T)->dst.protonum
> +
> +#define FMT_CONN	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u:%u"
> +#define ARG_CONN(C)	NIPQUAD((C)->caddr), ntohs((C)->cport), \
> +			NIPQUAD((C)->vaddr), ntohs((C)->vport), \
> +			NIPQUAD((C)->daddr), ntohs((C)->dport), \
> +			(C)->protocol, (C)->state
>  

Please use the appropriate format string (%pI4) instead of NIPQUAD.

> +		buf_len = sprintf(buf, "%u,%u,%u,%u,%u,%u", NIPQUAD(from.ip),
> +				  (ntohs(port)>>8)&255, ntohs(port)&255);
> +
> +		ct = nf_ct_get(skb, &ctinfo);
> +		ret = nf_nat_mangle_tcp_packet(skb,
> +					       ct,
> +					       ctinfo,
> +					       start-data,
> +					       end-start,
> +					       buf,
> +					       buf_len);
> +
> +		if (ct && ct != &nf_conntrack_untracked)

ct is non-NULL, otherwise we'll crash in nf_nat_mangle_tcp_packet().
Are you sure you want to mangle untracked packets above? That doesn't
work when their are size changes.

^ permalink raw reply

* Re: [patch v2.2 2/4] [PATCH v2.1 2/4] IPVS: make friends with nf_conntrack
From: Patrick McHardy @ 2010-05-01 16:19 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.644762316@vergenet.net>

Looks good to me.

^ permalink raw reply

* Re: [patch v2.2 1/4] [PATCH v2.1 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Patrick McHardy @ 2010-05-01 16:18 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, linux-kernel, netfilter, Wensong Zhang,
	Julius Volz, David S. Miller, Hannes Eder,
	Netfilter Development Mailinglist
In-Reply-To: <20100501032120.298829234@vergenet.net>

Simon Horman wrote:

> @@ -0,0 +1,25 @@
> +#ifndef _XT_IPVS_H
> +#define _XT_IPVS_H 1

You don't need to define a value.

> +config NETFILTER_XT_MATCH_IPVS
> +	tristate '"ipvs" match support'
> +	depends on IP_VS
> +	depends on NETFILTER_ADVANCED
> +	help
> +	  This option allows you to match against IPVS properties of a packet.
> +
> +	  If unsure, say N.

You're using conntrack symbols, so this seems to need a dependency
on NF_CONNTRACK.

> +static bool ipvs_mt_check(const struct xt_mtchk_param *par)

We've changed the signature to "int" in nf-next to be able to
return errno codes. Please rebase your patches onto nf-next-2.6.git.

Please also CC netfilter-devel at least for those parts that affect
non-IPVS netfilter.

> +{
> +	if (par->family != NFPROTO_IPV4
> +#ifdef CONFIG_IP_VS_IPV6
> +	    && par->family != NFPROTO_IPV6
> +#endif
> +		) {
> +		pr_info("protocol family %u not supported\n", par->family);
> +		return false;
> +	}
> +
> +	return true;
> +}


^ permalink raw reply

* [PATCH v21 094/100] c/r: Basic support for network namespaces and devices (v6)
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

When checkpointing a task tree with network namespaces, we hook into
do_checkpoint_ns() along with the others.  Any devices in a given namespace
are checkpointed (including their peer, in the case of veth) sequentially.
Each network device stores a list of protocol addresses, as well as other
information, such as hardware address.

This patch supports veth pairs, as well as the loopback adapter.  The
loopback support is there to make sure that any additional addresses and
state (such as up/down) is copied to the loopback adapter that we are
given in the new network namespace.

On restart, we instantiate new network namespaces and veth pairs as
necessary.  Any device we encounter that isn't in a network namespace
that was checkpointed as part of a task is left in the namespace of the
restarting process.  This will be the case for a veth half that exists
in the init netns to provide network access to a container.

Still to do are:

  1. Routes
  2. Netfilter rules
  3. IPv6 addresses
  4. Other virtual device types (e.g. bridges)
  5. Multicast
  6. Device config info (ipv4_devconf)
  7. Additional ipv4 address attributes

Changelog[v21]:
  - Do not include checkpoint_hdr.h explicitly
 - Fix acquiring socket lock before reading RTNETLINK response
 - Skip down interfaces (v2)
 - Export net checkpoint fns
 - Add CHECKPOINT_NETNS flag
 - Rename CONFIG_CHECKPOINT_NETNS -> CONFIG_NETNS_CHECKPOINT
 - Netdev restore function dispatching from a table
 - Added a comment about the controverial determination of "initial netns"
 - Simplify the E2BIG error handling
 - Remove a redundant check for checkpoint support per-device

Changes in v6:
 - Store addresses in network byte order, per Dave's recommendation

Changes in v5:
 - Rebase
 - Remove checkpoint_container() noise
 - Factor out some common bits of the RTNL newlink operations
 - Add macvlan support

Changes in v4:
 - Fix allocation under lock in ckpt_netdev_inet_addrs()
 - Add comment for case where there is no netns info in checkpoint image
 - Fix inner structure alignment in netdev_addr header
 - Fix instances of kfree(skb)
 - Remove init_netns_ref from container header and checkpoint context
 - Add 'extern' to checkpoint.h prototypes
 - Swizzle do_restore_netns() to handle netns more like the others
 - Return E2BIG for failure case when collecting inet addrs
 - Report case where device doesn't support checkpoint
 - Remove nested netns check from may_checkpoint_task()
 - Move veth-specific netdev attributes into unioned struct to set an
   example for specific attributes of additional device types
 - Add 'sit' device restore path that doesn't really do anything
 - Fail instead of skip when encountering a device with no checkpoint
   support

Changes in v3:
 - Use dev->checkpoint() for per-device checkpoint operation
 - Use RTNL for veth pair creation on restart
 - Export some of the functions that will be needed by dev->ndo_checkpoint()

Changes in v2:
 - Add CONFIG_CHECKPOINT_NETNS that is dependent on NET, NET_NS, and
   CHECKPOINT.  Conditionally compile the checkpoint_dev code based on it.
 - Updated comment on should_checkpoint_netdev()
 - Updated checkpoint_netdev() to explicitly check for "veth" in name
 - Changed checkpoint_netns() to use BUG() for impossible condition
 - Fixed a bug on restart with all devices in the init netns
 - Lock the dev_base_lock while traversing interface addresses
 - Collect all addresses for an interface before writing out in one
   single pass

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint/usage.txt |    1 +
 include/linux/checkpoint.h         |   29 ++-
 include/linux/checkpoint_hdr.h     |   58 +++
 kernel/checkpoint/checkpoint.c     |    5 -
 kernel/nsproxy.c                   |   24 +-
 net/Kconfig                        |    4 +
 net/Makefile                       |    1 +
 net/checkpoint.c                   |   63 +++-
 net/checkpoint_dev.c               |  818 ++++++++++++++++++++++++++++++++++++
 9 files changed, 995 insertions(+), 8 deletions(-)
 create mode 100644 net/checkpoint_dev.c

diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
index d697ed1..5700448 100644
--- a/Documentation/checkpoint/usage.txt
+++ b/Documentation/checkpoint/usage.txt
@@ -15,6 +15,7 @@ The API consists of three new system calls:
  an open file to which error and debug messages are written. @flags
  may be one or more of:
    - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+   - CHECKPOINT_NETNS : include network namespaces and devices
  (other value are not allowed).
 
  Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 43d67ce..84bb7a9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -14,6 +14,7 @@
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
+#define CHECKPOINT_NETNS	0x2
 
 /* restart user flags */
 #define RESTART_TASKSELF	0x1
@@ -35,6 +36,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <linux/inetdevice.h>
 #include <net/sock.h>
 
 /* sycall helpers */
@@ -55,7 +57,10 @@ extern long do_sys_restart(pid_t pid, int fd,
 #define CKPT_CTX_ERROR		(1 << CKPT_CTX_ERROR_BIT)
 
 /* ckpt_ctx: uflags */
-#define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+#define CHECKPOINT_USER_FLAGS \
+	(CHECKPOINT_SUBTREE | \
+	 CHECKPOINT_NETNS)
+
 #define RESTART_USER_FLAGS  \
 	(RESTART_TASKSELF | \
 	 RESTART_FROZEN | \
@@ -119,6 +124,28 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
 extern void sock_listening_list_free(struct list_head *head);
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+extern int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netns(struct ckpt_ctx *ctx);
+extern int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_netdev(struct ckpt_ctx *ctx);
+
+extern int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx,
+				     struct net_device *dev);
+extern int ckpt_netdev_inet_addrs(struct in_device *indev,
+				  struct ckpt_netdev_addr *list[]);
+extern int ckpt_netdev_hwaddr(struct net_device *dev,
+			      struct ckpt_hdr_netdev *h);
+extern struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					struct net_device *dev,
+					struct ckpt_netdev_addr *addrs[]);
+#else
+# define checkpoint_netns NULL
+# define restore_netns NULL
+# define checkpoint_netdev NULL
+# define restore_netdev NULL
+#endif
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1564726..eb5e1b4 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -189,6 +189,12 @@ enum {
 #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
 	CKPT_HDR_SOCKET_INET,
 #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
+	CKPT_HDR_NET_NS,
+#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
+	CKPT_HDR_NETDEV,
+#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
+	CKPT_HDR_NETDEV_ADDR,
+#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
 
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
@@ -261,6 +267,10 @@ enum obj_type {
 #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
 	CKPT_OBJ_SECURITY,
 #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
+	CKPT_OBJ_NET_NS,
+#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
+	CKPT_OBJ_NETDEV,
+#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -444,6 +454,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
 	__s32 ipc_objref;
+	__s32 net_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -768,6 +779,53 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_netns {
+	struct ckpt_hdr h;
+	__s32 this_ref;
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_types {
+	CKPT_NETDEV_LO,
+	CKPT_NETDEV_VETH,
+	CKPT_NETDEV_SIT,
+	CKPT_NETDEV_MACVLAN,
+	CKPT_NETDEV_MAX,
+};
+
+struct ckpt_hdr_netdev {
+	struct ckpt_hdr h;
+	__s32 netns_ref;
+	union {
+		struct {
+			__s32 this_ref;
+			__s32 peer_ref;
+		} veth;
+		struct {
+			__u32 mode;
+		} macvlan;
+	};
+	__u32 inet_addrs;
+	__u16 type;
+	__u16 flags;
+	__u8 hwaddr[6];
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_addr_types {
+	CKPT_NETDEV_ADDR_IPV4,
+};
+
+struct ckpt_netdev_addr {
+	__u16 type;
+	union {
+		struct {
+			__be32 inet4_local;
+			__be32 inet4_address;
+			__be32 inet4_mask;
+			__be32 inet4_broadcast;
+		};
+	} __attribute__((aligned(8)));
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/kernel/checkpoint/checkpoint.c b/kernel/checkpoint/checkpoint.c
index 7a4f1ce..4059c28 100644
--- a/kernel/checkpoint/checkpoint.c
+++ b/kernel/checkpoint/checkpoint.c
@@ -291,11 +291,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		_ckpt_err(ctx, -EPERM, "%(T)Nested mnt_ns unsupported\n");
 		ret = -EPERM;
 	}
-	/* no support for >1 private netns */
-	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
-		_ckpt_err(ctx, -EPERM, "%(T)Nested net_ns unsupported\n");
-		ret = -EPERM;
-	}
 	/* pidns must be descendent of root_nsproxy */
 	pidns = nsproxy->pid_ns;
 	while (pidns != ctx->root_nsproxy->pid_ns) {
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 7fb3cea..d4af91d 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -260,6 +260,12 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+#ifdef CONFIG_NETNS_CHECKPOINT
+	if (ctx->uflags & CHECKPOINT_NETNS)
+		ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+#endif
 #ifdef CONFIG_IPC_NS
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 	if (ret < 0)
@@ -308,6 +314,15 @@ static int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
 #endif	/* CONFIG_IPC_NS */
 	h->ipc_objref = ret;
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+	if (ctx->uflags & CHECKPOINT_NETNS)
+		ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	else
+		ret = 0;
+	if (ret < 0)
+		goto out;
+	h->net_objref = ret;
+#endif
 	/* FIXME: for now, only marked visited to pacify leaks */
 	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
 	if (ret < 0)
@@ -341,6 +356,14 @@ static void *restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	if (h->net_objref == 0)
+		net_ns = current->nsproxy->net_ns;
+	else
+		net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net_ns)) {
+		ret = PTR_ERR(net_ns);
+		goto out;
+	}
 
 	if (h->ipc_objref == 0)
 		ipc_ns = ctx->root_nsproxy->ipc_ns;
@@ -356,7 +379,6 @@ static void *restore_ns(struct ckpt_ctx *ctx)
 	}
 
 	mnt_ns = ctx->root_nsproxy->mnt_ns;
-	net_ns = ctx->root_nsproxy->net_ns;
 
 	if (uts_ns == current->nsproxy->uts_ns &&
 	    ipc_ns == current->nsproxy->ipc_ns &&
diff --git a/net/Kconfig b/net/Kconfig
index 041c35e..c1cb774 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -276,4 +276,8 @@ source "net/wimax/Kconfig"
 source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 
+config NETNS_CHECKPOINT
+       bool
+       default y if NET && NET_NS && CHECKPOINT
+
 endif   # if NET
diff --git a/net/Makefile b/net/Makefile
index 74b038f..b7d78f4 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -67,3 +67,4 @@ endif
 obj-$(CONFIG_WIMAX)		+= wimax/
 
 obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+obj-$(CONFIG_NETNS_CHECKPOINT)	+= checkpoint_dev.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
index 03c1224..b1f56bf 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -986,6 +986,56 @@ struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
  * sock-related checkpoint objects
  */
 
+static int netns_grab(void *ptr)
+{
+	struct net *net = ptr;
+
+	get_net(net);
+	return 0;
+}
+
+static void netns_drop(void *ptr, int lastref)
+{
+	struct net *net = ptr;
+
+	put_net(net);
+}
+
+/* netns object */
+static const struct ckpt_obj_ops ckpt_obj_netns_ops = {
+	.obj_name = "NET_NS",
+	.obj_type = CKPT_OBJ_NET_NS,
+	.ref_grab = netns_grab,
+	.ref_drop = netns_drop,
+	.checkpoint = checkpoint_netns,
+	.restore = restore_netns,
+};
+
+static int netdev_grab(void *ptr)
+{
+	struct net_device *dev = ptr;
+
+	dev_hold(dev);
+	return 0;
+}
+
+static void netdev_drop(void *ptr, int lastref)
+{
+	struct net_device *dev = ptr;
+
+	dev_put(dev);
+}
+
+/* netdev object */
+static const struct ckpt_obj_ops ckpt_obj_netdev_ops = {
+	.obj_name = "NET_DEV",
+	.obj_type = CKPT_OBJ_NETDEV,
+	.ref_grab = netdev_grab,
+	.ref_drop = netdev_drop,
+	.checkpoint = checkpoint_netdev,
+	.restore = restore_netdev,
+};
+
 static int obj_sock_grab(void *ptr)
 {
 	sock_hold((struct sock *) ptr);
@@ -1033,6 +1083,17 @@ static const struct ckpt_obj_ops ckpt_obj_sock_ops = {
 
 static int __init checkpoint_register_sock(void)
 {
-	return register_checkpoint_obj(&ckpt_obj_sock_ops);
+	int ret;
+
+	ret = register_checkpoint_obj(&ckpt_obj_sock_ops);
+	if (ret < 0)
+		return ret;
+	ret = register_checkpoint_obj(&ckpt_obj_netns_ops);
+	if (ret < 0)
+		return ret;
+	ret = register_checkpoint_obj(&ckpt_obj_netdev_ops);
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 module_init(checkpoint_register_sock);
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
new file mode 100644
index 0000000..34a6bdb
--- /dev/null
+++ b/net/checkpoint_dev.c
@@ -0,0 +1,818 @@
+/*
+ *  Copyright 2010 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/sched.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/inetdevice.h>
+#include <linux/veth.h>
+#include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
+
+#include <net/net_namespace.h>
+#include <net/sch_generic.h>
+
+struct dq_netdev {
+	struct net_device *dev;
+	struct ckpt_ctx *ctx;
+};
+
+struct veth_newlink {
+	char *peer;
+};
+
+struct mvl_newlink {
+	char this[IFNAMSIZ+1];
+	char base[IFNAMSIZ+1];
+	int mode;
+	__u8 *hwaddr;
+};
+
+typedef int (*new_link_fn)(struct sk_buff *, void *);
+
+static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = devinet_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = dev_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static struct socket *rtnl_open(void)
+{
+	struct socket *sock;
+	int ret;
+
+	ret = sock_create(AF_NETLINK, SOCK_DGRAM, NETLINK_ROUTE, &sock);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	return sock;
+}
+
+static int rtnl_close(struct socket *rtnl)
+{
+	if (rtnl)
+		return kernel_sock_shutdown(rtnl, SHUT_RDWR);
+	else
+		return 0;
+}
+
+static struct nlmsghdr *rtnl_get_response(struct socket *rtnl,
+					  struct sk_buff **skb)
+{
+	int ret;
+	long timeo = MAX_SCHEDULE_TIMEOUT;
+	struct nlmsghdr *nlh;
+
+	*skb = NULL;
+
+	lock_sock(rtnl->sk);
+	ret = sk_wait_data(rtnl->sk, &timeo);
+	if (ret)
+		*skb = skb_dequeue(&rtnl->sk->sk_receive_queue);
+	release_sock(rtnl->sk);
+
+	if (!*skb)
+		return ERR_PTR(-EPIPE);
+
+	ret = -EINVAL;
+	nlh = nlmsg_hdr(*skb);
+	if (!nlh)
+		goto err;
+
+	if (nlh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *errmsg = nlmsg_data(nlh);
+		ret = errmsg->error;
+		goto err;
+	}
+
+	return nlh;
+ err:
+	kfree_skb(*skb);
+	*skb = NULL;
+
+	return ERR_PTR(ret);
+}
+
+int ckpt_netdev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	/*
+	 * Currently, we treat the "initial network namespace" as that
+	 * of the process doing the checkpoint.  This gives us a
+	 * consistent view of the container and its layout from the
+	 * perspective of the "agent" doing the checkpoint and
+	 * restore.
+	 */
+	return dev->nd_net == current->nsproxy->net_ns;
+}
+EXPORT_SYMBOL_GPL(ckpt_netdev_in_init_netns);
+
+int ckpt_netdev_hwaddr(struct net_device *dev, struct ckpt_hdr_netdev *h)
+{
+	struct net *net = dev->nd_net;
+	struct ifreq req;
+	int ret;
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
+	if (ret < 0)
+		return ret;
+	h->flags = req.ifr_flags;
+
+	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
+	if (ret < 0)
+		return ret;
+
+	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
+
+	return 0;
+}
+
+int ckpt_netdev_inet_addrs(struct in_device *indev,
+			   struct ckpt_netdev_addr *_abuf[])
+{
+	struct ckpt_netdev_addr *abuf = NULL;
+	struct in_ifaddr *addr = indev->ifa_list;
+	int addrs = 0;
+	int max = 32;
+
+ retry:
+	*_abuf = krealloc(abuf, max * sizeof(*abuf), GFP_KERNEL);
+	if (*_abuf == NULL) {
+		addrs = -ENOMEM;
+		goto out;
+	}
+	abuf = *_abuf;
+
+	read_lock(&dev_base_lock);
+
+	while (addr) {
+		abuf[addrs].type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 now */
+		abuf[addrs].inet4_local = htonl(addr->ifa_local);
+		abuf[addrs].inet4_address = htonl(addr->ifa_address);
+		abuf[addrs].inet4_mask = htonl(addr->ifa_mask);
+		abuf[addrs].inet4_broadcast = htonl(addr->ifa_broadcast);
+
+		addr = addr->ifa_next;
+		if (++addrs >= max) {
+			read_unlock(&dev_base_lock);
+			max *= 2;
+			goto retry;
+		}
+	}
+
+	read_unlock(&dev_base_lock);
+ out:
+	if (addrs < 0) {
+		kfree(abuf);
+		*_abuf = NULL;
+	}
+
+	return addrs;
+}
+
+struct ckpt_hdr_netdev *ckpt_netdev_base(struct ckpt_ctx *ctx,
+					 struct net_device *dev,
+					 struct ckpt_netdev_addr *addrs[])
+{
+	struct ckpt_hdr_netdev *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_netdev_hwaddr(dev, h);
+	if (ret < 0)
+		goto out;
+
+	*addrs = NULL;
+	ret = h->inet_addrs = ckpt_netdev_inet_addrs(dev->ip_ptr, addrs);
+	if (ret < 0)
+		goto out;
+
+	if (ckpt_netdev_in_init_netns(ctx, dev))
+		ret = h->netns_ref = 0;
+	else
+		ret = h->netns_ref = checkpoint_obj(ctx, dev->nd_net,
+						    CKPT_OBJ_NET_NS);
+ out:
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+		kfree(*addrs);
+	}
+
+	return h;
+}
+EXPORT_SYMBOL_GPL(ckpt_netdev_base);
+
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net_device *dev = (struct net_device *)ptr;
+	int ret;
+
+	if (!dev->netdev_ops->ndo_checkpoint) {
+		ckpt_err(ctx, -ENOSYS,
+			 "Device %s does not support checkpoint\n", dev->name);
+		return -ENOSYS;
+	}
+
+	ckpt_debug("checkpointing netdev %s\n", dev->name);
+
+	ret = dev->netdev_ops->ndo_checkpoint(ctx, dev);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "Failed to checkpoint netdev %s: %i\n",
+			 dev->name, ret);
+
+	return ret;
+}
+
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net *net = ptr;
+	struct net_device *dev;
+	struct ckpt_hdr_netns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	BUG_ON(h->this_ref <= 0);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops->ndo_checkpoint)
+			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
+		else if (dev->flags & IFF_UP)
+			ret = -ENOSYS;
+		else
+			/* TODO: There should be a flag to attempt a
+			 * checkpoint of downed interfaces, regardless
+			 * of whether they support checkpoint or not.
+			 */
+			ret = 0;
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+static int restore_in_addrs(struct ckpt_ctx *ctx,
+			    __u32 naddrs,
+			    struct net *net,
+			    struct net_device *dev)
+{
+	__u32 i;
+	int ret = 0;
+	int len = naddrs * sizeof(struct ckpt_netdev_addr);
+	struct ckpt_netdev_addr *addrs = NULL;
+
+	ret = ckpt_read_payload(ctx, (void **)&addrs, len, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < naddrs; i++) {
+		struct ckpt_netdev_addr *addr = &addrs[i];
+		struct ifreq req;
+		struct sockaddr_in *inaddr;
+
+		if (addr->type != CKPT_NETDEV_ADDR_IPV4) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
+				 addr->type);
+			break;
+		}
+
+		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
+			   addr->inet4_address,
+			   addr->inet4_mask,
+			   addr->inet4_broadcast);
+
+		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_address);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set address\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_mask);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set netmask\n");
+			break;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = ntohl(addr->inet4_broadcast);
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set broadcast\n");
+			break;
+		}
+	}
+
+ out:
+	kfree(addrs);
+
+	return ret;
+}
+
+static int veth_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct ifinfomsg ifm;
+	int ret = -ENOMEM;
+	struct veth_newlink *d = data;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo)
+		goto out;
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "veth");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put(skb, VETH_INFO_PEER, sizeof(ifm), &ifm);
+	if (!ret)
+		ret = nla_put_string(skb, IFLA_IFNAME, d->peer);
+
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+
+	return ret;
+}
+
+static int mvl_new_link_msg(struct sk_buff *skb, void *data)
+{
+	struct mvl_newlink *d = data;
+	struct nlattr *linkinfo;
+	struct nlattr *linkdata;
+	struct net_device *lowerdev;
+	int ret;
+
+	lowerdev = dev_get_by_name(current->nsproxy->net_ns, d->base);
+	if (!lowerdev)
+		return -ENOENT;
+
+	ret = nla_put(skb, IFLA_ADDRESS, ETH_ALEN, d->hwaddr);
+	if (ret)
+		goto out_put;
+
+	ret = nla_put_u32(skb, IFLA_LINK, lowerdev->ifindex);
+	if (ret)
+		goto out_put;
+
+	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
+	if (!linkinfo) {
+		ret = -ENOMEM;
+		goto out_put;
+	}
+
+	ret = nla_put_string(skb, IFLA_INFO_KIND, "macvlan");
+	if (ret)
+		goto out;
+
+	linkdata = nla_nest_start(skb, IFLA_INFO_DATA);
+	if (!linkdata) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = nla_put_u32(skb, IFLA_MACVLAN_MODE, d->mode);
+	nla_nest_end(skb, linkdata);
+ out:
+	nla_nest_end(skb, linkinfo);
+ out_put:
+	dev_put(lowerdev);
+
+	return ret;
+}
+
+static struct sk_buff *new_link_msg(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb;
+	struct ifinfomsg *ifm;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		goto out;
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_NEWLINK, sizeof(*ifm), flags);
+	if (!nlh)
+		goto out;
+
+	ifm = nlmsg_data(nlh);
+	memset(ifm, 0, sizeof(*ifm));
+
+	ret = nla_put_string(skb, IFLA_IFNAME, name);
+	if (ret)
+		goto out;
+
+	ret = fn(skb, data);
+
+	nlmsg_end(skb, nlh);
+
+ out:
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
+{
+	int ret = -ENOMEM;
+	struct socket *rtnl = NULL;
+	struct sk_buff *skb = NULL;
+	struct nlmsghdr *nlh;
+	struct msghdr msg;
+	struct kvec kvec;
+
+	skb = new_link_msg(fn, data, name);
+	if (IS_ERR(skb)) {
+		ckpt_debug("failed to create new link message: %li\n",
+			   PTR_ERR(skb));
+		return ERR_PTR(PTR_ERR(skb));
+	}
+
+	memset(&msg, 0, sizeof(msg));
+	kvec.iov_len = skb->len;
+	kvec.iov_base = skb->head;
+
+	rtnl = rtnl_open();
+	if (IS_ERR(rtnl)) {
+		ret = PTR_ERR(rtnl);
+		ckpt_debug("Unable to open rtnetlink socket: %i\n", ret);
+		goto out_noclose;
+	}
+
+	ret = kernel_sendmsg(rtnl, &msg, &kvec, 1, kvec.iov_len);
+	if (ret < 0)
+		goto out;
+	else if (ret != skb->len) {
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Free the send skb to make room for the receive skb */
+	kfree_skb(skb);
+
+	nlh = rtnl_get_response(rtnl, &skb);
+	if (IS_ERR(nlh)) {
+		ret = PTR_ERR(nlh);
+		ckpt_debug("RTNETLINK said: %i\n", ret);
+	}
+ out:
+	rtnl_close(rtnl);
+ out_noclose:
+	kfree_skb(skb);
+
+	if (ret < 0)
+		return ERR_PTR(ret);
+	else
+		return dev_get_by_name(current->nsproxy->net_ns, name);
+}
+
+static int netdev_noop(void *data)
+{
+	return 0;
+}
+
+static int netdev_cleanup(void *data)
+{
+	struct dq_netdev *dq = data;
+
+	dev_put(dq->dev);
+
+	if (dq->ctx->errno) {
+		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
+		unregister_netdev(dq->dev);
+	}
+
+	return 0;
+}
+
+static struct net_device *restore_veth(struct ckpt_ctx *ctx,
+				       struct ckpt_hdr_netdev *h,
+				       struct net *net)
+{
+	int ret;
+	char this_name[IFNAMSIZ];
+	char peer_name[IFNAMSIZ];
+	struct net_device *dev;
+	struct net_device *peer;
+	struct ifreq req;
+	struct dq_netdev dq;
+
+	dq.ctx = ctx;
+
+	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
+
+	peer = ckpt_obj_try_fetch(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
+	if (IS_ERR(peer)) {
+		struct veth_newlink veth = {
+			.peer = peer_name,
+		};
+
+		dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
+		if (IS_ERR(dev))
+			return dev;
+
+		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
+		if (!peer) {
+			ret = -EINVAL;
+			goto err_dev;
+		}
+
+		dq.dev = peer;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			goto err_peer;
+
+		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
+				      CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+		dev_put(peer);
+
+		dq.dev = dev;
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     netdev_noop, netdev_cleanup);
+		if (ret)
+			/* Can't recall peer dq, so let it cleanup peer */
+			goto err_dev;
+
+	} else {
+		/* We're second: get our dev from the hash */
+		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
+		if (IS_ERR(dev))
+			return dev;
+	}
+
+	/* Move to our new netns */
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+	if (ret < 0)
+		goto out;
+
+	/* Restore MAC address */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
+	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
+ out:
+	if (ret)
+		dev = ERR_PTR(ret);
+
+	return dev;
+
+ err_peer:
+	dev_put(peer);
+	unregister_netdev(peer);
+ err_dev:
+	dev_put(dev);
+	unregister_netdev(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_lo(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_netdev *h,
+				     struct net *net)
+{
+	struct net_device *dev;
+	char name[IFNAMSIZ+1];
+	int ret;
+
+	dev = dev_get_by_name(net, "lo");
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	ret = _ckpt_read_buffer(ctx, name, IFNAMSIZ);
+	if (ret < 0)
+		goto err;
+
+	if (strncmp(dev->name, name, IFNAMSIZ) != 0) {
+		ret = dev_change_name(dev, name);
+		if (ret < 0)
+			goto err;
+	}
+
+	return dev;
+ err:
+	dev_put(dev);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_sit(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_netdev *h,
+				      struct net *net)
+{
+	/* Don't actually do anything for SIT devices yet */
+	return dev_get_by_name(net, "sit0");
+}
+
+static struct net_device *restore_macvlan(struct ckpt_ctx *ctx,
+					  struct ckpt_hdr_netdev *h,
+					  struct net *net)
+{
+	struct net_device *dev;
+	struct mvl_newlink mvl = {
+		.mode = h->macvlan.mode,
+		.hwaddr = h->hwaddr,
+	};
+	int ret;
+
+	ret = _ckpt_read_buffer(ctx, mvl.this, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, mvl.base, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	dev = rtnl_newlink(mvl_new_link_msg, &mvl, mvl.this);
+	if (IS_ERR(dev)) {
+		ckpt_err(ctx, PTR_ERR(dev),
+			 "Failed to create macvlan device %s:%s",
+			 mvl.this, mvl.base);
+		goto out;
+	}
+
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to change netns of %s:%s\n",
+			 mvl.this, mvl.base);
+		dev_put(dev);
+		unregister_netdev(dev);
+		dev = ERR_PTR(ret);
+	}
+ out:
+	return dev;
+}
+
+typedef struct net_device *(*restore_netdev_fn)(struct ckpt_ctx *,
+						struct ckpt_hdr_netdev *,
+						struct net *);
+
+restore_netdev_fn restore_netdev_functions[] = {
+	restore_lo,		/* CKPT_NETDEV_LO */
+	restore_veth,		/* CKPT_NETDEV_VETH */
+	restore_sit,		/* CKPT_NETDEV_SIT */
+	restore_macvlan,	/* CKPT_NETDEV_MACVLAN */
+};
+
+void *restore_netdev(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = NULL;
+	struct ifreq req;
+	struct net *net;
+	int ret;
+	restore_netdev_fn restore_fn = NULL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->netns_ref != 0) {
+		net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
+		if (IS_ERR(net)) {
+			ckpt_debug("failed to get net for %i\n", h->netns_ref);
+			ret = PTR_ERR(net);
+			goto out;
+		}
+	} else
+		net = current->nsproxy->net_ns;
+
+	if (h->type >= CKPT_NETDEV_MAX) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "Invalid netdev type %i\n", h->type);
+		goto out;
+	}
+
+	restore_fn = restore_netdev_functions[h->type];
+
+	dev = restore_fn(ctx, h, net);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
+		goto out;
+	}
+
+	/* Restore flags (which will likely bring the interface up) */
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	req.ifr_flags = h->flags;
+	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0)
+		ret = restore_in_addrs(ctx, h->inet_addrs, net, dev);
+ out:
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to restore netdevice\n");
+		if ((h->type == CKPT_NETDEV_VETH) && !IS_ERR(dev))
+			dev_put(dev);
+		dev = ERR_PTR(ret);
+	} else
+		ckpt_debug("restored netdev %s\n", dev->name);
+
+	ckpt_hdr_put(ctx, h);
+
+	return dev;
+}
+
+void *restore_netns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netns *h;
+	struct net *net;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");
+		return h;
+	}
+
+	if (h->this_ref != 0) {
+		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
+		if (IS_ERR(net))
+			goto out;
+	} else
+		net = current->nsproxy->net_ns;
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return net;
+}
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 098/100] c/r: Add a checkpoint handler to the 'sit' device
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

This handler doesn't really do much to checkpoint the device, other
than the minimum required to support the restart process.  When we
add IPv6 support to this, then we can fill this out.

This allows us to avoid skipping unsupported interfaces on a normal
system.

Changelog[v21]:
  - Do not include checkpoint_hdr.h explicitly
  - Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 net/ipv6/sit.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 5abae10..5ecbe56 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1084,11 +1084,45 @@ static int ipip6_tunnel_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+#include <linux/checkpoint.h>
+
+#ifdef CONFIG_NETNS_CHECKPOINT
+static int ipip6_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_SIT;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops ipip6_netdev_ops = {
 	.ndo_uninit	= ipip6_tunnel_uninit,
 	.ndo_start_xmit	= ipip6_tunnel_xmit,
 	.ndo_do_ioctl	= ipip6_tunnel_ioctl,
 	.ndo_change_mtu	= ipip6_tunnel_change_mtu,
+#ifdef CONFIG_NETNS_CHECKPOINT
+	.ndo_checkpoint	= ipip6_checkpoint,
+#endif
 };
 
 static void ipip6_tunnel_setup(struct net_device *dev)
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 096/100] c/r: Add checkpoint support for veth devices (v2)
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

Adds an ndo_checkpoint() handler for veth devices to checkpoint themselves.
Writes out the pairing information, addresses, and initiates a checkpoint
on the peer if the peer won't be reached from another netns.  Throws an
error of our peer's netns isn't already in the hash (i.e., a tree leak).

Changelog[v21]
 - Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n
 - Clean up the error path in restore_veth()

Changes in v2:
 - Fix check detecting if peer is in the init netns

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 drivers/net/veth.c   |   76 +++++++++++++++++++++++++++++++++++++++++++
 net/checkpoint_dev.c |   87 +++++++++++++++++--------------------------------
 2 files changed, 106 insertions(+), 57 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f9f0730..d76b5e0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -285,6 +285,79 @@ static void veth_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static int veth_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer = priv->peer;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+	int n;
+
+	if (!peer) {
+		ckpt_err(ctx, -EINVAL, "veth device has no peer!\n");
+		return -EINVAL;
+	}
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_VETH;
+
+	ret = h->veth.this_ref = ckpt_obj_lookup_add(ctx, dev,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = h->veth.peer_ref = ckpt_obj_lookup_add(ctx, peer,
+						     CKPT_OBJ_NETDEV, &n);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+		if (ret)
+			goto out;
+	}
+
+	/* Only checkpoint peer if we're not going to arrive at it
+	 * via another task's netns.  Fail if the pipe exits
+	 * our container to a netns not already in the hash
+	 */
+	if (ckpt_netdev_in_init_netns(ctx, peer))
+		ret = checkpoint_obj(ctx, peer, CKPT_OBJ_NETDEV);
+	else if (!ckpt_obj_lookup(ctx, peer->nd_net, CKPT_OBJ_NET_NS)) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret,
+			 "Peer %s of %s not in checkpointed namespaces\n",
+			 peer->name, dev->name);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -293,6 +366,9 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_change_mtu      = veth_change_mtu,
 	.ndo_get_stats       = veth_get_stats,
 	.ndo_set_mac_address = eth_mac_addr,
+#ifdef CONFIG_NETNS_CHECKPOINT
+	.ndo_checkpoint      = veth_checkpoint,
+#endif
 };
 
 static void veth_setup(struct net_device *dev)
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index 5097011..a8e3341 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -20,11 +20,6 @@
 #include <net/net_namespace.h>
 #include <net/sch_generic.h>
 
-struct dq_netdev {
-	struct net_device *dev;
-	struct ckpt_ctx *ctx;
-};
-
 struct veth_newlink {
 	char *peer;
 };
@@ -587,25 +582,6 @@ static int rtnl_dellink(char *name)
 	return ret;
 }
 
-static int netdev_noop(void *data)
-{
-	return 0;
-}
-
-static int netdev_cleanup(void *data)
-{
-	struct dq_netdev *dq = data;
-
-	dev_put(dq->dev);
-
-	if (dq->ctx->errno) {
-		ckpt_debug("Unregistering netdev %s\n", dq->dev->name);
-		unregister_netdev(dq->dev);
-	}
-
-	return 0;
-}
-
 static struct net_device *restore_veth(struct ckpt_ctx *ctx,
 				       struct ckpt_hdr_netdev *h,
 				       struct net *net)
@@ -616,9 +592,6 @@ static struct net_device *restore_veth(struct ckpt_ctx *ctx,
 	struct net_device *dev;
 	struct net_device *peer;
 	struct ifreq req;
-	struct dq_netdev dq;
-
-	dq.ctx = ctx;
 
 	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
 	if (ret < 0)
@@ -640,37 +613,31 @@ static struct net_device *restore_veth(struct ckpt_ctx *ctx,
 		if (IS_ERR(dev))
 			return dev;
 
+		ret = ckpt_obj_insert(ctx, dev, h->veth.this_ref,
+				      CKPT_OBJ_NETDEV);
+		dev_put(dev);
+		if (ret < 0)
+			goto err;
+
 		peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
 		if (!peer) {
 			ret = -EINVAL;
-			goto err_dev;
+			goto err;
 		}
 
-		dq.dev = peer;
-		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
-				     netdev_noop, netdev_cleanup);
-		if (ret)
-			goto err_peer;
-
 		ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
 				      CKPT_OBJ_NETDEV);
-		if (ret < 0)
-			/* Can't recall peer dq, so let it cleanup peer */
-			goto err_dev;
 		dev_put(peer);
-
-		dq.dev = dev;
-		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
-				     netdev_noop, netdev_cleanup);
-		if (ret)
-			/* Can't recall peer dq, so let it cleanup peer */
-			goto err_dev;
+		if (ret < 0)
+			goto err;
 
 	} else {
 		/* We're second: get our dev from the hash */
 		dev = ckpt_obj_fetch(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
-		if (IS_ERR(dev))
-			return dev;
+		if (IS_ERR(dev)) {
+			ret = PTR_ERR(dev);
+			goto err;
+		}
 	}
 
 	/* Move to our new netns */
@@ -678,25 +645,31 @@ static struct net_device *restore_veth(struct ckpt_ctx *ctx,
 	ret = dev_change_net_namespace(dev, net, dev->name);
 	rtnl_unlock();
 	if (ret < 0)
-		goto out;
+		goto err;
 
 	/* Restore MAC address */
 	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
 	memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
 	req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
 	ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
- out:
-	if (ret)
-		dev = ERR_PTR(ret);
+	if (ret < 0)
+		goto err;
 
 	return dev;
-
- err_peer:
-	dev_put(peer);
-	unregister_netdev(peer);
- err_dev:
-	dev_put(dev);
-	unregister_netdev(dev);
+ err:
+	/* Delete from hash to drop reference */
+	ckpt_obj_delete(ctx, h->veth.this_ref, CKPT_OBJ_NETDEV);
+	ckpt_obj_delete(ctx, h->veth.peer_ref, CKPT_OBJ_NETDEV);
+
+	/* This will fail to delete the interface if we get here
+	 * because of a failed attempt at setting the hardware
+	 * address, since the device has been moved to another netns.
+	 * This is not a problem, however, because the death of that
+	 * netns will take the device (and its peer) down with it
+	 * cleanly.
+	 */
+	if (rtnl_dellink(this_name) < 0)
+		ckpt_debug("failed to delete interfaces on error\n");
 
 	return ERR_PTR(ret);
 }
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 095/100] c/r: Add rtnl_dellink() helper
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

This is the kernel equivalent of "ip link del $name" and matches the
existing rtnl_newlink() equivalent of "ip link add $name".  It factors
out the message creation and dispatch code a little further into
rtnl_do() before adding the new function.

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 net/checkpoint_dev.c |   86 +++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
index 34a6bdb..5097011 100644
--- a/net/checkpoint_dev.c
+++ b/net/checkpoint_dev.c
@@ -475,22 +475,49 @@ static struct sk_buff *new_link_msg(new_link_fn fn, void *data, char *name)
 	return skb;
 }
 
-static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
+static struct sk_buff *del_link_msg(char *name)
+{
+	int ret = -ENOMEM;
+	int flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_ACK;
+	struct nlmsghdr *nlh;
+	struct sk_buff *skb;
+	struct ifinfomsg *ifm;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	nlh = nlmsg_put(skb, 0, 0, RTM_DELLINK, sizeof(*ifm), flags);
+	if (!nlh)
+		goto out;
+
+	ifm = nlmsg_data(nlh);
+	memset(ifm, 0, sizeof(*ifm));
+
+	ret = nla_put_string(skb, IFLA_IFNAME, name);
+	if (ret)
+		goto out;
+
+	nlmsg_end(skb, nlh);
+
+ out:
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static int rtnl_do(struct sk_buff *skb)
 {
 	int ret = -ENOMEM;
 	struct socket *rtnl = NULL;
-	struct sk_buff *skb = NULL;
+	struct sk_buff *rskb = NULL;
 	struct nlmsghdr *nlh;
 	struct msghdr msg;
 	struct kvec kvec;
 
-	skb = new_link_msg(fn, data, name);
-	if (IS_ERR(skb)) {
-		ckpt_debug("failed to create new link message: %li\n",
-			   PTR_ERR(skb));
-		return ERR_PTR(PTR_ERR(skb));
-	}
-
 	memset(&msg, 0, sizeof(msg));
 	kvec.iov_len = skb->len;
 	kvec.iov_base = skb->head;
@@ -510,25 +537,56 @@ static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
 		goto out;
 	}
 
-	/* Free the send skb to make room for the receive skb */
-	kfree_skb(skb);
-
-	nlh = rtnl_get_response(rtnl, &skb);
+	nlh = rtnl_get_response(rtnl, &rskb);
 	if (IS_ERR(nlh)) {
 		ret = PTR_ERR(nlh);
 		ckpt_debug("RTNETLINK said: %i\n", ret);
 	}
  out:
 	rtnl_close(rtnl);
+	kfree_skb(rskb);
  out_noclose:
-	kfree_skb(skb);
+	return ret;
+}
 
+static struct net_device *rtnl_newlink(new_link_fn fn, void *data, char *name)
+{
+	struct sk_buff *skb;
+	int ret;
+
+	skb = new_link_msg(fn, data, name);
+	if (IS_ERR(skb)) {
+		ckpt_debug("failed to create new link message: %li\n",
+			   PTR_ERR(skb));
+		return ERR_PTR(PTR_ERR(skb));
+	}
+
+	ret = rtnl_do(skb);
+	kfree_skb(skb);
 	if (ret < 0)
 		return ERR_PTR(ret);
 	else
 		return dev_get_by_name(current->nsproxy->net_ns, name);
 }
 
+static int rtnl_dellink(char *name)
+{
+	struct sk_buff *skb;
+	int ret;
+
+	skb = del_link_msg(name);
+	if (IS_ERR(skb)) {
+		ckpt_debug("failed to create del link message: %li\n",
+			   PTR_ERR(skb));
+		return PTR_ERR(skb);
+	}
+
+	ret = rtnl_do(skb);
+	kfree_skb(skb);
+
+	return ret;
+}
+
 static int netdev_noop(void *data)
 {
 	return 0;
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 097/100] c/r: Add loopback checkpoint support (v2)
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

Adds a small ndo_checkpoint() handler for loopback devices to write the
name and addresses like other interfaces.

Changelog[v21]:
 - Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n

Changes in v2:
 - Add CONFIG_CHECKPOINT around the handler

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 drivers/net/loopback.c |   45 ++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 72b7949..9a958a8 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -155,10 +155,49 @@ static void loopback_dev_free(struct net_device *dev)
 	free_netdev(dev);
 }
 
+#ifdef CONFIG_NETNS_CHECKPOINT
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static int loopback_checkpoint(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct ckpt_hdr_netdev *h;
+	struct ckpt_netdev_addr *addrs;
+	int ret;
+
+	h = ckpt_netdev_base(ctx, dev, &addrs);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	h->type = CKPT_NETDEV_LO;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+	if (ret < 0)
+		goto out;
+
+	if (h->inet_addrs > 0) {
+		int len = (sizeof(struct ckpt_netdev_addr) * h->inet_addrs);
+		ret = ckpt_write_buffer(ctx, addrs, len);
+	}
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(addrs);
+
+	return ret;
+}
+#endif
 static const struct net_device_ops loopback_ops = {
-	.ndo_init      = loopback_dev_init,
-	.ndo_start_xmit= loopback_xmit,
-	.ndo_get_stats = loopback_get_stats,
+	.ndo_init       = loopback_dev_init,
+	.ndo_start_xmit = loopback_xmit,
+	.ndo_get_stats  = loopback_get_stats,
+#ifdef CONFIG_NETNS_CHECKPOINT
+	.ndo_checkpoint = loopback_checkpoint,
+#endif
 };
 
 /*
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 093/100] c/r: Add checkpoint and collect hooks to net_device_ops
From: Oren Laadan @ 2010-05-01 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

These will be implemented per-driver by those that support such
operations.

Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/netdevice.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..9f6de34 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -691,6 +691,12 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int			(*ndo_collect)(struct ckpt_ctx *ctx,
+					       struct net_device *dev);
+	int			(*ndo_checkpoint)(struct ckpt_ctx *ctx,
+						  struct net_device *dev);
+#endif
 };
 
 /*
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH v21 073/100] c/r: Add AF_UNIX support (v12)
From: Oren Laadan @ 2010-05-01 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, Alexey Dobriyan, netdev, Oren Laadan
In-Reply-To: <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>

From: Dan Smith <danms@us.ibm.com>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in ckpt-v21:
  - Do not include checkpoint_hdr.h explicitly
  - [Dan Smith] Disable softirqs when taking the socket queue lock
Changes in ckpt-v19:
  - [Serge Hallyn] skb->tail can be offset
Changes in ckpt-v19-rc3:
  - Rebase to kernel 2.6.33: export and leverage sock_alloc_file()
  - [Nathan Lynch] Fix net/checkpoint.c for 64-bit
Changes in ckpt-v19-rc2:
  - Change select uses of ckpt_debug() to ckpt_err() in net c/r
  - [Dan Smith] Unify skb read/write functions and handle fragmented buffers
  - [Dan Smith] Update buffer restore code to match the new format
Changes in ckpt-v19-rc1:
  - [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
  - [Dan Smith] Remove an unnecessary check on socket restart
  - [Matt Helsley] Add cpp definitions for enums
  - [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase
Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()
Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case
Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature
Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)
Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values
Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names
Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()
Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart
Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names
Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/checkpoint.c                |    7 +
 include/linux/checkpoint.h     |    8 +
 include/linux/checkpoint_hdr.h |  119 +++++-
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   15 +
 include/net/sock.h             |   10 +
 kernel/checkpoint/objhash.c    |   26 +-
 net/Makefile                   |    2 +
 net/checkpoint.c               | 1028 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  646 +++++++++++++++++++++++++
 13 files changed, 1872 insertions(+), 7 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 783c920..23ec4de 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
+#include <net/sock.h>
 
 /**************************************************************************
  * Checkpoint
@@ -619,6 +620,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_FIFO,
 		.restore = fifo_file_restore,
 	},
+	/* socket */
+	{
+		.file_name = "SOCKET",
+		.file_type = CKPT_FILE_SOCKET,
+		.restore = sock_file_restore,
+	},
 };
 
 static void *restore_file(struct ckpt_ctx *ctx)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 549f133..25275af 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -33,6 +33,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <net/sock.h>
 
 /* sycall helpers */
 extern long do_sys_checkpoint(pid_t pid, int fd,
@@ -97,6 +98,13 @@ extern int restore_read_page(struct ckpt_ctx *ctx, struct page *page);
 /* pids */
 extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
 
+/* socket functions */
+extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
+			      struct socket *socket,
+			      struct sockaddr *loc, unsigned *loc_len,
+			      struct sockaddr *rem, unsigned *rem_len);
+extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e706636..2be2d2c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -10,18 +10,22 @@
  *  distribution for more details.
  */
 
-#ifndef __KERNEL__
-#include <sys/types.h>
-#include <linux/types.h>
-#endif
-
 #ifdef __KERNEL__
 #include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/un.h>
 
 #ifndef CONFIG_CHECKPOINT
 #error linux/checkpoint_hdr.h included directly (without CONFIG_CHECKPOINT)
 #endif
 
+#else /* __KERNEL__ */
+
+#include <sys/types.h>
+#include <linux/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+
 #endif
 
 /*
@@ -145,6 +149,17 @@ enum {
 	CKPT_HDR_SIGPENDING,
 #define CKPT_HDR_SIGPENDING CKPT_HDR_SIGPENDING
 
+	CKPT_HDR_SOCKET = 701,
+#define CKPT_HDR_SOCKET CKPT_HDR_SOCKET
+	CKPT_HDR_SOCKET_QUEUE,
+#define CKPT_HDR_SOCKET_QUEUE CKPT_HDR_SOCKET_QUEUE
+	CKPT_HDR_SOCKET_BUFFER,
+#define CKPT_HDR_SOCKET_BUFFER CKPT_HDR_SOCKET_BUFFER
+	CKPT_HDR_SOCKET_FRAG,
+#define CKPT_HDR_SOCKET_FRAG CKPT_HDR_SOCKET_FRAG
+	CKPT_HDR_SOCKET_UNIX,
+#define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -200,6 +215,8 @@ enum obj_type {
 #define CKPT_OBJ_USER CKPT_OBJ_USER
 	CKPT_OBJ_GROUPINFO,
 #define CKPT_OBJ_GROUPINFO CKPT_OBJ_GROUPINFO
+	CKPT_OBJ_SOCK,
+#define CKPT_OBJ_SOCK CKPT_OBJ_SOCK
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -449,6 +466,8 @@ enum file_type {
 #define CKPT_FILE_PIPE CKPT_FILE_PIPE
 	CKPT_FILE_FIFO,
 #define CKPT_FILE_FIFO CKPT_FILE_FIFO
+	CKPT_FILE_SOCKET,
+#define CKPT_FILE_SOCKET CKPT_FILE_SOCKET
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -473,6 +492,96 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+/* socket */
+struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
+
+	struct { /* struct socket */
+		__u64 flags;
+		__u8 state;
+	} socket __attribute__ ((aligned(8)));
+
+	struct { /* struct sock_common */
+		__u32 bound_dev_if;
+		__u32 reuse;
+		__u16 family;
+		__u8 state;
+	} sock_common __attribute__ ((aligned(8)));
+
+	struct { /* struct sock */
+		__s64 rcvlowat;
+		__u64 flags;
+
+		__s64 rcvtimeo;
+		__s64 sndtimeo;
+
+		__u32 err;
+		__u32 err_soft;
+		__u32 priority;
+		__s32 rcvbuf;
+		__s32 sndbuf;
+		__u16 type;
+		__s16 backlog;
+
+		__u8 protocol;
+		__u8 state;
+		__u8 shutdown;
+		__u8 userlocks;
+		__u8 no_check;
+
+		struct linger linger;
+	} sock __attribute__ ((aligned(8)));
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_queue {
+	struct ckpt_hdr h;
+	__u32 skb_count;
+	__u32 total_bytes;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__u32 transport_header;
+	__u32 network_header;
+	__u32 mac_header;
+	__u32 lin_len; /* Length of linear data */
+	__u32 frg_len; /* Length of fragment data */
+	__u32 skb_len; /* Length of skb (adjusted) */
+	__u32 hdr_len; /* Length of skipped header */
+	__u32 mac_len;
+	__u32 data_offset; /* Offset of data pointer from head */
+	__s32 sk_objref;
+	__s32 pr_objref;
+	__u16 protocol;
+	__u16 nr_frags;
+	__u8 cb[48];
+};
+
+struct ckpt_hdr_socket_buffer_frag {
+	struct ckpt_hdr h;
+	__u32 size;
+	__u32 offset;
+};
+
+#define CKPT_UNIX_LINKED 1
+struct ckpt_hdr_socket_unix {
+	struct ckpt_hdr h;
+	__s32 this;
+	__s32 peer;
+	__u32 peercred_uid;
+	__u32 peercred_gid;
+	__u32 flags;
+	__u32 laddr_len;
+	__u32 raddr_len;
+	struct sockaddr_un laddr;
+	struct sockaddr_un raddr;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_file_socket {
+	struct ckpt_hdr_file common;
+	__s32 sock_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/net.h b/include/linux/net.h
index 1f32c70..6ffe827 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -246,6 +246,8 @@ extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
 extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
+extern int	     sock_alloc_file(struct socket *sock, struct file **f,
+				     int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..ee423d1 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -68,4 +68,19 @@ static inline int unix_sysctl_register(struct net *net) { return 0; }
 static inline void unix_sysctl_unregister(struct net *net) {}
 #endif
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+extern int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+extern int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+			struct ckpt_hdr_socket *h);
+extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
+
+#else
+#define unix_checkpoint NULL
+#define unix_restore NULL
+#define unix_collect NULL
+#endif /* CONFIG_CHECKPOINT */
+
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 3cf7de4..1c7665a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1713,4 +1713,14 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
+#ifdef CONFIG_CHECKPOINT
+/* Checkpoint/Restart Functions */
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *h);
+extern int sock_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#endif
+
 #endif	/* _SOCK_H */
diff --git a/kernel/checkpoint/objhash.c b/kernel/checkpoint/objhash.c
index 0fe741b..4960c25 100644
--- a/kernel/checkpoint/objhash.c
+++ b/kernel/checkpoint/objhash.c
@@ -29,7 +29,7 @@ struct ckpt_obj {
 	struct hlist_node next;
 };
 
-/* object internal flags */
+/*` object internal flags */
 #define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
 #define CKPT_OBJ_VISITED		0x2   /* object already visited */
 
@@ -481,6 +481,26 @@ static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
  */
 
 /**
+ * obj_sock_adjust_users - remove implicit reference on DEAD sockets
+ * @obj: CKPT_OBJ_SOCK object to adjust
+ *
+ * Sockets that have been disconnected from their struct file have
+ * a reference count one less than normal sockets.  The objhash's
+ * assumption of such a reference is therefore incorrect, so we correct
+ * it here.
+ */
+static inline void obj_sock_adjust_users(struct ckpt_obj *obj)
+{
+	struct sock *sk = (struct sock *)obj->ptr;
+
+	if (sock_flag(sk, SOCK_DEAD)) {
+		obj->users--;
+		ckpt_debug("Adjusting SOCK %i count to %i\n",
+			   obj->objref, obj->users);
+	}
+}
+
+/**
  * ckpt_obj_contained - test if shared objects are contained in checkpoint
  * @ctx: checkpoint context
  *
@@ -505,6 +525,10 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
 			continue;
+
+		if (obj->ops->obj_type == CKPT_OBJ_SOCK)
+			obj_sock_adjust_users(obj);
+
 		if (obj->ops->ref_users(obj->ptr) != obj->users) {
 			ckpt_err(ctx, -EBUSY,
 				 "%(O)%(P)%(S)Usage leak (%d != %d)\n",
diff --git a/net/Makefile b/net/Makefile
index 1542e72..74b038f 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -65,3 +65,5 @@ ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
 obj-$(CONFIG_WIMAX)		+= wimax/
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
new file mode 100644
index 0000000..9116d7a
--- /dev/null
+++ b/net/checkpoint.c
@@ -0,0 +1,1028 @@
+/*
+ *  Copyright 2009 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *             Oren Laadan <orenl@cs.columbia.edu>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/socket.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/fs_struct.h>
+#include <linux/highmem.h>
+
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+#include <net/tcp.h>
+
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	struct sock *sk;
+};
+
+static int sock_copy_buffers(struct sk_buff_head *from,
+			     struct sk_buff_head *to,
+			     uint32_t *total_bytes)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int i;
+	struct sk_buff *skb;
+	struct sk_buff **skbs;
+
+	*total_bytes = 0;
+
+	spin_lock_bh(&from->lock);
+	skb_queue_walk(from, skb)
+		count1++;
+	spin_unlock_bh(&from->lock);
+
+	skbs = kzalloc(sizeof(*skbs) * count1, GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	for (i = 0; i < count1;  i++) {
+		skbs[i] = dev_alloc_skb(0);
+		if (!skbs[i])
+			goto err;
+	}
+
+	i = 0;
+	spin_lock_bh(&from->lock);
+	skb_queue_walk(from, skb) {
+		if (++count2 > count1)
+			break; /* The queue changed as we read it */
+
+		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
+		skb_queue_tail(to, skbs[i]);
+
+		*total_bytes += skb->len;
+		i++;
+	}
+	spin_unlock_bh(&from->lock);
+
+	if (count1 != count2)
+		goto err;
+
+	kfree(skbs);
+
+	return count1;
+ err:
+	while (skb_dequeue(to))
+		; /* Pull all the buffers out of the queue */
+	for (i = 0; i < count1; i++)
+		kfree_skb(skbs[i]);
+	kfree(skbs);
+
+	return -EAGAIN;
+}
+
+static void sock_record_header_info(struct sk_buff *skb,
+				    struct ckpt_hdr_socket_buffer *h)
+{
+
+	h->mac_len = skb->mac_len;
+	h->skb_len = skb->len;
+	h->hdr_len = skb->data - skb->head;
+	h->frg_len = skb->data_len;
+	h->data_offset = (skb->data - skb->head);
+
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	h->transport_header = skb->transport_header;
+	h->network_header = skb->network_header;
+	h->mac_header = skb->mac_header;
+	h->lin_len = (unsigned long) skb->tail;
+#else
+	h->transport_header = skb->transport_header - skb->head;
+	h->network_header = skb->network_header - skb->head;
+	h->mac_header = skb->mac_header - skb->head;
+	h->lin_len = ((unsigned long) skb->tail - (unsigned long) skb->head);
+#endif
+
+	memcpy(h->cb, skb->cb, sizeof(skb->cb));
+	h->nr_frags = skb_shinfo(skb)->nr_frags;
+}
+
+int sock_restore_header_info(struct ckpt_ctx *ctx,
+			     struct sk_buff *skb,
+			     struct ckpt_hdr_socket_buffer *h)
+{
+	if (h->mac_header + h->mac_len != h->network_header) {
+		ckpt_err(ctx, -EINVAL,
+			 "skb mac_header %u+%u != network header %u\n",
+			 h->mac_header, h->mac_len, h->network_header);
+		return -EINVAL;
+	}
+
+	if (h->network_header > h->lin_len) {
+		ckpt_err(ctx, -EINVAL,
+			 "skb network header %u > linear length %u\n",
+			 h->network_header, h->lin_len);
+		return -EINVAL;
+	}
+
+	if (h->transport_header > h->lin_len) {
+		ckpt_err(ctx, -EINVAL,
+			 "skb transport header %u > linear length %u\n",
+			 h->transport_header, h->lin_len);
+		return -EINVAL;
+	}
+
+	if (h->data_offset > h->lin_len) {
+		ckpt_err(ctx, -EINVAL,
+			 "skb data offset %u > linear length %u\n",
+			 h->data_offset, h->lin_len);
+		return -EINVAL;
+	}
+
+	if (h->skb_len > SKB_MAX_ALLOC) {
+		ckpt_err(ctx, -EINVAL,
+			 "skb total length %u larger than max of %lu\n",
+			 h->skb_len, SKB_MAX_ALLOC);
+		return -EINVAL;
+	}
+
+	skb_set_transport_header(skb, h->transport_header);
+	skb_set_network_header(skb, h->network_header);
+	skb_set_mac_header(skb, h->mac_header);
+	skb->mac_len = h->mac_len;
+
+	/* FIXME: This should probably be sanitized per-protocol to
+	 * make sure nothing bad happens if it is hijacked.  For the
+	 * current set of protocols that we restore this way, the data
+	 * contained within is not very risky (flags and sequence
+	 * numbers) but could still be evalutated from a
+	 * could-the-user- have-set-these-flags point of view.
+	 */
+	memcpy(skb->cb, h->cb, sizeof(skb->cb));
+
+	skb->data = skb->head + h->data_offset;
+	skb->len = h->skb_len;
+
+	return 0;
+}
+
+static int sock_restore_skb_frag(struct ckpt_ctx *ctx,
+				 struct sk_buff *skb,
+				 int frag_idx)
+{
+	struct ckpt_hdr_socket_buffer_frag *h;
+	struct page *page;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_FRAG);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read buffer object\n");
+		return PTR_ERR(h);
+	}
+
+	if ((h->size > PAGE_SIZE) || (h->offset >= PAGE_SIZE)) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "skb frag size=%i,offset=%i > PAGE_SIZE\n",
+			 h->size, h->offset);
+		goto out;
+	}
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = restore_read_page(ctx, page);
+	if (ret) {
+		ckpt_err(ctx, ret, "failed to read fragment: %i\n", ret);
+		__free_page(page);
+	} else {
+		ckpt_debug("read %i+%i for fragment %i\n",
+			   h->offset, h->size, frag_idx);
+		skb_add_rx_frag(skb, frag_idx, page, h->offset, h->size);
+		ret = h->size;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sk_buff *skb = NULL;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return (struct sk_buff *)h;
+
+	ret = -ENOSPC;
+	if (h->lin_len > SKB_MAX_ALLOC) {
+		ckpt_err(ctx, ret, "socket linear buffer too big (%u > %lu)\n",
+			 h->lin_len, SKB_MAX_ALLOC);
+		goto out;
+	} else if (h->frg_len > SKB_MAX_ALLOC) {
+		ckpt_err(ctx, ret, "socket frag size too big (%u > %lu\n",
+			 h->frg_len, SKB_MAX_ALLOC);
+		goto out;
+	} else if (h->nr_frags >= MAX_SKB_FRAGS) {
+		ckpt_err(ctx, ret, "socket frag count too big (%u > %lu\n",
+			 h->nr_frags, MAX_SKB_FRAGS);
+		goto out;
+	}
+
+	skb = alloc_skb(h->lin_len, GFP_KERNEL);
+	if (!skb) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = _ckpt_read_obj_type(ctx, skb_put(skb, h->lin_len),
+				  h->lin_len, CKPT_HDR_BUFFER);
+	ckpt_debug("read linear skb length %u: %i\n", h->lin_len, ret);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->nr_frags; i++) {
+		ret = sock_restore_skb_frag(ctx, skb, i);
+		ckpt_debug("read skb frag %i/%i: %i\n",
+			   i + 1, h->nr_frags, ret);
+		if (ret < 0)
+			goto out;
+		h->frg_len -= ret;
+	}
+
+	if (h->frg_len != 0) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "length %u remaining after reading frags\n",
+			 h->frg_len);
+		goto out;
+	}
+
+	sock_restore_header_info(ctx, skb, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0) {
+		kfree_skb(skb);
+		skb = ERR_PTR(ret);
+	}
+
+	return skb;
+}
+
+static int __sock_write_skb_frag(struct ckpt_ctx *ctx, skb_frag_t *frag)
+{
+	struct ckpt_hdr_socket_buffer_frag *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_FRAG);
+	if (!h)
+		return -ENOMEM;
+
+	h->size = frag->size;
+	h->offset = frag->page_offset;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *)h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_dump_page(ctx, frag->page);
+	ckpt_debug("writing frag page: %i\n", ret);
+	return ret;
+}
+
+static int __sock_write_skb(struct ckpt_ctx *ctx,
+			    struct sk_buff *skb,
+			    int dst_objref)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (!h)
+		return -ENOMEM;
+
+	if (dst_objref > 0) {
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto out;
+		h->sk_objref = ret;
+		h->pr_objref = dst_objref;
+	}
+
+	sock_record_header_info(skb, h);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj_type(ctx, skb->head, h->lin_len, CKPT_HDR_BUFFER);
+	ckpt_debug("writing skb linear region %u: %i\n", h->lin_len, ret);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		ret = __sock_write_skb_frag(ctx, frag);
+		ckpt_debug("writing buffer fragment %i/%i (%i)\n",
+			   i + 1, h->nr_frags, ret);
+		if (ret < 0)
+			goto out;
+		h->frg_len -= frag->size;
+	}
+
+	WARN_ON(h->frg_len != 0);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int __sock_write_buffers(struct ckpt_ctx *ctx,
+				struct sk_buff_head *queue,
+				uint16_t family,
+				int dst_objref)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb) {
+		int ret = 0;
+
+		if (UNIXCB(skb).fp) {
+			ckpt_err(ctx, -EBUSY, "%(T)af_unix: pass fd\n");
+			return -EBUSY;
+		}
+
+		/* The other ancillary messages UNIX are always
+		 * present unlike descriptors.  Even though we can't
+		 * detect them and fail the checkpoint, we're not at
+		 * risk because we don't restore the control
+		 * information in the UNIX code.
+		 */
+
+		ret = __sock_write_skb(ctx, skb, dst_objref);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int sock_write_buffers(struct ckpt_ctx *ctx,
+			      struct sk_buff_head *queue,
+			      uint16_t family,
+			      int dst_objref)
+{
+	struct ckpt_hdr_socket_queue *h;
+	struct sk_buff_head tmpq;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (!h)
+		return -ENOMEM;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &h->total_bytes);
+	if (ret < 0)
+		goto out;
+
+	h->skb_count = ret;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (!ret)
+		ret = __sock_write_buffers(ctx, &tmpq, family, dst_objref);
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_deferred_write_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	int ret;
+	int dst_objref;
+
+	dst_objref = ckpt_obj_lookup(ctx, dq->sk, CKPT_OBJ_SOCK);
+	if (dst_objref < 0) {
+		ckpt_err(ctx, dst_objref, "%(T)socket: owner gone?\n");
+		return dst_objref;
+	}
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue,
+				 dq->sk->sk_family, dst_objref);
+	ckpt_debug("write recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue,
+				 dq->sk->sk_family, dst_objref);
+	ckpt_debug("write send buffers: %i\n", ret);
+
+	return ret;
+}
+
+int sock_defer_write_buffers(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk = sk;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      sock_deferred_write_buffers, NULL);
+}
+
+int ckpt_sock_getnames(struct ckpt_ctx *ctx, struct socket *sock,
+		       struct sockaddr *loc, unsigned *loc_len,
+		       struct sockaddr *rem, unsigned *rem_len)
+{
+	int ret;
+
+	ret = sock_getname(sock, loc, loc_len);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)%(P)socket: getname local\n", sock);
+		return -EINVAL;
+	}
+
+	ret = sock_getpeer(sock, rem, rem_len);
+	if (ret) {
+		if ((sock->sk->sk_type != SOCK_DGRAM) &&
+		    (sock->sk->sk_state == TCP_ESTABLISHED)) {
+			ckpt_err(ctx, ret, "%(T)%(P)socket: getname peer\n",
+				       sock);
+			return -EINVAL;
+		}
+		*rem_len = 0;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst_verify(struct ckpt_hdr_socket *h)
+{
+	uint8_t userlocks_mask =
+		SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK |
+		SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK;
+
+	if (h->sock.shutdown & ~SHUTDOWN_MASK)
+		return -EINVAL;
+	if (h->sock.userlocks & ~userlocks_mask)
+		return -EINVAL;
+	if (!ckpt_validate_errno(h->sock.err))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sock_cptrst_opt(int op, struct socket *sock,
+			   int optname, char *opt, int len)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (op == CKPT_CPT)
+		ret = sock_getsockopt(sock, SOL_SOCKET, optname, opt, &len);
+	else
+		ret = sock_setsockopt(sock, SOL_SOCKET, optname, opt, len);
+
+	set_fs(fs);
+
+	return ret;
+}
+
+#define CKPT_COPY_SOPT(op, sk, name, opt) \
+	sock_cptrst_opt(op, sk->sk_socket, name, (char *)opt, sizeof(*opt))
+
+static int sock_cptrst_bufopts(int op, struct sock *sk,
+			       struct ckpt_hdr_socket *h)
+{
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVBUF, &h->sock.rcvbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_RCVBUFFORCE, &h->sock.rcvbuf)) {
+			ckpt_debug("Failed to set SO_RCVBUF");
+			return -EINVAL;
+		}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_SNDBUF, &h->sock.sndbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_SNDBUFFORCE, &h->sock.sndbuf)) {
+			ckpt_debug("Failed to set SO_SNDBUF");
+			return -EINVAL;
+		}
+
+	/* It's silly that we have to fight ourselves here, but
+	 * sock_setsockopt() doubles the initial value, so divide here
+	 * to store the user's value and avoid doubling on restart
+	 */
+	if ((op == CKPT_CPT) && (h->sock.rcvbuf != SOCK_MIN_RCVBUF))
+		h->sock.rcvbuf >>= 1;
+
+	if ((op == CKPT_CPT) && (h->sock.sndbuf != SOCK_MIN_SNDBUF))
+		h->sock.sndbuf >>= 1;
+
+	return 0;
+}
+
+struct sock_flag_mapping {
+	int opt;
+	int flag;
+};
+
+struct sock_flag_mapping sk_flag_map[] = {
+	{SO_OOBINLINE, SOCK_URGINLINE},
+	{SO_KEEPALIVE, SOCK_KEEPOPEN},
+	{SO_BROADCAST, SOCK_BROADCAST},
+	{SO_TIMESTAMP, SOCK_RCVTSTAMP},
+	{SO_TIMESTAMPNS, SOCK_RCVTSTAMPNS},
+	{SO_DEBUG, SOCK_DBG},
+	{SO_DONTROUTE, SOCK_LOCALROUTE},
+};
+
+struct sock_flag_mapping sock_flag_map[] = {
+	{SO_PASSCRED, SOCK_PASSCRED},
+};
+
+static int sock_restore_flag(struct socket *sock,
+			     unsigned long *flags,
+			     int flag,
+			     int option)
+{
+	int v = 1;
+	int ret = 0;
+
+	if (test_and_clear_bit(flag, flags))
+		ret = sock_setsockopt(sock, SOL_SOCKET, option,
+				      (char *)&v, sizeof(v));
+
+	return ret;
+}
+
+
+static int sock_restore_flags(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	unsigned long sk_flags = h->sock.flags;
+	unsigned long sock_flags = h->socket.flags;
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sk_flag_map); i++) {
+		int opt = sk_flag_map[i].opt;
+		int flag = sk_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sk_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set skopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sock_flag_map); i++) {
+		int opt = sock_flag_map[i].opt;
+		int flag = sock_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sock_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set sockopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	/* TODO: Handle SOCK_TIMESTAMPING_* flags */
+	if (test_bit(SOCK_TIMESTAMPING_TX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_TX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RAW_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SYS_HARDWARE, &sk_flags)) {
+		ckpt_debug("SOF_TIMESTAMPING_* flags are not supported\n");
+		return -ENOSYS;
+	}
+
+	if (test_and_clear_bit(SOCK_DEAD, &sk_flags))
+		sock_set_flag(sock->sk, SOCK_DEAD);
+
+
+	/* Anything that is still set in the flags that isn't part of
+	 * our protocol's default set, indicates an error
+	 */
+	if (sk_flags & ~sock->sk->sk_flags) {
+		ckpt_debug("Unhandled sock flags: %lx\n", sk_flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_copy_timeval(int op, struct sock *sk,
+			     int sockopt, __s64 *saved)
+{
+	struct timeval tv;
+
+	if (op == CKPT_CPT) {
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+		*saved = timeval_to_ns(&tv);
+	} else {
+		tv = ns_to_timeval(*saved);
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
+		       struct ckpt_hdr_socket *h, int op)
+{
+	if (sk->sk_socket)
+		CKPT_COPY(op, h->socket.state, sk->sk_socket->state);
+
+	CKPT_COPY(op, h->sock_common.bound_dev_if, sk->sk_bound_dev_if);
+	CKPT_COPY(op, h->sock_common.family, sk->sk_family);
+
+	CKPT_COPY(op, h->sock.shutdown, sk->sk_shutdown);
+	CKPT_COPY(op, h->sock.userlocks, sk->sk_userlocks);
+	CKPT_COPY(op, h->sock.no_check, sk->sk_no_check);
+	CKPT_COPY(op, h->sock.protocol, sk->sk_protocol);
+	CKPT_COPY(op, h->sock.err, sk->sk_err);
+	CKPT_COPY(op, h->sock.err_soft, sk->sk_err_soft);
+	CKPT_COPY(op, h->sock.type, sk->sk_type);
+	CKPT_COPY(op, h->sock.state, sk->sk_state);
+	CKPT_COPY(op, h->sock.backlog, sk->sk_max_ack_backlog);
+
+	if (sock_cptrst_bufopts(op, sk, h))
+		return -EINVAL;
+
+	if (CKPT_COPY_SOPT(op, sk, SO_REUSEADDR, &h->sock_common.reuse)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_REUSEADDR");
+
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_PRIORITY, &h->sock.priority)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_PRIORITY");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVLOWAT, &h->sock.rcvlowat)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_RCVLOWAT");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_LINGER, &h->sock.linger)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_LINGER");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_SNDTIMEO, &h->sock.sndtimeo)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_SNDTIMEO");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_RCVTIMEO, &h->sock.rcvtimeo)) {
+		ckpt_err(ctx, -EINVAL, "Failed to set SO_RCVTIMEO");
+		return -EINVAL;
+	}
+
+	if (op == CKPT_CPT) {
+		h->sock.flags = sk->sk_flags;
+		h->socket.flags = sk->sk_socket->flags;
+	} else {
+		int ret;
+		mm_segment_t old_fs;
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = sock_restore_flags(sk->sk_socket, h);
+		set_fs(old_fs);
+		if (ret)
+			return ret;
+	}
+
+	if ((h->socket.state == SS_CONNECTED) &&
+	    (h->sock.state != TCP_ESTABLISHED)) {
+		ckpt_err(ctx, -EINVAL, "sock/et in inconsistent state: %i/%i",
+			 h->socket.state, h->sock.state);
+		return -EINVAL;
+	} else if ((h->sock.state < TCP_ESTABLISHED) ||
+		   (h->sock.state >= TCP_MAX_STATES)) {
+		ckpt_err(ctx, -EINVAL,
+			 "sock in invalid state: %i", h->sock.state);
+		return -EINVAL;
+	} else if (h->socket.state > SS_DISCONNECTING) {
+		ckpt_err(ctx, -EINVAL, "socket in invalid state: %i",
+			 h->socket.state);
+		return -EINVAL;
+	}
+
+	if (op == CKPT_RST)
+		return sock_cptrst_verify(h);
+	else
+		return 0;
+}
+
+static int __do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
+	int ret;
+
+	if (!sock->ops->checkpoint) {
+		ckpt_err(ctx, -ENOSYS, "%(T)%(V)%(P)socket: proto_ops\n",
+			       sock->ops, sock);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (!h)
+		return -ENOMEM;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	/* part II: per socket type state */
+	ret = sock->ops->checkpoint(ctx, sock);
+	if (ret < 0)
+		goto out;
+
+	/* part III: socket buffers */
+	if ((sk->sk_state != TCP_LISTEN) && (!sock_flag(sk, SOCK_DEAD)))
+		ret = sock_defer_write_buffers(ctx, sk);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct sock *sk = ptr;
+	struct socket *sock;
+	int ret;
+
+	if (sk->sk_socket)
+		return __do_sock_checkpoint(ctx, sk);
+
+	/* Temporarily adopt this orphan socket */
+	ret = sock_create(sk->sk_family, sk->sk_type, 0, &sock);
+	if (ret < 0)
+		return ret;
+	sock_graft(sk, sock);
+
+	ret = __do_sock_checkpoint(ctx, sk);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+	sock_release(sock);
+
+	return ret;
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int sock_collect_skbs(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
+{
+	struct sk_buff_head tmpq;
+	struct sk_buff *skb;
+	int ret = 0;
+	int bytes;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &bytes);
+	if (ret < 0)
+		return ret;
+
+	skb_queue_walk(&tmpq, skb) {
+		/* Socket buffers do not maintain a ref count on their
+		 * owning sock because they're counted in sock_wmem_alloc.
+		 * So, we only need to collect sockets from the queue that
+		 * won't be collected any other way (i.e. DEAD sockets that
+		 * are hanging around only because they're waiting for us
+		 * to process their skb.
+		 */
+
+		if (!ckpt_obj_lookup(ctx, skb->sk, CKPT_OBJ_SOCK) &&
+		    sock_flag(skb->sk, SOCK_DEAD)) {
+			ret = ckpt_obj_collect(ctx, skb->sk, CKPT_OBJ_SOCK);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_write_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_receive_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = ckpt_obj_collect(ctx, sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sock->ops->collect)
+		ret = sock->ops->collect(ctx, sock);
+
+	return ret;
+}
+
+static void *restore_sock(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket *h;
+	struct socket *sock;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
+	h->sock.type &= SOCK_TYPE_MASK;
+
+	ret = sock_create(h->sock_common.family, h->sock.type,
+			  h->sock.protocol, &sock);
+	if (ret < 0)
+		goto err;
+
+	if (!sock->ops->restore) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "proto_ops lacks restore %pS\n", sock->ops);
+		goto err;
+	}
+
+	/*
+	 * part II: per socket type state
+	 * (also takes care of part III: socket buffer)
+	 */
+	ret = sock->ops->restore(ctx, sock, h);
+	if (ret < 0)
+		goto err;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sock->sk, h, CKPT_RST);
+	if (ret < 0)
+		goto err;
+
+	ckpt_hdr_put(ctx, h);
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+	return ERR_PTR(ret);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int fd, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	fd = sock_alloc_file(sk->sk_socket, &file, O_RDWR);
+	if (fd < 0)
+		return ERR_PTR(fd);
+	put_unused_fd(fd); /* We'll let the checkpoint code re-allocate this */
+
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in
+	 * the socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sk);
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		return ERR_PTR(ret);
+	}
+
+	return file;
+}
+
+/*
+ * sock-related checkpoint objects
+ */
+
+static int obj_sock_grab(void *ptr)
+{
+	sock_hold((struct sock *) ptr);
+	return 0;
+}
+
+static void obj_sock_drop(void *ptr, int lastref)
+{
+	struct sock *sk = (struct sock *) ptr;
+
+	/*
+	 * Sockets created during restart are graft()ed, i.e. have a
+	 * valid @sk->sk_socket. Because only an fput() results in the
+	 * necessary sock_release(), we may leak the struct socket of
+	 * sockets that were not attached to a file. Therefore, if
+	 * @lastref is set, we hereby invoke sock_release() on sockets
+	 * that we have put into the objhash but were never attached
+	 * to a file.
+	 */
+	if (lastref && sk->sk_socket && !sk->sk_socket->file) {
+		struct socket *sock = sk->sk_socket;
+		sock_orphan(sk);
+		sock->sk = NULL;
+		sock_release(sock);
+	}
+
+	sock_put((struct sock *) ptr);
+}
+
+static int obj_sock_users(void *ptr)
+{
+	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
+}
+
+/* sock object */
+static const struct ckpt_obj_ops ckpt_obj_sock_ops = {
+	.obj_name = "SOCKET",
+	.obj_type = CKPT_OBJ_SOCK,
+	.ref_drop = obj_sock_drop,
+	.ref_grab = obj_sock_grab,
+	.ref_users = obj_sock_users,
+	.checkpoint = checkpoint_sock,
+	.restore = restore_sock,
+};
+
+static int __init checkpoint_register_sock(void)
+{
+	return register_checkpoint_obj(&ckpt_obj_sock_ops);
+}
+module_init(checkpoint_register_sock);
diff --git a/net/socket.c b/net/socket.c
index b9f421b..da2864f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -148,6 +148,10 @@ static const struct file_operations socket_file_ops = {
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
 	.splice_read =	sock_splice_read,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint =   sock_file_checkpoint,
+	.collect = sock_file_collect,
+#endif
 };
 
 /*
@@ -343,7 +347,7 @@ static const struct dentry_operations sockfs_dentry_operations = {
  *	but we take care of internal coherence yet.
  */
 
-static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
+int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 {
 	struct qstr name = { .name = "" };
 	struct path path;
diff --git a/net/unix/Makefile b/net/unix/Makefile
index b852a2b..fbff1e6 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UNIX)	+= unix.o
 
 unix-y			:= af_unix.o garbage.o
 unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
+unix-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3d9122e..a7d0cff 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -523,6 +523,9 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_dgram_ops = {
@@ -544,6 +547,9 @@ static const struct proto_ops unix_dgram_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_seqpacket_ops = {
@@ -565,6 +571,9 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static struct proto unix_proto = {
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
new file mode 100644
index 0000000..c90a497
--- /dev/null
+++ b/net/unix/checkpoint.c
@@ -0,0 +1,646 @@
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/fs_struct.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/user.h>
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+struct dq_join {
+	struct ckpt_ctx *ctx;
+	int src_objref;
+	int dst_objref;
+};
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	int sk_objref; /* objref of the socket these buffers belong to */
+};
+
+#define UNIX_ADDR_EMPTY(a) (a <= sizeof(short))
+
+static inline int unix_need_cwd(struct sockaddr_un *addr, unsigned long len)
+{
+	return (!UNIX_ADDR_EMPTY(len)) &&
+		addr->sun_path[0] &&
+		(addr->sun_path[0] != '/');
+}
+
+static int unix_join(struct sock *src, struct sock *dst)
+{
+	if (unix_sk(src)->peer != NULL)
+		return 0; /* We're second */
+
+	sock_hold(dst);
+	unix_sk(src)->peer = dst;
+
+	return 0;
+
+}
+
+static int unix_deferred_join(void *data)
+{
+	struct dq_join *dq = (struct dq_join *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *src;
+	struct sock *dst;
+
+	src = ckpt_obj_fetch(ctx, dq->src_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_err(ctx, -EINVAL, "%(O)Bad src sock\n", dq->src_objref);
+		return -EINVAL;
+	}
+
+	dst = ckpt_obj_fetch(ctx, dq->dst_objref, CKPT_OBJ_SOCK);
+	if (!dst) {
+		ckpt_err(ctx, -EINVAL, "%(O)Bad dst sock\n", dq->dst_objref);
+		return -EINVAL;
+	}
+
+	return unix_join(src, dst);
+}
+
+static int unix_defer_join(struct ckpt_ctx *ctx,
+			   int src_objref,
+			   int dst_objref)
+{
+	struct dq_join dq;
+
+	dq.ctx = ctx;
+	dq.src_objref = src_objref;
+	dq.dst_objref = dst_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_join, NULL);
+}
+
+static int unix_write_cwd(struct ckpt_ctx *ctx,
+			  struct sock *sk, const char *sockpath)
+{
+	struct path path;
+	char *buf;
+	char *fqpath;
+	int offset;
+	int len = PATH_MAX;
+	int ret = -ENOENT;
+
+	buf = kmalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path.dentry = unix_sk(sk)->dentry;
+	path.mnt = unix_sk(sk)->mnt;
+
+	fqpath = ckpt_fill_fname(&path, &ctx->root_fs_path, buf, &len);
+	if (IS_ERR(fqpath)) {
+		ret = PTR_ERR(fqpath);
+		goto out;
+	}
+
+	offset = strlen(fqpath) - strlen(sockpath);
+	if (offset <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fqpath[offset] = '\0';
+
+	ckpt_debug("writing socket directory: %s\n", fqpath);
+	ret = ckpt_write_string(ctx, fqpath, offset + 1);
+ out:
+	kfree(buf);
+	return ret;
+}
+
+int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	struct ckpt_hdr_socket_unix *un;
+	int new;
+	int ret = -ENOMEM;
+
+	if ((sock->sk->sk_state == TCP_LISTEN) &&
+	    !skb_queue_empty(&sock->sk->sk_receive_queue)) {
+		ckpt_err(ctx, -EBUSY,
+			 "%(T)%(E)%(P)af_unix: listen with pending peers\n",
+			 sock);
+		return -EBUSY;
+	}
+
+	un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (!un)
+		return -EINVAL;
+
+	ret = ckpt_sock_getnames(ctx, sock,
+				 (struct sockaddr *)&un->laddr, &un->laddr_len,
+				 (struct sockaddr *)&un->raddr, &un->raddr_len);
+	if (ret)
+		goto out;
+
+	if (sk->dentry && (sk->dentry->d_inode->i_nlink > 0))
+		un->flags |= CKPT_UNIX_LINKED;
+
+	un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new);
+	if (un->this < 0)
+		goto out;
+
+	if (sk->peer)
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
+	else
+		un->peer = 0;
+
+	if (un->peer < 0) {
+		ret = un->peer;
+		goto out;
+	}
+
+	un->peercred_uid = sock->sk->sk_peercred.uid;
+	un->peercred_gid = sock->sk->sk_peercred.gid;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un);
+	if (ret < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len))
+		ret = unix_write_cwd(ctx, sock->sk, un->laddr.sun_path);
+ out:
+	ckpt_hdr_put(ctx, un);
+
+	return ret;
+}
+
+int unix_collect(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, sock->sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sk->peer)
+		ret = ckpt_obj_collect(ctx, sk->peer, CKPT_OBJ_SOCK);
+
+	return 0;
+}
+
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sock *sk;
+	struct msghdr msg;
+	struct kvec kvec;
+	uint8_t sock_shutdown;
+	uint8_t peer_shutdown = 0;
+	void *buf = NULL;
+	int sndbuf;
+	int ret;
+
+	memset(&msg, 0, sizeof(msg));
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->lin_len > SKB_MAX_ALLOC) {
+		ckpt_err(ctx, ret, "socket buffer too big (%u > %lu)\n",
+			 h->lin_len, SKB_MAX_ALLOC);
+		goto out;
+	} else if (h->nr_frags != 0) {
+		ckpt_err(ctx, ret, "unix socket claims to have fragments\n");
+		goto out;
+	}
+
+	buf = kmalloc(h->lin_len, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	kvec.iov_len = h->lin_len;
+	kvec.iov_base = buf;
+	ret = _ckpt_read_obj_type(ctx, kvec.iov_base,
+				  h->lin_len, CKPT_HDR_BUFFER);
+	ckpt_debug("read unix socket buffer %u: %i\n", h->lin_len, ret);
+	if (ret < h->lin_len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	sk = ckpt_obj_fetch(ctx, h->sk_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
+	/* If we don't have a destination or a peer and we know the
+	 * destination of this skb, then we must need to join with our
+	 * peer
+	 */
+	if (!addrlen && !unix_sk(sk)->peer) {
+		struct sock *pr;
+		pr = ckpt_obj_fetch(ctx, h->pr_objref, CKPT_OBJ_SOCK);
+		if (IS_ERR(pr)) {
+			ret = PTR_ERR(pr);
+			ckpt_err(ctx, ret, "Failed to fetch peer\n");
+			goto out;
+		}
+		ret = unix_join(sk, pr);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to join sockets\n");
+			goto out;
+		}
+	}
+
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
+	/* Unshutdown peer too, if necessary */
+	if (unix_sk(sk)->peer) {
+		peer_shutdown = unix_sk(sk)->peer->sk_shutdown;
+		unix_sk(sk)->peer->sk_shutdown &= ~SHUTDOWN_MASK;
+	}
+
+	/* Make sure there's room in the send buffer: Worst case, we
+	 * give them the benefit of the doubt and set the buffer limit
+	 * to the system default.  This should cover the case where
+	 * the user set the limit low after loading up the buffer.
+	 *
+	 * However, if there isn't room in the buffer and the system
+	 * default won't accommodate them either, then increase the
+	 * limit as needed, only if they have CAP_NET_ADMIN.
+	 */
+	sndbuf = sk->sk_sndbuf;
+	if (((sk->sk_sndbuf - atomic_read(&sk->sk_wmem_alloc)) < h->lin_len) &&
+	    (h->lin_len > sysctl_wmem_max) &&
+	    capable(CAP_NET_ADMIN))
+		sk->sk_sndbuf += h->lin_len;
+	else
+		sk->sk_sndbuf = sysctl_wmem_max;
+
+	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, h->lin_len);
+	ckpt_debug("kernel_sendmsg(%i,%u): %i\n",
+		   h->sk_objref, h->lin_len, ret);
+	if ((ret > 0) && (ret != h->lin_len))
+		ret = -ENOMEM;
+
+	sk->sk_sndbuf = sndbuf;
+	sk->sk_shutdown = sock_shutdown;
+	if (peer_shutdown)
+		unix_sk(sk)->peer->sk_shutdown = peer_shutdown;
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(buf);
+	return ret;
+}
+
+static int unix_read_buffers(struct ckpt_ctx *ctx,
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_queue *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	for (i = 0; i < h->skb_count; i++) {
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
+		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
+		if (ret < 0)
+			goto out;
+
+		if (ret > h->total_bytes) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Buffers exceeded claim");
+			goto out;
+		}
+
+		h->total_bytes -= ret;
+	}
+
+	ret = h->skb_count;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int unix_deferred_restore_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *sk;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
+	int ret;
+
+	sk = ckpt_obj_fetch(ctx, dq->sk_objref, CKPT_OBJ_SOCK);
+	if (!sk) {
+		ckpt_err(ctx, -EINVAL, "%(O) missing sock\n", dq->sk_objref);
+		return -EINVAL;
+	}
+
+	if ((sk->sk_type == SOCK_DGRAM) && (unix_sk(sk)->addr != NULL)) {
+		addr = (struct sockaddr *)&unix_sk(sk)->addr->name;
+		addrlen = unix_sk(sk)->addr->len;
+	}
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read send buffers: %i\n", ret);
+	if (ret > 0)
+		ret = -EINVAL; /* No send buffers for UNIX sockets */
+
+	return ret;
+}
+
+static int unix_defer_restore_buffers(struct ckpt_ctx *ctx, int sk_objref)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk_objref = sk_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_restore_buffers, NULL);
+}
+
+static struct unix_address *unix_makeaddr(struct sockaddr_un *sun_addr,
+					  unsigned len)
+{
+	struct unix_address *addr;
+
+	if (len > sizeof(struct sockaddr_un))
+		return ERR_PTR(-EINVAL);
+
+	addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL);
+	if (!addr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(addr->name, sun_addr, len);
+	addr->len = len;
+	atomic_set(&addr->refcnt, 1);
+
+	return addr;
+}
+
+static int unix_restore_connected(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_socket *h,
+				  struct ckpt_hdr_socket_unix *un,
+				  struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr *addr = NULL;
+	unsigned long flags = h->sock.flags;
+	unsigned int addrlen = 0;
+	int dead = test_bit(SOCK_DEAD, &flags);
+	int ret = 0;
+
+
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
+	sk->sk_peercred.pid = task_tgid_vnr(current);
+
+	if (may_setuid(ctx->realcred->user->user_ns, un->peercred_uid) &&
+	    may_setgid(un->peercred_gid)) {
+		sk->sk_peercred.uid = un->peercred_uid;
+		sk->sk_peercred.gid = un->peercred_gid;
+	} else {
+		ckpt_err(ctx, -EPERM, "peercred %i:%i would require setuid",
+			 un->peercred_uid, un->peercred_gid);
+		return -EPERM;
+	}
+
+	if (!dead && (un->peer > 0)) {
+		ret = unix_defer_join(ctx, un->this, un->peer);
+		ckpt_debug("unix_defer_join: %i\n", ret);
+	}
+
+	if (!dead && !ret)
+		ret = unix_defer_restore_buffers(ctx, un->this);
+
+	return ret;
+}
+
+static int unix_unlink(const char *name)
+{
+	struct path spath;
+	struct path ppath;
+	int ret;
+
+	ret = kern_path(name, 0, &spath);
+	if (ret)
+		return ret;
+
+	ret = kern_path(name, LOOKUP_PARENT, &ppath);
+	if (ret)
+		goto out_s;
+
+	if (!spath.dentry) {
+		ckpt_debug("No dentry found for %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	if (!ppath.dentry || !ppath.dentry->d_inode) {
+		ckpt_debug("No inode for parent of %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry);
+ out_p:
+	path_put(&ppath);
+ out_s:
+	path_put(&spath);
+
+	return ret;
+}
+
+/* Call bind() for socket, optionally changing (temporarily) to @path first
+ * if non-NULL
+ */
+static int unix_chdir_and_bind(struct socket *sock,
+			       const char *path,
+			       struct sockaddr *addr,
+			       unsigned long addrlen)
+{
+	struct sockaddr_un *un = (struct sockaddr_un *)addr;
+	struct path cur = { .mnt = NULL, .dentry = NULL };
+	struct path dir = { .mnt = NULL, .dentry = NULL };
+	int ret;
+
+	if (path) {
+		ckpt_debug("switching to cwd %s for unix bind", path);
+
+		ret = kern_path(path, 0, &dir);
+		if (ret)
+			return ret;
+
+		ret = inode_permission(dir.dentry->d_inode,
+				       MAY_EXEC | MAY_ACCESS);
+		if (ret)
+			goto out;
+
+		write_lock(&current->fs->lock);
+		cur = current->fs->pwd;
+		current->fs->pwd = dir;
+		write_unlock(&current->fs->lock);
+	}
+
+	ret = unix_unlink(un->sun_path);
+	ckpt_debug("unlink(%s): %i\n", un->sun_path, ret);
+	if ((ret == 0) || (ret == -ENOENT))
+		ret = sock_bind(sock, addr, addrlen);
+
+	if (path) {
+		write_lock(&current->fs->lock);
+		current->fs->pwd = cur;
+		write_unlock(&current->fs->lock);
+	}
+ out:
+	if (path)
+		path_put(&dir);
+
+	return ret;
+}
+
+static int unix_fakebind(struct socket *sock,
+			 struct sockaddr_un *addr, unsigned long len)
+{
+	struct unix_address *uaddr;
+
+	uaddr = unix_makeaddr(addr, len);
+	if (IS_ERR(uaddr))
+		return PTR_ERR(uaddr);
+
+	unix_sk(sock->sk)->addr = uaddr;
+
+	return 0;
+}
+
+static int unix_restore_bind(struct ckpt_hdr_socket *h,
+			     struct ckpt_hdr_socket_unix *un,
+			     struct socket *sock,
+			     const char *path)
+{
+	struct sockaddr *addr = (struct sockaddr *)&un->laddr;
+	unsigned long len = un->laddr_len;
+	unsigned long flags = h->sock.flags;
+	int dead = test_bit(SOCK_DEAD, &flags);
+
+	if (dead)
+		return unix_fakebind(sock, &un->laddr, len);
+	else if (!un->laddr.sun_path[0])
+		return sock_bind(sock, addr, len);
+	else if (!(un->flags & CKPT_UNIX_LINKED))
+		return unix_fakebind(sock, &un->laddr, len);
+	else
+		return unix_chdir_and_bind(sock, path, addr, len);
+}
+
+/* Some easy pre-flight checks before we get underway */
+static int unix_precheck(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	struct net *net = sock_net(sock->sk);
+	unsigned long sk_flags = h->sock.flags;
+
+	if ((h->socket.state == SS_CONNECTING) ||
+	    (h->socket.state == SS_DISCONNECTING) ||
+	    (h->socket.state == SS_FREE)) {
+		ckpt_debug("AF_UNIX socket can't be SS_(DIS)CONNECTING");
+		return -EINVAL;
+	}
+
+	/* AF_UNIX overloads the backlog setting to define the maximum
+	 * queue length for DGRAM sockets.  Make sure we don't let the
+	 * caller exceed that value on restart.
+	 */
+	if ((h->sock.type == SOCK_DGRAM) &&
+	    (h->sock.backlog > net->unx.sysctl_max_dgram_qlen)) {
+		ckpt_debug("DGRAM backlog of %i exceeds system max of %i\n",
+			   h->sock.backlog, net->unx.sysctl_max_dgram_qlen);
+		return -EINVAL;
+	}
+
+	if (test_bit(SOCK_USE_WRITE_QUEUE, &sk_flags)) {
+		ckpt_debug("AF_UNIX socket has SOCK_USE_WRITE_QUEUE set");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+		 struct ckpt_hdr_socket *h)
+
+{
+	struct ckpt_hdr_socket_unix *un;
+	int ret = -EINVAL;
+	char *cwd = NULL;
+
+	ret = unix_precheck(sock, h);
+	if (ret)
+		return ret;
+
+	un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (IS_ERR(un))
+		return PTR_ERR(un);
+
+	if (un->peer < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len)) {
+		cwd = ckpt_read_string(ctx, PATH_MAX);
+		if (IS_ERR(cwd)) {
+			ret = PTR_ERR(cwd);
+			goto out;
+		}
+	}
+
+	if ((h->sock.state != TCP_ESTABLISHED) &&
+	    !UNIX_ADDR_EMPTY(un->laddr_len)) {
+		ret = unix_restore_bind(h, un, sock, cwd);
+		if (ret)
+			goto out;
+	}
+
+	if ((h->sock.state == TCP_ESTABLISHED) || (h->sock.state == TCP_CLOSE))
+		ret = unix_restore_connected(ctx, h, un, sock);
+	else if (h->sock.state == TCP_LISTEN)
+		ret = sock->ops->listen(sock, h->sock.backlog);
+	else
+		ckpt_err(ctx, ret, "bad af_unix state %i\n", h->sock.state);
+
+ out:
+	ckpt_hdr_put(ctx, un);
+	kfree(cwd);
+	return ret;
+}
-- 
1.6.3.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox