Netdev List
 help / color / mirror / Atom feed
* Re: pktgen: tricks
From: Eric Dumazet @ 2009-09-24 10:32 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Stephen Hemminger, Jesper Dangaard Brouer, Robert Olsson, netdev
In-Reply-To: <200909241310.05297.denys@visp.net.lb>

Denys Fedoryschenko a écrit :
> On Thursday 24 September 2009 03:41:41 Stephen Hemminger wrote:
>> Other kernel config help:
>>   - turn off lock dependency checker, kmecheck, page alloc debug
>>     basically anything that slows stuff down
>>   - turn off content group scheduler
> Maybe, but i'm not sure (i can't test it):
> Disable randomize VA space? On embedded boards it was helping. 
> In some case disabling SMP helped, when various SMP locks involved, but maybe 
> not for pktgen.
> 
>

pktgen is a kernel module, and is not affected by randomize VA space.

But of course, disabling SMP must help, as long as your machine needs
one cpu only :)


^ permalink raw reply

* Re: pktgen: tricks
From: Denys Fedoryschenko @ 2009-09-24 10:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jesper Dangaard Brouer, Robert Olsson, netdev
In-Reply-To: <20090923174141.1d350103@s6510>

On Thursday 24 September 2009 03:41:41 Stephen Hemminger wrote:
> Other kernel config help:
>   - turn off lock dependency checker, kmecheck, page alloc debug
>     basically anything that slows stuff down
>   - turn off content group scheduler
Maybe, but i'm not sure (i can't test it):
Disable randomize VA space? On embedded boards it was helping. 
In some case disabling SMP helped, when various SMP locks involved, but maybe 
not for pktgen.


> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* Re: [PATCH] ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2
From: Wolfgang Grandegger @ 2009-09-24  8:37 UTC (permalink / raw)
  To: Sebastian Haas
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20090923153747.30154.99860.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

Hi Sebastian,

Sebastian Haas wrote:
> The driver mapped only 128 bytes of the CAN controller address space when a
> CPC-PCI v2 was detected (incl. CPC-104P). This patch will fix it by always
> mapping the whole address space (4096 bytes on all boards) of the
> corresponding PCI BAR.
> 
> Signed-off-by: Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
> ---
> 
>  drivers/net/can/sja1000/ems_pci.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/can/sja1000/ems_pci.c b/drivers/net/can/sja1000/ems_pci.c
> index 7d84b8a..ba98063 100644
> --- a/drivers/net/can/sja1000/ems_pci.c
> +++ b/drivers/net/can/sja1000/ems_pci.c
> @@ -94,12 +94,14 @@ struct ems_pci_card {
>  #define EMS_PCI_CDR             (CDR_CBP | CDR_CLKOUT_MASK)
>  
>  #define EMS_PCI_V1_BASE_BAR     1
> -#define EMS_PCI_V1_MEM_SIZE     4096
> +#define EMS_PCI_V1_MEM_SIZE     4096 /* size of PITA control area */
>  #define EMS_PCI_V2_BASE_BAR     2
> -#define EMS_PCI_V2_MEM_SIZE     128
> +#define EMS_PCI_V2_MEM_SIZE     128 /* size of PLX control area */
>  #define EMS_PCI_CAN_BASE_OFFSET 0x400 /* offset where the controllers starts */
>  #define EMS_PCI_CAN_CTRL_SIZE   0x200 /* memory size for each controller */
>  
> +#define EMS_PCI_CONTR_MEM_SIZE  4096 /* size of controller area */
> +
>  static struct pci_device_id ems_pci_tbl[] = {
>  	/* CPC-PCI v1 */
>  	{PCI_VENDOR_ID_SIEMENS, 0x2104, PCI_ANY_ID, PCI_ANY_ID,},
> @@ -266,7 +268,7 @@ static int __devinit ems_pci_add_card(struct pci_dev *pdev,
>  		goto failure_cleanup;
>  	}
>  
> -	card->base_addr = pci_iomap(pdev, base_bar, mem_size);
> +	card->base_addr = pci_iomap(pdev, base_bar, EMS_PCI_CONTR_MEM_SIZE);
>  	if (card->base_addr == NULL) {
>  		err = -ENOMEM;
>  		goto failure_cleanup;

I see. To avoid confusion I suggest renaming some variables and defines:

s/EMS_PCI_V1_MEM_SIZE/EMS_PCI_V1_CONF_SIZE/
s/EMS_PCI_V2_MEM_SIZE/EMS_PCI_V2_CONF_SIZE/
s/mem_size/conf_size/
s/EMS_PCI_CONTR_MEM_SIZE/EMS_PCI_BASE_SIZE/

Would that not be more appropriate?

Wolfgang.

^ permalink raw reply

* Re: [AX25] kernel panic
From: Jarek Poplawski @ 2009-09-24  8:07 UTC (permalink / raw)
  To: Bernard Pidoux F6BVP
  Cc: Bernard Pidoux, Ralf Baechle DL5RB, Linux Netdev List, linux-hams
In-Reply-To: <4ABA9058.3010605@free.fr>

On Wed, Sep 23, 2009 at 11:17:12PM +0200, Bernard Pidoux F6BVP wrote:
> Hi Jarek,
Hi Bernard,

>
> After applying your second patch I had to wait until today before I catched
> these kernel messages.
> The last three ones where single lines.
> There was no kernel panic. 

Probably you didn't hit the some kind of traffic/problems, because
there was only some additional reporting in this patch vs. take 1.

In the meantime I found some strangeness in refcounting around
ax25_send_frame, and below is a patch which IMHO should fix it. I
don't know if it matters in your case (the previous report suggested
there should be something more), but let's try. Please, don't remove
the previous debugging patch yet, while testing this one.

Thanks,
Jarek P.
--- (ax25_send_frame patch take 1 for testing)

 include/net/netrom.h  |    2 ++
 net/ax25/ax25_out.c   |    6 ++++++
 net/netrom/nr_route.c |   11 ++++++-----
 net/rose/rose_link.c  |    8 ++++++++
 net/rose/rose_route.c |    5 +++++
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/net/netrom.h b/include/net/netrom.h
index 15696b1..ab170a6 100644
--- a/include/net/netrom.h
+++ b/include/net/netrom.h
@@ -132,6 +132,8 @@ static __inline__ void nr_node_put(struct nr_node *nr_node)
 static __inline__ void nr_neigh_put(struct nr_neigh *nr_neigh)
 {
 	if (atomic_dec_and_test(&nr_neigh->refcount)) {
+		if (nr_neigh->ax25)
+			ax25_cb_put(nr_neigh->ax25);
 		kfree(nr_neigh->digipeat);
 		kfree(nr_neigh);
 	}
diff --git a/net/ax25/ax25_out.c b/net/ax25/ax25_out.c
index bf706f8..1491260 100644
--- a/net/ax25/ax25_out.c
+++ b/net/ax25/ax25_out.c
@@ -92,6 +92,12 @@ ax25_cb *ax25_send_frame(struct sk_buff *skb, int paclen, ax25_address *src, ax2
 #endif
 	}
 
+	/*
+	 * There is one ref for the state machine; a caller needs
+	 * one more to put it back, just like with the existing one.
+	 */
+	ax25_cb_hold(ax25);
+
 	ax25_cb_add(ax25);
 
 	ax25->state = AX25_STATE_1;
diff --git a/net/netrom/nr_route.c b/net/netrom/nr_route.c
index 4eb1ac9..850ffc0 100644
--- a/net/netrom/nr_route.c
+++ b/net/netrom/nr_route.c
@@ -842,12 +842,13 @@ int nr_route_frame(struct sk_buff *skb, ax25_cb *ax25)
 	dptr  = skb_push(skb, 1);
 	*dptr = AX25_P_NETROM;
 
-	ax25s = ax25_send_frame(skb, 256, (ax25_address *)dev->dev_addr, &nr_neigh->callsign, nr_neigh->digipeat, nr_neigh->dev);
-	if (nr_neigh->ax25 && ax25s) {
-		/* We were already holding this ax25_cb */
+	ax25s = nr_neigh->ax25;
+	nr_neigh->ax25 = ax25_send_frame(skb, 256,
+					 (ax25_address *)dev->dev_addr,
+					 &nr_neigh->callsign,
+					 nr_neigh->digipeat, nr_neigh->dev);
+	if (ax25s)
 		ax25_cb_put(ax25s);
-	}
-	nr_neigh->ax25 = ax25s;
 
 	dev_put(dev);
 	ret = (nr_neigh->ax25 != NULL);
diff --git a/net/rose/rose_link.c b/net/rose/rose_link.c
index bd86a63..5ef5f69 100644
--- a/net/rose/rose_link.c
+++ b/net/rose/rose_link.c
@@ -101,13 +101,17 @@ static void rose_t0timer_expiry(unsigned long param)
 static int rose_send_frame(struct sk_buff *skb, struct rose_neigh *neigh)
 {
 	ax25_address *rose_call;
+	ax25_cb *ax25s;
 
 	if (ax25cmp(&rose_callsign, &null_ax25_address) == 0)
 		rose_call = (ax25_address *)neigh->dev->dev_addr;
 	else
 		rose_call = &rose_callsign;
 
+	ax25s = neigh->ax25;
 	neigh->ax25 = ax25_send_frame(skb, 260, rose_call, &neigh->callsign, neigh->digipeat, neigh->dev);
+	if (ax25s)
+		ax25_cb_put(ax25s);
 
 	return (neigh->ax25 != NULL);
 }
@@ -120,13 +124,17 @@ static int rose_send_frame(struct sk_buff *skb, struct rose_neigh *neigh)
 static int rose_link_up(struct rose_neigh *neigh)
 {
 	ax25_address *rose_call;
+	ax25_cb *ax25s;
 
 	if (ax25cmp(&rose_callsign, &null_ax25_address) == 0)
 		rose_call = (ax25_address *)neigh->dev->dev_addr;
 	else
 		rose_call = &rose_callsign;
 
+	ax25s = neigh->ax25;
 	neigh->ax25 = ax25_find_cb(rose_call, &neigh->callsign, neigh->digipeat, neigh->dev);
+	if (ax25s)
+		ax25_cb_put(ax25s);
 
 	return (neigh->ax25 != NULL);
 }
diff --git a/net/rose/rose_route.c b/net/rose/rose_route.c
index 9478d9b..8e5d5ac 100644
--- a/net/rose/rose_route.c
+++ b/net/rose/rose_route.c
@@ -234,6 +234,8 @@ static void rose_remove_neigh(struct rose_neigh *rose_neigh)
 
 	if ((s = rose_neigh_list) == rose_neigh) {
 		rose_neigh_list = rose_neigh->next;
+		if (rose_neigh->ax25)
+			ax25_cb_put(rose_neigh->ax25);
 		kfree(rose_neigh->digipeat);
 		kfree(rose_neigh);
 		return;
@@ -242,6 +244,8 @@ static void rose_remove_neigh(struct rose_neigh *rose_neigh)
 	while (s != NULL && s->next != NULL) {
 		if (s->next == rose_neigh) {
 			s->next = rose_neigh->next;
+			if (rose_neigh->ax25)
+				ax25_cb_put(rose_neigh->ax25);
 			kfree(rose_neigh->digipeat);
 			kfree(rose_neigh);
 			return;
@@ -814,6 +818,7 @@ void rose_link_failed(ax25_cb *ax25, int reason)
 
 	if (rose_neigh != NULL) {
 		rose_neigh->ax25 = NULL;
+		ax25_cb_put(ax25);
 
 		rose_del_route_by_neigh(rose_neigh);
 		rose_kill_by_neigh(rose_neigh);

^ permalink raw reply related

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-24  8:03 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABA78DC.7070604@redhat.com>

On 09/23/2009 10:37 PM, Avi Kivity wrote:
>
> Example: feature negotiation.  If it happens in userspace, it's easy 
> to limit what features we expose to the guest.  If it happens in the 
> kernel, we need to add an interface to let the kernel know which 
> features it should expose to the guest.  We also need to add an 
> interface to let userspace know which features were negotiated, if we 
> want to implement live migration.  Something fairly trivial bloats 
> rapidly.

btw, we have this issue with kvm reporting cpuid bits to the guest.  
Instead of letting kvm talk directly to the hardware and the guest, kvm 
gets the cpuid bits from the hardware, strips away features it doesn't 
support, exposes that to userspace, and expects userspace to program the 
cpuid bits it wants to expose to the guest (which may be different than 
what kvm exposed to userspace, and different from guest to guest).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 3/3 V2] i2400m-sdio: select IWMC3200TOP in Kconfig
From: Tomas Winkler @ 2009-09-24  8:00 UTC (permalink / raw)
  To: davem, linville, netdev, linux-wireless, linux-mmc
  Cc: yi.zhu, inaky.perez-gonzalez, cindy.h.kao, guy.cohen,
	ron.rindjunsky, Tomas Winkler

i2400m-sdio requires iwmc3200top for its operation

create separate config option to separate 3200 specifics
from eventual further wimax sdio HW.

Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
---
V2: create separate config option as discussed with Inaky

 drivers/net/wimax/i2400m/Kconfig |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/net/wimax/i2400m/Kconfig b/drivers/net/wimax/i2400m/Kconfig
index d623b3d..9723c5c 100644
--- a/drivers/net/wimax/i2400m/Kconfig
+++ b/drivers/net/wimax/i2400m/Kconfig
@@ -31,6 +31,15 @@ config WIMAX_I2400M_SDIO
 
 	  If unsure, it is safe to select M (module).
 
+config WIMAX_IWMC3200_SDIO
+	bool "Intel Wireless Multicom WiMAX Connection 3200 over SDIO"
+	depends on WIMAX_I2400M_SDIO
+	select IWMC3200TOP
+	help
+	  Select if you have a device based on the Intel Multicom WiMAX
+          Connection 3200 over SDIO.
+ 
+
 config WIMAX_I2400M_DEBUG_LEVEL
 	int "WiMAX i2400m debug level"
 	depends on WIMAX_I2400M
-- 
1.6.0.6

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


^ permalink raw reply related

* [PATCH] tunnels: Optimize tx path
From: Eric Dumazet @ 2009-09-24  7:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

We currently dirty a cache line to update tunnel device stats
(tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
counters that already are present in cpu cache, in the cache
line shared with txq->_xmit_lock

This patch extends IPTUNNEL_XMIT() macro to use txq pointer
provided by the caller.

Also &tunnel->dev->stats can be replaced by &dev->stats

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ipip.h |    6 +++---
 net/ipv4/ip_gre.c  |    5 +++--
 net/ipv4/ipip.c    |    5 +++--
 net/ipv6/sit.c     |    5 +++--
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/net/ipip.h b/include/net/ipip.h
index 5d3036f..8f990bb 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -50,9 +50,9 @@ struct ip_tunnel_prl_entry
 	ip_select_ident(iph, &rt->u.dst, NULL);				\
 									\
 	err = ip_local_out(skb);					\
-	if (net_xmit_eval(err) == 0) {					\
-		stats->tx_bytes += pkt_len;				\
-		stats->tx_packets++;					\
+	if (likely(net_xmit_eval(err) == 0)) {				\
+		txq->tx_bytes += pkt_len;				\
+		txq->tx_packets++;					\
 	} else {							\
 		stats->tx_errors++;					\
 		stats->tx_aborted_errors++;				\
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index d9645c9..45b8785 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -665,7 +665,8 @@ drop_nolock:
 static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
-	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct net_device_stats *stats = &dev->stats;
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, 0);
 	struct iphdr  *old_iph = ip_hdr(skb);
 	struct iphdr  *tiph;
 	u8     tos;
@@ -818,7 +819,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 		struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
 		if (!new_skb) {
 			ip_rt_put(rt);
-			stats->tx_dropped++;
+			txq->tx_dropped++;
 			dev_kfree_skb(skb);
 			tunnel->recursion--;
 			return NETDEV_TX_OK;
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 62548cb..40b3730 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -390,7 +390,8 @@ static int ipip_rcv(struct sk_buff *skb)
 static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
-	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct net_device_stats *stats = &dev->stats;
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, 0);
 	struct iphdr  *tiph = &tunnel->parms.iph;
 	u8     tos = tunnel->parms.iph.tos;
 	__be16 df = tiph->frag_off;
@@ -483,7 +484,7 @@ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 		struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
 		if (!new_skb) {
 			ip_rt_put(rt);
-			stats->tx_dropped++;
+			txq->tx_dropped++;
 			dev_kfree_skb(skb);
 			tunnel->recursion--;
 			return NETDEV_TX_OK;
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0ae4f64..a4aeb2b 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -613,7 +613,8 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 				     struct net_device *dev)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
-	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct net_device_stats *stats = &dev->stats;
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, 0);
 	struct iphdr  *tiph = &tunnel->parms.iph;
 	struct ipv6hdr *iph6 = ipv6_hdr(skb);
 	u8     tos = tunnel->parms.iph.tos;
@@ -751,7 +752,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 		struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
 		if (!new_skb) {
 			ip_rt_put(rt);
-			stats->tx_dropped++;
+			txq->tx_dropped++;
 			dev_kfree_skb(skb);
 			tunnel->recursion--;
 			return NETDEV_TX_OK;

^ permalink raw reply related

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-24  7:18 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABA8FDC.5010008@gmail.com>

On 09/24/2009 12:15 AM, Gregory Haskins wrote:
>
>>> There are various aspects about designing high-performance virtual
>>> devices such as providing the shortest paths possible between the
>>> physical resources and the consumers.  Conversely, we also need to
>>> ensure that we meet proper isolation/protection guarantees at the same
>>> time.  What this means is there are various aspects to any
>>> high-performance PV design that require to be placed in-kernel to
>>> maximize the performance yet properly isolate the guest.
>>>
>>> For instance, you are required to have your signal-path (interrupts and
>>> hypercalls), your memory-path (gpa translation), and
>>> addressing/isolation model in-kernel to maximize performance.
>>>
>>>        
>> Exactly.  That's what vhost puts into the kernel and nothing more.
>>      
> Actually, no.  Generally, _KVM_ puts those things into the kernel, and
> vhost consumes them.  Without KVM (or something equivalent), vhost is
> incomplete.  One of my goals with vbus is to generalize the "something
> equivalent" part here.
>    

I don't really see how vhost and vbus are different here.  vhost expects 
signalling to happen through a couple of eventfds and requires someone 
to supply them and implement kernel support (if needed).  vbus requires 
someone to write a connector to provide the signalling implementation.  
Neither will work out-of-the-box when implementing virtio-net over 
falling dominos, for example.

>>> Vbus accomplishes its in-kernel isolation model by providing a
>>> "container" concept, where objects are placed into this container by
>>> userspace.  The host kernel enforces isolation/protection by using a
>>> namespace to identify objects that is only relevant within a specific
>>> container's context (namely, a "u32 dev-id").  The guest addresses the
>>> objects by its dev-id, and the kernel ensures that the guest can't
>>> access objects outside of its dev-id namespace.
>>>
>>>        
>> vhost manages to accomplish this without any kernel support.
>>      
> No, vhost manages to accomplish this because of KVMs kernel support
> (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
> merely a kind of "tuntap"-like clone signalled by eventfds.
>    

Without a vbus-connector-falling-dominos, vbus-venet can't do anything 
either.  Both vhost and vbus need an interface, vhost's is just narrower 
since it doesn't do configuration or enumeration.

> This goes directly to my rebuttal of your claim that vbus places too
> much in the kernel.  I state that, one way or the other, address decode
> and isolation _must_ be in the kernel for performance.  Vbus does this
> with a devid/container scheme.  vhost+virtio-pci+kvm does it with
> pci+pio+ioeventfd.
>    

vbus doesn't do kvm guest address decoding for the fast path.  It's 
still done by ioeventfd.

>>   The guest
>> simply has not access to any vhost resources other than the guest->host
>> doorbell, which is handed to the guest outside vhost (so it's somebody
>> else's problem, in userspace).
>>      
> You mean _controlled_ by userspace, right?  Obviously, the other side of
> the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
> vhost would be pointless: e.g. just use vanilla tuntap if you don't need
> fast in-kernel decoding.
>    

Yes (though for something like level-triggered interrupts we're probably 
keeping it in userspace, enjoying the benefits of vhost data path while 
paying more for signalling).

>>> All that is required is a way to transport a message with a "devid"
>>> attribute as an address (such as DEVCALL(devid)) and the framework
>>> provides the rest of the decode+execute function.
>>>
>>>        
>> vhost avoids that.
>>      
> No, it doesn't avoid it.  It just doesn't specify how its done, and
> relies on something else to do it on its behalf.
>    

That someone else can be in userspace, apart from the actual fast path.

> Conversely, vbus specifies how its done, but not how to transport the
> verb "across the wire".  That is the role of the vbus-connector abstraction.
>    

So again, vbus does everything in the kernel (since it's so easy and 
cheap) but expects a vbus-connector.  vhost does configuration in 
userspace (since it's so clunky and fragile) but expects a couple of 
eventfds.

>>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>>
>>>        
>> It's the wrong name.  vhost implements only the data path.
>>      
> Understood, but vhost+virtio-pci is what I am contrasting, and I use
> "vhost" for short from that point on because I am too lazy to type the
> whole name over and over ;)
>    

If you #define A A+B+C don't expect intelligent conversation afterwards.

>>> It is not immune to requiring in-kernel addressing support either, but
>>> rather it just does it differently (and its not as you might expect via
>>> qemu).
>>>
>>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
>>> assigns resources (such as BARs, interrupts, etc).
>>>        
>> vhost does not rely on qemu.  It relies on its user to handle
>> configuration.  In one important case it's qemu+pci.  It could just as
>> well be the lguest launcher.
>>      
> I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
>
> The point I am making specifically is that vhost in general relies on
> other in-kernel components to function.  I.e. It cannot function without
> having something like the PCI model to build an IO namespace.  That
> namespace (in this case, pio addresses+data tuples) are used for the
> in-kernel addressing function under KVM + virtio-pci.
>
> The case of the lguest launcher is a good one to highlight.  Yes, you
> can presumably also use lguest with vhost, if the requisite facilities
> are exposed to lguest-bus, and some eventfd based thing like ioeventfd
> is written for the host (if it doesnt exist already).
>
> And when the next virt design "foo" comes out, it can make a "foo-bus"
> model, and implement foo-eventfd on the backend, etc, etc.
>    

It's exactly the same with vbus needing additional connectors for 
additional transports.

> Ira can make ira-bus, and ira-eventfd, etc, etc.
>
> Each iteration will invariably introduce duplicated parts of the stack.
>    

Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).


>> For the N+1th time, no.  vhost is perfectly usable without pci.  Can we
>> stop raising and debunking this point?
>>      
> Again, I understand vhost is decoupled from PCI, and I don't mean to
> imply anything different.  I use PCI as an example here because a) its
> the only working example of vhost today (to my knowledge), and b) you
> have stated in the past that PCI is the only "right" way here, to
> paraphrase.  Perhaps you no longer feel that way, so I apologize if you
> feel you already recanted your position on PCI and I missed it.
>    

For kvm/x86 pci definitely remains king.  I was talking about the two 
lguest users and Ira.

> I digress.  My point here isn't PCI.  The point here is the missing
> component for when PCI is not present.  The component that is partially
> satisfied by vbus's devid addressing scheme.  If you are going to use
> vhost, and you don't have PCI, you've gotta build something to replace it.
>    

Yes, that's why people have keyboards.  They'll write that glue code if 
they need it.  If it turns out to be a hit an people start having virtio 
transport module writing parties, they'll figure out a way to share code.

>>> All you really need is a simple decode+execute mechanism, and a way to
>>> program it from userspace control.  vbus tries to do just that:
>>> commoditize it so all you need is the transport of the control messages
>>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>>> across various environments (like KVM or Iras rig).
>>>
>>>        
>> If you think it should be "commodotized", write libvhostconfig.so.
>>      
> I know you are probably being facetious here, but what do you propose
> for the parts that must be in-kernel?
>    

On the guest side, virtio-shmem.ko can unify the ring access.  It 
probably makes sense even today.  On the host side I eventfd is the 
kernel interface and libvhostconfig.so can provide the configuration 
when an existing ABI is not imposed.

>>> And your argument, I believe, is that vbus allows both to be implemented
>>> in the kernel (though to reiterate, its optional) and is therefore a bad
>>> design, so lets discuss that.
>>>
>>> I believe the assertion is that things like config-space are best left
>>> to userspace, and we should only relegate fast-path duties to the
>>> kernel.  The problem is that, in my experience, a good deal of
>>> config-space actually influences the fast-path and thus needs to
>>> interact with the fast-path mechanism eventually anyway.
>>> Whats left
>>> over that doesn't fall into this category may cheaply ride on existing
>>> plumbing, so its not like we created something new or unnatural just to
>>> support this subclass of config-space.
>>>
>>>        
>> Flexibility is reduced, because changing code in the kernel is more
>> expensive than in userspace, and kernel/user interfaces aren't typically
>> as wide as pure userspace interfaces.  Security is reduced, since a bug
>> in the kernel affects the host, while a bug in userspace affects just on
>> guest.
>>      
> For a mac-address attribute?  Thats all we are really talking about
> here.  These points you raise, while true of any kernel code I suppose,
> are a bit of a stretch in this context.
>    

Look at the virtio-net feature negotiation.  There's a lot more there 
than the MAC address, and it's going to grow.

>> Example: feature negotiation.  If it happens in userspace, it's easy to
>> limit what features we expose to the guest.
>>      
> Its not any harder in the kernel.  I do this today.
>
> And when you are done negotiating said features, you will generally have
> to turn around and program the feature into the backend anyway (e.g.
> ioctl() to vhost module).  Now you have to maintain some knowledge of
> that particular feature and how to program it in two places.
>    

No, you can leave it enabled unconditionally in vhost (the guest won't 
use what it doesn't know about).

> Conversely, I am eliminating the (unnecessary) middleman by letting the
> feature negotiating take place directly between the two entities that
> will consume it.
>    

The middleman is necessary, if you want to support live migration, or to 
restrict a guest to a subset of your features.

>>   If it happens in the
>> kernel, we need to add an interface to let the kernel know which
>> features it should expose to the guest.
>>      
> You need this already either way for both models anyway.  As an added
> bonus, vbus has generalized that interface using sysfs attributes, so
> all models are handled in a similar and community accepted way.
>    

vhost doesn't need it since userspace takes care of it.

>>   We also need to add an
>> interface to let userspace know which features were negotiated, if we
>> want to implement live migration.  Something fairly trivial bloats rapidly.
>>      
> Can you elaborate on the requirements for live-migration?  Wouldnt an
> opaque save/restore model work here? (e.g. why does userspace need to be
> able to interpret the in-kernel state, just pass it along as a blob to
> the new instance).
>    

A blob would work, if you commit to forward and backward compatibility 
in the kernel side (i.e. an older kernel must be able to accept a blob 
from a newer one).  I don't like blobs though, they tie you to the 
implemenetation.

>> As you can see above, userspace needs to be involved in this, and the
>> number of interfaces required is smaller if it's in userspace:
>>      
> Actually, no.  My experience has been the opposite.  Anytime I sat down
> and tried to satisfy your request to move things to the userspace,
> things got ugly and duplicative really quick.  I suspect part of the
> reason you may think its easier because you already have part of
> virtio-net in userspace and its surrounding support, but that is not the
> case moving forward for new device types.
>    

I can't comment on your experience, but we'll definitely build on 
existing code for new device types.

>> you only
>> need to know which features the kernel supports (they can be enabled
>> unconditionally, just not exposed).
>>
>> Further, some devices are perfectly happy to be implemented in
>> userspace, so we need userspace configuration support anyway.  Why
>> reimplement it in the kernel?
>>      
> Thats fine.  vbus is targetted for high-performance IO.  So if you have
> a robust userspace (like KVM+QEMU) and low-performance constraints (say,
> for a console or something), put it in userspace and vbus is not
> involved.  I don't care.
>    

So now the hypothetical non-pci hypervisor needs to support two busses.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [BUG] af_unix race in close?
From: Stephen Hemminger @ 2009-09-24  5:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jike Song
In-Reply-To: <4ABAF727.8060905@gmail.com>

On Thu, 24 Sep 2009 06:35:51 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> inted (2.6.31-0.125.4.2.rc5.git2.fc12.i686 #1) AOA110
> > EIP: 0060:[] EFLAGS: 00010202 CPU: 0
> > EIP is at unix_write_space+0x45/0x87
> > EAX: 6b6b6b6b EBX: ec988780 ECX: 00000000 EDX: 6b6b6b8f
> > ESI: ec988950 EDI: ffffff20 EBP: ec941e28 ESP: ec941e1c
> >  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > Process metacity (pid: 6809, ti=ec940000 task=e63095c0 task.ti=ec940000)
> > Stack:
> >  37dc7803 ec988780 000000e1 ec941e40 c0772142 37dc7803 dcc1c900 dcc1c900  
> > <0> c07f6a02 ec941e50 c0775766 37dc7803 dcc1c900 ec941e60 c07754ae 37dc7803
> > <0> dcc1c900 ec941e78 c07755db 37dc7803 ec98b0c0 dcc1c900 00000000 ec941ea0  
> > Call Trace:
> >  [] ? sock_wfree+0x44/0x68
> >  [] ? unix_release_sock+0x182/0x1e0
> >  [] ? skb_release_head_state+0x6c/0xcb
> >  [] ? __kfree_skb+0x20/0x94
> >  [] ? kfree_skb+0x68/0x7f
> >  [] ? unix_release_sock+0x182/0x1e0
> >  [] ? unix_release+0x2f/0x42
> >  [] ? sock_release+0x29/0x7f
> >  [] ? sock_close+0x30/0x45
> >  [] ? __fput+0x101/0x1a2

Good thanks. It should probably go up to stable as well.

^ permalink raw reply

* Re: [BUG] af_unix race in close?
From: Eric Dumazet @ 2009-09-24  4:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, Jike Song
In-Reply-To: <20090923165421.60e0d49c@s6510>

Stephen Hemminger a écrit :
> This oops seems to show lots of times:
> http://www.kerneloops.org/guilty.php?guilty=unix_write_space&version=2.6.31-release&start=2064384&end=2097151&class=oops
> Looks like race in unix domain socket close with data outstanding.
> 
> BUG: unable to handle kernel paging request at 6b6b6b8f
> IP: [] unix_write_space+0x45/0x87
> *pde = 00000000 
> Oops: 0000 [#1] SMP 
> last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT1/charge_full
> Modules linked in: ext2 fuse nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_multipath uinput uvcvideo videodev v4l1_compat arc4 snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support ecb serio_raw i2c_i801 snd_hda_intel joydev snd_hda_codec snd_hwdep snd_pcm snd_timer ath5k r8169 snd mac80211 mii soundcore ath snd_page_alloc jmb38x_ms cfg80211 memstick rfkill wmi squashfs vfat fat mmc_block i915 sdhci_pci ata_generic pata_acpi sdhci mmc_core drm i2c_algo_bit i2c_core usb_storage video output [last unloaded: microcode]
> 
> Pid: 6809, comm: metacity Not tainted (2.6.31-0.125.4.2.rc5.git2.fc12.i686 #1) AOA110
> EIP: 0060:[] EFLAGS: 00010202 CPU: 0
> EIP is at unix_write_space+0x45/0x87
> EAX: 6b6b6b6b EBX: ec988780 ECX: 00000000 EDX: 6b6b6b8f
> ESI: ec988950 EDI: ffffff20 EBP: ec941e28 ESP: ec941e1c
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process metacity (pid: 6809, ti=ec940000 task=e63095c0 task.ti=ec940000)
> Stack:
>  37dc7803 ec988780 000000e1 ec941e40 c0772142 37dc7803 dcc1c900 dcc1c900
> <0> c07f6a02 ec941e50 c0775766 37dc7803 dcc1c900 ec941e60 c07754ae 37dc7803
> <0> dcc1c900 ec941e78 c07755db 37dc7803 ec98b0c0 dcc1c900 00000000 ec941ea0
> Call Trace:
>  [] ? sock_wfree+0x44/0x68
>  [] ? unix_release_sock+0x182/0x1e0
>  [] ? skb_release_head_state+0x6c/0xcb
>  [] ? __kfree_skb+0x20/0x94
>  [] ? kfree_skb+0x68/0x7f
>  [] ? unix_release_sock+0x182/0x1e0
>  [] ? unix_release+0x2f/0x42
>  [] ? sock_release+0x29/0x7f
>  [] ? sock_close+0x30/0x45
>  [] ? __fput+0x101/0x1a2
>  [] ? fput+0x27/0x3a
>  [] ? filp_close+0x64/0x7f
>  [] ? put_files_struct+0x68/0xbd
>  [] ? exit_files+0x43/0x59
>  [] ? do_exit+0x1d6/0x648
>  [] ? audit_syscall_entry+0x134/0x167
>  [] ? do_group_exit+0x72/0x99
>  [] ? sys_exit_group+0x27/0x3c
>  [] ? syscall_call+0x7/0xb
> Code: 00 89 45 f4 31 c0 89 f0 e8 9a 76 02 00 8b 83 dc 00 00 00 c1 e0 02 3b 83 e4 00 00 00 7f 32 8b 83 a4 00 00 00 85 c0 74 17 8d 50 24 <39> 50 24 74 0f b9 01 00 00 00 ba 01 00 00 00 e8 bb cf c3 ff b9 
> EIP: [] unix_write_space+0x45/0x87 SS:ESP 0068:ec941e1c
> CR2: 000000006b6b6b8f
> ---[ end trace 4a36bd1eb2fc9896 ]---
> 

Hello Stephen

I already took a look at the problem, and I re-sent possible fix for this yesterday

http://patchwork.ozlabs.org/patch/34162/

First reporter I am aware of was Jike Song

Thanks

^ permalink raw reply

* Re: [PATCH 6/8] [RFC] CAIF Protocol Stack
From: Randy Dunlap @ 2009-09-24  4:00 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, Kim.xx.Lilliestierna
In-Reply-To: <1253727091-10383-1-git-send-email-sjur.brandeland@stericsson.com>

On Wed, 23 Sep 2009 19:31:31 +0200 sjur.brandeland@stericsson.com wrote:

> From: Kim Lilliestierna <Kim.xx.Lilliestierna@ericsson.com>
> 
> Signed-off-by: sjur.brandeland@stericsson.com
> 
> ---
>  drivers/net/caif/Kconfig      |   64 +++
>  drivers/net/caif/Makefile     |   29 ++
>  drivers/net/caif/chnl_tty.c   |  220 +++++++++++
>  drivers/net/caif/phyif_loop.c |  309 +++++++++++++++
>  drivers/net/caif/phyif_ser.c  |  189 +++++++++
>  drivers/net/caif/phyif_shm.c  |  870 +++++++++++++++++++++++++++++++++++++++++
>  drivers/net/caif/shm.h        |   95 +++++
>  drivers/net/caif/shm_cfgifc.c |   60 +++
>  drivers/net/caif/shm_mbxifc.c |   98 +++++
>  drivers/net/caif/shm_smbx.c   |   81 ++++
>  10 files changed, 2015 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/caif/Kconfig
>  create mode 100644 drivers/net/caif/Makefile
>  create mode 100644 drivers/net/caif/chnl_tty.c
>  create mode 100644 drivers/net/caif/phyif_loop.c
>  create mode 100644 drivers/net/caif/phyif_ser.c
>  create mode 100644 drivers/net/caif/phyif_shm.c
>  create mode 100644 drivers/net/caif/shm.h
>  create mode 100644 drivers/net/caif/shm_cfgifc.c
>  create mode 100644 drivers/net/caif/shm_mbxifc.c
>  create mode 100644 drivers/net/caif/shm_smbx.c
> 
> diff --git a/drivers/net/caif/Kconfig b/drivers/net/caif/Kconfig
> new file mode 100644
> index 0000000..3cbe302
> --- /dev/null
> +++ b/drivers/net/caif/Kconfig
> @@ -0,0 +1,64 @@
> +#
> +# CAIF net configurations
> +#
> +
> +if CAIF
> +
> +# Include physical drivers
> +# should be broken out into its own config file
> +# source "drivers/net/caif/Kconfig"
> +
> +# Some options here should be mande platform dependent

                                 made

> +
> +comment "CAIF physical drivers"
> +
> +config CAIF_TTY
> +	tristate "CAIF TTY transport driver "

	no trailing space, please (before the final ")

> +	default CAIF
> +	---help---
> +	The CAIF TTY transport driver
> +	If sou say yes here you will also need to build a users space utility to set the line disicpline on the the

	If you ......                                     userspace [or user space]

	drop final (duplicate) "the" above.

> +	tty, see Documentation/net/caif/examples/linedsc
> +
> +config CAIF_SHM
> +	tristate "CAIF shared memory transport driver"
> +	default n
> +	---help---
> +	The caif low level driver for the shared memory driver

	End sentences with a period ('.').

> +	Be aware that if you enable this you need to also enable a low level shared memory driver

	End above with period.  Begin below with "The".

> +        the default is to include the loopback test driver.
> +
> +config CAIF_LOOPBACK
> +	tristate "CAIF loopback driver test driver"
> +	default CAIF
> +	---help---
> +	Loopback test driver
> +
> +if CAIF_SHM
> +
> +comment "CAIF shared memory low level physical drivers"
> +
> +config CAIF_SHM_LOOPBACK
> +	tristate "Caif shared memory loopback driver"
> +	default CAIF_USE_SHM
> +	---help---
> +	loop back driver that emulates a real shared memory transport
> +	mainly used for debugging.
> +
> +config CAIF_MBXIF
> +	tristate "Caif shared mailbox interface"
> +	default CAIF_SHM
> +	---help---
> +	Generic shared mailbox interface
> +
> +config CAIF_SMBX
> +	tristate "Use Simluated Mail box"
> +	default CAIF_MBXIF
> +	---help---
> +	Answer y if you whant to use a simulated mail box interface for caif shared memory transport

	                want
	End above sentence with a period.

> +	Mainly used for debugging and as example driver
> +	This can also be built as a module

Ditto.

> +
> +endif #CAIF_USE_SHM
> +
> +endif # CAIF



---
~Randy

^ permalink raw reply

* Re: pull request: wireless-next-2.6 2009-09-23
From: John W. Linville @ 2009-09-24  3:50 UTC (permalink / raw)
  To: David Miller
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090923.162449.141768862.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Wed, Sep 23, 2009 at 04:24:49PM -0700, David Miller wrote:

> Also, please put the branch name, even if it's simply 'master'
> in your GIT pull urls you give me.

Noted...thanks!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Ben Greear @ 2009-09-24  2:08 UTC (permalink / raw)
  To: Rick Jones; +Cc: NetDev
In-Reply-To: <4ABAB727.2020507@hp.com>

Rick Jones wrote:
> Ben Greear wrote:
>> When LRO is enabled, the received packet and byte counters represent the
>> LRO'd packets, not the packets/bytes on the wire. 
>
> When LRO is enabled, are all the bytes on the wire actually 
> transferred into the host?
No...the ethernet, IP and TCP headers and such are not, for packets that 
are combined into a single
large SKB.

That is why the driver counts them wrong.  The bytes are off by a few 
percentage points, but the
packet count is off by an order of magnitude.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* Re: [PATCH 2/5] Implement loss counting on TFRC-SP receiver
From: Ivo Calado @ 2009-09-24  1:43 UTC (permalink / raw)
  To: Gerrit Renker; +Cc: dccp, netdev
In-Reply-To: <425e6efa0909231501p499059a4y3530d1a9f55b5a2a@mail.gmail.com>

On Sat, Sep 19, 2009 at 9:11 AM,  <gerrit@erg.abdn.ac.uk> wrote:
>>>                s64 len = dccp_delta_seqno(cur->li_seqno,
>>> cong_evt_seqno);
>>>                if ((len <= 0) ||
>>>                    (!tfrc_lh_closed_check(cur,
>>> cong_evt->tfrchrx_ccval))) {
>>> +                       cur->li_losses += rh->num_losses;
>>>                        return false;
>>>                }
>>> This has a multiplicative effect, since rh->num_losses is added to
> cur->li_losses
>>> each time the condition is evaluated. E.g. if 3 times in a row
> reordered
>>> (earlier)
>>> sequence numbers arrive, or if the CCvals do not differ (high-speed
> networks),
>>> we end up with 3 * rh->num_losses, which can't be correct.
>>
>>
>> The following code would be correct then?
>>
>>               if ((len <= 0) ||
>>                   (!tfrc_lh_closed_check(cur, cong_evt->tfrchrx_ccval)))
> {
>> +                       cur->li_losses += rh->num_losses;
>> +                       rh->num_losses  = 0;
>>                       return false;
>> With this change I suppose the could be fixed. With that, the
>> rh->num_losses couldn't added twice. Am I correct?
>>
>>
> The function tfrc_lh_interval_add() is called when
>  * __two_after_loss() returns true (a new loss is detected) or
>  * a data packet is ECN-CE marked.
>
> I am still not sure about the 'len <= 0' case; this would be true
> if an ECN-marked packet arrives whose sequence number is 'before'
> the start of the current loss interval, or if a loss is detected
> which is older than the start of the current loss interval.
>
> The other case (tfrc_lh_closed_check) returns 1 if the current loss
> interval is 'closed' according to RFC 4342, 10.2.
>
> Intuitively, in the first case it refers to the preceding loss
> interval (i.e. not cur->...), in the second case it seems correct.
>
> Doing the first case is complicated due to going back in history.
> The simplest solution I can think of at the moment is to ignore
> the exception-case of reordered packets and do something like
>
>    if (len <= 0) {
>       /* FIXME: this belongs into the previous loss interval */
> tfrc_pr_debug("Warning: ignoring loss due to reordering");
>       return false;
>    }
>    if (!tfrc_lh_closed_check(...)) {
>        // your code from above
>    }

Okay, i'll add your sugestion. But i don't know how this would be fixed at all.

>
> However, there is a much deeper underlying question: currently the
> implementation is not really what the specification says; if we
> wanted to abide by the letter of the law, we would have to implement
> the Loss Intervals Option first, and then sort out such details as
> above. Discussion continues further below.
>

Yes, the loss intervals options is mandatory. I'll discuss this too below.

>>> --- dccp_tree_work4.orig/net/dccp/ccids/lib/packet_history_sp.c +++
> dccp_tree_work4/net/dccp/ccids/lib/packet_history_sp.c
>>> @@ -244,6 +244,7 @@
>>>                h->loss_count = 3;
>>>                tfrc_sp_rx_hist_entry_from_skb(tfrc_rx_hist_entry(h, 3),
>                                               skb, n3);
>>> +               h->num_losses = dccp_loss_count(s2, s3, n3);
>>>                return 1;
>>>        }
>>> This only measures the gap between s2 and s3, but the "hole" starts at s0,
>>> so it would need to be dccp_loss_count(s0, s3, n3). Algorithm is
> documented at
>>> http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\
>>> ccid3/ccid3_packet_reception/loss_detection/loss_detection_algorithm_notes.txt
>>
>> <snip>
>>
>>>  }
>>> Here it is also between s0 and s2, not between s0 and s3. It is case
> VI(c.3).
>>> However, the above still is a crude approximation, since it only
> measures between
>>> the last sequence number received before the loss and the third
> sequence
>>> number
>>> after the loss. It would be better to either
>>>  * use the first sequence number after the loss (this can be s1, s2, or
> s3) or
>>>  * check if there are more holes between the first/second and the
> second/third
>>>   sequence numbers after the loss.
>>> The second option would be the correct one, it should also take the NDP
> counts
>>> of each gap into account. And already we have a fairly complicated
> algorithm.
>>
>>
>>
>> I'll study loss_detection_algorithm_notes.txt and correct the code. But
> I have one question, that i don't know if is already answered by the
> documentation:
>> Further holes, between the the first and third packet received after the
> hole are accounted only in future calls to the function, right? Because
> the receiver needs to receive more packets to confirm loss, right?
>> So, it's really necessary to look for other holes after the loss? Will
> not this other holes be identified as losses in future?
> I stand corrected, you are right: only the hole between
>  * the highest sequence number before the loss (S0) and
>  * the first sequence number after the loss
>   (S1 or S3 depending on reordering)
> are relevant.

Thanks, already corrected in next patch.

>
> Continuing the point from above, I would like to ask which way you would
> like to go with your implementation:
>  (a) receiver computes the Loss Event Rate, sender just uses this value
>  (b) receiver only gathers the data (loss intervals, lost packets),
>     sender does all the verification, book-keeping, and computation.
>
> From reading your patches, I think it is going in the direction of (a).
> But if this is the case, we don't need the Dropped Packets Option from
> RFC 5622, 8.7. By definition it only makes sense if Loss Intervals
> Options are also present.
>
> So it is necessary to decide whether to go the full way, which means
>  * support Loss Intervals and Dropped Packets alike
>  * modify TFRC library (it will be a redesign)
>  * modify receiver code
>  * modify sender code,
> or to use the present approach where
>  * the receiver computes the Loss Rate and
>  * a Mandatory Send Loss Event Rate feature is present during feature
>   negotiation, to avoid problems with incompatible senders
>   (there is a comment explaining this, in net/dccp/feat.c).
>
> Thoughts?
>

Initially I conceived that the receiver would compute the loss event
rate and send it, along with loss intervals and dropped packets
option. And this would enable the sender to just use the loss event
rate reported, as is done currently (in current implementation of
TFRC), or compute it itself, using loss intervals option (in case of
CCID3) and dropped packets option too (CCID4's case).
In my code, I compute the loss event rate using the options, then
compare the two values. I don't know if would be better to keep all
the loss event rate calc only at one side, sender or receiver.

I believe that the first way is better (to "support Loss Intervals and
Dropped Packets alike..."), because RFC requires loss intervals option
to be sent. And so, proceed and implement dropped packets option for
TFRC-SP. You are right, this would need a redesign and rewrite of
sender and receiver code.




-- 
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab - http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center - http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply

* Re: r8169, enabling TX checksumming breaks things?
From: David Dillow @ 2009-09-24  1:12 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: romieu, netdev
In-Reply-To: <200909231724.23132.denys@visp.net.lb>

On Wed, 2009-09-23 at 17:24 +0300, Denys Fedoryschenko wrote:
> On Wednesday 23 September 2009 17:02:24 David Dillow wrote:
> > On Wed, 2009-09-23 at 09:15 +0300, Denys Fedoryschenko wrote:
> > > Hi
> > >
> > > Is it expected that:
> > > 1)TX checksumming is off by default
> > > 2)If i try to enable it over ethtool -K eth0 tx on , TCP sessions on
> > > proxy getting stuck, even in tcpdump looks everything fine and packets
> > > reaching destination, i don't understand what is a reason of failure.
> > > Maybe if this feature supposed to not work - user must not be able just
> > > to turn it on?
> >
> > It is broken for large swaths of the hardware -- I have patches that got
> > it and TSO working on my hardware, and they provide a framework to see
> > about getting it working on yours.
> >
> > Basically, the fields are in different places depending on the chip
> > revision. I'll try to dig those out tonight and send them along so we
> > can experiment.
> Thanks, i have 8 hosts (4 hosts with RTL8168b/8111b. and 4 with 
> RTL8168d/8111d) to test. Ready for patches to test them :-)

This is the patch I have against HEAD, which Works For Me (TM). If I
recall correctly the list in rtl8169_set_tso_csum() is from combing the
vendor drivers, but I'm not 100% on that. Too much time has passed.

I added the MAC version to the driver banner, which should make it
easier to figure out which MAC you have, and to change that function if
it does not work for you out of the box. It seems that some chips don't
work with this at all, based on testing by Michael Riepe (XID 3c4000c0).

As should be obvious, this isn't ready to go anywhere near upstream yet
I think.

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 50c6a3c..c58062e 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -28,7 +28,7 @@
 #include <asm/io.h>
 #include <asm/irq.h>
 
-#define RTL8169_VERSION "2.3LK-NAPI"
+#define RTL8169_VERSION "2.3LK-NAPI-TSO"
 #define MODULENAME "r8169"
 #define PFX MODULENAME ": "
 
@@ -384,13 +384,19 @@ enum desc_status_bit {
 	FirstFrag	= (1 << 29), /* First segment of a packet */
 	LastFrag	= (1 << 28), /* Final segment of a packet */
 
-	/* Tx private */
+	/* Tx private -- TxDesc->opts1 */
 	LargeSend	= (1 << 27), /* TCP Large Send Offload (TSO) */
 	MSSShift	= 16,        /* MSS value position */
 	MSSMask		= 0xfff,     /* MSS value + LargeSend bit: 12 bits */
 	IPCS		= (1 << 18), /* Calculate IP checksum */
 	UDPCS		= (1 << 17), /* Calculate UDP/IP checksum */
 	TCPCS		= (1 << 16), /* Calculate TCP/IP checksum */
+
+	/* Tx private -- TxDesc->opts2 */
+	UDPCS_2		= (1 << 31), /* Calculate UDP/IP checksum */
+	TCPCS_2		= (1 << 30), /* Calculate TCP/IP checksum */
+	IPCS_2		= (1 << 29), /* Calculate IP checksum */
+	MSSShift_2	= 18,        /* MSS value position */
 	TxVlanTag	= (1 << 17), /* Add VLAN tag */
 
 	/* Rx private */
@@ -2310,11 +2316,12 @@ rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (netif_msg_probe(tp)) {
 		u32 xid = RTL_R32(TxConfig) & 0x9cf0f8ff;
 
-		printk(KERN_INFO "%s: %s at 0x%lx, "
+		printk(KERN_INFO "%s: %s (%d) at 0x%lx, "
 		       "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, "
 		       "XID %08x IRQ %d\n",
 		       dev->name,
 		       rtl_chip_info[tp->chipset].name,
+		       rtl_chip_info[tp->chipset].mac_version,
 		       dev->base_addr,
 		       dev->dev_addr[0], dev->dev_addr[1],
 		       dev->dev_addr[2], dev->dev_addr[3],
@@ -3299,7 +3306,7 @@ static void rtl8169_tx_timeout(struct net_device *dev)
 }
 
 static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
-			      u32 opts1)
+			      u32 opts1, u32 opts2)
 {
 	struct skb_shared_info *info = skb_shinfo(skb);
 	unsigned int cur_frag, entry;
@@ -3323,6 +3330,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 		status = opts1 | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
 
 		txd->opts1 = cpu_to_le32(status);
+		txd->opts2 = cpu_to_le32(opts2);
 		txd->addr = cpu_to_le64(mapping);
 
 		tp->tx_skb[entry].len = len;
@@ -3336,24 +3344,64 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
 	return cur_frag;
 }
 
-static inline u32 rtl8169_tso_csum(struct sk_buff *skb, struct net_device *dev)
+static void rtl8169_set_tso_csum(struct sk_buff *skb, struct net_device *dev,
+				u32 *opts1, u32 *opts2)
 {
-	if (dev->features & NETIF_F_TSO) {
-		u32 mss = skb_shinfo(skb)->gso_size;
+	struct rtl8169_private *tp = netdev_priv(dev);
+	const struct iphdr *ip = ip_hdr(skb);
+	u32 mss = 0;
 
-		if (mss)
-			return LargeSend | ((mss & MSSMask) << MSSShift);
-	}
-	if (skb->ip_summed == CHECKSUM_PARTIAL) {
-		const struct iphdr *ip = ip_hdr(skb);
+	if (dev->features & NETIF_F_TSO)
+		mss = skb_shinfo(skb)->gso_size;
 
-		if (ip->protocol == IPPROTO_TCP)
-			return IPCS | TCPCS;
-		else if (ip->protocol == IPPROTO_UDP)
-			return IPCS | UDPCS;
-		WARN_ON(1);	/* we need a WARN() */
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_01: case RTL_GIGA_MAC_VER_02:
+	case RTL_GIGA_MAC_VER_03: case RTL_GIGA_MAC_VER_04:
+	case RTL_GIGA_MAC_VER_05: case RTL_GIGA_MAC_VER_06:
+	case RTL_GIGA_MAC_VER_10: case RTL_GIGA_MAC_VER_11:
+	case RTL_GIGA_MAC_VER_12: case RTL_GIGA_MAC_VER_13:
+	case RTL_GIGA_MAC_VER_16: case RTL_GIGA_MAC_VER_17:
+		/* LargeSend implies checksums, as IPCS et al occupy the
+		 * same bits as the MSS.
+		 *
+		 * FIXME: MSS may actually go in opts2 on all MAC versions,
+		 * but I only have the one below so I'll keep this as it was.
+		 */
+		if (mss) {
+			*opts1 |= LargeSend;
+			*opts1 |= (mss & MSSMask) << MSSShift;
+		} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			if (ip->protocol == IPPROTO_TCP)
+				*opts1 |= IPCS | TCPCS;
+			else if (ip->protocol == IPPROTO_UDP)
+				*opts1 |= IPCS | UDPCS;
+			else
+				WARN_ON_ONCE(1);
+		}
+		break;
+	default:
+		/* LargeSend implies checksums
+		 */
+		if (mss) {
+			/* This used to put (mss & MSSMask) << MSSShift
+			 * in opts1, but that causes my NIC to retransmit
+			 * the last packet over and over and over again...
+			 *
+			 * I have XID 281000c0, so this may be different
+			 * for other chips. Testers welcomed!
+			 */
+			*opts1 |= LargeSend;
+			*opts2 |= (mss & MSSMask) << MSSShift_2;
+		} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			if (ip->protocol == IPPROTO_TCP)
+				*opts2 |= IPCS_2 | TCPCS_2;
+			else if (ip->protocol == IPPROTO_UDP)
+				*opts2 |= IPCS_2 | UDPCS_2;
+			else
+				WARN_ON_ONCE(1);
+		}
+		break;
 	}
-	return 0;
 }
 
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -3365,7 +3413,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	void __iomem *ioaddr = tp->mmio_addr;
 	dma_addr_t mapping;
 	u32 status, len;
-	u32 opts1;
+	u32 opts1, opts2;
 
 	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
 		if (netif_msg_drv(tp)) {
@@ -3379,9 +3427,11 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(le32_to_cpu(txd->opts1) & DescOwn))
 		goto err_stop;
 
-	opts1 = DescOwn | rtl8169_tso_csum(skb, dev);
+	opts1 = DescOwn;
+	opts2 = rtl8169_tx_vlan_tag(tp, skb);
+	rtl8169_set_tso_csum(skb, dev, &opts1, &opts2);
 
-	frags = rtl8169_xmit_frags(tp, skb, opts1);
+	frags = rtl8169_xmit_frags(tp, skb, opts1, opts2);
 	if (frags) {
 		len = skb_headlen(skb);
 		opts1 |= FirstFrag;
@@ -3395,7 +3445,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	tp->tx_skb[entry].len = len;
 	txd->addr = cpu_to_le64(mapping);
-	txd->opts2 = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
+	txd->opts2 = cpu_to_le32(opts2);
 
 	wmb();
 



^ permalink raw reply related

* Re: pktgen: tricks
From: Rick Jones @ 2009-09-24  1:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jesper Dangaard Brouer, Robert Olsson, netdev
In-Reply-To: <20090923174141.1d350103@s6510>

Stephen Hemminger wrote:
> On Tue, 22 Sep 2009 22:49:02 -0700
> Stephen Hemminger <shemminger@vyatta.com> wrote:
> 
> 
>>I thought others want to know how to get maximum speed of pktgen.
>>
>>1. Run nothing else (even X11), just a command line
>>2. Make sure ethernet flow control is disabled
>>   ethtool -A eth0 autoneg off rx off tx off
>>3. Make sure clocksource is TSC.  On my old SMP Opteron's
>>   needed to get patch since in 2.6.30 or later, the clock guru's
>>   decided to remove it on all non Intel machines.  Look for patch
>>   than enables "tsc=reliable"
>>4. Compile Ethernet drivers in, the overhead of the indirect
>>   function call required for modules (or cache footprint),
>>   slows things down.
>>5. Increase transmit ring size to 1000
>>   ethtool -G eth0 tx 1000
>>
>>Result: OK: 70408581(c70405979+d2602) nsec, 100000000 (60byte,0frags)
>>  1420281pps 681Mb/sec (681734880bps) errors: 0
> 
> 
> Other kernel config help:
>   - turn off lock dependency checker, kmecheck, page alloc debug
>     basically anything that slows stuff down
>   - turn off content group scheduler

I and thought netperf was getting away from real-world?-)

rick jones

^ permalink raw reply

* Re: pktgen: tricks
From: Stephen Hemminger @ 2009-09-24  0:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jesper Dangaard Brouer, Robert Olsson, netdev
In-Reply-To: <20090922224902.17ed6cc4@nehalam>

On Tue, 22 Sep 2009 22:49:02 -0700
Stephen Hemminger <shemminger@vyatta.com> wrote:

> I thought others want to know how to get maximum speed of pktgen.
> 
> 1. Run nothing else (even X11), just a command line
> 2. Make sure ethernet flow control is disabled
>    ethtool -A eth0 autoneg off rx off tx off
> 3. Make sure clocksource is TSC.  On my old SMP Opteron's
>    needed to get patch since in 2.6.30 or later, the clock guru's
>    decided to remove it on all non Intel machines.  Look for patch
>    than enables "tsc=reliable"
> 4. Compile Ethernet drivers in, the overhead of the indirect
>    function call required for modules (or cache footprint),
>    slows things down.
> 5. Increase transmit ring size to 1000
>    ethtool -G eth0 tx 1000
> 
> Result: OK: 70408581(c70405979+d2602) nsec, 100000000 (60byte,0frags)
>   1420281pps 681Mb/sec (681734880bps) errors: 0

Other kernel config help:
  - turn off lock dependency checker, kmecheck, page alloc debug
    basically anything that slows stuff down
  - turn off content group scheduler

^ permalink raw reply

* [PULL] virtio_net updates
From: Rusty Russell @ 2009-09-24  0:35 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Herbert Xu, Mark McLoughlin, Dinesh Subhraveti, Amit Shah

Hi Dave,

   Now Linus has the prereq add_buf changes, these can all feed via you.  Note
that the driver is changed to return TX_BUSY first which is simplest, then
complicated again to avoid it.

These have been in linux-next since last merge window, too.

Thanks!
Rusty.


The following changes since commit a724eada8c2a7b62463b73ccf73fd0bb6e928aeb:
  Linus Torvalds (1):
        Merge branch 'ixp4xx' of git://git.kernel.org/.../chris/linux-2.6

are available in the git repository at:

  ssh://master.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-davem.git master

Amit Shah (1):
      virtio_net: Check for room in the vq before adding buffer

Rusty Russell (5):
      virtio_net: skb_orphan() and nf_reset() in xmit path.
      virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
      virtio_net: don't free buffers in xmit ring
      virtio_net: formalize skb_vnet_hdr
      virtio_net: avoid (most) NETDEV_TX_BUSY by stopping queue early.

 drivers/net/virtio_net.c |  229 ++++++++++++++++++----------------------------
 1 files changed, 88 insertions(+), 141 deletions(-)

commit 2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Thu Sep 24 09:59:17 2009 -0600

    virtio_net: skb_orphan() and nf_reset() in xmit path.
    
    The complex transmit free logic was introduced to avoid hangs on
    removing the ip_conntrack module and also because drivers aren't
    generally supposed to keep stale skbs for unbounded times.
    
    After some debate, it was decided that while doing skb_orphan()
    generally is a rat's nest, we can do it in this driver.  Following
    patches take advantage of this.
    
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

 drivers/net/virtio_net.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

commit 8958f574dbe7e41cc54df0df1accc861bb9f6be8
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Thu Sep 24 09:59:18 2009 -0600

    virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.
    
    This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
    "virtio: wean net driver off NETDEV_TX_BUSY".
    
    The complexity of queuing an skb (setting a tasklet to re-xmit) is
    questionable, especially once we get rid of the other reason for the
    tasklet in the next patch.
    
    If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.
    This is frowned upon, so a followup patch uses a more complex solution.
    
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>

 drivers/net/virtio_net.c |   46 ++++++++++------------------------------------
 1 files changed, 10 insertions(+), 36 deletions(-)

commit b0c39dbdc204006ef3558a66716ff09797619778
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Thu Sep 24 09:59:19 2009 -0600

    virtio_net: don't free buffers in xmit ring
    
    The virtio_net driver is complicated by the two methods of freeing old
    xmit buffers (in addition to freeing old ones at the start of the xmit
    path).
    
    The original code used a 1/10 second timer attached to xmit_free(),
    reset on every xmit.  Before we orphaned skbs on xmit, the
    transmitting userspace could block with a full socket until the timer
    fired, the skb destructor was called, and they were re-woken.
    
    So we added the VIRTIO_F_NOTIFY_ON_EMPTY feature: supporting devices
    send an interrupt (even if normally suppressed) on an empty xmit ring
    which makes us schedule xmit_tasklet().  This was a benchmark win.
    
    Unfortunately, VIRTIO_F_NOTIFY_ON_EMPTY makes quite a lot of work: a
    host which is faster than the guest will fire the interrupt every xmit
    packet (slowing the guest down further).  Attempting mitigation in the
    host adds overhead of userspace timers (possibly with the additional
    pain of signals), and risks increasing latency anyway if you get it
    wrong.
    
    In practice, this effect was masked by benchmarks which take advantage
    of GSO (with its inherent transmit batching), but it's still there.
    
    Now we orphan xmitted skbs, the pressure is off: remove both paths and
    no longer request VIRTIO_F_NOTIFY_ON_EMPTY.  Note that the current
    QEMU will notify us even if we don't negotiate this feature (legal,
    but suboptimal); a patch is outstanding to improve that.
    
    Move the skb_orphan/nf_reset to after we've done the send and notified
    the other end, for a slight optimization.
    
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
    Cc: Mark McLoughlin <markmc@redhat.com>

 drivers/net/virtio_net.c |   64 +++------------------------------------------
 1 files changed, 5 insertions(+), 59 deletions(-)

commit b3f24698a7faa6e9d8a14124cfdc25353fc8ca19
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Thu Sep 24 09:59:19 2009 -0600

    virtio_net: formalize skb_vnet_hdr
    
    We put the virtio_net_hdr into the skb's cb region; turn this into a
    union to clean up the code slightly and allow future expansion.
    
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
    Cc: Mark McLoughlin <markmc@redhat.com>
    Cc: Dinesh Subhraveti <dineshs@us.ibm.com>

 drivers/net/virtio_net.c |   81 +++++++++++++++++++++++++---------------------
 1 files changed, 44 insertions(+), 37 deletions(-)

commit 48925e372f04f5e35fec6269127c62b2c71ab794
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Thu Sep 24 09:59:20 2009 -0600

    virtio_net: avoid (most) NETDEV_TX_BUSY by stopping queue early.
    
    Now we can tell the theoretical capacity remaining in the output
    queue, virtio_net can waste entries by stopping the queue early.
    
    It doesn't work in the case of indirect buffers and kmalloc failure,
    but that's rare (we could drop the packet in that case, but other
    drivers return TX_BUSY for similar reasons).
    
    For the record, I think this patch reflects poorly on the linux
    network API.
    
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
    Cc: Dinesh Subhraveti <dineshs@us.ibm.com>

 drivers/net/virtio_net.c |   64 ++++++++++++++++++++++++++++-----------------
 1 files changed, 40 insertions(+), 24 deletions(-)

commit 0aea51c37fc5868cd723f670af9056c2ef694fee
Author: Amit Shah <amit.shah@redhat.com>
Date:   Wed Aug 26 14:58:28 2009 +0530

    virtio_net: Check for room in the vq before adding buffer
    
    Saves us one cycle of alloc-add-free if the queue was full.
    
    Signed-off-by: Amit Shah <amit.shah@redhat.com>
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified)

 drivers/net/virtio_net.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

^ permalink raw reply

* [PATCH]ks8851_mll ethernet network driver
From: david choi @ 2009-09-24  0:05 UTC (permalink / raw)
  To: davem; +Cc: greg, netdev, Charles Li, choi, jgarzik, shemminger

Hello David Miller,

Based on email_client.txt, I use add-on firefox extension of
ViewSourceWith to send this patch.
Hope to fix line-wrapping issue.

Subject         : [PATCH] ks8851_mll ethernet network driver
>From            : David J. Choi( david.choi@micrel.com )
Body of the explanation :
                Fix up issues as followings;
                -Remove the lines of "#define DEBUG", "#define MALLOC"
                -Remove the compile warnings.
                -Add to return IRQ_NONE when there is no IRQ
indication in hardware.
                -Remove mutex_lock/unlock() in ks_net_open() because
they are redundancy in the function.
Signed-off-by   : David J. Choi

Regards,
David J. Choi
------------------------------------
--- linux-2.6.31-rc3/drivers/net/ks8851_mll.c.orig	2009-09-17
10:18:56.000000000 -0700
+++ linux-2.6.31-rc3/drivers/net/ks8851_mll.c	2009-09-17
10:09:37.000000000 -0700
@@ -21,8 +21,6 @@
  * KS8851 16bit MLL chip from Micrel Inc.
  */

-#define DEBUG
-
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
@@ -419,7 +417,6 @@ union ks_tx_hdr {
  * or one of the work queues.
  *
  */
-#define MALLOC(x)		kmalloc(x, GFP_KERNEL)

 /* Receive multiplex framer header info */
 struct type_frame_head {
@@ -552,11 +549,9 @@ static void ks_wrreg16(struct ks_net *ks
  */
 static inline void ks_inblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		*wptr++ = (u16)ioread16(data_port);
-	} while (--len);
+	while (len--)
+		*wptr++ = (u16)ioread16(ks->hw_addr);
 }

 /**
@@ -568,11 +563,9 @@ static inline void ks_inblk(struct ks_ne
  */
 static inline void ks_outblk(struct ks_net *ks, u16 *wptr, u32 len)
 {
-	u32 data_port = (u32)ks->hw_addr;
 	len >>= 1;
-	do {
-		iowrite16(*wptr++, data_port);
-	} while (--len);
+	while (len--)
+		iowrite16(*wptr++, ks->hw_addr);
 }

 /**
@@ -818,6 +811,11 @@ static irqreturn_t ks_irq(int irq, void
 	ks_save_cmd_reg(ks);

 	status = ks_rdreg16(ks, KS_ISR);
+	if (unlikely(!status)) {
+		ks_restore_cmd_reg(ks);
+		return IRQ_NONE;
+	}
+
 	ks_wrreg16(ks, KS_ISR, status);

 	if (likely(status & IRQ_RXI))
@@ -858,7 +856,6 @@ static int ks_net_open(struct net_device
 	/* lock the card, even if we may not actually do anything
 	 * else at the moment.
 	 */
-	mutex_lock(&ks->lock);

 	if (netif_msg_ifup(ks))
 		ks_dbg(ks, "%s - entry\n", __func__);
@@ -875,8 +872,6 @@ static int ks_net_open(struct net_device
 	if (netif_msg_ifup(ks))
 		ks_dbg(ks, "network device %s up\n", netdev->name);

-	mutex_unlock(&ks->lock);
-
 	return 0;
 }

@@ -1515,12 +1510,13 @@ void ks_enable(struct ks_net *ks)

 static int ks_hw_init(struct ks_net *ks)
 {
+#define	MHEADER_SIZE	(sizeof(struct type_frame_head) * MAX_RECV_FRAMES)
 	ks->promiscuous = 0;
 	ks->all_mcast = 0;
 	ks->mcast_lst_size = 0;

 	ks->frame_head_info = (struct type_frame_head *) \
-		MALLOC(sizeof(struct type_frame_head) * MAX_RECV_FRAMES);
+		kmalloc(MHEADER_SIZE, GFP_KERNEL);
 	if (!ks->frame_head_info) {
 		printk(KERN_ERR "Error: Fail to allocate frame memory\n");
 		return false;

^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Rick Jones @ 2009-09-24  0:02 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4ABAA2D0.4030608@candelatech.com>

Ben Greear wrote:
> When LRO is enabled, the received packet and byte counters represent the
> LRO'd packets, not the packets/bytes on the wire. 

When LRO is enabled, are all the bytes on the wire actually transferred into the 
host?

rick jones

^ permalink raw reply

* [BUG] af_unix race in close?
From: Stephen Hemminger @ 2009-09-23 23:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This oops seems to show lots of times:
http://www.kerneloops.org/guilty.php?guilty=unix_write_space&version=2.6.31-release&start=2064384&end=2097151&class=oops
Looks like race in unix domain socket close with data outstanding.

BUG: unable to handle kernel paging request at 6b6b6b8f
IP: [] unix_write_space+0x45/0x87
*pde = 00000000 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT1/charge_full
Modules linked in: ext2 fuse nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_multipath uinput uvcvideo videodev v4l1_compat arc4 snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support ecb serio_raw i2c_i801 snd_hda_intel joydev snd_hda_codec snd_hwdep snd_pcm snd_timer ath5k r8169 snd mac80211 mii soundcore ath snd_page_alloc jmb38x_ms cfg80211 memstick rfkill wmi squashfs vfat fat mmc_block i915 sdhci_pci ata_generic pata_acpi sdhci mmc_core drm i2c_algo_bit i2c_core usb_storage video output [last unloaded: microcode]

Pid: 6809, comm: metacity Not tainted (2.6.31-0.125.4.2.rc5.git2.fc12.i686 #1) AOA110
EIP: 0060:[] EFLAGS: 00010202 CPU: 0
EIP is at unix_write_space+0x45/0x87
EAX: 6b6b6b6b EBX: ec988780 ECX: 00000000 EDX: 6b6b6b8f
ESI: ec988950 EDI: ffffff20 EBP: ec941e28 ESP: ec941e1c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process metacity (pid: 6809, ti=ec940000 task=e63095c0 task.ti=ec940000)
Stack:
 37dc7803 ec988780 000000e1 ec941e40 c0772142 37dc7803 dcc1c900 dcc1c900
<0> c07f6a02 ec941e50 c0775766 37dc7803 dcc1c900 ec941e60 c07754ae 37dc7803
<0> dcc1c900 ec941e78 c07755db 37dc7803 ec98b0c0 dcc1c900 00000000 ec941ea0
Call Trace:
 [] ? sock_wfree+0x44/0x68
 [] ? unix_release_sock+0x182/0x1e0
 [] ? skb_release_head_state+0x6c/0xcb
 [] ? __kfree_skb+0x20/0x94
 [] ? kfree_skb+0x68/0x7f
 [] ? unix_release_sock+0x182/0x1e0
 [] ? unix_release+0x2f/0x42
 [] ? sock_release+0x29/0x7f
 [] ? sock_close+0x30/0x45
 [] ? __fput+0x101/0x1a2
 [] ? fput+0x27/0x3a
 [] ? filp_close+0x64/0x7f
 [] ? put_files_struct+0x68/0xbd
 [] ? exit_files+0x43/0x59
 [] ? do_exit+0x1d6/0x648
 [] ? audit_syscall_entry+0x134/0x167
 [] ? do_group_exit+0x72/0x99
 [] ? sys_exit_group+0x27/0x3c
 [] ? syscall_call+0x7/0xb
Code: 00 89 45 f4 31 c0 89 f0 e8 9a 76 02 00 8b 83 dc 00 00 00 c1 e0 02 3b 83 e4 00 00 00 7f 32 8b 83 a4 00 00 00 85 c0 74 17 8d 50 24 <39> 50 24 74 0f b9 01 00 00 00 ba 01 00 00 00 e8 bb cf c3 ff b9 
EIP: [] unix_write_space+0x45/0x87 SS:ESP 0068:ec941e1c
CR2: 000000006b6b6b8f
---[ end trace 4a36bd1eb2fc9896 ]---



^ permalink raw reply

* [PATCH v18 70/80] c/r: Add AF_UNIX support (v12)
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Alexey Dobriyan, netdev, Oren Laadan
In-Reply-To: <1253749920-18673-1-git-send-email-orenl@librato.com>

From: Dan Smith <danms@us.ibm.com>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase

Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()

Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case

Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature

Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)

Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values

Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names

Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()

Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart

Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names

Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 +
 checkpoint/objhash.c           |   69 ++++
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |   87 +++++
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   14 +
 include/net/sock.h             |   12 +
 net/Makefile                   |    2 +
 net/checkpoint.c               |  752 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  634 +++++++++++++++++++++++++++++++++
 13 files changed, 1601 insertions(+), 1 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1de89d6..058bc0e 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 
 /**************************************************************************
@@ -591,6 +592,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_FIFO,
 		.restore = fifo_file_restore,
 	},
+	/* socket */
+	{
+		.file_name = "SOCKET",
+		.file_type = CKPT_FILE_SOCKET,
+		.restore = sock_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index bf2f761..0978060 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -20,6 +20,7 @@
 #include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 struct ckpt_obj;
 struct ckpt_obj_ops;
@@ -234,6 +235,40 @@ static void obj_groupinfo_drop(void *ptr, int lastref)
 	put_group_info((struct group_info *) ptr);
 }
 
+static int obj_sock_grab(void *ptr)
+{
+	sock_hold((struct sock *) ptr);
+	return 0;
+}
+
+static void obj_sock_drop(void *ptr, int lastref)
+{
+	struct sock *sk = (struct sock *) ptr;
+
+	/*
+	 * Sockets created during restart are graft()ed, i.e. have a
+	 * valid @sk->sk_socket. Because only an fput() results in the
+	 * necessary sock_release(), we may leak the struct socket of
+	 * sockets that were not attached to a file. Therefore, if
+	 * @lastref is set, we hereby invoke sock_release() on sockets
+	 * that we have put into the objhash but were never attached
+	 * to a file.
+	 */
+	if (lastref && sk->sk_socket && !sk->sk_socket->file) {
+		struct socket *sock = sk->sk_socket;
+		sock_orphan(sk);
+		sock->sk = NULL;
+		sock_release(sock);
+	}
+
+	sock_put((struct sock *) ptr);
+}
+
+static int obj_sock_users(void *ptr)
+{
+	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -362,6 +397,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_groupinfo,
 		.restore = restore_groupinfo,
 	},
+	/* sock object */
+	{
+		.obj_name = "SOCKET",
+		.obj_type = CKPT_OBJ_SOCK,
+		.ref_drop = obj_sock_drop,
+		.ref_grab = obj_sock_grab,
+		.ref_users = obj_sock_users,
+		.checkpoint = checkpoint_sock,
+		.restore = restore_sock,
+	},
 };
 
 
@@ -751,6 +796,26 @@ static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
  */
 
 /**
+ * obj_sock_adjust_users - remove implicit reference on DEAD sockets
+ * @obj: CKPT_OBJ_SOCK object to adjust
+ *
+ * Sockets that have been disconnected from their struct file have
+ * a reference count one less than normal sockets.  The objhash's
+ * assumption of such a reference is therefore incorrect, so we correct
+ * it here.
+ */
+static inline void obj_sock_adjust_users(struct ckpt_obj *obj)
+{
+	struct sock *sk = (struct sock *)obj->ptr;
+
+	if (sock_flag(sk, SOCK_DEAD)) {
+		obj->users--;
+		ckpt_debug("Adjusting SOCK %i count to %i\n",
+			   obj->objref, obj->users);
+	}
+}
+
+/**
  * ckpt_obj_contained - test if shared objects are contained in checkpoint
  * @ctx: checkpoint context
  *
@@ -773,6 +838,10 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
 			continue;
+
+		if (obj->ops->obj_type == CKPT_OBJ_SOCK)
+			obj_sock_adjust_users(obj);
+
 		if (obj->ops->ref_users(obj->ptr) != obj->users) {
 			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
 			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ec98a43..92a21b2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <net/sock.h>
 
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
@@ -77,6 +78,12 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* socket functions */
+extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
+			      struct socket *socket,
+			      struct sockaddr *loc, unsigned *loc_len,
+			      struct sockaddr *rem, unsigned *rem_len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e4dfbd7..ac16c59 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,14 @@
 
 #include <linux/types.h>
 
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#include <linux/un.h>
+#else
+#include <sys/socket.h>
+#include <sys/un.h>
+#endif
+
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
@@ -92,6 +100,11 @@ enum {
 	CKPT_HDR_SIGNAL_TASK,
 	CKPT_HDR_SIGPENDING,
 
+	CKPT_HDR_SOCKET = 701,
+	CKPT_HDR_SOCKET_QUEUE,
+	CKPT_HDR_SOCKET_BUFFER,
+	CKPT_HDR_SOCKET_UNIX,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -127,6 +140,7 @@ enum obj_type {
 	CKPT_OBJ_CRED,
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
+	CKPT_OBJ_SOCK,
 	CKPT_OBJ_MAX
 };
 
@@ -353,6 +367,7 @@ enum file_type {
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
+	CKPT_FILE_SOCKET,
 	CKPT_FILE_MAX
 };
 
@@ -376,6 +391,78 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+/* socket */
+struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
+
+	struct { /* struct socket */
+		__u64 flags;
+		__u8 state;
+	} socket __attribute__ ((aligned(8)));
+
+	struct { /* struct sock_common */
+		__u32 bound_dev_if;
+		__u32 reuse;
+		__u16 family;
+		__u8 state;
+	} sock_common __attribute__ ((aligned(8)));
+
+	struct { /* struct sock */
+		__s64 rcvlowat;
+		__u64 flags;
+
+		__s64 rcvtimeo;
+		__s64 sndtimeo;
+
+		__u32 err;
+		__u32 err_soft;
+		__u32 priority;
+		__s32 rcvbuf;
+		__s32 sndbuf;
+		__u16 type;
+		__s16 backlog;
+
+		__u8 protocol;
+		__u8 state;
+		__u8 shutdown;
+		__u8 userlocks;
+		__u8 no_check;
+
+		struct linger linger;
+	} sock __attribute__ ((aligned(8)));
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_queue {
+	struct ckpt_hdr h;
+	__u32 skb_count;
+	__u32 total_bytes;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__s32 sk_objref;
+	__s32 pr_objref;
+};
+
+#define CKPT_UNIX_LINKED 1
+struct ckpt_hdr_socket_unix {
+	struct ckpt_hdr h;
+	__s32 this;
+	__s32 peer;
+	__u32 peercred_uid;
+	__u32 peercred_gid;
+	__u32 flags;
+	__u32 laddr_len;
+	__u32 raddr_len;
+	struct sockaddr_un laddr;
+	struct sockaddr_un raddr;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_file_socket {
+	struct ckpt_hdr_file common;
+	__s32 sock_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/net.h b/include/linux/net.h
index b99f350..d1ce6eb 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -232,6 +232,8 @@ extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
 extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
+extern int	     sock_attach_fd(struct socket *sock, struct file *file,
+				    int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..e42a714 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -68,4 +68,18 @@ static inline int unix_sysctl_register(struct net *net) { return 0; }
 static inline void unix_sysctl_unregister(struct net *net) {}
 #endif
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+extern int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+extern int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+			struct ckpt_hdr_socket *h);
+extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
+
+#else
+#define unix_checkpoint NULL
+#define unix_restore NULL
+#endif /* CONFIG_CHECKPOINT */
+
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 12530bf..ec351f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1646,4 +1646,16 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
+#ifdef CONFIG_CHECKPOINT
+/* Checkpoint/Restart Functions */
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sock(struct ckpt_ctx *ctx);
+extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *h);
+extern int sock_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#endif
+
 #endif	/* _SOCK_H */
diff --git a/net/Makefile b/net/Makefile
index ba324ae..91d12fe 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -66,3 +66,5 @@ ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
 obj-$(CONFIG_WIMAX)		+= wimax/
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
new file mode 100644
index 0000000..a11ec7a
--- /dev/null
+++ b/net/checkpoint.c
@@ -0,0 +1,752 @@
+/*
+ *  Copyright 2009 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *             Oren Laadan <orenl@cs.columbia.edu>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/socket.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/fs_struct.h>
+
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	struct sock *sk;
+};
+
+static int sock_copy_buffers(struct sk_buff_head *from,
+			     struct sk_buff_head *to,
+			     uint32_t *total_bytes)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int i;
+	struct sk_buff *skb;
+	struct sk_buff **skbs;
+
+	*total_bytes = 0;
+
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb)
+		count1++;
+	spin_unlock(&from->lock);
+
+	skbs = kzalloc(sizeof(*skbs) * count1, GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	for (i = 0; i < count1;  i++) {
+		skbs[i] = dev_alloc_skb(0);
+		if (!skbs[i])
+			goto err;
+	}
+
+	i = 0;
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb) {
+		if (++count2 > count1)
+			break; /* The queue changed as we read it */
+
+		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
+		skb_queue_tail(to, skbs[i]);
+
+		*total_bytes += skb->len;
+		i++;
+	}
+	spin_unlock(&from->lock);
+
+	if (count1 != count2)
+		goto err;
+
+	kfree(skbs);
+
+	return count1;
+ err:
+	while (skb_dequeue(to))
+		; /* Pull all the buffers out of the queue */
+	for (i = 0; i < count1; i++)
+		kfree_skb(skbs[i]);
+	kfree(skbs);
+
+	return -EAGAIN;
+}
+
+static int __sock_write_buffers(struct ckpt_ctx *ctx,
+				struct sk_buff_head *queue,
+				int dst_objref)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb) {
+		struct ckpt_hdr_socket_buffer *h;
+		int ret = 0;
+
+		/* FIXME: This could be a false positive for non-unix
+		 *        buffers, so add a type check here in the
+		 *        future
+		 */
+		if (UNIXCB(skb).fp) {
+			ckpt_write_err(ctx, "TE", "af_unix: pass fd", -EBUSY);
+			return -EBUSY;
+		}
+
+		/* The other ancillary messages are always present
+		 * unlike descriptors.  Even though we can't detect
+		 * them and fail the checkpoint, we're not at risk
+		 * because we don't save out (or restore) the control
+		 * information contained in the skb.
+		 */
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+		if (!h)
+			return -ENOMEM;
+
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto end;
+		h->sk_objref = ret;
+		h->pr_objref = dst_objref;
+
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
+					  CKPT_HDR_BUFFER);
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int sock_write_buffers(struct ckpt_ctx *ctx,
+			      struct sk_buff_head *queue,
+			      int dst_objref)
+{
+	struct ckpt_hdr_socket_queue *h;
+	struct sk_buff_head tmpq;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (!h)
+		return -ENOMEM;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &h->total_bytes);
+	if (ret < 0)
+		goto out;
+
+	h->skb_count = ret;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (!ret)
+		ret = __sock_write_buffers(ctx, &tmpq, dst_objref);
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_deferred_write_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	int ret;
+	int dst_objref;
+
+	dst_objref = ckpt_obj_lookup(ctx, dq->sk, CKPT_OBJ_SOCK);
+	if (dst_objref < 0) {
+		ckpt_write_err(ctx, "TE", "socket: owner gone?", dst_objref);
+		return dst_objref;
+	}
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue, dst_objref);
+	ckpt_debug("write recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue, dst_objref);
+	ckpt_debug("write send buffers: %i\n", ret);
+
+	return ret;
+}
+
+int sock_defer_write_buffers(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk = sk;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      sock_deferred_write_buffers, NULL);
+}
+
+int ckpt_sock_getnames(struct ckpt_ctx *ctx, struct socket *sock,
+		       struct sockaddr *loc, unsigned *loc_len,
+		       struct sockaddr *rem, unsigned *rem_len)
+{
+	int ret;
+
+	ret = sock_getname(sock, loc, loc_len);
+	if (ret) {
+		ckpt_write_err(ctx, "TEP", "socket: getname local", ret, sock);
+		return -EINVAL;
+	}
+
+	ret = sock_getpeer(sock, rem, rem_len);
+	if (ret) {
+		if ((sock->sk->sk_type != SOCK_DGRAM) &&
+		    (sock->sk->sk_state == TCP_ESTABLISHED)) {
+			ckpt_write_err(ctx, "TEP", "socket: getname peer",
+				       ret, sock);
+			return -EINVAL;
+		}
+		*rem_len = 0;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst_verify(struct ckpt_hdr_socket *h)
+{
+	uint8_t userlocks_mask = SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK |
+		                 SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK;
+
+	if (h->sock.shutdown & ~SHUTDOWN_MASK)
+		return -EINVAL;
+	if (h->sock.userlocks & ~userlocks_mask)
+		return -EINVAL;
+	if (!ckpt_validate_errno(h->sock.err))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sock_cptrst_opt(int op, struct socket *sock,
+			   int optname, char *opt, int len)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (op == CKPT_CPT)
+		ret = sock_getsockopt(sock, SOL_SOCKET, optname, opt, &len);
+	else
+		ret = sock_setsockopt(sock, SOL_SOCKET, optname, opt, len);
+
+	set_fs(fs);
+
+	return ret;
+}
+
+#define CKPT_COPY_SOPT(op, sk, name, opt) \
+	sock_cptrst_opt(op, sk->sk_socket, name, (char *)opt, sizeof(*opt))
+
+static int sock_cptrst_bufopts(int op, struct sock *sk,
+			       struct ckpt_hdr_socket *h)
+{
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVBUF, &h->sock.rcvbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_RCVBUFFORCE, &h->sock.rcvbuf)) {
+			ckpt_debug("Failed to set SO_RCVBUF");
+			return -EINVAL;
+		}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_SNDBUF, &h->sock.sndbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_SNDBUFFORCE, &h->sock.sndbuf)) {
+			ckpt_debug("Failed to set SO_SNDBUF");
+			return -EINVAL;
+		}
+
+	/* It's silly that we have to fight ourselves here, but
+	 * sock_setsockopt() doubles the initial value, so divide here
+	 * to store the user's value and avoid doubling on restart
+	 */
+	if ((op == CKPT_CPT) && (h->sock.rcvbuf != SOCK_MIN_RCVBUF))
+		h->sock.rcvbuf >>= 1;
+
+	if ((op == CKPT_CPT) && (h->sock.sndbuf != SOCK_MIN_SNDBUF))
+		h->sock.sndbuf >>= 1;
+
+	return 0;
+}
+
+struct sock_flag_mapping {
+	int opt;
+	int flag;
+};
+
+struct sock_flag_mapping sk_flag_map[] = {
+	{SO_OOBINLINE, SOCK_URGINLINE},
+	{SO_KEEPALIVE, SOCK_KEEPOPEN},
+	{SO_BROADCAST, SOCK_BROADCAST},
+	{SO_TIMESTAMP, SOCK_RCVTSTAMP},
+	{SO_TIMESTAMPNS, SOCK_RCVTSTAMPNS},
+	{SO_DEBUG, SOCK_DBG},
+	{SO_DONTROUTE, SOCK_LOCALROUTE},
+};
+
+struct sock_flag_mapping sock_flag_map[] = {
+	{SO_PASSCRED, SOCK_PASSCRED},
+};
+
+static int sock_restore_flag(struct socket *sock,
+			     unsigned long *flags,
+			     int flag,
+			     int option)
+{
+	int v = 1;
+	int ret = 0;
+
+	if (test_and_clear_bit(flag, flags))
+		ret = sock_setsockopt(sock, SOL_SOCKET, option,
+				      (char *)&v, sizeof(v));
+
+	return ret;
+}
+
+
+static int sock_restore_flags(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	unsigned long sk_flags = h->sock.flags;
+	unsigned long sock_flags = h->socket.flags;
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sk_flag_map); i++) {
+		int opt = sk_flag_map[i].opt;
+		int flag = sk_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sk_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set skopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sock_flag_map); i++) {
+		int opt = sock_flag_map[i].opt;
+		int flag = sock_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sock_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set sockopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	/* TODO: Handle SOCK_TIMESTAMPING_* flags */
+	if (test_bit(SOCK_TIMESTAMPING_TX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_TX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RAW_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SYS_HARDWARE, &sk_flags)) {
+		ckpt_debug("SOF_TIMESTAMPING_* flags are not supported\n");
+		return -ENOSYS;
+	}
+
+	if (test_and_clear_bit(SOCK_DEAD, &sk_flags))
+		sock_set_flag(sock->sk, SOCK_DEAD);
+
+
+	/* Anything that is still set in the flags that isn't part of
+	 * our protocol's default set, indicates an error
+	 */
+	if (sk_flags & ~sock->sk->sk_flags) {
+		ckpt_debug("Unhandled sock flags: %lx\n", sk_flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_copy_timeval(int op, struct sock *sk,
+			     int sockopt, __s64 *saved)
+{
+	struct timeval tv;
+
+	if (op == CKPT_CPT) {
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+		*saved = timeval_to_ns(&tv);
+	} else {
+		tv = ns_to_timeval(*saved);
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
+		       struct ckpt_hdr_socket *h, int op)
+{
+	if (sk->sk_socket) {
+		CKPT_COPY(op, h->socket.state, sk->sk_socket->state);
+	}
+
+	CKPT_COPY(op, h->sock_common.bound_dev_if, sk->sk_bound_dev_if);
+	CKPT_COPY(op, h->sock_common.family, sk->sk_family);
+
+	CKPT_COPY(op, h->sock.shutdown, sk->sk_shutdown);
+	CKPT_COPY(op, h->sock.userlocks, sk->sk_userlocks);
+	CKPT_COPY(op, h->sock.no_check, sk->sk_no_check);
+	CKPT_COPY(op, h->sock.protocol, sk->sk_protocol);
+	CKPT_COPY(op, h->sock.err, sk->sk_err);
+	CKPT_COPY(op, h->sock.err_soft, sk->sk_err_soft);
+	CKPT_COPY(op, h->sock.type, sk->sk_type);
+	CKPT_COPY(op, h->sock.state, sk->sk_state);
+	CKPT_COPY(op, h->sock.backlog, sk->sk_max_ack_backlog);
+
+	if (sock_cptrst_bufopts(op, sk, h))
+		return -EINVAL;
+
+	if (CKPT_COPY_SOPT(op, sk, SO_REUSEADDR, &h->sock_common.reuse)) {
+		ckpt_debug("Failed to set SO_REUSEADDR");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_PRIORITY, &h->sock.priority)) {
+		ckpt_debug("Failed to set SO_PRIORITY");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVLOWAT, &h->sock.rcvlowat)) {
+		ckpt_debug("Failed to set SO_RCVLOWAT");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_LINGER, &h->sock.linger)) {
+		ckpt_debug("Failed to set SO_LINGER");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_SNDTIMEO, &h->sock.sndtimeo)) {
+		ckpt_debug("Failed to set SO_SNDTIMEO");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_RCVTIMEO, &h->sock.rcvtimeo)) {
+		ckpt_debug("Failed to set SO_RCVTIMEO");
+		return -EINVAL;
+	}
+
+	if (op == CKPT_CPT) {
+		h->sock.flags = sk->sk_flags;
+		h->socket.flags = sk->sk_socket->flags;
+	} else {
+		int ret;
+		mm_segment_t old_fs;
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = sock_restore_flags(sk->sk_socket, h);
+		set_fs(old_fs);
+		if (ret)
+			return ret;
+	}
+
+	if ((h->socket.state == SS_CONNECTED) &&
+	    (h->sock.state != TCP_ESTABLISHED)) {
+		ckpt_debug("socket/sock in inconsistent state: %i/%i",
+			   h->socket.state, h->sock.state);
+		return -EINVAL;
+	} else if ((h->sock.state < TCP_ESTABLISHED) ||
+		   (h->sock.state >= TCP_MAX_STATES)) {
+		ckpt_debug("sock in invalid state: %i", h->sock.state);
+		return -EINVAL;
+	} else if ((h->socket.state < SS_FREE) ||
+		   (h->socket.state > SS_DISCONNECTING)) {
+		ckpt_debug("socket in invalid state: %i",
+			   h->socket.state);
+		return -EINVAL;
+	}
+
+	if (op == CKPT_RST)
+		return sock_cptrst_verify(h);
+	else
+		return 0;
+}
+
+static int __do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
+	int ret;
+
+	if (!sock->ops->checkpoint) {
+		ckpt_write_err(ctx, "TEVP", "socket: proto_ops",
+			       -ENOSYS, sock->ops, sock);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (!h)
+		return -ENOMEM;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	/* part II: per socket type state */
+	ret = sock->ops->checkpoint(ctx, sock);
+	if (ret < 0)
+		goto out;
+
+	/* part III: socket buffers */
+	if ((sk->sk_state != TCP_LISTEN) && (!sock_flag(sk, SOCK_DEAD)))
+		ret = sock_defer_write_buffers(ctx, sk);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock;
+	int ret;
+
+	if (sk->sk_socket)
+		return __do_sock_checkpoint(ctx, sk);
+
+	/* Temporarily adopt this orphan socket */
+	ret = sock_create(sk->sk_family, sk->sk_type, 0, &sock);
+	if (ret < 0)
+		return ret;
+	sock_graft(sk, sock);
+
+	ret = __do_sock_checkpoint(ctx, sk);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+	sock_release(sock);
+
+	return ret;
+}
+
+int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_sock_checkpoint(ctx, (struct sock *)ptr);
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int sock_collect_skbs(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
+{
+	struct sk_buff_head tmpq;
+	struct sk_buff *skb;
+	int ret = 0;
+	int bytes;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &bytes);
+	if (ret < 0)
+		return ret;
+
+	skb_queue_walk(&tmpq, skb) {
+		/* Socket buffers do not maintain a ref count on their
+		 * owning sock because they're counted in sock_wmem_alloc.
+		 * So, we only need to collect sockets from the queue that
+		 * won't be collected any other way (i.e. DEAD sockets that
+		 * are hanging around only because they're waiting for us
+		 * to process their skb.
+		 */
+
+		if (!ckpt_obj_lookup(ctx, skb->sk, CKPT_OBJ_SOCK) &&
+		    sock_flag(skb->sk, SOCK_DEAD)) {
+			ret = ckpt_obj_collect(ctx, skb->sk, CKPT_OBJ_SOCK);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_write_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_receive_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = ckpt_obj_collect(ctx, sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sock->ops->collect)
+		ret = sock->ops->collect(ctx, sock);
+
+	return ret;
+}
+
+static struct file *sock_alloc_attach_fd(struct socket *sock)
+{
+	struct file *file;
+	int err;
+
+	file = get_empty_filp();
+	if (!file)
+		return ERR_PTR(ENOMEM);
+
+	err = sock_attach_fd(sock, file, 0);
+	if (err < 0) {
+		put_filp(file);
+		file = ERR_PTR(err);
+	}
+
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in
+	 * the socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sock->sk);
+
+	return file;
+}
+
+struct sock *do_sock_restore(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket *h;
+	struct socket *sock;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
+	h->sock.type &= SOCK_TYPE_MASK;
+
+	ret = sock_create(h->sock_common.family, h->sock.type, 0, &sock);
+	if (ret < 0)
+		goto err;
+
+	if (!sock->ops->restore) {
+		ckpt_debug("proto_ops lacks checkpoint: %pS\n", sock->ops);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * part II: per socket type state
+	 * (also takes care of part III: socket buffer)
+	 */
+	ret = sock->ops->restore(ctx, sock, h);
+	if (ret < 0)
+		goto err;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sock->sk, h, CKPT_RST);
+	if (ret < 0)
+		goto err;
+
+	ckpt_hdr_put(ctx, h);
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+	return ERR_PTR(ret);
+}
+
+void *restore_sock(struct ckpt_ctx *ctx)
+{
+	return do_sock_restore(ctx);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	file = sock_alloc_attach_fd(sk->sk_socket);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		return ERR_PTR(ret);
+	}
+
+	return file;
+}
diff --git a/net/socket.c b/net/socket.c
index 63c4498..0a4d539 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -140,6 +140,10 @@ static const struct file_operations socket_file_ops = {
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
 	.splice_read =	sock_splice_read,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint =   sock_file_checkpoint,
+	.collect = sock_file_collect,
+#endif
 };
 
 /*
@@ -368,7 +372,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
+int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
diff --git a/net/unix/Makefile b/net/unix/Makefile
index b852a2b..fbff1e6 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UNIX)	+= unix.o
 
 unix-y			:= af_unix.o garbage.o
 unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
+unix-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..b3d4f16 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -523,6 +523,9 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_dgram_ops = {
@@ -544,6 +547,9 @@ static const struct proto_ops unix_dgram_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_seqpacket_ops = {
@@ -565,6 +571,9 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static struct proto unix_proto = {
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
new file mode 100644
index 0000000..8b7cb22
--- /dev/null
+++ b/net/unix/checkpoint.c
@@ -0,0 +1,634 @@
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/fs_struct.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/user.h>
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+struct dq_join {
+	struct ckpt_ctx *ctx;
+	int src_objref;
+	int dst_objref;
+};
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	int sk_objref; /* objref of the socket these buffers belong to */
+};
+
+#define UNIX_ADDR_EMPTY(a) (a <= sizeof(short))
+
+static inline int unix_need_cwd(struct sockaddr_un *addr, unsigned long len)
+{
+	return (!UNIX_ADDR_EMPTY(len)) &&
+		addr->sun_path[0] &&
+		(addr->sun_path[0] != '/');
+}
+
+static int unix_join(struct sock *src, struct sock *dst)
+{
+	if (unix_sk(src)->peer != NULL)
+		return 0; /* We're second */
+
+	sock_hold(dst);
+	unix_sk(src)->peer = dst;
+
+	return 0;
+
+}
+
+static int unix_deferred_join(void *data)
+{
+	struct dq_join *dq = (struct dq_join *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *src;
+	struct sock *dst;
+
+	src = ckpt_obj_fetch(ctx, dq->src_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing src sock ref %i\n", dq->src_objref);
+		return -EINVAL;
+	}
+
+	dst = ckpt_obj_fetch(ctx, dq->dst_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing dst sock ref %i\n", dq->dst_objref);
+		return -EINVAL;
+	}
+
+	return unix_join(src, dst);
+}
+
+static int unix_defer_join(struct ckpt_ctx *ctx,
+			   int src_objref,
+			   int dst_objref)
+{
+	struct dq_join dq;
+
+	dq.ctx = ctx;
+	dq.src_objref = src_objref;
+	dq.dst_objref = dst_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_join, NULL);
+}
+
+static int unix_write_cwd(struct ckpt_ctx *ctx,
+			  struct sock *sk, const char *sockpath)
+{
+	struct path path;
+	char *buf;
+	char *fqpath;
+	int offset;
+	int len = PATH_MAX;
+	int ret = -ENOENT;
+
+	buf = kmalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path.dentry = unix_sk(sk)->dentry;
+	path.mnt = unix_sk(sk)->mnt;
+
+	fqpath = ckpt_fill_fname(&path, &ctx->fs_mnt, buf, &len);
+	if (IS_ERR(fqpath)) {
+		ret = PTR_ERR(fqpath);
+		goto out;
+	}
+
+	offset = strlen(fqpath) - strlen(sockpath);
+	if (offset <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fqpath[offset] = '\0';
+
+	ckpt_debug("writing socket directory: %s\n", fqpath);
+	ret = ckpt_write_string(ctx, fqpath, offset + 1);
+ out:
+	kfree(buf);
+	return ret;
+}
+
+int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	struct ckpt_hdr_socket_unix *un;
+	int new;
+	int ret = -ENOMEM;
+
+	if ((sock->sk->sk_state == TCP_LISTEN) &&
+	    !skb_queue_empty(&sock->sk->sk_receive_queue)) {
+		ckpt_write_err(ctx, "TEP", "af_unix: listen with pending peers",
+			       -EBUSY, sock);
+		return -EBUSY;
+	}
+
+	un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (!un)
+		return -EINVAL;
+
+	ret = ckpt_sock_getnames(ctx, sock,
+				 (struct sockaddr *)&un->laddr, &un->laddr_len,
+				 (struct sockaddr *)&un->raddr, &un->raddr_len);
+	if (ret)
+		goto out;
+
+	if (sk->dentry && (sk->dentry->d_inode->i_nlink > 0))
+		un->flags |= CKPT_UNIX_LINKED;
+
+	un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new);
+	if (un->this < 0)
+		goto out;
+
+	if (sk->peer)
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
+	else
+		un->peer = 0;
+
+	if (un->peer < 0) {
+		ret = un->peer;
+		goto out;
+	}
+
+	un->peercred_uid = sock->sk->sk_peercred.uid;
+	un->peercred_gid = sock->sk->sk_peercred.gid;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un);
+	if (ret < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len))
+		ret = unix_write_cwd(ctx, sock->sk, un->laddr.sun_path);
+ out:
+	ckpt_hdr_put(ctx, un);
+
+	return ret;
+}
+
+int unix_collect(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, sock->sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sk->peer)
+		ret = ckpt_obj_collect(ctx, sk->peer, CKPT_OBJ_SOCK);
+
+	return 0;
+}
+
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sock *sk;
+	struct msghdr msg;
+	struct kvec kvec;
+	uint8_t sock_shutdown;
+	uint8_t peer_shutdown = 0;
+	void *buf = NULL;
+	int sndbuf;
+	int len;
+	int ret = 0;
+
+	memset(&msg, 0, sizeof(msg));
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (len < 0) {
+		ret = len;
+		goto out;
+	} else if (len > SKB_MAX_ALLOC) {
+		ckpt_debug("Socket buffer too big (%i > %lu)",
+			   len, SKB_MAX_ALLOC);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	sk = ckpt_obj_fetch(ctx, h->sk_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
+	/* If we don't have a destination or a peer and we know the
+	 * destination of this skb, then we must need to join with our
+	 * peer
+	 */
+	if (!addrlen && !unix_sk(sk)->peer) {
+		struct sock *pr;
+		pr = ckpt_obj_fetch(ctx, h->pr_objref, CKPT_OBJ_SOCK);
+		if (IS_ERR(pr)) {
+			ckpt_debug("Failed to get our peer: %li\n", PTR_ERR(pr));
+			ret = PTR_ERR(pr);
+			goto out;
+		}
+		ret = unix_join(sk, pr);
+		if (ret < 0) {
+			ckpt_debug("Failed to join: %i\n", ret);
+			goto out;
+		}
+	}
+
+	kvec.iov_len = len;
+	buf = kmalloc(len, GFP_KERNEL);
+	kvec.iov_base = buf;
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = ckpt_kread(ctx, kvec.iov_base, len);
+	if (ret < 0)
+		goto out;
+
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
+	/* Unshutdown peer too, if necessary */
+	if (unix_sk(sk)->peer) {
+		peer_shutdown = unix_sk(sk)->peer->sk_shutdown;
+		unix_sk(sk)->peer->sk_shutdown &= ~SHUTDOWN_MASK;
+	}
+
+	/* Make sure there's room in the send buffer */
+	sndbuf = sk->sk_sndbuf;
+	if (((sk->sk_sndbuf - atomic_read(&sk->sk_wmem_alloc)) < len) &&
+	    capable(CAP_NET_ADMIN))
+		sk->sk_sndbuf += len;
+	else
+		sk->sk_sndbuf = sysctl_wmem_max;
+
+	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, len);
+	ckpt_debug("kernel_sendmsg(%i,%i): %i\n", h->sk_objref, len, ret);
+	if ((ret > 0) && (ret != len))
+		ret = -ENOMEM;
+
+	sk->sk_sndbuf = sndbuf;
+	sk->sk_shutdown = sock_shutdown;
+	if (peer_shutdown)
+		unix_sk(sk)->peer->sk_shutdown = peer_shutdown;
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(buf);
+	return ret;
+}
+
+static int unix_read_buffers(struct ckpt_ctx *ctx,
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_queue *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	for (i = 0; i < h->skb_count; i++) {
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
+		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
+		if (ret < 0)
+			goto out;
+
+		if (ret > h->total_bytes) {
+			ckpt_debug("Buffers exceeded claim");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		h->total_bytes -= ret;
+		ret = 0;
+	}
+
+	ret = h->skb_count;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int unix_deferred_restore_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *sk;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
+	int ret;
+
+	sk = ckpt_obj_fetch(ctx, dq->sk_objref, CKPT_OBJ_SOCK);
+	if (!sk) {
+		ckpt_debug("Missing sock ref %i\n", dq->sk_objref);
+		return -EINVAL;
+	}
+
+	if ((sk->sk_type == SOCK_DGRAM) && (unix_sk(sk)->addr != NULL)) {
+		addr = (struct sockaddr *)&unix_sk(sk)->addr->name;
+		addrlen = unix_sk(sk)->addr->len;
+	}
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read send buffers: %i\n", ret);
+	if (ret > 0)
+		ret = -EINVAL; /* No send buffers for UNIX sockets */
+
+	return ret;
+}
+
+static int unix_defer_restore_buffers(struct ckpt_ctx *ctx, int sk_objref)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk_objref = sk_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_restore_buffers, NULL);
+}
+
+static struct unix_address *unix_makeaddr(struct sockaddr_un *sun_addr,
+					  unsigned len)
+{
+	struct unix_address *addr;
+
+	if (len > sizeof(struct sockaddr_un))
+		return ERR_PTR(-EINVAL);
+
+	addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL);
+	if (!addr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(addr->name, sun_addr, len);
+	addr->len = len;
+	atomic_set(&addr->refcnt, 1);
+
+	return addr;
+}
+
+static int unix_restore_connected(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_socket *h,
+				  struct ckpt_hdr_socket_unix *un,
+				  struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr *addr = NULL;
+	unsigned long flags = h->sock.flags;
+	unsigned int addrlen = 0;
+	int dead = test_bit(SOCK_DEAD, &flags);
+	int ret = 0;
+
+
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
+	sk->sk_peercred.pid = task_tgid_vnr(current);
+
+	if (may_setuid(ctx->realcred->user->user_ns, un->peercred_uid) &&
+	    may_setgid(un->peercred_gid)) {
+		sk->sk_peercred.uid = un->peercred_uid;
+		sk->sk_peercred.gid = un->peercred_gid;
+	} else {
+		ckpt_debug("peercred %i:%i would require setuid",
+			   un->peercred_uid, un->peercred_gid);
+		return -EPERM;
+	}
+
+	if (!dead && (un->peer > 0)) {
+		ret = unix_defer_join(ctx, un->this, un->peer);
+		ckpt_debug("unix_defer_join: %i\n", ret);
+	}
+
+	if (!dead && !ret)
+		ret = unix_defer_restore_buffers(ctx, un->this);
+
+	return ret;
+}
+
+static int unix_unlink(const char *name)
+{
+	struct path spath;
+	struct path ppath;
+	int ret;
+
+	ret = kern_path(name, 0, &spath);
+	if (ret)
+		return ret;
+
+	ret = kern_path(name, LOOKUP_PARENT, &ppath);
+	if (ret)
+		goto out_s;
+
+	if (!spath.dentry) {
+		ckpt_debug("No dentry found for %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	if (!ppath.dentry || !ppath.dentry->d_inode) {
+		ckpt_debug("No inode for parent of %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry);
+ out_p:
+	path_put(&ppath);
+ out_s:
+	path_put(&spath);
+
+	return ret;
+}
+
+/* Call bind() for socket, optionally changing (temporarily) to @path first
+ * if non-NULL
+ */
+static int unix_chdir_and_bind(struct socket *sock,
+			       const char *path,
+			       struct sockaddr *addr,
+			       unsigned long addrlen)
+{
+	struct sockaddr_un *un = (struct sockaddr_un *)addr;
+	struct path cur = { .mnt = NULL, .dentry = NULL };
+	struct path dir = { .mnt = NULL, .dentry = NULL };
+	int ret;
+
+	if (path) {
+		ckpt_debug("switching to cwd %s for unix bind", path);
+
+		ret = kern_path(path, 0, &dir);
+		if (ret)
+			return ret;
+
+		ret = inode_permission(dir.dentry->d_inode,
+				       MAY_EXEC | MAY_ACCESS);
+		if (ret)
+			goto out;
+
+		write_lock(&current->fs->lock);
+		cur = current->fs->pwd;
+		current->fs->pwd = dir;
+		write_unlock(&current->fs->lock);
+	}
+
+	ret = unix_unlink(un->sun_path);
+	ckpt_debug("unlink(%s): %i\n", un->sun_path, ret);
+	if ((ret == 0) || (ret == -ENOENT))
+		ret = sock_bind(sock, addr, addrlen);
+
+	if (path) {
+		write_lock(&current->fs->lock);
+		current->fs->pwd = cur;
+		write_unlock(&current->fs->lock);
+	}
+ out:
+	if (path)
+		path_put(&dir);
+
+	return ret;
+}
+
+static int unix_fakebind(struct socket *sock,
+			 struct sockaddr_un *addr, unsigned long len)
+{
+	struct unix_address *uaddr;
+
+	uaddr = unix_makeaddr(addr, len);
+	if (IS_ERR(uaddr))
+		return PTR_ERR(uaddr);
+
+	unix_sk(sock->sk)->addr = uaddr;
+
+	return 0;
+}
+
+static int unix_restore_bind(struct ckpt_hdr_socket *h,
+			     struct ckpt_hdr_socket_unix *un,
+			     struct socket *sock,
+			     const char *path)
+{
+	struct sockaddr *addr = (struct sockaddr *)&un->laddr;
+	unsigned long len = un->laddr_len;
+	unsigned long flags = h->sock.flags;
+	int dead = test_bit(SOCK_DEAD, &flags);
+
+	if (dead)
+		return unix_fakebind(sock, &un->laddr, len);
+	else if (!un->laddr.sun_path[0])
+		return sock_bind(sock, addr, len);
+	else if (!(un->flags & CKPT_UNIX_LINKED))
+		return unix_fakebind(sock, &un->laddr, len);
+	else
+		return unix_chdir_and_bind(sock, path, addr, len);
+}
+
+/* Some easy pre-flight checks before we get underway */
+static int unix_precheck(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	struct net *net = sock_net(sock->sk);
+	unsigned long sk_flags = h->sock.flags;
+
+	if ((h->socket.state == SS_CONNECTING) ||
+	    (h->socket.state == SS_DISCONNECTING) ||
+	    (h->socket.state == SS_FREE)) {
+		ckpt_debug("AF_UNIX socket can't be SS_(DIS)CONNECTING");
+		return -EINVAL;
+	}
+
+	/* AF_UNIX overloads the backlog setting to define the maximum
+	 * queue length for DGRAM sockets.  Make sure we don't let the
+	 * caller exceed that value on restart.
+	 */
+	if ((h->sock.type == SOCK_DGRAM) &&
+	    (h->sock.backlog > net->unx.sysctl_max_dgram_qlen)) {
+		ckpt_debug("DGRAM backlog of %i exceeds system max of %i\n",
+			   h->sock.backlog, net->unx.sysctl_max_dgram_qlen);
+		return -EINVAL;
+	}
+
+	if (test_bit(SOCK_USE_WRITE_QUEUE, &sk_flags)) {
+		ckpt_debug("AF_UNIX socket has SOCK_USE_WRITE_QUEUE set");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+		 struct ckpt_hdr_socket *h)
+
+{
+	struct ckpt_hdr_socket_unix *un;
+	int ret = -EINVAL;
+	char *cwd = NULL;
+
+	ret = unix_precheck(sock, h);
+	if (ret)
+		return ret;
+
+	un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (IS_ERR(un))
+		return PTR_ERR(un);
+
+	if (un->peer < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len)) {
+		cwd = ckpt_read_string(ctx, PATH_MAX);
+		if (IS_ERR(cwd)) {
+			ret = PTR_ERR(cwd);
+			goto out;
+		}
+	}
+
+	if ((h->sock.state != TCP_ESTABLISHED) &&
+	    !UNIX_ADDR_EMPTY(un->laddr_len)) {
+		ret = unix_restore_bind(h, un, sock, cwd);
+		if (ret)
+			goto out;
+	}
+
+	if ((h->sock.state == TCP_ESTABLISHED) || (h->sock.state == TCP_CLOSE))
+		ret = unix_restore_connected(ctx, h, un, sock);
+	else if (h->sock.state == TCP_LISTEN)
+		ret = sock->ops->listen(sock, h->sock.backlog);
+	else
+		ckpt_debug("unsupported UNIX socket state %i\n", h->sock.state);
+ out:
+	ckpt_hdr_put(ctx, un);
+	kfree(cwd);
+	return ret;
+}
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v18 68/80] Add common socket helpers to unify the security hooks
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith, netdev
In-Reply-To: <1253749920-18673-1-git-send-email-orenl@librato.com>

From: Dan Smith <danms@us.ibm.com>

This moves the meat out of the bind(), getsockname(), and getpeername() syscalls
into helper functions that performs security_socket_bind() and then the
sock->ops->call().  This allows a unification of this behavior between the
syscalls and the pending socket restart logic.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Cc: netdev@vger.kernel.org
---
 include/net/sock.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c       |   29 ++++++-----------------------
 2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..12530bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1578,6 +1578,54 @@ extern void sock_enable_timestamp(struct sock *sk, int flag);
 extern int sock_get_timestamp(struct sock *, struct timeval __user *);
 extern int sock_get_timestampns(struct sock *, struct timespec __user *);
 
+/* bind() helper shared between any callers needing to perform a bind on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_bind(struct socket *sock,
+			    struct sockaddr *addr,
+			    int addr_len)
+{
+	int err;
+
+	err = security_socket_bind(sock, addr, addr_len);
+	if (err)
+		return err;
+	else
+		return sock->ops->bind(sock, addr, addr_len);
+}
+
+/* getname() helper shared between any callers needing to perform a getname on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getname(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getsockname(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 0);
+}
+
+/* getpeer() helper shared between any callers needing to perform a getpeer on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getpeer(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getpeername(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 1);
+}
+
 /* 
  *	Enable debug/info messages 
  */
diff --git a/net/socket.c b/net/socket.c
index 6d47165..63c4498 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1414,15 +1414,10 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock) {
 		err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
-		if (err >= 0) {
-			err = security_socket_bind(sock,
-						   (struct sockaddr *)&address,
-						   addrlen);
-			if (!err)
-				err = sock->ops->bind(sock,
-						      (struct sockaddr *)
-						      &address, addrlen);
-		}
+		if (err >= 0)
+			err = sock_bind(sock,
+					(struct sockaddr *)&address,
+					addrlen);
 		fput_light(sock->file, fput_needed);
 	}
 	return err;
@@ -1610,11 +1605,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
 	if (!sock)
 		goto out;
 
-	err = security_socket_getsockname(sock);
-	if (err)
-		goto out_put;
-
-	err = sock->ops->getname(sock, (struct sockaddr *)&address, &len, 0);
+	err = sock_getname(sock, (struct sockaddr *)&address, &len);
 	if (err)
 		goto out_put;
 	err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr, usockaddr_len);
@@ -1639,15 +1630,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
 
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock != NULL) {
-		err = security_socket_getpeername(sock);
-		if (err) {
-			fput_light(sock->file, fput_needed);
-			return err;
-		}
-
-		err =
-		    sock->ops->getname(sock, (struct sockaddr *)&address, &len,
-				       1);
+		err = sock_getpeer(sock, (struct sockaddr *)&address, &len);
 		if (!err)
 			err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr,
 						usockaddr_len);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: pull request: wireless-next-2.6 2009-09-23
From: David Miller @ 2009-09-23 23:24 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090923155339.GA14597-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Wed, 23 Sep 2009 11:53:40 -0400

> Please accept this one last round of wireless bits for the merge window.
> They are fixes for the most part, and the teams involved are very eager
> to have these for the 2.6.32 cycle.  In particular, the b43 and ath9k
> teams have some late breakers -- yes, I have reminded them of the
> process, but they are quite insistent...
> 
> Please let me know if there are problems!

Pulled, but yes you guys need to stick to pure bug fixes at
this point, please.

Also, please put the branch name, even if it's simply 'master'
in your GIT pull urls you give me.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] af_packet: add interframe drop cmsg
From: Neil Horman @ 2009-09-23 23:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem
In-Reply-To: <4ABA8B30.9060904@gmail.com>

On Wed, Sep 23, 2009 at 10:55:12PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
> >     Add Ancilliary data to better represent loss information
> >     
> >     I've had a few requests recently to provide more detail regarding frame loss
> >     during an AF_PACKET packet capture session.  Specifically the requestors want to
> >     see where in a packet sequence frames were lost, i.e. they want to see that 40
> >     frames were lost between frames 302 and 303 in a packet capture file.  In order
> >     to do this we need:
> >     
> >     1) The kernel to export this data to user space
> >     2) The applications to make use of it
> >     
> >     This patch addresses item (1).  It does this by doing the following:
> >     
> >     A) attaching ancilliary data to any skb enqueued to a socket recieve queue for
> >     which frames were lost between it and the previously enqueued frame.  Note I use
> >     a ring buffer with a correlator value (the skb pointer) to do this.  This was
> >     done because the skb->cb array is exhausted already for AF_PACKET
> 
> Hmm, how mmap() users can find this information ? I thought recent libpcap were
> using mmap(), in order to reduce losses :)
> 
Yeah, in some/most cases it does, but to be honest, I think any solution for
determining frame loss with sequence encoding is going to diverge between a
descriptor based solution (i.e. recvmsg), and an mmap solution is going to be
divergent.  About the only solution I could see that could be common would be
the use of some sort of marker frame getting inserted into the receive queue,
and I'm pretty certain thats going to be a hard sell.

> >     
> >     B) For any frame dequeued that has ancilliary data in the ring buffer (as
> >     determined by the correlator value), we add a cmsg structure to the msghdr that
> >     gets copied to user space, this cmsg structure is of cmsg_level AF_PACKET, and
> >     cmsg_type PACKET_GAPDATA.  It contains a u32 value which counts the number of
> >     frames lost between the reception of the frame being currently recevied and the
> >     frame most recently preceding it.  Note this creates a situation in which if we
> >     have packet loss followed immediately by a socket close operation we might miss
> >     some gap information.  This situation is covered by the use of the
> >     PACKET_AUXINFO socket option, which provides total loss stats (from which the
> >     final gap can be computed).
> >     
> >     I've tested this patch myself, and it works well.
> 
> Okay :)
> 
Thanks for the vote of confidence :).  I augmented the patch to randomly drop
frames, then wrote a applicatoin to loop on an AF_PACKET frame receive, and
compared a printk showing the in-kernel drop rates with what the user space app
recorded.

> >     
> >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> >  include/linux/if_packet.h |    2 +
> >  net/packet/af_packet.c    |   90 +++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 91 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
> > index dea7d6b..e5d200f 100644
> > --- a/include/linux/if_packet.h
> > +++ b/include/linux/if_packet.h
> > @@ -48,11 +48,13 @@ struct sockaddr_ll
> >  #define PACKET_RESERVE			12
> >  #define PACKET_TX_RING			13
> >  #define PACKET_LOSS			14
> > +#define PACKET_GAPDATA			15
> >  
> >  struct tpacket_stats
> >  {
> >  	unsigned int	tp_packets;
> >  	unsigned int	tp_drops;
> > +	unsigned int    tp_gap;
> >  };
> >  
> >  struct tpacket_auxdata
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index d3d52c6..b74a91c 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -179,6 +179,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
> >  
> >  static void packet_flush_mclist(struct sock *sk);
> >  
> > +struct packet_gap_record {
> > +		struct sk_buff *skb;
> > +		__u32 gap;
> > +};
> > +
> >  struct packet_sock {
> >  	/* struct sock has to be the first member of packet_sock */
> >  	struct sock		sk;
> > @@ -197,6 +202,11 @@ struct packet_sock {
> >  	int			ifindex;	/* bound device		*/
> >  	__be16			num;
> >  	struct packet_mclist	*mclist;
> > +	struct packet_gap_record *gaps;
> > +	unsigned int gap_head;
> > +	unsigned int gap_tail;
> > +	unsigned int gap_list_size;
> > +
> >  #ifdef CONFIG_PACKET_MMAP
> >  	atomic_t		mapped;
> >  	enum tpacket_versions	tp_version;
> > @@ -524,6 +534,55 @@ static inline unsigned int run_filter(struct sk_buff *skb, struct sock *sk,
> >  }
> >  
> >  /*
> > + * If we've lost frames since the last time we queued one to the
> > + * sk_receive_queue, we need to record it here.
> > + * This must be called under the protection of the socket lock
> > + * to prevent racing with other softirqs and user space
> > + */
> > +static void record_packet_gap(struct sk_buff *skb, struct packet_sock *po)
> > +{
> > +	/*
> > +	 * do nothing if there is no gap
> > +	 */
> > +	if (!po->stats.tp_gap)
> > +		return;
> > +
> > +	/*
> > +	 * If there is, check the gap list tail to make sure we
> > +	 * have an open entry
> > +	 */
> > +	if (po->gaps[po->gap_tail].skb != NULL) {
> > +		if (net_ratelimit())
> > +			printk(KERN_WARNING "packet socket gap list is full!\n");
> 
> New code can use pr_warning() macro
> 
good point.

> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * We have a free entry, record it
> > +	 */
> > +	po->gaps[po->gap_tail].skb = skb;
> > +	po->gaps[po->gap_tail].gap = po->stats.tp_gap;
> > +	po->gap_tail = (po->gap_tail+1) % po->gap_list_size;
> 
> you could avoid this divide
> 
> 	if (++po->gap_tail == po->gap_list_size)
> 		po->gap_tail = 0;
> 
Yup, I can do that.

> > +	po->stats.tp_gap = 0;
> > +	return;
> > +
> > +}
> > +
> > +static __u32 check_packet_gap(struct sk_buff *skb, struct packet_sock *po)
> > +{
> > +	__u32 gap = 0;
> > +
> > +	if (po->gaps[po->gap_head].skb != skb)
> > +		return 0;
> > +
> > +	gap = po->gaps[po->gap_head].gap;
> > +	po->gaps[po->gap_head].skb = NULL;
> > +	po->gap_head = (po->gap_head + 1) % po->gap_list_size;
> 
> ditto
> 
ditto :)

> > +	return gap;
> > +}
> > +
> > +
> > +/*
> >     This function makes lazy skb cloning in hope that most of packets
> >     are discarded by BPF.
> >  
> > @@ -626,6 +685,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
> >  
> >  	spin_lock(&sk->sk_receive_queue.lock);
> >  	po->stats.tp_packets++;
> > +	record_packet_gap(skb, po);
> >  	__skb_queue_tail(&sk->sk_receive_queue, skb);
> >  	spin_unlock(&sk->sk_receive_queue.lock);
> >  	sk->sk_data_ready(sk, skb->len);
> > @@ -634,6 +694,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
> >  drop_n_acct:
> >  	spin_lock(&sk->sk_receive_queue.lock);
> >  	po->stats.tp_drops++;
> > +	po->stats.tp_gap++;
> >  	spin_unlock(&sk->sk_receive_queue.lock);
> >  
> >  drop_n_restore:
> > @@ -811,6 +872,7 @@ drop:
> >  
> >  ring_is_full:
> >  	po->stats.tp_drops++;
> > +	po->stats.tp_gap++;
> >  	spin_unlock(&sk->sk_receive_queue.lock);
> >  
> >  	sk->sk_data_ready(sk, 0);
> > @@ -1223,6 +1285,8 @@ static int packet_release(struct socket *sock)
> >  	skb_queue_purge(&sk->sk_receive_queue);
> >  	sk_refcnt_debug_release(sk);
> >  
> > +	kfree(po->gaps);
> > +
> >  	sock_put(sk);
> >  	return 0;
> >  }
> > @@ -1350,6 +1414,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol)
> >  	struct packet_sock *po;
> >  	__be16 proto = (__force __be16)protocol; /* weird, but documented */
> >  	int err;
> > +	unsigned int num_records = PAGE_SIZE/sizeof(struct packet_gap_record);
> >  
> >  	if (!capable(CAP_NET_RAW))
> >  		return -EPERM;
> > @@ -1360,6 +1425,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol)
> >  	sock->state = SS_UNCONNECTED;
> >  
> >  	err = -ENOBUFS;
> > +
> >  	sk = sk_alloc(net, PF_PACKET, GFP_KERNEL, &packet_proto);
> >  	if (sk == NULL)
> >  		goto out;
> > @@ -1374,6 +1440,19 @@ static int packet_create(struct net *net, struct socket *sock, int protocol)
> >  	sk->sk_family = PF_PACKET;
> >  	po->num = proto;
> >  
> > +	err = -ENOMEM;
> > +	po->gaps = kmalloc(sizeof(struct packet_gap_record)*num_records,
> > +				GFP_KERNEL);
> 
> kzalloc(), and no need for some following lines
> 
Will do.

> > +	if (!po->gaps)
> > +		goto out_free;
> > +	po->gap_tail = po->gap_head = 0;
> > +	po->gap_list_size = num_records;
> > +
> > +	for (num_records = 0; num_records < po->gap_list_size; num_records++) {
> > +		po->gaps[num_records].skb = NULL;
> > +		po->gaps[num_records].gap = 0;
> > +	}
> > +
> >  	sk->sk_destruct = packet_sock_destruct;
> >  	sk_refcnt_debug_inc(sk);
> >  
> > @@ -1402,6 +1481,9 @@ static int packet_create(struct net *net, struct socket *sock, int protocol)
> >  	sock_prot_inuse_add(net, &packet_proto, 1);
> >  	write_unlock_bh(&net->packet.sklist_lock);
> >  	return 0;
> > +
> > +out_free:
> > +	sk_free(sk);
> >  out:
> >  	return err;
> >  }
> > @@ -1418,6 +1500,7 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
> >  	struct sk_buff *skb;
> >  	int copied, err;
> >  	struct sockaddr_ll *sll;
> > +	__u32 gap;
> >  
> >  	err = -EINVAL;
> >  	if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT))
> > @@ -1492,10 +1575,15 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
> >  		aux.tp_mac = 0;
> >  		aux.tp_net = skb_network_offset(skb);
> >  		aux.tp_vlan_tci = skb->vlan_tci;
> > -
> 
> Please dont mix cleanups 
> 
Doh, I thought I'd removed that.  Thanks

> >  		put_cmsg(msg, SOL_PACKET, PACKET_AUXDATA, sizeof(aux), &aux);
> >  	}
> >  
> > +	lock_sock(sk);
> 
> strange locking here. this doesnt match locking used at record time.
> 
> ( spin_lock(&sk->sk_receive_queue.lock);)
> 
Not sure why AF_PACKET didn't use sock_lock_bh, there, I think that probably
requires a cleanup.  The intent was to protect the ring buffer using the socket
lock.  I think we likey need to change the existing lock usage in af_packet.c to
use sock_lock_bh in this case.

> > +	gap = check_packet_gap(skb, pkt_sk(sk));
> > +	release_sock(sk);
> > +	if (gap)
> > +		put_cmsg(msg, SOL_PACKET, PACKET_GAPDATA, sizeof(u32), &gap);
> > +
> >  	/*
> >  	 *	Free or return the buffer as appropriate. Again this
> >  	 *	hides all the races and re-entrancy issues from us.
> 
> Thanks
> 
Thanks for the notes Eric, I'll cleanup and repost in the AM.

Regards
Neil

> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox