Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container qdisc sch_mclass
From: Jarek Poplawski @ 2011-01-03 17:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, bhutchings@solarflare.com,
	nhorman@tuxdriver.com
In-Reply-To: <4D2161FF.4070804@intel.com>

On Sun, Jan 02, 2011 at 09:43:27PM -0800, John Fastabend wrote:
> On 12/30/2010 3:37 PM, Jarek Poplawski wrote:
> > John Fastabend wrote:
> >> This implements a mclass 'multi-class' queueing discipline that by
> >> default creates multiple mq qdisc's one for each traffic class. Each
> >> mq qdisc then owns a range of queues per the netdev_tc_txq mappings.
> > 
> > Is it really necessary to add one more abstraction layer for this,
> > probably not most often used (or even asked by users), functionality?
> > Why mclass can't simply do these few things more instead of attaching
> > (and changing) mq?
> > 
> 
> The statistics work nicely when the mq qdisc is used. 

Well, I sometimes add leaf qdiscs only to get class stats with less
typing, too ;-)

> 
> qdisc mclass 8002: root  tc 4 map 0 1 2 3 0 1 2 3 1 1 1 1 1 1 1 1
>              queues:(0:1) (2:3) (4:5) (6:15)
>  Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8003: parent 8002:1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8004: parent 8002:2
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8005: parent 8002:3
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8006: parent 8002:4
>  Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc sfq 8007: parent 8005:1 limit 127p quantum 1514b
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc sfq 8008: parent 8005:2 limit 127p quantum 1514b
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> The mclass gives the statistics for the interface and then statistics on the mq qdisc gives statistics for each traffic class. Also, when using the 'mq qdisc' with this abstraction other qdisc can be grafted onto the queue. For example the sch_sfq is used in the above example.

IMHO, these tc offsets and counts make simply two level hierarchy
(classes with leaf subclasses) similarly (or simpler) to other
classful qdisc which manage it all inside one module. Of course,
we could think of another way of code organization, but it should
be rather done at the beginning of schedulers design. The mq qdisc
broke the design a bit adding a fake root, but I doubt we should go
deeper unless it's necessary. Doing mclass (or something) as a more
complex alternative to mq should be enough. Why couldn't mclass graft
sch_sfq the same way as mq?

> 
> Although I am not too hung up on this use case it does seem to be a good abstraction to me. Is it strictly necessary though no and looking at the class statistics of mclass could be used to get stats per traffic class.

I am not too hung up on this either, especially if it's OK to others,
especially to DaveM ;-)

> 
> > ...
> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >> index 0af57eb..723ee52 100644
> >> --- a/include/net/sch_generic.h
> >> +++ b/include/net/sch_generic.h
> >> @@ -50,6 +50,7 @@ struct Qdisc {
> >>  #define TCQ_F_INGRESS		4
> >>  #define TCQ_F_CAN_BYPASS	8
> >>  #define TCQ_F_MQROOT		16
> >> +#define TCQ_F_MQSAFE		32
> > 
> > If every other qdisc added a flag for qdiscs it likes...
> > 
> 
> then we run out of bits and get unneeded complexity. I think I will drop the MQSAFE bit completely and let user space catch this. The worst that should happen is the noop qdisc is used.

Maybe you're right. On the other hand, usually flags are added for
more general purpose and the optimal/wrong configs are the matter of
documentation.

> 
> >> @@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
> >>  		dev->qdisc = txq->qdisc_sleeping;
> >>  		atomic_inc(&dev->qdisc->refcnt);
> >>  	} else {
> >> -		qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
> >> +		if (dev->num_tc)
> > 
> > Actually, where this num_tc is expected to be set? I can see it inside
> > mclass only, with unsetting on destruction, but probably I miss something.
> 
> Either through mclass as you noted or a driver could set the num_tc. One of the RFC's I sent out has ixgbe setting the num_tc when DCB was enabled.

OK, I probably missed this second possibility in the last version.

...
> >> +	/* Unwind attributes on failure */
> >> +	u8 unwnd_tc = dev->num_tc;
> >> +	u8 unwnd_map[16];
> > 
> > [TC_MAX_QUEUE] ?
> 
> Actually TC_BITMASK+1 is probably more accurate. This array maps the skb priority to a traffic class after the priority is masked with TC_BITMASK.
> 
> > 
> >> +	struct netdev_tc_txq unwnd_txq[16];
> >> +
> 
> Although unwnd_txq should be TC_MAX_QUEUE.
...
> >> +	/* Always use supplied priority mappings */
> >> +	for (i = 0; i < 16; i++) {
> > 
> > i < qopt->num_tc ?
> 
> Nope, TC_BITMASK+1 here. If we only have 4 tcs for example we still need to map all 16 priority values to a tc.

OK, anyway, all these '16' should be 'upgraded'.
 
Thanks,
Jarek P.

^ permalink raw reply

* Re: [PATCH v2 00/12] make rpc_pipefs be mountable multiple time
From: Kirill A. Shutemov @ 2011-01-03 16:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rob Landley, Rob Landley, Trond Myklebust, J. Bruce Fields,
	Neil Brown, Pavel Emelyanov, linux-nfs, David S. Miller, netdev,
	linux-kernel
In-Reply-To: <20101231130329.GA3610@shutemov.name>

On Fri, Dec 31, 2010 at 03:03:29PM +0200, Kirill A. Shutemov wrote:
> On Thu, Dec 30, 2010 at 06:52:43AM -0600, Rob Landley wrote:
> > On 12/30/2010 05:45 AM, Kirill A. Shutemov wrote:
> > > Currently, there is no association between rpc_pipefs and mount namespace,
> > 
> > There is in that the root context doesn't need to have this mounted, and 
> > new namespaces do.  So there's an existing association between a LACK of 
> > a namespace and a different default behavior.
> >
> > My understanding (correct me if I'm wrong) is that the historical 
> > behavior is that there's only one, and it doesn't actually live anywhere 
> > in the filesystem tree.  You're adding a special location.  I'm 
> > wondering if there's any way for that location not to be special.
> 
> /var/lib/net/rpc_pipefs is default path where userspace part of NFS stack
> (gssd, idmapd) want to see rpc_pipefs
> 
> > > so I don't see simple way to restrict number of rpc_pipefs per mount
> > > namespace. Associating mount namespace with rpc_pipefs is not a good idea,
> > > I think.
> > 
> > I'm talking about associating a default rpc_pipefs instance with a 
> > namespace, which it seems to me you're already doing by emulating the 
> > legacy behavior.  Before you CLONE_NEWNS you get a magic default mount 
> > that doesn't exist in the tree.  After you CLONE_NEWNS you get something 
> > like -EINVAL unless you supply your own default.
> 
> Root namespace is special. In case of nfsroot you need rpc_pipefs before
> root available.
> 
> > (I'm actually not sure 
> > why new namespaces don't fall back to the magic global one...)
> 
> It breaks isolation. Container should not use host's rpc_pipefs without
> host's permission.
>  
> > I'm suggesting that if the user doesn't specify -o rpcmount then the 
> > default could be the first rpc_pipefs mount visible to the current 
> > process context, rather than a specific path.  Logic to do that exists 
> > in the proc/self/mounts code (which I'm reading through now...).
> 
> static int check_rpc_pipefs(struct vfsmount *mnt, void *arg)
> {
>         struct vfsmount **rpcmount = arg;
>         struct path path = {
>                 .mnt = mnt,
>                 .dentry = mnt->mnt_root,
>         };
> 
>         if (!mnt->mnt_sb)
>                 return 0;
>         if (mnt->mnt_sb->s_magic != RPCAUTH_GSSMAGIC)
>                 return 0;
> 
>         if (!path_is_under(&path, &current->fs->root))
>                 return 0;
> 
>         *rpcmount = mntget(mnt);
>         return 1;
> }
> 
> struct vfsmount *get_rpc_pipefs(const char *p)
> {
>         int error;
>         struct vfsmount *rpcmount = ERR_PTR(-EINVAL);
>         struct path path;
> 
>         if (!p) {
>                 iterate_mounts(check_rpc_pipefs, &rpcmount,
>                                 current->nsproxy->mnt_ns->root);
> 
>                 if (IS_ERR(rpcmount) && (current->nsproxy->mnt_ns ==
>                                         init_task.nsproxy->mnt_ns))
>                         return mntget(init_rpc_pipefs);
> 
>                 return rpcmount;
>         }
> 
>         error = kern_path(p, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &path);
>         if (error)
>                 return ERR_PTR(error);
> 
>         check_rpc_pipefs(path.mnt, &rpcmount);
>         path_put(&path);
> 
>         return rpcmount;
> }
> EXPORT_SYMBOL_GPL(get_rpc_pipefs);
> 
> Something like this? Patch to replace patch #10 attached.

Any comments?

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/2] can: add driver for Softing card
From: Kurt Van Dijck @ 2011-01-03 16:38 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4D148788.3010808-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>

On Fri, Dec 24, 2010 at 12:44:08PM +0100, Marc Kleine-Budde wrote:
> 
> >> hmmm..all stuff behind dpram is __iomem, isn't it? I think it should
> >> only be accessed with via the ioread/iowrite operators. Please check
> > I did an ioremap_nocache. Since it is unaligned, ioread/iowrite would render
> > a lot of statements.
> 
> The thing is, ioremapped mem should not be accessed directly. Instead
> ioread/iowrite should be used. The softing driver should work on non x86
> platforms, too.
> 
I use __attribute__((packed)) structs to refer to the iomemory.
To read an unaligned uint16_t, is should then use 2 readb()'s ??

I could of course turn that sequence into a macro ....

Kurt

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/2] can: add driver for Softing card
From: Kurt Van Dijck @ 2011-01-03 16:28 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4D135BC3.6070707-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>

On Thu, Dec 23, 2010 at 03:25:07PM +0100, Marc Kleine-Budde wrote:
> 
> Another option is to write a threaded interrupt handler.
It seems a threaded irq is the way to go.
During the implementation, a locking issue came up.
Am I right that within a threaded irq handler, I should use spin_lock_bh()
to prevent the ndo_start_xmit() to be called (ie. prevent the softirq).
> 

Kurt

^ permalink raw reply

* [PATCH, second try] Typo in comments in include/linux/igmp.h
From: François-Xavier Le Bail @ 2011-01-03 16:03 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 922 bytes --]

[Second try, patch invalid, sorry]

Hello,

There is a typo in comments in include/linux/igmp.h:

83 #define IGMP_HOST_MEMBERSHIP_QUERY      0x11    /* From RFC1112 */
84 #define IGMP_HOST_MEMBERSHIP_REPORT     0x12    /* Ditto */
85 #define IGMP_DVMRP                      0x13    /* DVMRP routing */
86 #define IGMP_PIM                        0x14    /* PIM routing */
87 #define IGMP_TRACE                      0x15
88 #define IGMPV2_HOST_MEMBERSHIP_REPORT   0x16    /* V2 version of 0x11 */
89 #define IGMP_HOST_LEAVE_MESSAGE         0x17
90 #define IGMPV3_HOST_MEMBERSHIP_REPORT   0x22    /* V3 version of 0x11 */

The line 88 and 90 are about REPORT messages.
The IGMP_HOST_MEMBERSHIP_REPORT value is 0x12.
So the comment on line 88 must be /* V2 version of 0x12 */,
and the comment on line 90 must be /* V3 version of 0x12 */.

Here is a patch.

Thank you,
Francois-Xavier Le Bail


      

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: igmp_h.diff --]
[-- Type: text/x-diff; name="igmp_h.diff", Size: 693 bytes --]

diff -ru a/include/linux/igmp.h b/include/linux/igmp.h
--- a/include/linux/igmp.h	2010-08-27 01:47:12.000000000 +0200
+++ b/include/linux/igmp.h	2010-12-15 09:50:47.808363144 +0100
@@ -85,9 +85,9 @@
 #define IGMP_DVMRP			0x13	/* DVMRP routing */
 #define IGMP_PIM			0x14	/* PIM routing */
 #define IGMP_TRACE			0x15
-#define IGMPV2_HOST_MEMBERSHIP_REPORT	0x16	/* V2 version of 0x11 */
+#define IGMPV2_HOST_MEMBERSHIP_REPORT	0x16	/* V2 version of 0x12 */
 #define IGMP_HOST_LEAVE_MESSAGE 	0x17
-#define IGMPV3_HOST_MEMBERSHIP_REPORT	0x22	/* V3 version of 0x11 */
+#define IGMPV3_HOST_MEMBERSHIP_REPORT	0x22	/* V3 version of 0x12 */
 
 #define IGMP_MTRACE_RESP		0x1e
 #define IGMP_MTRACE			0x1f

^ permalink raw reply

* [PATCH] Typo in comments in include/linux/igmp.h
From: François-Xavier Le Bail @ 2011-01-03 15:25 UTC (permalink / raw)
  To: netdev

Hello,

There is a typo in comments in include/linux/igmp.h:

83 #define IGMP_HOST_MEMBERSHIP_QUERY      0x11    /* From RFC1112 */
84 #define IGMP_HOST_MEMBERSHIP_REPORT     0x12    /* Ditto */
85 #define IGMP_DVMRP                      0x13    /* DVMRP routing */
86 #define IGMP_PIM                        0x14    /* PIM routing */
87 #define IGMP_TRACE                      0x15
88 #define IGMPV2_HOST_MEMBERSHIP_REPORT   0x16    /* V2 version of 0x11 */
89 #define IGMP_HOST_LEAVE_MESSAGE         0x17
90 #define IGMPV3_HOST_MEMBERSHIP_REPORT   0x22    /* V3 version of 0x11 */

The line 88 and 90 are about REPORT messages.
The IGMP_HOST_MEMBERSHIP_REPORT value is 0x12.
So the comment on line 88 must be /* V2 version of 0x12 */,
and the comment on line 90 must be /* V3 version of 0x12 */.

Here is a patch.

Thank you,
Francois-Xavier Le Bail

------------------------------------

diff -ru a/include/linux/igmp.h b/include/linux/igmp.h
--- a/include/linux/igmp.h    2010-08-27 01:47:12.000000000 +0200
+++ b/include/linux/igmp.h    2010-12-15 09:50:47.808363144 +0100
@@ -85,9 +85,9 @@
 #define IGMP_DVMRP            0x13    /* DVMRP routing */
 #define IGMP_PIM            0x14    /* PIM routing */
 #define IGMP_TRACE            0x15
-#define IGMPV2_HOST_MEMBERSHIP_REPORT    0x16    /* V2 version of 0x11 */
+#define IGMPV2_HOST_MEMBERSHIP_REPORT    0x16    /* V2 version of 0x12 */
 #define IGMP_HOST_LEAVE_MESSAGE     0x17
-#define IGMPV3_HOST_MEMBERSHIP_REPORT    0x22    /* V3 version of 0x11 */
+#define IGMPV3_HOST_MEMBERSHIP_REPORT    0x22    /* V3 version of 0x12 */
 
 #define IGMP_MTRACE_RESP        0x1e
 #define IGMP_MTRACE            0x1f





      

^ permalink raw reply

* Re: [PATCH 14/15]include:media:davinci:vpss.h Typo change diable to disable.
From: Justin P. Mattock @ 2011-01-03 15:11 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: devel, linux-m68k, linux-scsi, netdev, linux-usb, linux-wireless,
	linux-kernel, ivtv-devel, spi-devel-general,
	Mauro Carvalho Chehab, linux-media
In-Reply-To: <alpine.LNX.2.00.1101031600510.26685@pobox.suse.cz>

On 01/03/2011 07:01 AM, Jiri Kosina wrote:
> On Fri, 31 Dec 2010, Mauro Carvalho Chehab wrote:
>
>> Em 30-12-2010 21:08, Justin P. Mattock escreveu:
>>> The below patch fixes a typo "diable" to "disable". Please let me know if this
>>> is correct or not.
>>>
>>> Signed-off-by: Justin P. Mattock<justinmattock@gmail.com>
>> Acked-by: Mauro Carvalho Chehab<mchehab@redhat.com>
>
> Applied.
>
>>
>> PS.: Next time, please c/c linux-media ONLY on patches related to media
>> drivers (/drivers/video and the corresponding include files). Having to
>> dig into a series of 15 patches to just actually look on 3 patches
>> is not nice.
>
> Absolutely.
>
> Justin, no kernel developer should be afraid of being CCed. But try to
> avoid really unnecessary spamming (which this was).
>

alright..

Justin P. Mattock

^ permalink raw reply

* Re: [PATCH 08/15]drivers:scsi:lpfc:lpfc_init.c Typo change diable to disable.
From: Jiri Kosina @ 2011-01-03 15:09 UTC (permalink / raw)
  To: Justin P. Mattock
  Cc: linux-m68k, linux-kernel, netdev, ivtv-devel, linux-media,
	linux-wireless, linux-scsi, spi-devel-general, devel, linux-usb
In-Reply-To: <1293750484-1161-8-git-send-email-justinmattock@gmail.com>

On Thu, 30 Dec 2010, Justin P. Mattock wrote:

> The below patch fixes a typo "diable" to "disable". Please let me know if this 
> is correct or not.
> 
> Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>

Folded patched 8, 9, 10 and 13 together and applied.

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 01/15]arch:m68k:ifpsp060:src:fpsp.S Typo change diable to disable.
From: Jiri Kosina @ 2011-01-03 15:07 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: devel, linux-m68k, linux-scsi, netdev, linux-usb, linux-wireless,
	linux-kernel, ivtv-devel, Justin P. Mattock, spi-devel-general,
	linux-media
In-Reply-To: <AANLkTins7rj1o4rEcEFmVSA2=1yXZSfLdO000gqQP7cg@mail.gmail.com>

On Fri, 31 Dec 2010, Geert Uytterhoeven wrote:

> On Fri, Dec 31, 2010 at 00:07, Justin P. Mattock
> <justinmattock@gmail.com> wrote:
> > The below patch fixes a typo "diable" to "disable". Please let me know if this
> > is correct or not.
> >
> > Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
> 
> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Applied, thanks.

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 07/15]drivers:net:wireless:iwlwifi Typo change diable to disable.
From: Jiri Kosina @ 2011-01-03 15:06 UTC (permalink / raw)
  To: Larry Finger
  Cc: Justin P. Mattock, linux-m68k, linux-kernel, netdev, ivtv-devel,
	linux-media, linux-wireless, linux-scsi, spi-devel-general, devel,
	linux-usb
In-Reply-To: <4D1DF7CA.8040504@lwfinger.net>

On Fri, 31 Dec 2010, Larry Finger wrote:

> On 12/30/2010 05:07 PM, Justin P. Mattock wrote:
> > The below patch fixes a typo "diable" to "disable". Please let me know if this 
> > is correct or not.
> > 
> > Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
> > 
> 
> ACKed-by: Larry Finger <Larry.Finger@lwfinger.net>

Applied, thanks.

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 11/15]drivers:media:video:cx18:cx23418.h Typo change diable to disable.
From: Jiri Kosina @ 2011-01-03 15:04 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Justin P. Mattock, linux-m68k, linux-kernel, netdev, ivtv-devel,
	linux-media, linux-wireless, linux-scsi, spi-devel-general, devel,
	linux-usb
In-Reply-To: <4D1DAF2D.5070604@gmail.com>

On Fri, 31 Dec 2010, Mauro Carvalho Chehab wrote:

> Em 30-12-2010 21:08, Justin P. Mattock escreveu:
> > The below patch fixes a typo "diable" to "disable". Please let me know if this 
> > is correct or not.
> > 
> > Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>

Folded all three 'media' fixes into one and applied with your Ack. Thanks.

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [RFC] net_sched: mark packet staying on queue too long
From: jamal @ 2011-01-03 15:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jarek Poplawski, David Miller, Jesper Dangaard Brouer,
	Patrick McHardy, netdev
In-Reply-To: <1294063372.2892.408.camel@edumazet-laptop>

On Mon, 2011-01-03 at 15:02 +0100, Eric Dumazet wrote:
> Le lundi 03 janvier 2011 à 08:52 -0500, jamal a écrit :

> I got fairly good results here, but admit-idly on a LAN.

Maybe just adding the randomness marking factor alone may help.
It all depends on RTT.

> Yep, maybe adding RED on each SFQ slot ;) Should be fairly cheap, and
> actually needed in case ECN is not possible and we must earlly drop
> instead.
> 

That would essentially be achieving the goal of SF Blue.

> I found BLUE very expensive in term of cache line accesses. Especially
> with double hashing.

If you can do it cheaply as you describe above, maybe should be
sufficient.

> local tcp, for a router ? Hmm... But yes I see your point.

Oh;-> thought you were talking host where my mumbling would make
more sense.

> Speaking of ECN marking, it seems we (in RED/GRED or tunnels) change skb
> data even if it is shared (can happen on ingress path)
> 
> Probably harmless, but tcpdump can show ECN bit being marked even on skb
> snapshot before ingress (and later, ECN marked) or tunnels, while it
> came unset from the wire.
> 
> Is it worth fixing this ? maybe using skb_make_writable() [once moved to
> core network from netfilter]

Typically the netdev owns the packet once it gets to that level and
it can do whatever it wants with it. But if you are seeing it on
ingress (probably using ifb?), then it makes sense to fix it.

cheers,
jamal



^ permalink raw reply

* Re: [PATCH 14/15]include:media:davinci:vpss.h Typo change diable to disable.
From: Jiri Kosina @ 2011-01-03 15:01 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Justin P. Mattock, linux-m68k, linux-kernel, netdev, ivtv-devel,
	linux-media, linux-wireless, linux-scsi, spi-devel-general, devel,
	linux-usb
In-Reply-To: <4D1DAFF5.3090108@gmail.com>

On Fri, 31 Dec 2010, Mauro Carvalho Chehab wrote:

> Em 30-12-2010 21:08, Justin P. Mattock escreveu:
> > The below patch fixes a typo "diable" to "disable". Please let me know if this 
> > is correct or not.
> > 
> > Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>

Applied.

> 
> PS.: Next time, please c/c linux-media ONLY on patches related to media
> drivers (/drivers/video and the corresponding include files). Having to
> dig into a series of 15 patches to just actually look on 3 patches 
> is not nice.

Absolutely.

Justin, no kernel developer should be afraid of being CCed. But try to 
avoid really unnecessary spamming (which this was).

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* [PATCH] new UDPCP Communication Protocol
From: stefani @ 2011-01-03 14:34 UTC (permalink / raw)
  To: linux-kernel, akpm, davem, netdev, eric.dumazet, shemminger, jj,
	daniel.baluta
  Cc: stefani

From: Stefani Seibold <stefani@seibold.net>

Changelog:
31.12.2010 first proposal
01.01.2011 code cleanup and fixes suggest by Eric Dumazet
02.01.2011 kick away UDP-Lite support
           change spin_lock_irq into spin_lock_bh
	   faster udpcp_release_sock
	   base is now linux-next
02.01.2011 fix camel style
           fix coding style
	   fix types in comments
	   add per socket max. connection limit (pevents against abuse)
	   make udpcp adjustable through /proc/sys/net/ipv4/udpcp_
03.01.2011 remove version info message
           add Documentation/networking/udpcp.txt API description

UDPCP is a communication protocol specified by the Open Base Station
Architecture Initiative Special Interest Group (OBSAI SIG). The
protocol is based on UDP and is designed to meet the needs of "Mobile
Communcation Base Station" internal communications. It is widely used by
the major networks infrastructure supplier.

The UDPCP communication service supports the following features:

-Connectionless communication for serial mode data transfer
-Acknowledged and unacknowledged transfer modes
-Retransmissions Algorithm
-Checksum Algorithm using Adler32
-Fragmentation of long messages (disassembly/reassembly) to match to the MTU
 during transport:
-Broadcasting and multicasting messages to multiple peers in unacknowledged
  transfer mode

UDPCP supports application level messages up to 64 KBytes (limited by 16-bit
packet data length field). Messages that are longer than the MTU will be
fragmented to the MTU.

UDPCP provides a reliable transport service that will perform message
retransmissions in case transport failures occur.

A documentation about the UDPCP protocol can be found here:

http://read.pudn.com/downloads76/doc/project/283718/OBSAI/OBSAI/RP1_V2.0.PDF

The code is also a nice example how to implement a UDP based protocol as
a kernel socket modules.

Due the nature of UDPCP which has no sliding windows support, the latency has
a huge impact. The perfomance increase by implementing as a kernel module is
about the factor 10.

Implementing it in User Space is to slow, due the context switches. Also
the net/sunrpc  approach in the kernel is not faster due the using of kernel
threads which are not better than user space (okay, a little bit because not
switching the MMU).

Handling the UDPCP into the data_ready() bh function is much faster:
- No context switch
- Assembly Multi-Fragment Message is very efficient using skb buffer chaining.
- Immediately handling an ack or data message save a lot of latency
- Less memory consuming

The implementation is now clean. There are no side effects to the network
subsystems so i ask for merge it into linux-next.

The patch is against linux next-20101231

- Stefani

Signed-off-by: Stefani Seibold <stefani@seibold.net>
---
 Documentation/networking/udpcp.txt |   82 +
 include/linux/socket.h             |    9 +-
 include/net/udp.h                  |    1 +
 include/net/udpcp.h                |   47 +
 net/Kconfig                        |    1 +
 net/Makefile                       |    1 +
 net/ipv4/ip_output.c               |    2 +
 net/ipv4/ip_sockglue.c             |    2 +
 net/udpcp/Kconfig                  |   34 +
 net/udpcp/Makefile                 |    5 +
 net/udpcp/udpcp.c                  | 2887 ++++++++++++++++++++++++++++++++++++
 11 files changed, 3068 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/networking/udpcp.txt
 create mode 100644 include/net/udpcp.h
 create mode 100644 net/udpcp/Kconfig
 create mode 100644 net/udpcp/Makefile
 create mode 100644 net/udpcp/udpcp.c

diff --git a/Documentation/networking/udpcp.txt b/Documentation/networking/udpcp.txt
new file mode 100644
index 0000000..c850218
--- /dev/null
+++ b/Documentation/networking/udpcp.txt
@@ -0,0 +1,82 @@
+UDPCP socket interface programming manual
+-----------------------------------------
+
+The socket interface is a derivate of the UDP sockets. All setsockopt(),
+getsockopt() and ioctl() kernel system calls  which are valid for UDP
+sockets should work on UDPCP sockets. There are some extensions to the
+sockopt and ioctl interface for the UDPCP sockets.
+
+Include the C header file <net/udpcp.h> to use the UDPCP socket options
+and ioctl calls.
+
+A UDPCP can be opened with socket(PF_INET, SOCK_DGRAM, PF_UDPCP). All
+operation which are valid for UDP sockets can also performed with UDPCP
+sockets.
+
+sockopt interface
+-----------------
+
+The level parameter for the UDPCP socket is SOL_UDPCP, where the
+following options are defined:
+
+- UDPCP_OPT_TRANSFER_MODE
+  Set default transfer mode. The optval is one of the following:
+   UDPCP_NOACK: no ACK for the transmitted message is requiered
+   UDPCP_ACK: a ACK for each transmitted message fragment is requiered
+   UDPCP_SINGLE_ACK: only a ACK for the last transmitted message fragment
+   is requiered
+
+- UDPCP_OPT_CHECKSUM_MODE
+  Set the default checksum mode. The optval is one of the following:
+   UDPCP_NOCHECKSUM: no checksum for the transmitted message is required
+   UDPCP_CHECKSUM: a checksum test for the transmitted message is required
+
+- UDPCP_OPT_TX_TIMEOUT
+  The timeout for a awaited ACK in milliseconds.
+  The optval should between >= 1 and max. UDPCP_MAX_WAIT_SEC * 1000
+
+- UDPCP_OPT_RX_TIMEOUT
+  Timeout for a outstanding incoming message fragment in milliseconds.
+  The optval should between >= 1 and max. UDPCP_MAX_WAIT_SEC * 1000
+
+- UDPCP_OPT_MAXTRY
+  The number of tries to send a message fragment.
+  The optval should between >= 1 and <= 10
+
+- UDPCP_OPT_OUTSTANDING_ACKS
+  The number of outstanding acks.
+  The optval should between >=1 and <= 255
+
+All optlen parameters are int's. Therefor the optlen should be sizeof(optlen).
+
+The values UDPCP_NOACK, UDPCP_ACK, UDPCP_SINGLE_ACK, UDPCP_NOCHECKSUM
+and UDPCP_CHECKSUM can also passed as control message with sendmsg(). For
+details look at the manual page for sendmsg().
+
+ioctl interface
+---------------
+
+For UDPCP sockets there are the following request commands defined:
+
+- UDPCP_IOCTL_GET_STATISTICS
+  This command returns the statistics of the socket in a struct
+  udpcp_statistics. The address of this struct must be passed as third
+  argument.
+
+- UDPCP_IOCTL_RESET_STATISTICS
+  This command resets the statistics of the socket
+
+- UDPCP_IOCTL_SYNC 
+  This command waits until all message fragments are transmitted. If the
+  third argument is not zero, this is the max. timeout value in
+  milliseconds, otherwise this call can block indefinitely.
+
+sysctl interface
+----------------
+
+/proc/sys/net/ipv4/udpcp/udpcp_max_connections
+  Maximum UDPCP connections per socket
+
+/proc/sys/net/ipv4/udpcp/udpcp_debug
+  kernel lock debug messages enabled or not
+
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 2dccbeb..2e9157c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -171,7 +171,7 @@ struct ucred {
 #define AF_DECnet	12	/* Reserved for DECnet project	*/
 #define AF_NETBEUI	13	/* Reserved for 802.2LLC project*/
 #define AF_SECURITY	14	/* Security callback pseudo AF */
-#define AF_KEY		15      /* PF_KEY key management API */
+#define AF_KEY		15	/* PF_KEY key management API */
 #define AF_NETLINK	16
 #define AF_ROUTE	AF_NETLINK /* Alias to emulate 4.4BSD */
 #define AF_PACKET	17	/* Packet family		*/
@@ -194,7 +194,8 @@ struct ucred {
 #define AF_IEEE802154	36	/* IEEE802154 sockets		*/
 #define AF_CAIF		37	/* CAIF sockets			*/
 #define AF_ALG		38	/* Algorithm sockets		*/
-#define AF_MAX		39	/* For now.. */
+#define AF_UDPCP	39	/* UDPCP sockets		*/
+#define AF_MAX		40	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -204,7 +205,7 @@ struct ucred {
 #define PF_AX25		AF_AX25
 #define PF_IPX		AF_IPX
 #define PF_APPLETALK	AF_APPLETALK
-#define	PF_NETROM	AF_NETROM
+#define PF_NETROM	AF_NETROM
 #define PF_BRIDGE	AF_BRIDGE
 #define PF_ATMPVC	AF_ATMPVC
 #define PF_X25		AF_X25
@@ -236,6 +237,7 @@ struct ucred {
 #define PF_IEEE802154	AF_IEEE802154
 #define PF_CAIF		AF_CAIF
 #define PF_ALG		AF_ALG
+#define PF_UDPCP	AF_UDPCP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -310,6 +312,7 @@ struct ucred {
 #define SOL_IUCV	277
 #define SOL_CAIF	278
 #define SOL_ALG		279
+#define SOL_UDPCP	280
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/net/udp.h b/include/net/udp.h
index bb967dd..82c95a7 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -47,6 +47,7 @@ struct udp_skb_cb {
 	} header;
 	__u16		cscov;
 	__u8		partial_cov;
+	__u8            udpcp_flag;
 };
 #define UDP_SKB_CB(__skb)	((struct udp_skb_cb *)((__skb)->cb))
 
diff --git a/include/net/udpcp.h b/include/net/udpcp.h
new file mode 100644
index 0000000..0745b15
--- /dev/null
+++ b/include/net/udpcp.h
@@ -0,0 +1,47 @@
+/* Definitions for UDPCP sockets. */
+
+#ifndef __LINUX_IF_UDPCP
+#define __LINUX_IF_UDPCP
+
+#include "linux/ioctl.h"
+
+#define UDPCP_MAX_MSGSIZE	65487
+
+#define UDPCP_MAX_WAIT_SEC	60
+
+#define UDPCP_OPT_TRANSFER_MODE		0
+#define UDPCP_OPT_CHECKSUM_MODE		1
+#define UDPCP_OPT_TX_TIMEOUT		2
+#define UDPCP_OPT_RX_TIMEOUT		3
+#define UDPCP_OPT_MAXTRY		4
+#define UDPCP_OPT_OUTSTANDING_ACKS	5
+
+#define UDPCP_NOACK		0
+#define UDPCP_ACK		1
+#define UDPCP_SINGLE_ACK	2
+#define UDPCP_NOCHECKSUM	3
+#define UDPCP_CHECKSUM		4
+
+#define UDPCP_IOC_MAGIC  251
+
+#define UDPCP_IOCTL_GET_STATISTICS \
+	_IOR(UDPCP_IOC_MAGIC, 0x01, struct udpcp_statistics *)
+#define UDPCP_IOCTL_RESET_STATISTICS \
+	_IO(UDPCP_IOC_MAGIC, 0x02)
+#define UDPCP_IOCTL_SYNC \
+	_IOR(UDPCP_IOC_MAGIC, 0x03, unsigned long)
+
+struct udpcp_statistics {
+	unsigned int txMsgs;		/* Num of transmitted messages */
+	unsigned int rxMsgs;		/* Num of received messages */
+	unsigned int txNodes;		/* Num of transmitter nodes */
+	unsigned int rxNodes;		/* Num of receiver nodes */
+	unsigned int txTimeout;		/* Num of unsuccessful transmissions */
+	unsigned int rxTimeout;		/* Num of partial message receptions */
+	unsigned int txRetries;		/* Num of resends */
+	unsigned int rxDiscardedFrags;	/* Num of discarded fragments */
+	unsigned int crcErrors;		/* Num of crc errors detected */
+};
+
+#endif
+
diff --git a/net/Kconfig b/net/Kconfig
index 7284062..4b3b619 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -302,6 +302,7 @@ source "net/rfkill/Kconfig"
 source "net/9p/Kconfig"
 source "net/caif/Kconfig"
 source "net/ceph/Kconfig"
+source "net/udpcp/Kconfig"
 
 
 endif   # if NET
diff --git a/net/Makefile b/net/Makefile
index a3330eb..388a582 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -70,3 +70,4 @@ obj-$(CONFIG_WIMAX)		+= wimax/
 obj-$(CONFIG_DNS_RESOLVER)	+= dns_resolver/
 obj-$(CONFIG_CEPH_LIB)		+= ceph/
 obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
+obj-$(CONFIG_UDPCP)		+= udpcp/
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04c7b3b..41f9276 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1084,6 +1084,7 @@ error:
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
+EXPORT_SYMBOL(ip_append_data);
 
 ssize_t	ip_append_page(struct sock *sk, struct page *page,
 		       int offset, size_t size, int flags)
@@ -1340,6 +1341,7 @@ error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	goto out;
 }
+EXPORT_SYMBOL(ip_push_pending_frames);
 
 /*
  *	Throw away all pending data on the socket.
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 3948c86..310369c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -226,6 +226,7 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc)
 	}
 	return 0;
 }
+EXPORT_SYMBOL(ip_cmsg_send);
 
 
 /* Special input handler for packets caught by router alert option.
@@ -369,6 +370,7 @@ void ip_local_error(struct sock *sk, int err, __be32 daddr, __be16 port, u32 inf
 	if (sock_queue_err_skb(sk, skb))
 		kfree_skb(skb);
 }
+EXPORT_SYMBOL(ip_local_error);
 
 /*
  *	Handle MSG_ERRQUEUE
diff --git a/net/udpcp/Kconfig b/net/udpcp/Kconfig
new file mode 100644
index 0000000..a58c1b0
--- /dev/null
+++ b/net/udpcp/Kconfig
@@ -0,0 +1,34 @@
+#
+# UDPCP protocol
+#
+
+config UDPCP
+	tristate "UDPCP Communication Protocol"
+	depends on INET
+	---help---
+	  UDPCP is a communication protocol specified by the Open Base Station
+	  Architecture Initiative Special Interest Group (OBSAI SIG). The
+	  protocol is based on UDP and is designed to meet the needs of "Mobile
+	  Communcation Base Station" internal communications.
+
+	  The UDPCP communication service supports the following features:
+
+          -Connectionless communication for serial mode data transfer
+          -Acknowledged and unacknowledged transfer modes
+          -Retransmissions Algorithm
+          -Checksum Algorithm using Adler32
+          -Fragmentation of long messages (disassembly/reassembly) to
+           match to the MTU during transport:
+          -Broadcasting and multicasting messages to multiple peers in
+           unacknowledged transfer mode
+
+          UDPCP supports application level messages up to 64 KBytes (limited
+          by 16-bit packet data length field). Messages that are longer than the
+          MTU will be fragmented to the MTU.
+
+          UDPCP provides a reliable transport service that will perform message
+          retransmissions in case transport failures occur.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called udpcp.
+
diff --git a/net/udpcp/Makefile b/net/udpcp/Makefile
new file mode 100644
index 0000000..37f87c5
--- /dev/null
+++ b/net/udpcp/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for UDPCP support code.
+#
+
+obj-$(CONFIG_UDPCP) += udpcp.o
diff --git a/net/udpcp/udpcp.c b/net/udpcp/udpcp.c
new file mode 100644
index 0000000..5475000
--- /dev/null
+++ b/net/udpcp/udpcp.c
@@ -0,0 +1,2887 @@
+/*
+ * UDPCP communication protocol
+ *
+ * Copyright (C) 2010 Stefani Seibold <stefani@seibold.net>
+ * in order of NSN Ulm/Germany
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ */
+
+#include <net/xfrm.h>
+#include <net/protocol.h>
+#include <net/ip.h>
+#include <net/udp.h>
+#include <net/inet_common.h>
+#include <linux/zutil.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/spinlock.h>
+#include <linux/errqueue.h>
+#include <linux/atomic.h>
+
+#include <net/udpcp.h>
+
+/*
+ * UDPCP Protocol default parameters
+ */
+#define UDPCP_TX_TIMEOUT	100	/* milliseconds */
+#define UDPCP_RX_TIMEOUT	1000	/* milliseconds */
+#define UDPCP_TX_MAXTRY		5
+#define UDPCP_OUTSTANDING_ACKS	1
+
+/*
+ * UDPCP Protocol definitions
+ */
+#define UDPCP_MSG_TYPE_BIT		14
+#define UDPCP_PROTOCOL_VERSION_BIT	11
+#define UDPCP_NO_ACK_BIT		10
+#define UDPCP_CHECKSUM_BIT		9
+#define UDPCP_SINGLE_ACK_BIT		8
+#define UDPCP_DUPLICATE_BIT		7
+
+#define UDPCP_MSG_TYPE_MASK		(3 << UDPCP_MSG_TYPE_BIT)
+#define UDPCP_PROTOCOL_MASK		(7 << UDPCP_PROTOCOL_VERSION_BIT)
+
+#define UDPCP_MSG_TYPE_DATA		(1 << UDPCP_MSG_TYPE_BIT)
+#define UDPCP_MSG_TYPE_ACK		(2 << UDPCP_MSG_TYPE_BIT)
+#define UDPCP_PROTOCOL_VERSION_2	(2 << UDPCP_PROTOCOL_VERSION_BIT)
+
+#define UDPCP_NO_ACK_FLAG		(1 << UDPCP_NO_ACK_BIT)
+#define UDPCP_CHECKSUM_FLAG		(1 << UDPCP_CHECKSUM_BIT)
+#define UDPCP_SINGLE_ACK_FLAG		(1 << UDPCP_SINGLE_ACK_BIT)
+#define UDPCP_DUPLICATE_FLAG		(1 << UDPCP_DUPLICATE_BIT)
+
+/*
+ * helper macros
+ */
+#define list_to_udpcpdest(d) container_of(d, struct udpcp_dest, list)
+#define list_to_udpcpsock(d) container_of(d, struct udpcp_sock, udpcplist)
+
+#define UDPCP_HDRSIZE	(sizeof(struct udpcphdr)-sizeof(struct udphdr))
+
+#define RX_NODE	1
+#define TX_NODE	2
+
+/*
+ * name of the /proc entry
+ */
+#define UDPCP_PROC	"driver/udpcp"
+
+/*
+ * UDPCP message header
+ */
+struct udpcphdr {
+	struct udphdr udphdr;
+	__be32 chksum;
+	__be16 msginfo;
+	u8 fragamount;
+	u8 fragnum;
+	__be16 msgid;
+	__be16 length;
+};
+
+/*
+ * UDPCP destination descriptor
+ *
+ * For each communication address an individual destination descriptor will
+ * be create.
+ *
+ * The fields has the following meanings:
+ *
+ * list:		link list: part of udpcp_sock.destlist
+ * xmit:		messages fragments to be transmit
+ * tx_time:		timestamp of the last transmitted message fragment
+ * rx_time:		timestamp ot the last received message fragment
+ * tx_timeout:		statistic use only: number of transmit timeout
+ * rx_timeout:		statistic use only: number of receive timeout
+ * tx_retries:		statistic use only: number of transmit retries
+ * rx_discarded_frags:	statistic use only: number of discarded messages
+ * xmit_wait:		message fragment which is waiting for an ACK
+ * xmit_last:		last fragment transmitted
+ * recv_msg:		first fragment of the received message
+ * recv_last:		last fragment of the received message
+ * lastmsg:		last messages fragment header received
+ * ipc:			linux internal ipc cookie
+ * fl:			flow/routing information
+ * rt:			routing entry currently used for this destination
+ * addr:		ipv4 destination address
+ * port:		destination port number
+ * msgid:		current message id for outgoing data messages
+ * use_flag:		statistic use only: flag for dest using TX and/or RX
+ * insync:		flag for protocol synchronization
+ * ackmode;		ack mode for the current assembled message
+ * chkmode;		checksum mode for the current assembled message
+ * try:			current number of retries xmit_wait message
+ * acks:		number of outstandig ack's
+ */
+struct udpcp_dest {
+	struct list_head list;
+	struct sk_buff_head xmit;
+	unsigned long tx_time;
+	unsigned long rx_time;
+	u32 tx_timeout;
+	u32 rx_timeout;
+	u32 tx_retries;
+	u32 rx_discarded_frags;
+	struct sk_buff *xmit_wait;
+	struct sk_buff *xmit_last;
+	struct sk_buff *recv_msg;
+	struct sk_buff *recv_last;
+	struct udpcphdr lastmsg;
+	struct ipcm_cookie ipc;
+	struct flowi fl;
+	struct rtable *rt;
+	__be32 addr;
+	__be16 port;
+	u16 msgid;
+	u8 use_flag;
+	u8 insync;
+	u8 ackmode;
+	u8 chkmode;
+	u8 try;
+	u8 acks;
+};
+
+/*
+ * UDPCP socket descriptor
+ *
+ * For each opened socket individual socket descriptor will
+ * be created
+ *
+ * The fields has the following meanings:
+ *
+ * udpsock:		UDP socket has to be the first member of udpcp_sock
+ * assembly:		messages fragments currently assembled
+ * assembly_len:	current length of the assembled message
+ * assembly_dest:	current destination assembled
+ * wq:			wait queue for UDPCP_IOCTL_SYNC
+ * destlist:		head of destination descriptors link list
+ * udpcplist:		link list: part of udpcp_list
+ * timer:		timeout handler
+ * stat:		statistics for this socket
+ * pending:		number of pending messages fragment in the queues
+ * tx_timeout:		transmit timeout in jiffies
+ * rx_timeout:		receive timeout in jiffies
+ * udp_data_ready:	original data_ready handler for this socket
+ * ackmode:		default ack mode
+ * chkmode:		default checksum mode
+ * maxtry:		max. number of resends
+ * acks:		max. number of outstandig ack's
+ * timeout:		flag for unhandled timeout
+ */
+struct udpcp_sock {
+	struct udp_sock udpsock;
+	struct sk_buff_head assembly;
+	u32 assembly_len;
+	struct udpcp_dest *assembly_dest;
+	wait_queue_head_t wq;
+	struct list_head destlist;
+	struct list_head udpcplist;
+	struct timer_list timer;
+	struct udpcp_statistics stat;
+	u32 pending;
+	unsigned long tx_timeout;
+	unsigned long rx_timeout;
+	u32 connections;
+	void (*udp_data_ready) (struct sock *sk, int bytes);
+	u8 ackmode;
+	u8 chkmode;
+	u8 maxtry;
+	u8 acks;
+	u8 timeout;
+};
+
+/* head of struct udpcp_sock.udpcplist link list */
+static struct list_head udpcp_list;
+
+/* spinlock for race free access to the static variables */
+static spinlock_t udpcp_lock;
+
+/* debug flag, set != 0 to enable debug */
+static int udpcp_max_connections = 64;
+
+/* /proc/sys/net/ipv4/udpcp_* table */
+static struct ctl_table_header *udpcp_ctl_table;
+
+/* debug flag, set != 0 to enable debug */
+static int debug;
+
+/* overall UDPCP statistics */
+static atomic_t udpcp_tx_msgs;
+static atomic_t udpcp_rx_msgs;
+static atomic_t udpcp_tx_nodes;
+static atomic_t udpcp_rx_nodes;
+static atomic_t udpcp_tx_timeout;
+static atomic_t udpcp_rx_timeout;
+static atomic_t udpcp_tx_retries;
+static atomic_t udpcp_rx_discarded_frags;
+static atomic_t udpcp_crc_errors;
+
+module_param(debug, int, 0);
+MODULE_PARM_DESC(debug, "Debug enabled or not");
+
+module_param(udpcp_max_connections, int, 0);
+MODULE_PARM_DESC(udpcp_max_connections, "maximum connections per sockets");
+
+static int zero;
+
+static struct ctl_table ipv4_udpcp_table[] = {
+	{
+		.procname	= "udpcp_max_connections",
+		.data		= &udpcp_max_connections,
+		.maxlen		= sizeof(udpcp_max_connections),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero
+	},
+	{
+		.procname	= "udpcp_debug",
+		.data		= &debug,
+		.maxlen		= sizeof(debug),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero
+	},
+	{ }
+};
+
+#ifdef CONFIG_PROC_FS
+/*
+ * Handle /proc/driver/udpcp
+ *
+ * Show the statistics information
+ */
+static int udpcp_proc(char *page, char **start, off_t off, int count, int *eof,
+		      void *data)
+{
+	int len;
+
+	len = snprintf(page, count,
+		       "txMsgs:          %u\n"
+		       "rxMsgs:          %u\n"
+		       "txNodes:         %u\n"
+		       "rxNodes:         %u\n"
+		       "txTimeout:       %u\n"
+		       "rxTimeout:       %u\n"
+		       "txRetries:       %u\n"
+		       "rxDiscaredFrags: %u\n"
+		       "crcErrors:       %u\n",
+			atomic_read(&udpcp_tx_msgs),
+			atomic_read(&udpcp_rx_msgs),
+			atomic_read(&udpcp_tx_nodes),
+			atomic_read(&udpcp_rx_nodes),
+			atomic_read(&udpcp_tx_timeout),
+			atomic_read(&udpcp_rx_timeout),
+			atomic_read(&udpcp_tx_retries),
+			atomic_read(&udpcp_rx_discarded_frags),
+			atomic_read(&udpcp_crc_errors)
+		);
+
+	if (len <= off)
+		return 0;
+
+	len -= off;
+
+	if (len > count)
+		return count;
+
+	return len;
+}
+#endif
+
+/*
+ * Helper for the UDPCP header from a socket buffer
+ */
+static inline struct udpcphdr *udpcp_hdr(const struct sk_buff *skb)
+{
+	return (struct udpcphdr *)skb_transport_header(skb);
+}
+
+/*
+ * Helper for conversion a basic socket into a UDPCP socket
+ */
+static inline struct udpcp_sock *udpcp_sk(const struct sock *sk)
+{
+	return (struct udpcp_sock *)sk;
+}
+
+/*
+ * Dump the transport data of a socket buffer
+ */
+static inline void dump_data(struct sk_buff *skb, unsigned int max)
+{
+	unsigned int i;
+	unsigned char *data;
+	int data_len;
+
+	data = skb_transport_header(skb) + sizeof(struct udpcphdr);
+	data_len = skb_tail_pointer(skb) - data;
+
+	pr_debug(" data: ");
+
+	if (!data_len) {
+		pr_cont("<none>\n");
+		return;
+	}
+
+	if (max > data_len)
+		max = data_len;
+
+	for (i = 0; i < max; i++)
+		pr_cont("%02x ", data[i]);
+
+	if (data_len > max)
+		pr_cont("...");
+	pr_cont("\n");
+}
+
+/*
+ * Dump and decode a msginfo value
+ */
+static inline void dump_msginfo(u16 msginfo)
+{
+	pr_debug(" msginfo:0x%04x (", msginfo);
+
+	pr_cont("PCKT:");
+	switch (msginfo & UDPCP_MSG_TYPE_MASK) {
+	case UDPCP_MSG_TYPE_DATA:
+		pr_cont("DATA");
+		break;
+	case UDPCP_MSG_TYPE_ACK:
+		pr_cont("ACK");
+		break;
+	default:
+		pr_cont("UNKNOWN");
+		break;
+	}
+	pr_cont(" VER:%d",
+	       (msginfo & UDPCP_PROTOCOL_MASK) >> UDPCP_PROTOCOL_VERSION_BIT);
+
+	if (msginfo & UDPCP_NO_ACK_FLAG)
+		pr_cont(" NO_ACK");
+	if (msginfo & UDPCP_CHECKSUM_FLAG)
+		pr_cont(" CHECKSUM");
+	if (msginfo & UDPCP_SINGLE_ACK_FLAG)
+		pr_cont(" SINGLE_ACK");
+	if (msginfo & UDPCP_DUPLICATE_FLAG)
+		pr_cont(" DUPLICATE");
+	pr_cont(")\n");
+}
+
+/*
+ * Dump and decode a UDPCP message fragment
+ */
+static void dump_msg(const char *action, struct sk_buff *skb, __be32 saddr,
+		     __be32 daddr)
+{
+	struct udpcphdr *uh = udpcp_hdr(skb);
+
+	pr_debug("udpcp: %s (%lu)\n", action, jiffies);
+
+	pr_debug(" src:0x%08x:%d dst:0x%08x:%d fraglen:%d\n",
+	       saddr, uh->udphdr.source, daddr, uh->udphdr.dest, skb->len);
+
+	pr_debug(" fragamount:%u fragnum:%u msgid:%u%s"
+		 " length:%u checksum:0x%08x\n",
+	       uh->fragamount, uh->fragnum, ntohs(uh->msgid),
+	       (!uh->msgid) ? "(Sync)" : "", ntohs(uh->length),
+	       ntohl(uh->chksum)
+	    );
+
+	dump_msginfo(ntohs(uh->msginfo));
+	dump_data(skb, 16);
+}
+
+/*
+ * Create a new destination descriptor for the given IPV4 address and port
+ */
+static struct udpcp_dest *new_dest(struct sock *sk, __be32 addr, __be16 port)
+{
+	struct udpcp_dest *dest;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	if (usk->connections >= udpcp_max_connections)
+		return NULL;
+
+	dest = kzalloc(sizeof(*dest), sk->sk_allocation);
+
+	if (dest) {
+		usk->connections++;
+		skb_queue_head_init(&dest->xmit);
+		dest->addr = addr;
+		dest->port = port;
+		dest->ackmode = UDPCP_ACK;
+		list_add_tail(&dest->list, &usk->destlist);
+	}
+
+	return dest;
+}
+
+/*
+ * Lookup for a destination descriptor for the given IPV4 address and port
+ */
+static struct udpcp_dest *__find_dest(struct sock *sk, __be32 addr, __be16 port)
+{
+	struct udpcp_dest *dest;
+	struct list_head *p;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	list_for_each(p, &usk->destlist) {
+		dest = list_to_udpcpdest(p);
+
+		if ((dest->addr == addr) && (dest->port == port))
+			return dest;
+	}
+	return NULL;
+}
+
+/*
+ * Lookup for a destination descriptor and create a new one if no
+ * descriptor was found.
+ */
+static struct udpcp_dest *find_dest(struct sock *sk, __be32 addr, __be16 port)
+{
+	struct udpcp_dest *dest = __find_dest(sk, addr, port);
+
+	if (!dest)
+		dest = new_dest(sk, addr, port);
+
+	return dest;
+}
+
+/*
+ * Calculate udp checksum, mostly stolen from udp stack
+ */
+static void udpcp_do_csum(struct sock *sk, struct sk_buff *skb,
+			  struct udpcp_dest *dest)
+{
+	struct flowi *fl = &dest->fl;
+	struct udphdr *uh = udp_hdr(skb);
+	__wsum csum = 0;
+	unsigned short len = ntohs(uh->len);
+
+	if (sk->sk_no_check == UDP_CSUM_NOXMIT) {
+		skb->ip_summed = CHECKSUM_NONE;
+		return;
+	}
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		/* UDP hardware csum */
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct udphdr, check);
+		uh->check =
+		    ~csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst, len,
+				       sk->sk_protocol, 0);
+		return;
+	}
+	csum = csum_partial(uh, sizeof(struct udpcphdr), 0);
+	csum = csum_add(csum, skb->csum);
+
+	/* add protocol-dependent pseudo-header */
+	uh->check =
+	    csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst, len, sk->sk_protocol,
+			      csum);
+	if (uh->check == 0)
+		uh->check = CSUM_MANGLED_0;
+}
+
+/*
+ * Fetch data from kernel space and fill in checksum if needed.
+ */
+static int ip_reply_glue_bits(void *dptr, char *to, int offset,
+			      int len, int odd, struct sk_buff *skb)
+{
+	__wsum csum;
+
+	csum = csum_partial_copy_nocheck(dptr+offset, to, len, 0);
+	skb->csum = csum_block_add(skb->csum, csum, odd);
+	return 0;
+}
+
+/*
+ * Send an ack for a received data message fragment
+ *
+ * If the argument duplicate is true a ACK with UDPCP_DUPLICATE_FLAG set will
+ * be send
+ */
+static void udpcp_send_ack(struct sock *sk, struct sk_buff *skb,
+			   struct udpcp_dest *dest, int duplicate)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct udpcphdr *uh = udpcp_hdr(skb);
+	struct rtable *rt = NULL;
+	__wsum csum;
+	struct ipcm_cookie ipc;
+	struct udpcphdr rep;
+
+	memset(&rep, 0, sizeof(rep));
+
+	/* Swap the send and the receive ports. */
+	rep.udphdr.source = uh->udphdr.dest;
+	rep.udphdr.dest = uh->udphdr.source;
+	rep.udphdr.len = htons(sizeof(struct udpcphdr));
+
+	rep.msginfo = htons(UDPCP_MSG_TYPE_ACK |
+			    UDPCP_NO_ACK_FLAG |
+			    UDPCP_SINGLE_ACK_FLAG | UDPCP_PROTOCOL_VERSION_2);
+	if (duplicate)
+		rep.msginfo |= htons(UDPCP_DUPLICATE_FLAG);
+	else
+		memcpy(&dest->lastmsg, uh, sizeof(dest->lastmsg));
+	rep.msgid = uh->msgid;
+	rep.fragamount = uh->fragamount;
+	rep.fragnum = uh->fragnum;
+	rep.length = 0;
+	rep.chksum = 0;
+	if (ntohs(uh->msginfo) & UDPCP_CHECKSUM_FLAG) {
+		u8 *data;
+		u32 data_len;
+
+		data = (u8 *) &rep + sizeof(struct udphdr);
+		data_len = sizeof(struct udpcphdr)-sizeof(struct udphdr);
+
+		rep.msginfo |= htons(UDPCP_CHECKSUM_FLAG);
+		rep.chksum = htonl(zlib_adler32(1, data, data_len));
+	}
+
+	if (unlikely(debug)) {
+		struct sk_buff tmp;
+
+		tmp.len = ntohs(rep.udphdr.len);
+		tmp.head = tmp.transport_header = tmp.data = (void *)&rep;
+		tmp.tail = tmp.head + tmp.len;
+
+		dump_msg("ack msg", &tmp, ip_hdr(skb)->daddr,
+			 ip_hdr(skb)->saddr);
+	}
+
+	csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+				      ip_hdr(skb)->saddr,
+				      sizeof(rep), sk->sk_protocol, 0);
+
+	ipc.addr = dest->addr;
+	ipc.opt = NULL;
+	ipc.tx_flags = 0;
+
+	{
+		struct flowi fl = {
+			.nl_u = { .ip4_u = {
+						.daddr = ipc.addr,
+						.saddr = ip_hdr(skb)->daddr,
+						.tos = RT_TOS(ip_hdr(skb)->tos)
+					      }
+			},
+			.uli_u = { .ports = {
+						.sport = udp_hdr(skb)->dest,
+						.dport = udp_hdr(skb)->source
+				       }
+			},
+			.proto = sk->sk_protocol,
+		};
+		security_skb_classify_flow(skb, &fl);
+		if (ip_route_output_key(sock_net(sk), &rt, &fl))
+			return;
+	}
+
+	inet->tos = ip_hdr(skb)->tos;
+	sk->sk_priority = skb->priority;
+	sk->sk_protocol = ip_hdr(skb)->protocol;
+	sk->sk_bound_dev_if = 0;
+	ip_append_data(sk, ip_reply_glue_bits, &rep, sizeof(rep),
+				0, &ipc, &rt, MSG_DONTWAIT);
+	skb = skb_peek(&sk->sk_write_queue);
+	if (skb) {
+		*((__sum16 *)skb_transport_header(skb) +
+		  offsetof(struct udphdr, check) / 2) =
+			csum_fold(csum_add(skb->csum, csum));
+		skb->ip_summed = CHECKSUM_NONE;
+		ip_push_pending_frames(sk);
+	}
+
+	ip_rt_put(rt);
+
+	UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_OUTDATAGRAMS, 0);
+}
+
+/*
+ * Pass a UDPCP skb buffer to the ip stack and send it
+ */
+static int udpcp_send_skb(struct sock *sk, struct sk_buff *skb,
+			  struct udpcp_dest *dest, struct ip_options *opt)
+{
+	int err;
+
+	skb_dst_set(skb, dst_clone(&dest->rt->dst));
+
+	err = ip_build_and_send_pkt(skb, sk, dest->fl.fl4_src,
+					dest->fl.fl4_dst, opt);
+
+	if (!err)
+		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_OUTDATAGRAMS, 0);
+	return err;
+}
+
+/*
+ * Release a routing table entry if no packet will be assembled
+ */
+static void udpcp_dst_release(struct udpcp_sock *usk, struct udpcp_dest *dest)
+{
+	if (usk->assembly_dest != dest) {
+		dst_release(&dest->rt->dst);
+		dest->rt = NULL;
+	}
+}
+
+/*
+ * Return true if the passed skb socket buffer is the last in the list
+ */
+static inline bool skb_is_eoq(const struct sk_buff_head *list,
+			      const struct sk_buff *skb)
+{
+	return (skb->next == (struct sk_buff *)list);
+}
+
+/*
+ * Arm the timeout handler for the socket
+ */
+static void udpcp_timer(struct sock *sk, unsigned long timeout)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	mod_timer(&usk->timer, timeout);
+}
+
+/*
+ * Decrement the socket pending counter and wakeup a waiting UDPCP_IOCTL_SYNC
+ */
+static inline void udpcp_dec_pending(struct sock *sk)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	if (!--usk->pending) {
+		if (waitqueue_active(&usk->wq))
+			wake_up_interruptible(&usk->wq);
+	}
+}
+
+/*
+ * Returns true is the passed message fragment is the last fragment
+ */
+static inline int udpcp_is_last_frag(struct udpcphdr *uh)
+{
+	return uh->fragamount == uh->fragnum + 1;
+}
+
+/*
+ * Transmit data message fragments
+ */
+static int _udpcp_xmit(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct sk_buff *skb = NULL;
+	struct sk_buff *skbc;
+	struct udpcphdr *uh;
+	int err = 0;
+
+	if (dest->acks >= usk->acks)
+		goto out;
+
+	if (!dest->xmit_last) {
+		/*
+		 * handle data message fragments without an ack
+		 */
+		while ((skb = skb_peek(&dest->xmit))) {
+			uh = udpcp_hdr(skb);
+
+			if (!(ntohs(uh->msginfo) & UDPCP_NO_ACK_FLAG))
+				break;
+			if (udpcp_is_last_frag(uh)) {
+				usk->stat.txMsgs++;
+				atomic_inc(&udpcp_tx_msgs);
+			}
+			skb_unlink(skb, &dest->xmit);
+			udpcp_dec_pending(sk);
+			if (unlikely(debug))
+				dump_msg("send msg", skb, dest->fl.fl4_src,
+					 dest->fl.fl4_dst);
+			err = udpcp_send_skb(sk, skb, dest,
+						(struct ip_options *)skb->cb);
+			if (err) {
+				kfree_skb(skb);
+				skb = NULL;
+				break;
+			}
+		}
+		dest->xmit_wait = skb;
+	} else {
+		/*
+		 * handle next data message fragment waiting for an ack
+		 */
+		uh = udpcp_hdr(dest->xmit_last);
+
+		if (udpcp_is_last_frag(uh))
+			goto out;
+
+		/*
+		 * get next data message fragment
+		 */
+		skb = dest->xmit_last->next;
+	}
+
+	/*
+	 * send all data message fragment till the first which must be acked
+	 */
+	while (skb) {
+		skbc = skb_clone(skb, sk->sk_allocation);
+
+		if (!skbc)
+			break;
+
+		if (unlikely(debug))
+			dump_msg("send msg", skbc, dest->fl.fl4_src,
+				 dest->fl.fl4_dst);
+		err = udpcp_send_skb(sk, skbc, dest,
+					(struct ip_options *)skb->cb);
+		if (err) {
+			kfree_skb(skbc);
+			break;
+		}
+
+		uh = udpcp_hdr(skb);
+
+		if (!(ntohs(uh->msginfo) & UDPCP_SINGLE_ACK_FLAG)
+		    || udpcp_is_last_frag(uh)) {
+			dest->xmit_last = skb;
+
+			if (++dest->acks >= usk->acks || udpcp_is_last_frag(uh))
+				break;
+		}
+
+		skb = skb_is_eoq(&dest->xmit, skb) ? NULL : skb->next;
+	}
+
+out:
+	if (skb_queue_empty(&dest->xmit))
+		udpcp_dst_release(usk, dest);
+
+	return err;
+}
+
+/*
+ * Transmit data message fragments and rearm the timeout handler if necessary
+ */
+static int udpcp_xmit(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int ret;
+
+	ret = _udpcp_xmit(sk, dest);
+
+	if (dest->xmit_wait) {
+		dest->tx_time = jiffies;
+
+		if (!timer_pending(&usk->timer))
+			udpcp_timer(sk, dest->tx_time + usk->tx_timeout);
+	}
+	return ret;
+}
+
+/*
+ * Queue the assembled message fragment into the transmit queue
+ */
+static void udpcp_queue_xmit(struct sock *sk, struct udpcp_dest *dest,
+			     u8 ackmode, u8 chkmode)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct udpcphdr *uh;
+	struct sk_buff *skb;
+	u8 fragamount;
+	u8 fragnum;
+	unsigned short msginfo;
+	struct flowi *fl = &dest->fl;
+
+	msginfo = UDPCP_MSG_TYPE_DATA | UDPCP_PROTOCOL_VERSION_2;
+	switch (ackmode) {
+	case UDPCP_NOACK:
+		msginfo |= UDPCP_NO_ACK_FLAG;
+		break;
+	case UDPCP_SINGLE_ACK:
+		msginfo |= UDPCP_SINGLE_ACK_FLAG;
+		break;
+	case UDPCP_ACK:
+	default:
+		break;
+	}
+	switch (chkmode) {
+	case UDPCP_NOCHECKSUM:
+		break;
+	case UDPCP_CHECKSUM:
+	default:
+		msginfo |= UDPCP_CHECKSUM_FLAG;
+		break;
+	}
+
+	fragamount = skb_queue_len(&usk->assembly);
+
+	udpcp_sk(sk)->pending += fragamount;
+
+	for (fragnum = 0; fragnum != fragamount; fragnum++) {
+		unsigned char *data;
+		int data_len;
+
+		skb = skb_dequeue(&usk->assembly);
+		uh = udpcp_hdr(skb);
+
+		/*
+		 * setup a UDPCP header
+		 */
+		uh->chksum = 0;
+		uh->msginfo = htons(msginfo);
+		uh->fragnum = fragnum;
+		uh->fragamount = fragamount;
+		uh->msgid = htons(dest->msgid);
+		uh->length = htons(usk->assembly_len);
+
+		data = skb_transport_header(skb) + sizeof(struct udphdr);
+		data_len = skb_tail_pointer(skb) - data;
+
+		if (chkmode == UDPCP_CHECKSUM)
+			uh->chksum = htonl(zlib_adler32(1, data, data_len));
+		/*
+		 * create a UDP header
+		 */
+		uh->udphdr.source = fl->fl_ip_sport;
+		uh->udphdr.dest = fl->fl_ip_dport;
+		uh->udphdr.len = htons(sizeof(struct udphdr) + data_len);
+		uh->udphdr.check = 0;
+
+		/*
+		 * create UDP checksum
+		 */
+		udpcp_do_csum(sk, skb, dest);
+
+		/*
+		 * add to xmit queue
+		 */
+		skb_queue_tail(&dest->xmit, skb);
+	}
+
+	dest->msgid++;
+	usk->assembly_len = 0;
+	usk->assembly_dest = NULL;
+}
+
+/*
+ * Remove all data message fragments of the first message from the transmit
+ * queue all fragments will be merged together
+ */
+static struct sk_buff *udpcp_dequeue_msg(struct sock *sk,
+					 struct udpcp_dest *dest)
+{
+	struct sk_buff *msg;
+	struct sk_buff *skb;
+	struct sk_buff **next;
+	struct udpcphdr *uh;
+
+	msg = skb_dequeue(&dest->xmit);
+	if (!msg)
+		return NULL;
+	skb_orphan(msg);
+
+	uh = udpcp_hdr(msg);
+	if (!uh->msgid) {
+		/*
+		 * sync message
+		 */
+		kfree_skb(msg);
+		return NULL;
+	}
+
+	skb_pull(msg, sizeof(struct udpcphdr));
+	if (udpcp_is_last_frag(uh))
+		return msg;
+
+	next = &skb_shinfo(msg)->frag_list;
+	for (;;) {
+		skb = skb_dequeue(&dest->xmit);
+		if (!skb)
+			break;
+		skb_orphan(skb);
+		uh = udpcp_hdr(skb);
+		skb_pull(msg, sizeof(struct udpcphdr));
+		msg->len += skb->len;
+		msg->data_len += skb->len;
+		*next = skb;
+		if (udpcp_is_last_frag(uh))
+			break;
+		next = &skb->next;
+	}
+	return msg;
+}
+
+static void udpcp_flush_err(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	if (!inet->recverr) {
+		skb_queue_purge(&dest->xmit);
+	} else {
+		struct sock_exterr_skb *serr;
+		struct iphdr *iph;
+		struct sk_buff *skb;
+
+		while (!skb_queue_empty(&dest->xmit)) {
+			skb = udpcp_dequeue_msg(sk, dest);
+			if (!skb)
+				continue;
+
+			if (unlikely(debug))
+				dump_msg("flush outgoing message", skb,
+					 dest->fl.fl4_src, dest->fl.fl4_dst);
+
+			skb_push(skb, sizeof(struct iphdr));
+			skb_reset_network_header(skb);
+			iph = ip_hdr(skb);
+			iph->daddr = dest->rt->rt_dst;
+
+			serr = SKB_EXT_ERR(skb);
+			serr->ee.ee_errno = EPROTO;
+			serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;
+			serr->ee.ee_type = 0;
+			serr->ee.ee_code = 0;
+			serr->ee.ee_pad = 0;
+			serr->ee.ee_info = 0;
+			serr->ee.ee_data = 0;
+			serr->addr_offset = (u8 *) &iph->daddr -
+						skb_network_header(skb);
+			serr->port = dest->fl.fl_ip_dport;
+
+			skb_reset_transport_header(skb);
+			skb_pull(skb, sizeof(struct iphdr));
+
+			/*
+			 * set a flag for UDPCP message
+			 */
+			UDP_SKB_CB(skb)->udpcp_flag = 1;
+
+			/*
+			 * pass the dequeued message to the error queue of the
+			 * socket
+			 */
+			skb_set_owner_r(skb, sk);
+			skb_queue_tail(&sk->sk_error_queue, skb);
+			if (!sock_flag(sk, SOCK_DEAD)) {
+				if (usk->udp_data_ready)
+					usk->udp_data_ready(sk, skb->len);
+			}
+		}
+	}
+
+	dest->xmit_wait = 0;
+	dest->xmit_last = 0;
+	dest->try = 0;
+	dest->acks = 0;
+
+	usk->pending = 0;
+	if (waitqueue_active(&usk->wq))
+		wake_up_interruptible(&usk->wq);
+}
+
+/*
+ * Purge the current incoming data message
+ */
+static void udpcp_purge_incoming(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	if (dest->recv_last) {
+		u32 fragnum = udpcp_hdr(dest->recv_last)->fragnum + 1;
+
+		dest->rx_discarded_frags += fragnum;
+		usk->stat.rxDiscardedFrags += fragnum;
+		atomic_add(fragnum, &udpcp_rx_discarded_frags);
+
+		dest->lastmsg.msgid = 0;
+
+		if (unlikely(debug))
+			dump_msg("purge incoming message", dest->recv_msg,
+				 dest->fl.fl4_src, dest->fl.fl4_dst);
+	}
+
+	kfree_skb(dest->recv_msg);
+	dest->recv_msg = 0;
+	dest->recv_last = 0;
+}
+
+/*
+ * Resend all data message fragments to the one which is currently waiting for
+ * an ack
+ */
+static int udpcp_resend(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct sk_buff *skb;
+	struct sk_buff *skbc;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int err;
+
+	if (++dest->try >= usk->maxtry) {
+		dest->insync = 0;
+		udpcp_flush_err(sk, dest);
+		udpcp_purge_incoming(sk, dest);
+		udpcp_dst_release(usk, dest);
+		return 0;
+	}
+
+	dest->tx_retries++;
+	usk->stat.txRetries++;
+	atomic_inc(&udpcp_tx_retries);
+
+	if (!dest->xmit_last) {
+		_udpcp_xmit(sk, dest);
+	} else {
+		skb = dest->xmit_wait;
+
+		for (;;) {
+			skbc = skb_clone(skb, sk->sk_allocation);
+
+			if (skbc == NULL)
+				break;
+
+			if (unlikely(debug))
+				dump_msg("resend msg", skbc, dest->fl.fl4_src,
+					 dest->fl.fl4_dst);
+			err = udpcp_send_skb(sk, skbc, dest,
+						(struct ip_options *)skb->cb);
+			if (err) {
+				kfree_skb(skbc);
+				break;
+			}
+
+			if (skb == dest->xmit_last) {
+				_udpcp_xmit(sk, dest);
+				break;
+			}
+
+			skb = skb->next;
+		}
+	}
+	dest->tx_time = jiffies;
+
+	return 1;
+}
+
+/*
+ * Handle udpcp timeout
+ */
+static void udpcp_handle_timeout(struct sock *sk)
+{
+	struct udpcp_dest *dest;
+	struct list_head *p;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int wflag = 0;
+	unsigned long t = jiffies + UDPCP_MAX_WAIT_SEC * HZ + 1;
+
+	usk->timeout = 0;
+
+	/*
+	 * walk through all destinations
+	 */
+	list_for_each(p, &usk->destlist) {
+		dest = list_to_udpcpdest(p);
+
+		if (dest->xmit_wait) {
+			if (time_is_before_eq_jiffies
+			    (dest->tx_time + usk->tx_timeout)) {
+				/*
+				 * transmit timeout expired
+				 */
+				if (unlikely(debug))
+					dump_msg("send timeout",
+						 dest->xmit_wait,
+						 dest->fl.fl4_src,
+						 dest->fl.fl4_dst);
+				if (udpcp_resend(sk, dest) == 0) {
+					dest->tx_timeout++;
+					usk->stat.txTimeout++;
+					atomic_inc(&udpcp_tx_timeout);
+					goto check_incoming;
+				}
+				wflag = 1;
+			}
+			if (time_before(dest->tx_time + usk->tx_timeout, t)) {
+				/*
+				 * calculate new timeout timer value
+				 */
+				t = dest->tx_time + usk->tx_timeout;
+				wflag = 1;
+			}
+		}
+check_incoming:
+		if (dest->recv_msg) {
+			if (time_is_before_eq_jiffies
+			    (dest->rx_time + usk->rx_timeout)) {
+				/*
+				 * receive timeout occurred
+				 */
+				if (unlikely(debug))
+					dump_msg("receive timeout",
+						 dest->recv_last,
+						 dest->fl.fl4_src,
+						 dest->fl.fl4_dst);
+				udpcp_purge_incoming(sk, dest);
+				dest->rx_timeout++;
+				usk->stat.rxTimeout++;
+				atomic_inc(&udpcp_rx_timeout);
+			} else
+			if (time_before(dest->rx_time + usk->rx_timeout, t)) {
+				/*
+				 * calculate new timeout timer value
+				 */
+				t = dest->rx_time + usk->rx_timeout;
+				wflag = 1;
+			}
+		}
+	}
+	/*
+	 * restart timer if necessary
+	 */
+	if (wflag)
+		udpcp_timer(sk, t);
+}
+
+/*
+ * Timeout function
+ */
+static void udpcp_timeout(unsigned long data)
+{
+	struct sock *sk = (struct sock *)data;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk)) {
+		udpcp_handle_timeout(sk);
+	} else {
+		/*
+		 * bad, cannot handle the timeout because the socket is in use
+		 * set flag for unhandled timeout and rearm the timer
+		 */
+		usk->timeout = 1;
+		udpcp_timer(sk, jiffies + 1);
+	}
+	bh_unlock_sock(sk);
+}
+
+/*
+ * Handle timeout if an the unhandled timeout flag is set
+ */
+static inline void check_timeout(struct sock *sk)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	while (usk->timeout) {
+		lock_sock(sk);
+		while (usk->timeout)
+			udpcp_handle_timeout(sk);
+		release_sock(sk);
+	}
+}
+
+/*
+ * Release the socket lock and test for unhandled timeouts
+ */
+static inline void udpcp_release_sock(struct sock *sk)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	while (usk->timeout)
+		udpcp_handle_timeout(sk);
+	release_sock(sk);
+	check_timeout(sk);
+}
+
+/*
+ * Parse sendmsg() control message
+ */
+static int udpcp_cmsg_send(struct msghdr *msg, u8 * ackmode, u8 * chkmode)
+{
+	struct cmsghdr *cmsg;
+
+	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+		if (!CMSG_OK(msg, cmsg))
+			return -EINVAL;
+		if (cmsg->cmsg_level != SOL_UDPCP)
+			continue;
+		switch (cmsg->cmsg_type) {
+		case UDPCP_NOACK:
+		case UDPCP_ACK:
+		case UDPCP_SINGLE_ACK:
+			*ackmode = cmsg->cmsg_type;
+			break;
+		case UDPCP_CHECKSUM:
+		case UDPCP_NOCHECKSUM:
+			*chkmode = cmsg->cmsg_type;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Validate a skb buffer
+ */
+static int udpcp_validate_skb(struct sk_buff *skb)
+{
+	if (skb->next) {
+		pr_err("udpcp: unexpected skb_buff->next != NULL\n");
+		BUG();
+		return 1;
+	}
+	if (skb_shinfo(skb)->frag_list) {
+		pr_err("udpcp: unexpected skb_shinfo(skb)->frag_list != NULL\n");
+		BUG();
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Split a message into fragments and store it into the assemble queue
+ * mostly stolen from UDP stack
+ */
+static int udpcp_data(struct sock *sk, struct udpcp_dest *dest,
+		      struct iovec *from, int length, unsigned int flags)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct inet_sock *inet = inet_sk(sk);
+	struct sk_buff *skb;
+	struct ipcm_cookie *ipc = &dest->ipc;
+	struct ip_options *opt = ipc->opt;
+	int hh_len;
+	int exthdrlen;
+	int mtu;
+	int copy;
+	int err;
+	int offset = 0;
+	unsigned int maxfraglen, fragheaderlen;
+	int csummode = CHECKSUM_NONE;
+	int transhdrlen = sizeof(struct udpcphdr);
+	struct rtable *rt = dest->rt;
+
+	if (opt && sizeof(skb->cb) < optlength(opt)) {
+		err = -EFAULT;
+		goto error;
+	}
+
+	usk->assembly_len += length;
+	usk->assembly_dest = dest;
+
+	if (usk->assembly_len > UDPCP_MAX_MSGSIZE) {
+		ip_local_error(sk, EMSGSIZE, rt->rt_dst, dest->fl.fl_ip_dport,
+				usk->assembly_len);
+		err = -EMSGSIZE;
+		goto error;
+	}
+
+	mtu = (inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+		rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+	sk->sk_sndmsg_page = NULL;
+	sk->sk_sndmsg_off = 0;
+	exthdrlen = rt->dst.header_len;
+	length += exthdrlen;
+	transhdrlen += exthdrlen;
+
+	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
+
+	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
+	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
+
+	if (rt->dst.dev->features & NETIF_F_V4_CSUM && !exthdrlen)
+		csummode = CHECKSUM_PARTIAL;
+
+	skb = skb_peek_tail(&usk->assembly);
+	if (skb) {
+		unsigned int off;
+
+		off = skb->len;
+
+		copy = mtu - skb->len;
+		if (copy > length)
+			copy = length;
+
+		if (copy > 0 &&
+		    ip_generic_getfrag(
+		     from, skb_put(skb, copy), 0, copy, off, skb) < 0) {
+			__skb_trim(skb, off);
+			err = -EFAULT;
+			goto error;
+		}
+		length -= copy;
+		offset += copy;
+
+		if (!length)
+			return 0;
+	}
+
+	do {
+		char *data;
+		unsigned int datalen;
+		unsigned int fraglen;
+		unsigned int alloclen;
+
+		length += transhdrlen;
+		/*
+		 * If remaining data exceeds the mtu,
+		 * we know we need more fragment(s).
+		 */
+		datalen = length;
+		if (datalen > mtu - fragheaderlen)
+			datalen = maxfraglen - fragheaderlen;
+		fraglen = datalen + fragheaderlen;
+
+		if ((flags & MSG_MORE)
+		    && !(rt->dst.dev->features & NETIF_F_SG))
+			alloclen = mtu;
+		else
+			alloclen = fraglen;
+
+		alloclen += rt->dst.trailer_len + hh_len + 15;
+
+		udpcp_release_sock(sk);
+		skb = sock_alloc_send_skb(sk, alloclen,
+					(flags & MSG_DONTWAIT), &err);
+		lock_sock(sk);
+		if (skb == NULL)
+			goto error;
+
+		if (udpcp_validate_skb(skb)) {
+			kfree_skb(skb);
+
+			goto error;
+		}
+
+		/*
+		 * Fill in the control structures
+		 */
+		skb->ip_summed = csummode;
+		skb->csum = 0;
+		skb_reserve(skb, hh_len);
+
+		/*
+		 * Find where to start putting bytes.
+		 */
+		data = skb_put(skb, fraglen);
+		skb_set_network_header(skb, exthdrlen);
+		skb->transport_header = (skb->network_header + fragheaderlen);
+		data += fragheaderlen;
+
+		copy = datalen - transhdrlen;
+
+		if (copy > 0 &&
+		  ip_generic_getfrag(
+		   from, data + transhdrlen, offset, copy, 0, skb) < 0) {
+			err = -EFAULT;
+			kfree_skb(skb);
+			goto error;
+		}
+
+		offset += copy;
+		length -= datalen;
+
+		if (ipc->opt)
+			memcpy(skb->cb, &ipc->opt, optlength(opt));
+
+		skb_pull(skb, fragheaderlen);
+		skb_queue_tail(&usk->assembly, skb);
+	} while (length > 0);
+
+	return 0;
+error:
+	skb_queue_purge(&usk->assembly);
+	usk->assembly_len = 0;
+
+	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	return err;
+}
+
+/*
+ * This function will be called by send(), sento() and sendmsg()
+ */
+static int udpcp_sendmsg(struct kiocb *iocb, struct sock *sk,
+			 struct msghdr *msg, size_t len)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct ipcm_cookie *ipc;
+	struct rtable *rt = NULL;
+	int free = 0;
+	int connected = 0;
+	__be32 daddr, faddr, saddr;
+	__be16 dport;
+	u8 tos;
+	int err = 0;
+	int corkreq = usk->udpsock.corkflag || msg->msg_flags & MSG_MORE;
+	struct udpcp_dest *dest;
+
+	if (len > UDPCP_MAX_MSGSIZE)
+		return -EMSGSIZE;
+
+	/*
+	 * Check the flags.
+	 */
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	/*
+	 * check if socket is binded to a port
+	 */
+	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK) || !inet->inet_num)
+		return -ENOTCONN;
+
+	/*
+	 * Get and verify the address.
+	 */
+	if (msg->msg_name) {
+		struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
+		if (msg->msg_namelen < sizeof(*usin))
+			return -EINVAL;
+		if (usin->sin_family != AF_INET) {
+			if (usin->sin_family != AF_UNSPEC)
+				return -EAFNOSUPPORT;
+		}
+
+		daddr = usin->sin_addr.s_addr;
+		dport = usin->sin_port;
+	} else {
+		if (sk->sk_state != TCP_ESTABLISHED)
+			return -EDESTADDRREQ;
+		daddr = inet->inet_daddr;
+		dport = inet->inet_dport;
+		/* Open fast path for connected socket.
+		   Route will not be used, if at least one option is set.
+		 */
+		connected = 1;
+	}
+
+	if (dport == 0)
+		return -EINVAL;
+
+	dest = find_dest(sk, daddr, dport);
+	if (!dest)
+		return -ENOMEM;
+
+	if (!(dest->use_flag & TX_NODE)) {
+		dest->use_flag |= TX_NODE;
+		usk->stat.txNodes++;
+		atomic_inc(&udpcp_tx_nodes);
+	}
+
+	ipc = &dest->ipc;
+
+	if (!skb_queue_empty(&usk->assembly)) {
+		/*
+		 * assembly is ongoing
+		 */
+		lock_sock(sk);
+		if (likely(!skb_queue_empty(&usk->assembly))) {
+			if (usk->assembly_dest != dest) {
+				udpcp_release_sock(sk);
+				return -EUSERS;
+			}
+			ipc->opt =
+			    (struct ip_options *)skb_peek(&usk->assembly)->cb;
+			goto queue_data;
+		}
+		udpcp_release_sock(sk);
+	}
+
+	ipc->addr = inet->inet_saddr;
+	ipc->oif = sk->sk_bound_dev_if;
+
+	dest->ackmode = usk->ackmode;
+	dest->chkmode = usk->chkmode;
+
+	if (msg->msg_controllen) {
+		/*
+		 * handle control message
+		 */
+		err = udpcp_cmsg_send(msg, &dest->ackmode, &dest->chkmode);
+		if (err)
+			return err;
+		err = ip_cmsg_send(sock_net(sk), msg, ipc);
+		if (err)
+			return err;
+		if (ipc->opt)
+			free = 1;
+		connected = 0;
+	}
+
+	if (!ipc->opt)
+		ipc->opt = inet->opt;
+
+	saddr = ipc->addr;
+	ipc->addr = faddr = daddr;
+
+	if (ipc->opt && ipc->opt->srr) {
+		if (!daddr)
+			return -EINVAL;
+		faddr = ipc->opt->faddr;
+		connected = 0;
+	}
+	tos = RT_TOS(inet->tos);
+	if (sock_flag(sk, SOCK_LOCALROUTE) ||
+	    (msg->msg_flags & MSG_DONTROUTE) ||
+	    (ipc->opt && ipc->opt->is_strictroute)) {
+		tos |= RTO_ONLINK;
+		connected = 0;
+	}
+
+	if (ipv4_is_multicast(daddr)) {
+		if (dest->ackmode != UDPCP_NOACK) {
+			err = EOPNOTSUPP;
+			goto out;
+		}
+		if (!ipc->oif)
+			ipc->oif = inet->mc_index;
+		if (!saddr)
+			saddr = inet->mc_addr;
+		connected = 0;
+	}
+
+	lock_sock(sk);
+	rt = dest->rt;
+	if (rt)
+		goto queue_data;
+	udpcp_release_sock(sk);
+
+	/*
+	 * calculate routing
+	 */
+	if (connected)
+		rt = (struct rtable *)sk_dst_check(sk, 0);
+
+	if (rt == NULL) {
+		struct flowi fl = {.oif = ipc->oif,
+			.nl_u = {.ip4_u = {.daddr = faddr,
+					   .saddr = saddr,
+					   .tos = tos} },
+			.proto = sk->sk_protocol,
+			.uli_u = {.ports = {.sport = inet->inet_sport,
+					    .dport = dport} }
+		};
+		struct net *net = sock_net(sk);
+
+		security_sk_classify_flow(sk, &fl);
+		err = ip_route_output_flow(net, &rt, &fl, sk, 1);
+		if (err) {
+			if (err == -ENETUNREACH)
+				IP_INC_STATS_BH(net, IPSTATS_MIB_OUTNOROUTES);
+			goto out;
+		}
+
+		err = -EACCES;
+		if ((rt->rt_flags & RTCF_BROADCAST) &&
+		    !sock_flag(sk, SOCK_BROADCAST))
+			goto out;
+		if (connected)
+			sk_dst_set(sk, dst_clone(&rt->dst));
+	}
+
+	if (msg->msg_flags & MSG_CONFIRM)
+		goto do_confirm;
+back_from_confirm:
+
+	saddr = rt->rt_src;
+	if (!ipc->addr)
+		daddr = ipc->addr = rt->rt_dst;
+
+	lock_sock(sk);
+
+	dest->fl.fl4_dst = daddr;
+	dest->fl.fl_ip_dport = dport;
+	dest->fl.fl4_src = saddr;
+	dest->fl.fl_ip_sport = inet->inet_sport;
+	dest->rt = rt;
+
+queue_data:
+	if (msg->msg_flags & MSG_PROBE)
+		goto release;
+
+	if (!dest->insync && skb_queue_empty(&dest->xmit)) {
+		/*
+		 * if not synced, queue a SYNC message
+		 */
+		err = udpcp_data(sk, dest, NULL, 0, 0);
+		if (err)
+			goto release;
+		dest->msgid = 0;
+		udpcp_queue_xmit(sk, dest, UDPCP_ACK, UDPCP_CHECKSUM);
+	}
+
+	/*
+	 * split message and store it to the assembly queue
+	 */
+	err = udpcp_data(sk, dest, msg->msg_iov, len,
+		       corkreq ? msg->msg_flags | MSG_MORE : msg->msg_flags);
+	if (err)
+		goto release;
+
+	if (!dest->msgid)
+		dest->msgid = 1;
+
+	if (!corkreq) {
+		/*
+		 * message is complete, transfer it from the assembly queue
+		 * into the transmit queue
+		 */
+		udpcp_queue_xmit(sk, dest, dest->ackmode, dest->chkmode);
+		/*
+		 * start transmit if possible
+		 */
+		err = udpcp_xmit(sk, dest);
+	}
+release:
+	udpcp_release_sock(sk);
+out:
+	if (free)
+		kfree(ipc->opt);
+
+	if (!err)
+		return len;
+	/*
+	 * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
+	 * ENOBUFS might not be good (it's not tunable per se), but otherwise
+	 * we don't have a good statistic (IpOutDiscards but it can be too many
+	 * things).  We could add another new stat but at least for now that
+	 * seems like overkill.
+	 */
+	if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_SNDBUFERRORS, 0);
+	return err;
+
+do_confirm:
+	dst_confirm(&rt->dst);
+	if (!(msg->msg_flags & MSG_PROBE) || len)
+		goto back_from_confirm;
+
+	err = 0;
+	goto out;
+}
+
+/*
+ * Sendpage() is not really implemented
+ */
+static int udpcp_sendpage(struct sock *sk, struct page *page, int offset,
+			  size_t size, int flags)
+{
+	return sock_no_sendpage(sk->sk_socket, page, offset, size, flags);
+}
+
+/*
+ * Release all message fragments of the first in the transmit queue
+ */
+static void udpcp_release_xmit(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct sk_buff *skb;
+	struct udpcphdr *uh;
+
+	for (;;) {
+		skb = skb_dequeue(&dest->xmit);
+
+		uh = udpcp_hdr(skb);
+
+		if (udpcp_is_last_frag(uh) && uh->msgid) {
+			usk->stat.txMsgs++;
+			atomic_inc(&udpcp_tx_msgs);
+		}
+
+		udpcp_dec_pending(sk);
+
+		kfree_skb(skb);
+		if (skb == dest->xmit_last)
+			break;
+	}
+
+	dest->xmit_wait = 0;
+	dest->xmit_last = 0;
+	dest->try = 0;
+}
+
+/*
+ * Set the sync state
+ */
+static void udpcp_sync(struct sock *sk, struct udpcp_dest *dest)
+{
+	dest->xmit_wait = 0;
+	dest->xmit_last = 0;
+	dest->try = 0;
+	dest->acks = 0;
+	dest->insync = 1;
+}
+
+/*
+ * Returns true if the first message in the transmit queue is a sync message
+ */
+static inline int udpcp_xmit_is_sync(struct udpcp_dest *dest)
+{
+	struct sk_buff *skb = skb_peek(&dest->xmit);
+
+	return skb && !udpcp_hdr(skb)->msgid;
+}
+
+static inline struct udpcphdr *udpcp_ack_scan(struct sk_buff *skb)
+{
+	struct udpcphdr *uh;
+
+	for (;;) {
+		uh = udpcp_hdr(skb);
+
+		if (!(ntohs(uh->msginfo) & UDPCP_SINGLE_ACK_FLAG)
+		    || udpcp_is_last_frag(uh))
+			return uh;
+
+		skb = skb->next;
+	}
+}
+
+/*
+ * Handle an incoming ack
+ */
+static void udpcp_handle_ack(struct sock *sk, struct sk_buff *skb,
+			     struct udpcp_dest *dest)
+{
+	struct udpcphdr *r_uh;
+	struct udpcphdr *q_uh;
+
+	if (!dest->acks)
+		return;
+
+	r_uh = udpcp_hdr(skb);
+
+	/*
+	 * acks doesn't have a payload
+	 */
+	if (r_uh->length)
+		return;
+
+	q_uh = udpcp_ack_scan(dest->xmit_wait);
+
+	/*
+	 * message id, fragnum and fragamount must match the awaited message
+	 * fragment
+	 */
+	if (r_uh->msgid != q_uh->msgid)
+		return;
+
+	if (r_uh->fragnum != q_uh->fragnum)
+		return;
+
+	if (r_uh->fragamount != q_uh->fragamount)
+		return;
+
+	dest->acks--;
+
+	/*
+	 * if last fragment release message
+	 */
+	if (udpcp_is_last_frag(q_uh)) {
+		udpcp_release_xmit(sk, dest);
+
+		/*
+		 * special handling for sync messages
+		 */
+		if (r_uh->msgid == 0)
+			udpcp_sync(sk, dest);
+	} else {
+		dest->xmit_wait = dest->xmit_wait->next;
+	}
+	/*
+	 * try to transmit next message/fragment
+	 */
+	udpcp_xmit(sk, dest);
+}
+
+/*
+ * Queue incoming message as owned by udpcp socket
+ */
+static void udpcp_set_owner_r(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct sk_buff *skb;
+
+	skb = dest->recv_msg;
+	skb_set_owner_r(skb, sk);
+
+	skb = skb_shinfo(skb)->frag_list;
+	if (!skb)
+		return;
+
+	for (;;) {
+		skb_set_owner_r(skb, sk);
+		if (udpcp_is_last_frag(udpcp_hdr(skb)))
+			break;
+		skb = skb->next;
+	}
+}
+
+/*
+ * Handle an incoming data message fragment
+ */
+static int udpcp_handle_data(struct sock *sk, struct sk_buff *skb,
+			     struct udpcp_dest *dest)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct udpcphdr *uh = udpcp_hdr(skb);
+	unsigned short msginfo = ntohs(uh->msginfo);
+	unsigned short length = ntohs(uh->length);
+
+	/*
+	 * special handling for sync messages
+	 */
+	if (uh->msgid == 0) {
+		/*
+		 * sync messages doesn't have a payload
+		 */
+		if (length)
+			return 1;
+
+		/*
+		 * sync messages doesn't have a ack rules
+		 */
+		if (msginfo & (UDPCP_NO_ACK_FLAG | UDPCP_SINGLE_ACK_FLAG))
+			return 1;
+
+		udpcp_send_ack(sk, skb, dest,
+			       memcmp(uh, &dest->lastmsg,
+				      sizeof(dest->lastmsg)) ? 0 : 1);
+
+		udpcp_purge_incoming(sk, dest);
+
+		/*
+		 * skip the first message in the queue if it is a sync messages
+		 */
+		if (udpcp_xmit_is_sync(dest)) {
+			dest->acks--;
+			udpcp_dec_pending(sk);
+			kfree_skb(skb_dequeue(&dest->xmit));
+		}
+
+		if (!dest->insync)
+			udpcp_sync(sk, dest);
+
+		udpcp_xmit(sk, dest);
+
+		return -1;
+	}
+
+	if (!dest->insync)
+		return 1;
+
+	if (length > UDPCP_MAX_MSGSIZE)
+		return 1;
+
+	length += sizeof(struct udpcphdr);
+
+	/*
+	 * if the message was still handled, send a duplicate ack
+	 */
+	if (!memcmp(uh, &dest->lastmsg, sizeof(dest->lastmsg))) {
+		udpcp_send_ack(sk, skb, dest, 1);
+		return 1;
+	}
+
+	if (dest->recv_msg) {
+		/*
+		 * if a fragment is already received validate the fragment
+		 */
+		if ((uh->msgid != udpcp_hdr(dest->recv_msg)->msgid) ||
+		    (uh->msginfo != udpcp_hdr(dest->recv_msg)->msginfo) ||
+		    (uh->length != udpcp_hdr(dest->recv_msg)->length) ||
+		    (uh->fragamount != udpcp_hdr(dest->recv_msg)->fragamount)
+		    ) {
+			udpcp_purge_incoming(sk, dest);
+			goto newmsg;
+		}
+
+		if (uh->fragnum != udpcp_hdr(dest->recv_last)->fragnum + 1)
+			return 1;
+
+		if (dest->recv_msg->len + skb->len - sizeof(struct udpcphdr) >
+		    length)
+			return 1;
+	} else {
+newmsg:
+		/*
+		 * first fragment must have the number 0
+		 */
+		if (uh->fragnum != 0)
+			return 1;
+
+		/*
+		 * UDPCP data length cannot be smaller then the UDP data length
+		 */
+		if (skb->len > length)
+			return 1;
+
+		/*
+		 * id of the last received is not valid
+		 */
+		if (dest->lastmsg.msgid == uh->msgid)
+			return 1;
+
+		/*
+		 * check against receive buffer limit
+		 */
+		if (atomic_read(&sk->sk_rmem_alloc) + length > sk->sk_rcvbuf)
+			return 1;
+	}
+
+	memset(&dest->lastmsg, 0, sizeof(dest->lastmsg));
+
+	if (!dest->recv_msg) {
+		/*
+		 * store the first message fragment
+		 */
+		if (skb->cloned) {
+			struct sk_buff *skbc;
+
+			skbc = skb_copy(skb, sk->sk_allocation);
+			if (skbc == NULL)
+				return 1;
+			kfree_skb(skb);
+			skb = skbc;
+		}
+		dest->recv_msg = skb;
+	} else {
+		/*
+		 * store the consecutively message fragment
+		 */
+		struct skb_shared_info *shinfo;
+
+		shinfo = skb_shinfo(dest->recv_msg);
+
+		if (!shinfo->frag_list)
+			shinfo->frag_list = skb;
+		else
+			dest->recv_last->next = skb;
+
+		skb_pull(skb, sizeof(struct udpcphdr));
+		dest->recv_msg->len += skb->len;
+		dest->recv_msg->data_len += skb->len;
+	}
+	dest->recv_last = skb;
+
+	msginfo = ntohs(uh->msginfo);
+
+	if (udpcp_is_last_frag(uh) || uh->fragamount == 0) {
+		/*
+		 * last fragment: queue it to the socket sk_receive_queue
+		 * and ack it
+		 */
+
+		if (dest->recv_msg->len != length) {
+			udpcp_purge_incoming(sk, dest);
+			return 0;
+		}
+
+		if (!(msginfo & UDPCP_NO_ACK_FLAG))
+			udpcp_send_ack(sk, skb, dest, 0);
+
+		memcpy(dest->recv_msg->data + UDPCP_HDRSIZE,
+		       dest->recv_msg->data, sizeof(struct udphdr));
+		skb_pull(dest->recv_msg, UDPCP_HDRSIZE);
+
+		usk->stat.rxMsgs++;
+		atomic_inc(&udpcp_rx_msgs);
+
+		/*
+		 * set a flag for UDPCP message
+		 */
+		UDP_SKB_CB(skb)->udpcp_flag = 1;
+
+		udpcp_set_owner_r(sk, dest);
+		skb_queue_tail(&sk->sk_receive_queue, dest->recv_msg);
+
+		/*
+		 * call the original data available handler
+		 */
+		if (usk->udp_data_ready)
+			usk->udp_data_ready(sk, dest->recv_msg->len);
+
+		dest->recv_msg = 0;
+		dest->recv_last = 0;
+	} else {
+		/*
+		 * ack fragment if requiered
+		 */
+		if (!(msginfo & UDPCP_NO_ACK_FLAG)
+		    && !(msginfo & UDPCP_SINGLE_ACK_FLAG))
+			udpcp_send_ack(sk, skb, dest, 0);
+
+		/*
+		 * setup timeout handler
+		 */
+		dest->rx_time = jiffies;
+
+		if (!timer_pending(&usk->timer))
+			udpcp_timer(sk, dest->rx_time + usk->rx_timeout);
+	}
+
+	return 0;
+}
+
+/*
+ * Deal with received UDPCP frames - sort out what type source it is
+ * and hand of it to the udpcp_handle_packet function.
+ */
+static void udpcp_data_ready(struct sock *sk, int slen)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	struct sk_buff *skb;
+	struct udpcp_dest *dest;
+	struct udpcphdr *uh;
+	unsigned short msginfo;
+	int ret;
+
+	skb = skb_peek_tail(&sk->sk_receive_queue);
+
+	/*
+	 * don't handle NULL pointer buffer and UDPCP messages
+	 */
+	if (skb == NULL || UDP_SKB_CB(skb)->udpcp_flag) {
+		if (usk->udp_data_ready)
+			usk->udp_data_ready(sk, slen);
+		return;
+	}
+
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	if (udpcp_validate_skb(skb)) {
+		kfree_skb(skb);
+
+		return;
+	}
+
+	skb_orphan(skb);
+
+	/*
+	 * do UDP checksum
+	 */
+	if (udp_lib_checksum_complete(skb)) {
+		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, 0);
+		return;
+	}
+
+	if (unlikely(debug))
+		dump_msg("receive", skb, ip_hdr(skb)->saddr,
+			 ip_hdr(skb)->daddr);
+
+	uh = udpcp_hdr(skb);
+	msginfo = ntohs(uh->msginfo);
+
+	/*
+	 * handle only UDPCP protocol version 2
+	 */
+	if ((msginfo & UDPCP_PROTOCOL_MASK) != UDPCP_PROTOCOL_VERSION_2) {
+		kfree_skb(skb);
+		return;
+	}
+
+	/*
+	 * handle UDPCP checksum
+	 */
+	if (msginfo & UDPCP_CHECKSUM_FLAG) {
+		u8 *data;
+		u32 data_len;
+		u32 chksum;
+
+		chksum = ntohl(uh->chksum);
+		data = (u8 *) skb->data + sizeof(struct udphdr);
+		data_len = skb->len - sizeof(struct udphdr);
+
+		uh->chksum = 0;
+
+		if (chksum != zlib_adler32(1, data, data_len)) {
+			kfree_skb(skb);
+			usk->stat.crcErrors++;
+			atomic_inc(&udpcp_crc_errors);
+			return;
+		}
+	}
+
+	dest = __find_dest(sk, ip_hdr(skb)->saddr, udp_hdr(skb)->source);
+
+	if (!dest) {
+		/*
+		 * new communication destination must start with an sync message
+		 */
+		if (((msginfo & UDPCP_MSG_TYPE_MASK) != UDPCP_MSG_TYPE_DATA) ||
+		    (uh->msgid != 0)) {
+			kfree_skb(skb);
+			return;
+		}
+
+		dest = new_dest(sk, ip_hdr(skb)->saddr, udp_hdr(skb)->source);
+
+		if (!dest) {
+			kfree_skb(skb);
+			return;
+		}
+	}
+
+	/*
+	 * handle message type
+	 */
+	switch (msginfo & UDPCP_MSG_TYPE_MASK) {
+	case UDPCP_MSG_TYPE_DATA:
+		if (!(dest->use_flag & RX_NODE)) {
+			dest->use_flag |= RX_NODE;
+			usk->stat.rxNodes++;
+			atomic_inc(&udpcp_rx_nodes);
+		}
+
+		ret = udpcp_handle_data(sk, skb, dest);
+
+		if (ret > 0) {
+			dest->rx_discarded_frags++;
+			usk->stat.rxDiscardedFrags++;
+			atomic_inc(&udpcp_rx_discarded_frags);
+		}
+		break;
+	case UDPCP_MSG_TYPE_ACK:
+		udpcp_handle_ack(sk, skb, dest);
+	default:
+		ret = 1;
+		break;
+	}
+	if (ret)
+		kfree_skb(skb);
+}
+
+/*
+ * Set socket options
+ */
+static int udpcp_setsockopt(struct sock *sk, int level, int optname,
+			    char __user *optval, unsigned int optlen)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int val, ret;
+
+	if (level != SOL_UDPCP) {
+		if (udp_prot.setsockopt) {
+			ret = udp_prot.setsockopt(sk, level, optname, optval,
+						optlen);
+			check_timeout(sk);
+			return ret;
+		}
+		return -ENOPROTOOPT;
+	}
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+
+	if (get_user(val, (int __user *)optval))
+		return -EFAULT;
+
+	switch (optname) {
+	case UDPCP_OPT_TRANSFER_MODE:
+		switch (val) {
+		case UDPCP_NOACK:
+		case UDPCP_ACK:
+		case UDPCP_SINGLE_ACK:
+			usk->ackmode = val;
+			break;
+		default:
+			return -EINVAL;
+		}
+		break;
+	case UDPCP_OPT_CHECKSUM_MODE:
+		switch (val) {
+		case UDPCP_NOCHECKSUM:
+		case UDPCP_CHECKSUM:
+			usk->chkmode = val;
+			break;
+		default:
+			return -EINVAL;
+		}
+		break;
+
+	case UDPCP_OPT_TX_TIMEOUT:
+		if ((val < 1) || (val > UDPCP_MAX_WAIT_SEC * 1000))
+			return -EINVAL;
+		usk->tx_timeout = msecs_to_jiffies(val);
+		break;
+
+	case UDPCP_OPT_RX_TIMEOUT:
+		if ((val < 1) || (val > UDPCP_MAX_WAIT_SEC * 1000))
+			return -EINVAL;
+		usk->rx_timeout = msecs_to_jiffies(val);
+		break;
+
+	case UDPCP_OPT_MAXTRY:
+		if ((val < 1) || (val > 10))
+			return -EINVAL;
+		usk->maxtry = val;
+		break;
+
+	case UDPCP_OPT_OUTSTANDING_ACKS:
+		if ((val < 1) || (val > 255))
+			return -EINVAL;
+		usk->acks = val;
+		break;
+
+	default:
+		return -ENOPROTOOPT;
+	}
+	return 0;
+}
+
+/*
+ * Get socket options
+ */
+static int udpcp_getsockopt(struct sock *sk, int level, int optname,
+			    char __user *optval, int __user *optlen)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int val, len, ret;
+
+	if (level != SOL_UDPCP) {
+		if (udp_prot.getsockopt) {
+			ret = udp_prot.getsockopt(sk, level, optname, optval,
+						optlen);
+			check_timeout(sk);
+			return ret;
+		}
+		return -ENOPROTOOPT;
+	}
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+
+	len = min_t(unsigned int, len, sizeof(int));
+
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case UDPCP_OPT_TRANSFER_MODE:
+		val = usk->ackmode;
+		break;
+
+	case UDPCP_OPT_CHECKSUM_MODE:
+		val = usk->chkmode;
+		break;
+
+	case UDPCP_OPT_TX_TIMEOUT:
+		val = jiffies_to_msecs(usk->tx_timeout);
+		break;
+
+	case UDPCP_OPT_MAXTRY:
+		val = usk->maxtry;
+		break;
+
+	case UDPCP_OPT_OUTSTANDING_ACKS:
+		val = usk->acks;
+		break;
+
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+/*
+ * ioctl() requests applicable to the UDPCP protocol
+ */
+int udpcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
+{
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int ret = 0;
+
+	switch (cmd) {
+	case UDPCP_IOCTL_GET_STATISTICS:
+		lock_sock(sk);
+		if (copy_to_user((void *)arg, &usk->stat, sizeof(usk->stat)))
+			ret = -EFAULT;
+		udpcp_release_sock(sk);
+		break;
+
+	case UDPCP_IOCTL_RESET_STATISTICS:
+		lock_sock(sk);
+		usk->stat.txMsgs = 0;
+		usk->stat.rxMsgs = 0;
+		usk->stat.txTimeout = 0;
+		usk->stat.rxTimeout = 0;
+		usk->stat.txRetries = 0;
+		usk->stat.rxDiscardedFrags = 0;
+		usk->stat.crcErrors = 0;
+		udpcp_release_sock(sk);
+		break;
+
+	case UDPCP_IOCTL_SYNC:
+		if (arg)
+			ret = wait_event_interruptible_timeout(usk->wq,
+				!usk->pending, msecs_to_jiffies(arg));
+		else
+			ret = wait_event_interruptible(usk->wq, !usk->pending);
+
+		break;
+
+	default:
+		if (udp_prot.ioctl) {
+			ret = udp_prot.ioctl(sk, cmd, arg);
+			check_timeout(sk);
+		} else {
+			ret = -ENOIOCTLCMD;
+		}
+		break;
+	}
+	return ret;
+}
+
+/*
+ * This function will be called by recv(), recvfrom() and revmsg()
+ */
+int udpcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+		  size_t len, int noblock, int flags, int *addr_len)
+{
+	int ret;
+
+	ret = udp_prot.recvmsg(iocb, sk, msg, len, noblock, flags, addr_len);
+	check_timeout(sk);
+	return ret;
+}
+
+/*
+ * This function will be called by socket() and initialized the socket
+ */
+static int udpcp_sockinit(struct sock *sk)
+{
+	int ret;
+	struct udpcp_sock *usk;
+
+	sk->sk_protocol = SOL_UDP;
+	sk->sk_allocation = GFP_ATOMIC;
+	if (udp_prot.init) {
+		ret = udp_prot.init(sk);
+
+		if (ret)
+			return ret;
+	}
+
+	usk = udpcp_sk(sk);
+	usk->timer.expires = 0;
+	usk->timer.function = udpcp_timeout;
+	usk->timer.data = (long)sk;
+	init_timer(&usk->timer);
+	INIT_LIST_HEAD(&usk->destlist);
+	init_waitqueue_head(&usk->wq);
+	usk->pending = 0;
+	usk->ackmode = UDPCP_ACK;
+	usk->chkmode = UDPCP_CHECKSUM;
+	usk->maxtry = UDPCP_TX_MAXTRY;
+	usk->acks = UDPCP_OUTSTANDING_ACKS;
+	usk->tx_timeout = msecs_to_jiffies(UDPCP_TX_TIMEOUT);
+	usk->rx_timeout = msecs_to_jiffies(UDPCP_RX_TIMEOUT);
+	usk->udp_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = udpcp_data_ready;
+	usk->udpsock.pending = 0;
+	skb_queue_head_init(&usk->assembly);
+	usk->assembly_len = 0;
+	usk->assembly_dest = NULL;
+
+	spin_lock_bh(&udpcp_lock);
+	list_add_tail(&usk->udpcplist, &udpcp_list);
+	spin_unlock_bh(&udpcp_lock);
+
+#ifdef MODULE
+	try_module_get(THIS_MODULE);
+#endif
+	return 0;
+}
+
+/*
+ * This function will be called by close()
+ */
+static void udpcp_destroy(struct sock *sk)
+{
+	struct list_head *p;
+	struct list_head *n;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+
+	spin_lock_bh(&udpcp_lock);
+	list_del(&usk->udpcplist);
+	spin_unlock_bh(&udpcp_lock);
+
+	if (udp_prot.destroy)
+		udp_prot.destroy(sk);
+
+	lock_sock(sk);
+
+	del_timer_sync(&usk->timer);
+	sk->sk_data_ready = usk->udp_data_ready;
+
+	skb_queue_purge(&usk->assembly);
+
+	list_for_each_safe(p, n, &usk->destlist) {
+		struct udpcp_dest *dest;
+
+		dest = list_to_udpcpdest(p);
+
+		skb_queue_purge(&dest->xmit);
+
+		kfree_skb(dest->recv_msg);
+
+		if (dest->rt)
+			dst_release(&dest->rt->dst);
+
+		kfree(dest);
+	}
+
+	atomic_sub(usk->stat.txNodes, &udpcp_tx_nodes);
+	atomic_sub(usk->stat.rxNodes, &udpcp_rx_nodes);
+
+	usk->pending = 0;
+
+	if (waitqueue_active(&usk->wq))
+		wake_up_interruptible(&usk->wq);
+
+	release_sock(sk);
+
+#ifdef MODULE
+	module_put(THIS_MODULE);
+#endif
+}
+
+static struct proto udpcp_prot;
+
+/*
+ * inet protocol stack descriptor
+ */
+static struct inet_protosw udpcp_protosw = {
+	.type = SOCK_DGRAM,
+	.protocol = PF_UDPCP,
+	.prot = &udpcp_prot,
+	.ops = &inet_dgram_ops,
+	.no_check = UDP_CSUM_DEFAULT,
+	.flags = 0,
+};
+
+#ifdef CONFIG_PROC_FS
+/*
+ * The following functions handles the /proc/net/udpcp entry
+ */
+struct udpcp_seq_afinfo {
+	char *name;
+	const struct file_operations seq_fops;
+	const struct seq_operations seq_ops;
+};
+
+struct udpcp_iter_state {
+	struct seq_net_private p;
+	struct sock *sk;
+	struct list_head *list;
+	int bucket;
+};
+
+static int udpcp_get_destlist(struct udpcp_sock *usk,
+			struct udpcp_iter_state *state)
+{
+	struct sock *sk = (struct sock *)usk;
+
+	if (sock_flag(sk, SOCK_DEAD))
+		return 0;
+
+	sock_hold(sk);
+	if (!list_empty(&usk->destlist)) {
+		state->sk = sk;
+		state->list = &usk->destlist;
+		return 1;
+	}
+	sock_put(sk);
+
+	return 0;
+}
+
+static inline int udpcp_next_dest(struct udpcp_iter_state *state)
+{
+	struct sock *sk = state->sk;
+	struct udpcp_sock *usk = udpcp_sk(sk);
+	int found = 0;
+
+	if (sock_flag(sk, SOCK_DEAD))
+		return 0;
+
+	lock_sock(sk);
+	if (!list_is_last(state->list, &usk->destlist)) {
+		state->list = state->list->next;
+		state->bucket++;
+		found = 1;
+	}
+	udpcp_release_sock(sk);
+	return found;
+}
+
+static void *udpcp_get_next(struct seq_file *seq)
+{
+	struct udpcp_iter_state *state = seq->private;
+	struct udpcp_sock *usk;
+	struct sock *sk;
+
+	while (state) {
+		if (udpcp_next_dest(state))
+			return state;
+
+		sk = state->sk;
+		usk = udpcp_sk(sk);
+
+		spin_lock_bh(&udpcp_lock);
+		while (!list_is_last(&usk->udpcplist, &udpcp_list)) {
+			usk = list_entry(usk->udpcplist.next, struct udpcp_sock,
+				       udpcplist);
+
+			if (udpcp_get_destlist(usk, state))
+				goto found;
+		}
+		state->sk = NULL;
+		state = NULL;
+found:
+		spin_unlock_bh(&udpcp_lock);
+		sock_put(sk);
+	}
+	return state;
+}
+
+static void *udpcp_get_first(struct seq_file *seq)
+{
+	struct list_head *p;
+	struct udpcp_iter_state *state = seq->private;
+	int found = 0;
+
+	if (!state)
+		return NULL;
+
+	spin_lock_bh(&udpcp_lock);
+	list_for_each(p, &udpcp_list) {
+		found = udpcp_get_destlist(list_to_udpcpsock(p), state);
+		if (found)
+			goto found;
+	}
+found:
+	spin_unlock_bh(&udpcp_lock);
+
+	if (!found)
+		return NULL;
+	return udpcp_get_next(seq);
+}
+
+static void *udpcp_get_idx(struct seq_file *seq, loff_t pos)
+{
+	if (!udpcp_get_first(seq))
+		return NULL;
+
+	while (pos--) {
+		if (!udpcp_get_next(seq))
+			return NULL;
+	}
+	return seq->private;
+}
+
+static void *udpcp_seq_start(struct seq_file *seq, loff_t * pos)
+{
+	return *pos ? udpcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+}
+
+static void *udpcp_seq_next(struct seq_file *seq, void *v, loff_t * pos)
+{
+	void *private;
+
+	if (v == SEQ_START_TOKEN)
+		private = udpcp_get_idx(seq, 0);
+	else
+		private = udpcp_get_next(seq);
+
+	++*pos;
+	return private;
+}
+
+static void udpcp_seq_stop(struct seq_file *seq, void *v)
+{
+	struct udpcp_iter_state *state = seq->private;
+
+	if (state->sk)
+		sock_put(state->sk);
+}
+
+static int udpcp_seq_open(struct inode *inode, struct file *file)
+{
+	struct udpcp_seq_afinfo *afinfo = PDE(inode)->data;
+	int err;
+
+	err = seq_open_net(inode, file, &afinfo->seq_ops,
+			   sizeof(struct udpcp_iter_state));
+	if (err < 0)
+		return err;
+
+	return err;
+}
+
+int udpcp_proc_register(struct net *net, struct udpcp_seq_afinfo *afinfo)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_create_data(afinfo->name, S_IRUGO, net->proc_net,
+			     &afinfo->seq_fops, afinfo);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+
+void udpcp_proc_unregister(struct net *net, struct udpcp_seq_afinfo *afinfo)
+{
+	proc_net_remove(net, afinfo->name);
+}
+
+static unsigned int udpcp_tx_queue_len(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct sk_buff *skb;
+	unsigned int n = 0;
+
+	skb_queue_walk(&dest->xmit, skb)
+	    n += skb->len;
+	return n;
+}
+
+static unsigned int udpcp_rx_queue_len(struct sock *sk, struct udpcp_dest *dest)
+{
+	struct sk_buff *skb;
+	unsigned int n = 0;
+
+	skb_queue_walk(&sk->sk_receive_queue, skb) {
+		if (udp_hdr(skb)->source == dest->port
+		    && ip_hdr(skb)->saddr == dest->addr)
+			n += skb->len;
+	}
+	return n;
+}
+
+static void udpcp_format_sock(struct seq_file *seq, int *len)
+{
+	struct udpcp_iter_state *state = seq->private;
+	struct sock *sk = state->sk;
+	struct inet_sock *inet = inet_sk(sk);
+	struct udpcp_dest *p = list_to_udpcpdest(state->list);
+	__be32 src = inet->inet_rcv_saddr;
+	__u16 srcp = ntohs(inet->inet_sport);
+	__be32 dest = p->addr;
+	__u16 destp = ntohs(p->port);
+
+	lock_sock(sk);
+	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
+		   " %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %u%n",
+		   state->bucket, src, srcp, dest, destp, sk->sk_state,
+		   udpcp_tx_queue_len(sk, p),
+		   udpcp_rx_queue_len(sk, p),
+		   0, 0L, p->tx_retries, sock_i_uid(sk),
+		   p->tx_timeout, sock_i_ino(sk),
+		   atomic_read(&sk->sk_refcnt), sk, p->rx_timeout,
+		   len);
+	udpcp_release_sock(sk);
+}
+
+int udpcp_seq_show(struct seq_file *seq, void *v)
+{
+	if (v == SEQ_START_TOKEN) {
+		seq_printf(seq, "%-127s\n",
+			   "  sl  local_address rem_address   st tx_queue "
+			   "rx_queue tr tm->when retrnsmt   uid  timeout "
+			   "inode ref pointer drops");
+	} else {
+		int len;
+
+		udpcp_format_sock(seq, &len);
+		seq_printf(seq, "%*s\n", 127 - len, "");
+	}
+	return 0;
+}
+
+static struct udpcp_seq_afinfo udpcp_seq_afinfo = {
+	.name = "udpcp",
+	.seq_fops = {
+			.owner = THIS_MODULE,
+			.open = udpcp_seq_open,
+			.read = seq_read,
+			.llseek = seq_lseek,
+			.release = seq_release_net,
+		     },
+	.seq_ops = {
+			.show = udpcp_seq_show,
+			.start = udpcp_seq_start,
+			.next = udpcp_seq_next,
+			.stop = udpcp_seq_stop,
+		    },
+};
+
+static int udpcp_proc_init_net(struct net *net)
+{
+	return udpcp_proc_register(net, &udpcp_seq_afinfo);
+}
+
+static void udpcp_proc_exit_net(struct net *net)
+{
+	udpcp_proc_unregister(net, &udpcp_seq_afinfo);
+}
+
+static struct pernet_operations udpcp_net_ops = {
+	.init = udpcp_proc_init_net,
+	.exit = udpcp_proc_exit_net,
+};
+
+static int __init udpcp_proc_init(void)
+{
+	return register_pernet_subsys(&udpcp_net_ops);
+}
+
+static void udpcp_proc_exit(void)
+{
+	unregister_pernet_subsys(&udpcp_net_ops);
+}
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * Install and init module
+ */
+static int __init udpcp_init(void)
+{
+	int ret;
+	struct proc_dir_entry *proc_entry = NULL;
+
+	spin_lock_init(&udpcp_lock);
+
+	INIT_LIST_HEAD(&udpcp_list);
+
+	/*
+	 * to prevent to rewrite the whole UDP protocol,
+	 * assign struct proto udp to the struct proto udpcp
+	 */
+	udpcp_prot = udp_prot;
+
+	/*
+	 * change the protocol name
+	 */
+	strcpy(udpcp_prot.name, "UDPCP");
+
+	/*
+	 * overload the following function, all other
+	 * functions will use the UDP protocol functions
+	 */
+	udpcp_prot.sendmsg = udpcp_sendmsg;
+	udpcp_prot.sendpage = udpcp_sendpage;
+	udpcp_prot.init = udpcp_sockinit;
+	udpcp_prot.destroy = udpcp_destroy;
+	udpcp_prot.setsockopt = udpcp_setsockopt;
+	udpcp_prot.getsockopt = udpcp_getsockopt;
+	udpcp_prot.ioctl = udpcp_ioctl;
+	udpcp_prot.recvmsg = udpcp_recvmsg;
+
+	/*
+	 * fix the object size for the embedded udpcp_sock structure
+	 */
+	udpcp_prot.obj_size = sizeof(struct udpcp_sock);
+
+	/*
+	 * register the UDPCP protocol
+	 */
+	ret = proto_register(&udpcp_prot, 1);
+	if (ret)
+		return ret;
+
+	/*
+	 * register the inet socket for UDPCP
+	 */
+	inet_register_protosw(&udpcp_protosw);
+
+	/*
+	 * register the /proc/sys/net/ipv4/udpcp_ entries
+	 */
+	udpcp_ctl_table =
+		register_sysctl_paths(net_ipv4_ctl_path, ipv4_udpcp_table);
+	if (udpcp_ctl_table == NULL) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+
+#ifdef CONFIG_PROC_FS
+	/*
+	 * register /proc/driver/udpcp entry
+	 */
+	proc_entry =
+	    create_proc_read_entry(UDPCP_PROC, S_IRUSR | S_IRGRP | S_IROTH,
+				   NULL, udpcp_proc, NULL);
+
+	if (!proc_entry) {
+		ret = -ENOMEM;
+		goto err2;
+	}
+	/*
+	 * register /proc/net/udpcp entry
+	 */
+	ret = udpcp_proc_init();
+
+	if (ret)
+		goto err3;
+#endif
+	pr_info("UDPCP protocol stack\n");
+	return 0;
+#ifdef CONFIG_PROC_FS
+err3:
+	remove_proc_entry(UDPCP_PROC, NULL);
+err2:
+	unregister_sysctl_table(udpcp_ctl_table);
+#endif
+err1:
+	inet_unregister_protosw(&udpcp_protosw);
+	proto_unregister(&udpcp_prot);
+	return ret;
+}
+
+/*
+ * Cleanup and exit module
+ */
+static void __exit udpcp_exit(void)
+{
+#ifdef CONFIG_PROC_FS
+	udpcp_proc_exit();
+	remove_proc_entry(UDPCP_PROC, NULL);
+#endif
+	unregister_sysctl_table(udpcp_ctl_table);
+	inet_unregister_protosw(&udpcp_protosw);
+	proto_unregister(&udpcp_prot);
+}
+
+module_init(udpcp_init);
+module_exit(udpcp_exit);
+
+MODULE_AUTHOR("Stefani Seibold <stefani@seibold.net>");
+MODULE_DESCRIPTION("UDPCP protocol stack");
+MODULE_LICENSE("GPL");
+
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH 1/1] bridge: stp: ensure mac header is set
From: Florian Westphal @ 2011-01-03 14:16 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, acme

commit bf9ae5386bca8836c16e69ab8fdbe46767d7452a
(llc: use dev_hard_header) removed the
skb_reset_mac_header call from llc_mac_hdr_init.

This seems fine itself, but br_send_bpdu() invokes ebtables LOCAL_OUT.

We oops in ebt_basic_match() because it assumes eth_hdr(skb) returns
a meaningful result.

Cc: acme@ghostprotocols.net
References: https://bugzilla.kernel.org/show_bug.cgi?id=24532
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bridge/br_stp_bpdu.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/bridge/br_stp_bpdu.c b/net/bridge/br_stp_bpdu.c
index 3d9a55d..289646e 100644
--- a/net/bridge/br_stp_bpdu.c
+++ b/net/bridge/br_stp_bpdu.c
@@ -50,6 +50,8 @@ static void br_send_bpdu(struct net_bridge_port *p,
 
 	llc_mac_hdr_init(skb, p->dev->dev_addr, p->br->group_addr);
 
+	skb_reset_mac_header(skb);
+
 	NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, skb, NULL, skb->dev,
 		dev_queue_xmit);
 }
-- 
1.7.2.2


^ permalink raw reply related

* Re: [PATCH] new UDPCP Communication Protocol
From: Stefani Seibold @ 2011-01-03 14:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Juhl, linux-kernel, akpm, davem, netdev, shemminger,
	daniel.baluta, jochen
In-Reply-To: <1294051199.2892.198.camel@edumazet-laptop>

Am Montag, den 03.01.2011, 11:39 +0100 schrieb Eric Dumazet:
> Le lundi 03 janvier 2011 à 10:54 +0100, Stefani Seibold a écrit :
> 
> > How can you do a routing, how can you determinate the MTU of the route.
> > This are basics. Look into other code how this things will be handled is
> > in my opinion the right way, since there a no function provide to do
> > this.
> > 
> 
> Hmm, how user land can perform this task then ?
> 

Userspace is much more complicate and more overhead than kernel space.
The UDPCP implementation in userspace is about the factor 10 slower.

> Is there an open source implementation of UDPCP ?
> 

I don't know any. These is the first one.

> What are its problems ? You say its dog slow, I really wonder why.
> UDP stack is pretty scalable these days, yet some improvements are
> possible.
> 

UDP is fast... but UDPCP depends extremely on latency due the missing of
sliding windows. 

> Why not adding generic helpers if you believe you miss some
> infrastructure ? This could benefit to other 'stacks' as well.
> 

Maybe i don't have the knowledge, maybe i don't have the time. Getting
in new API functions into LINUX is much more complicate than getting new
driver into LINUX. I know what i am talk, it takes me one year to the
new kfifo API (kfifo.c, kfifo.h) into the kernel.

> > Otherwise you can say the same about all the filesystem or PCI
> > drvivers , which do also a lot in the same way. But since this is the
> > way to do it, it is the right way.
> > 
> 
> These drivers are here because of high performance on top of high
> performance specs.
> 
> While UDPCP is only a layer above UDP. If the problem comes from UDP
> being too slow, it'll be slow too.
> 

Because of latency. Handling the UDPCP into the data_read() bh function
is much faster:
- No context switch
- Assembly Multi-Fragment Message is very efficient using skb buffer
chaining.
- Immediately handling an ack or data message save a lot of latency

Implementing it in User Space is to slow, due the context switches. Also
the sunrpc approach is not faster due the using of kernel threads which
are not better than user space (okay, a little bit because not switching
the MMU).

The implementation is clean. I did fix all issues what i was asked for.
The protocol has now absolut no side effects. So i ask again for merge
into linux-next.

- Stefani

^ permalink raw reply

* Re: [RFC] net_sched: mark packet staying on queue too long
From: Eric Dumazet @ 2011-01-03 14:02 UTC (permalink / raw)
  To: hadi
  Cc: Jarek Poplawski, David Miller, Jesper Dangaard Brouer,
	Patrick McHardy, netdev
In-Reply-To: <1294062755.2472.11.camel@mojatatu>

Le lundi 03 janvier 2011 à 08:52 -0500, jamal a écrit :
> On Sun, 2011-01-02 at 22:27 +0100, Eric Dumazet wrote:
> > While playing with SFQ and other AQM, I was bothered to see how easy it
> > was for a single tcp flow to 'fill the pipe' and consume lot of memory
> > buffers in queues. I know Jesper use more than 50.000 SFQ on his
> > routers, and with GRO packets this can consume a lot of memory.
> > 
> > I played a bit adding ECN in SFQ, first by marking packets for a
> > particular flow if this flow qlen was above a given threshold, and later
> > using another trick : ECN mark packet if it stayed longer than a given
> > delay in the queue. This of course could be done on other modules, what
> > do you think ?
> > 
> 
> I think for this to be effective, it would require maintaining some
> history of the effect (some form of moving window average)
> and probably a randomness in marking instead of a deterministic one.
> Something like what Stochastic Fair RED/BLUE Queueing does.
> Otherwise you get a burst of marked packets then silence then a burst
> etc (i.e the classical synchronization effect).
> 

I got fairly good results here, but admit-idly on a LAN.

Yep, maybe adding RED on each SFQ slot ;) Should be fairly cheap, and
actually needed in case ECN is not possible and we must earlly drop
instead.

I found BLUE very expensive in term of cache line accesses. Especially
with double hashing.

> It would probably be more effective to provide feedback to the local tcp
> since we can detect this locally instead of waiting to some round trip
> (or half roundtrip) effect at the receiver with ECN i.e in the same
> spirit as NET_XMIT_CN but for which local TCP does something useful with
> that info (instead of "retransmit shortly"). But even that would require
> maintaining some state on the scheduler per hash in this case....
> 

local tcp, for a router ? Hmm... But yes I see your point.

Speaking of ECN marking, it seems we (in RED/GRED or tunnels) change skb
data even if it is shared (can happen on ingress path)

Probably harmless, but tcpdump can show ECN bit being marked even on skb
snapshot before ingress (and later, ECN marked) or tunnels, while it
came unset from the wire.

Is it worth fixing this ? maybe using skb_make_writable() [once moved to
core network from netfilter]




^ permalink raw reply

* Re: [RFC] net_sched: mark packet staying on queue too long
From: jamal @ 2011-01-03 13:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jarek Poplawski, David Miller, Jesper Dangaard Brouer,
	Patrick McHardy, netdev
In-Reply-To: <1294003631.2535.253.camel@edumazet-laptop>

On Sun, 2011-01-02 at 22:27 +0100, Eric Dumazet wrote:
> While playing with SFQ and other AQM, I was bothered to see how easy it
> was for a single tcp flow to 'fill the pipe' and consume lot of memory
> buffers in queues. I know Jesper use more than 50.000 SFQ on his
> routers, and with GRO packets this can consume a lot of memory.
> 
> I played a bit adding ECN in SFQ, first by marking packets for a
> particular flow if this flow qlen was above a given threshold, and later
> using another trick : ECN mark packet if it stayed longer than a given
> delay in the queue. This of course could be done on other modules, what
> do you think ?
> 

I think for this to be effective, it would require maintaining some
history of the effect (some form of moving window average)
and probably a randomness in marking instead of a deterministic one.
Something like what Stochastic Fair RED/BLUE Queueing does.
Otherwise you get a burst of marked packets then silence then a burst
etc (i.e the classical synchronization effect).

It would probably be more effective to provide feedback to the local tcp
since we can detect this locally instead of waiting to some round trip
(or half roundtrip) effect at the receiver with ECN i.e in the same
spirit as NET_XMIT_CN but for which local TCP does something useful with
that info (instead of "retransmit shortly"). But even that would require
maintaining some state on the scheduler per hash in this case....

cheers,
jamal

^ permalink raw reply

* RE: [net-next-2.6 08/08] r8169: more 8168dp support.
From: hayeswang @ 2011-01-03 12:50 UTC (permalink / raw)
  To: 'Francois Romieu', davem; +Cc: netdev, 'Ben Hutchings'
In-Reply-To: <20110102233751.GI5780@electric-eye.fr.zoreil.com>

> From: Francois Romieu [mailto:romieu@fr.zoreil.com] 
> Sent: Monday, January 03, 2011 7:38 AM
> To: davem@davemloft.net
> Cc: netdev@vger.kernel.org; Hayeswang; Ben Hutchings
> Subject: [net-next-2.6 08/08] r8169: more 8168dp support.
> 
> Adapted from version 8.019.00 of Realtek's r8168 driver
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Cc: Hayes <hayeswang@realtek.com>
> ---
> @@ -3038,8 +3104,10 @@ static void __devexit 
> rtl8169_remove_one(struct pci_dev *pdev)
>  	struct net_device *dev = pci_get_drvdata(pdev);
>  	struct rtl8169_private *tp = netdev_priv(dev);
>  
> -	if (tp->mac_version == RTL_GIGA_MAC_VER_27)
> +	if ((tp->mac_version == RTL_GIGA_MAC_VER_27) ||
> +	    (tp->mac_version == RTL_GIGA_MAC_VER_28)) {
>  		rtl8168_driver_stop(tp);
> +	}
>  
>  	cancel_delayed_work_sync(&tp->task);
>  
> @@ -3122,11 +3190,19 @@ err_pm_runtime_put:
>  	goto out;
>  }
>  
> -static void rtl8169_hw_reset(void __iomem *ioaddr)
> +static void rtl8169_hw_reset(struct rtl8169_private *tp)
>  {
> +	void __iomem *ioaddr = tp->mmio_addr;
> +
>  	/* Disable interrupts */
>  	rtl8169_irq_mask_and_ack(ioaddr);
>  
> +	if (tp->mac_version == RTL_GIGA_MAC_VER_28) {

This check should include RTL_GIGA_MAC_VER_27.

> +		while (RTL_R8(TxPoll) & NPQ)
> +			udelay(20);
> +
> +	}
> +
>  	/* Reset the chipset */
>  	RTL_W8(ChipCmd, CmdReset);
>  

After the reset, there are something to do for RTL_GIGA_MAC_VER_27. You may
check the soure code of realtek. Find "rtl8168_nic_reset".
 
Best Regards,
Hayes



^ permalink raw reply

* RE: [net-next-2.6 04/08] r8169: 8168DP specific MII registers access methods.
From: hayeswang @ 2011-01-03 12:30 UTC (permalink / raw)
  To: 'Francois Romieu', davem; +Cc: netdev, 'Ben Hutchings'
In-Reply-To: <20110102233704.GE5780@electric-eye.fr.zoreil.com>

> From: Francois Romieu [mailto:romieu@fr.zoreil.com] 
> Sent: Monday, January 03, 2011 7:37 AM
> To: davem@davemloft.net
> Cc: netdev@vger.kernel.org; Hayeswang; Ben Hutchings
> Subject: [net-next-2.6 04/08] r8169: 8168DP specific MII 
> registers access methods.
> 
> Adapted from version 8.019.00 of Realtek's r8168 driver.
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Cc: Hayes <hayeswang@realtek.com>
> ---
>  drivers/net/r8169.c |   83 
> +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index 
> f9d8ff0..7f6fd12 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -277,6 +277,20 @@ enum rtl8168_8101_registers {
>  #define	EFUSEAR_DATA_MASK		0xff
>  };
>  
> +enum rtl8168_registers {
> +	EPHY_RXER_NUM		= 0x7c,
> +	OCPDR			= 0xb0,	/* OCP GPHY access */
> +#define OCPDR_WRITE_CMD			0x80000000
> +#define OCPDR_READ_CMD			0x00000000
> +#define OCPDR_REG_MASK			0xff
> +#define OCPDR_GPHY_REG_SHIFT		12

The source code of realtek makes a mistake. The value of OCPDR_GPHY_REG_SHIFT
must be 16, not 12. The reg should be at bit 16 ~ 22.

> +#define OCPDR_DATA_MASK			0xffff
> +	OCPAR			= 0xb4,
> +#define OCPAR_FLAG			0x80000000
> +#define OCPAR_GPHY_WRITE_CMD		0x8000f060
> +#define OCPAR_GPHY_READ_CMD		0x0000f060
> +};
> +


^ permalink raw reply

* Re: [PATCH V8 08/13] posix clocks: cleanup the CLOCK_DISPTACH macro
From: Richard Cochran @ 2011-01-03 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	John Stultz, Krzysztof Halasa, Rodolfo Giometti, Thomas Gleixner
In-Reply-To: <1294046949.2016.36.camel@laptop>

On Mon, Jan 03, 2011 at 10:29:09AM +0100, Peter Zijlstra wrote:
> On Fri, 2010-12-31 at 20:15 +0100, Richard Cochran wrote:
> > +#define CLOCK_DISPATCH(clock, call, arglist) dispatch_##call arglist
> 
> How about you run something like:
> 
>  :% s/CLOCK_DISPATCH([^,]*, \([^,]*\), \([^)]*)\))/dispatch_\1\2/g
> 
> and remove that cruft all together?

Gladly ;^)

Richard

^ permalink raw reply

* Re: [PATCH] net: bridge: check the length of skb after nf_bridge_maybe_copy_header()
From: Changli Gao @ 2011-01-03 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, bridge, netdev
In-Reply-To: <20101231.111003.48499853.davem@davemloft.net>

On Sat, Jan 1, 2011 at 3:10 AM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Sat, 25 Dec 2010 21:41:30 +0800
>
>> Since nf_bridge_maybe_copy_header() may change the length of skb,
>> we should check the length of skb after it to handle the ppoe skbs.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>
> This is really strange.
>
> packet_length() subtracts VLAN_HLEN from the value it returns, so the
> correct fix seems to be to make this function handle the PPPOE case
> too.
>

It is correct. The actual MTU of 802.1q frame is 4 bytes larger. For
example, the MTU of ethernet is normally 1500, however the actual MTU
of the 802.1Q is 1504.

Please see this patch:
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=c893b8066c7bf6156e4d760e5acaf4c148e37190;hp=3c0fef0b7d36e5f8d3ea3731a8228102274e3c23

> Otherwise I suspect you have many other functions to fix as well.
>
> I'm not applying this patch.
>



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* RE: [PATCH 1/1 V3] bridge: fix br_multicast_ipv6_rcv for paged skbs
From: Johannes Berg @ 2011-01-03 10:41 UTC (permalink / raw)
  To: Winkler, Tomas
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Stephen Hemminger
In-Reply-To: <6F5C1D715B2DA5498A628E6B9C124F04019BF9E48E@hasmsx504.ger.corp.intel.com>

On Mon, 2011-01-03 at 12:17 +0200, Winkler, Tomas wrote:

> > > > > -		struct mld_msg *mld = (struct mld_msg *)icmp6h;
> > > > > +		struct mld_msg *mld;
> > > > > +		if (!pskb_may_pull(skb2, sizeof(*mld))) {
> > > > > +			err = -EINVAL;
> > > > > +			goto out;
> > > > > +		}
> > > > > +		mld = (struct mld_msg *)icmp6h;
> > > >
> > > > This (and the second instance) is incorrect afaict -- the pointer
> > > > "icmp6h" should be reloaded after the pskb_may_pull(), no?
> > >
> > > mld_msg is bigger than icmp6h by sizeof(in6_addr) so we have to try pull
> > again a bigger chunk.
> > 
> > Right, I know, the pskb_may_pull() is needed, but I believe you need to
> > re-calculate icmp6h here.
> 
> You are right, it can be moved to new memory buffer.
> 
> Probably something like that will do it:
> 
> if (!pskb_may_pull(skb2, sizeof(*mld))) {
> 	err = -EINVAL;
> 	goto out;
> }
> mld = (struct mld_msg *)skb_transport_header(skb2) 

Yeah, that'll do, although I see more callers of icmp6_hdr() here
instead even for mld stuff (which is an inline though doing just this)

johannes


^ permalink raw reply

* Re: [PATCH] new UDPCP Communication Protocol
From: Eric Dumazet @ 2011-01-03 10:39 UTC (permalink / raw)
  To: Stefani Seibold
  Cc: Jesper Juhl, linux-kernel, akpm, davem, netdev, shemminger,
	daniel.baluta, jochen
In-Reply-To: <1294048469.20187.13.camel@wall-e>

Le lundi 03 janvier 2011 à 10:54 +0100, Stefani Seibold a écrit :

> How can you do a routing, how can you determinate the MTU of the route.
> This are basics. Look into other code how this things will be handled is
> in my opinion the right way, since there a no function provide to do
> this.
> 

Hmm, how user land can perform this task then ?

Is there an open source implementation of UDPCP ?

What are its problems ? You say its dog slow, I really wonder why.
UDP stack is pretty scalable these days, yet some improvements are
possible.

Why not adding generic helpers if you believe you miss some
infrastructure ? This could benefit to other 'stacks' as well.

> Otherwise you can say the same about all the filesystem or PCI
> drvivers , which do also a lot in the same way. But since this is the
> way to do it, it is the right way.
> 

These drivers are here because of high performance on top of high
performance specs.

While UDPCP is only a layer above UDP. If the problem comes from UDP
being too slow, it'll be slow too.

^ permalink raw reply

* [PATCH V4] bridge: fix br_multicast_ipv6_rcv for paged skbs
From: Tomas Winkler @ 2011-01-03 10:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, Tomas Winkler, Johannes Berg, Stephen Hemminger

use pskb_may_pull to access ipv6 header correctly for paged skbs
It was omitted in the bridge code leading to crash in blind
__skb_pull

since the skb is cloned undonditionally we also simplify the
the exit path

this fixes bug https://bugzilla.kernel.org/show_bug.cgi?id=25202

Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 IEEE 802.11: authenticated
Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 IEEE 802.11: associated (aid 2)
Dec 15 14:36:40 User-PC hostapd: wlan0: STA 00:15:00:60:5d:34 RADIUS: starting accounting session 4D0608A3-00000005
Dec 15 14:36:41 User-PC kernel: [175576.120287] ------------[ cut here ]------------
Dec 15 14:36:41 User-PC kernel: [175576.120452] kernel BUG at include/linux/skbuff.h:1178!
Dec 15 14:36:41 User-PC kernel: [175576.120609] invalid opcode: 0000 [#1] SMP
Dec 15 14:36:41 User-PC kernel: [175576.120749] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
Dec 15 14:36:41 User-PC kernel: [175576.121035] Modules linked in: approvals binfmt_misc bridge stp llc parport_pc ppdev arc4 iwlagn snd_hda_codec_realtek iwlcore i915 snd_hda_intel mac80211 joydev snd_hda_codec snd_hwdep snd_pcm snd_seq_midi drm_kms_helper snd_rawmidi drm snd_seq_midi_event snd_seq snd_timer snd_seq_device cfg80211 eeepc_wmi usbhid psmouse intel_agp i2c_algo_bit intel_gtt uvcvideo agpgart videodev sparse_keymap snd shpchp v4l1_compat lp hid video serio_raw soundcore output snd_page_alloc ahci libahci atl1c
Dec 15 14:36:41 User-PC kernel: [175576.122712]
Dec 15 14:36:41 User-PC kernel: [175576.122769] Pid: 0, comm: kworker/0:0 Tainted: G        W   2.6.37-rc5-wl+ #3 1015PE/1016P
Dec 15 14:36:41 User-PC kernel: [175576.123012] EIP: 0060:[<f83edd65>] EFLAGS: 00010283 CPU: 1
Dec 15 14:36:41 User-PC kernel: [175576.123193] EIP is at br_multicast_rcv+0xc95/0xe1c [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.123362] EAX: 0000001c EBX: f5626318 ECX: 00000000 EDX: 00000000
Dec 15 14:36:41 User-PC kernel: [175576.123550] ESI: ec512262 EDI: f5626180 EBP: f60b5ca0 ESP: f60b5bd8
Dec 15 14:36:41 User-PC kernel: [175576.123737]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Dec 15 14:36:41 User-PC kernel: [175576.123902] Process kworker/0:0 (pid: 0, ti=f60b4000 task=f60a8000 task.ti=f60b0000)
Dec 15 14:36:41 User-PC kernel: [175576.124137] Stack:
Dec 15 14:36:41 User-PC kernel: [175576.124181]  ec556500 f6d06800 f60b5be8 c01087d8 ec512262 00000030 00000024 f5626180
Dec 15 14:36:41 User-PC kernel: [175576.124181]  f572c200 ef463440 f5626300 3affffff f6d06dd0 e60766a4 000000c4 f6d06860
Dec 15 14:36:41 User-PC kernel: [175576.124181]  ffffffff ec55652c 00000001 f6d06844 f60b5c64 c0138264 c016e451 c013e47d
Dec 15 14:36:41 User-PC kernel: [175576.124181] Call Trace:
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c01087d8>] ? sched_clock+0x8/0x10
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0138264>] ? enqueue_entity+0x174/0x440
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c016e451>] ? sched_clock_cpu+0x131/0x190
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c013e47d>] ? select_task_rq_fair+0x2ad/0x730
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0524fc1>] ? nf_iterate+0x71/0x90
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f83e4914>] ? br_handle_frame_finish+0x184/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f83e4790>] ? br_handle_frame_finish+0x0/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f83e46e9>] ? br_handle_frame+0x189/0x230 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f83e4790>] ? br_handle_frame_finish+0x0/0x220 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f83e4560>] ? br_handle_frame+0x0/0x230 [bridge]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c04ff026>] ? __netif_receive_skb+0x1b6/0x5b0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c04f7a30>] ? skb_copy_bits+0x110/0x210
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0503a7f>] ? netif_receive_skb+0x6f/0x80
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f82cb74c>] ? ieee80211_deliver_skb+0x8c/0x1a0 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f82cc836>] ? ieee80211_rx_handlers+0xeb6/0x1aa0 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c04ff1f0>] ? __netif_receive_skb+0x380/0x5b0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c016e242>] ? sched_clock_local+0xb2/0x190
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c012b688>] ? default_spin_lock_flags+0x8/0x10
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c05d83df>] ? _raw_spin_lock_irqsave+0x2f/0x50
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f82cd621>] ? ieee80211_prepare_and_rx_handle+0x201/0xa90 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f82ce154>] ? ieee80211_rx+0x2a4/0x830 [mac80211]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f815a8d6>] ? iwl_update_stats+0xa6/0x2a0 [iwlcore]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f8499212>] ? iwlagn_rx_reply_rx+0x292/0x3b0 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c05d83df>] ? _raw_spin_lock_irqsave+0x2f/0x50
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f8483697>] ? iwl_rx_handle+0xe7/0x350 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<f8486ab7>] ? iwl_irq_tasklet+0xf7/0x5c0 [iwlagn]
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c01aece1>] ? __rcu_process_callbacks+0x201/0x2d0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0150d05>] ? tasklet_action+0xc5/0x100
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0150a07>] ? __do_softirq+0x97/0x1d0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c05d910c>] ? nmi_stack_correct+0x2f/0x34
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0150970>] ? __do_softirq+0x0/0x1d0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  <IRQ>
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c01508f5>] ? irq_exit+0x65/0x70
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c05df062>] ? do_IRQ+0x52/0xc0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c01036b0>] ? common_interrupt+0x30/0x38
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c03a1fc2>] ? intel_idle+0xc2/0x160
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c04daebb>] ? cpuidle_idle_call+0x6b/0x100
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c0101dea>] ? cpu_idle+0x8a/0xf0
Dec 15 14:36:41 User-PC kernel: [175576.124181]  [<c05d2702>] ? start_secondary+0x1e8/0x1ee

Cc: David Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
---
V2: Implement David Miller's suggestion 
V3: Fix the mld_msg access:
	Length check was wrong and psk_may_pull performs itself the length check
V4: mld_msg pointer need to be recalculated after pull
    simplify the exit path

 net/bridge/br_multicast.c |   29 +++++++++++++++++++----------
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index f19e347..67296b7 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1430,7 +1430,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 				 struct net_bridge_port *port,
 				 struct sk_buff *skb)
 {
-	struct sk_buff *skb2 = skb;
+	struct sk_buff *skb2;
 	struct ipv6hdr *ip6h;
 	struct icmp6hdr *icmp6h;
 	u8 nexthdr;
@@ -1469,15 +1469,16 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 	if (!skb2)
 		return -ENOMEM;
 
+
+	err = -EINVAL;
+	if (!pskb_may_pull(skb2, offset + sizeof(struct icmp6hdr)))
+		goto out;
+
 	len -= offset - skb_network_offset(skb2);
 
 	__skb_pull(skb2, offset);
 	skb_reset_transport_header(skb2);
 
-	err = -EINVAL;
-	if (!pskb_may_pull(skb2, sizeof(*icmp6h)))
-		goto out;
-
 	icmp6h = icmp6_hdr(skb2);
 
 	switch (icmp6h->icmp6_type) {
@@ -1516,7 +1517,12 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 	switch (icmp6h->icmp6_type) {
 	case ICMPV6_MGM_REPORT:
 	    {
-		struct mld_msg *mld = (struct mld_msg *)icmp6h;
+		struct mld_msg *mld;
+		if (!pskb_may_pull(skb2, sizeof(*mld))) {
+			err = -EINVAL;
+			goto out;
+		}
+		mld = (struct mld_msg *)skb_transport_header(skb2);
 		BR_INPUT_SKB_CB(skb2)->mrouters_only = 1;
 		err = br_ip6_multicast_add_group(br, port, &mld->mld_mca);
 		break;
@@ -1529,15 +1535,18 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 		break;
 	case ICMPV6_MGM_REDUCTION:
 	    {
-		struct mld_msg *mld = (struct mld_msg *)icmp6h;
+		struct mld_msg *mld;
+		if (!pskb_may_pull(skb2, sizeof(*mld))) {
+			err = -EINVAL;
+			goto out;
+		}
+		mld = (struct mld_msg *)skb_transport_header(skb2);
 		br_ip6_multicast_leave_group(br, port, &mld->mld_mca);
 	    }
 	}
 
 out:
-	__skb_push(skb2, offset);
-	if (skb2 != skb)
-		kfree_skb(skb2);
+	kfree_skb(skb2);
 	return err;
 }
 #endif
-- 
1.7.3.4

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox