Netdev List
 help / color / mirror / Atom feed
* Re: Increasing TCP initial cwnd
From: Ilpo Järvinen @ 2009-10-26 13:54 UTC (permalink / raw)
  To: Yair Gottdenker; +Cc: Netdev
In-Reply-To: <fa3cddca0910260640s22d0396ey5fc51c51047bc2c@mail.gmail.com>

On Mon, 26 Oct 2009, Yair Gottdenker wrote:

> I am working on research project to control the sender initial
> congestion window size. I am trying to allow user space to set the
> initial congestion window size but with no luck. The sender always
> sends just 4 packets disregarding from the snd_cwnd.
> 
> I am working on kernel version 2.6.31.3.
> 
> I made the following changes:
> 1. tcp_ipv4.c -> in function tcp_v4_init_sock, changed from
> tp->snd_cwnd = 2 to tp->snd_cwnd = user_space_initial_cwnd.
> 2. tcp_output.c -> tcp_init_cwnd - to always return user_space_initial_cwnd.
> 3. tcp_cong.c -> tcp_slow_start- to always return user_space_initial_cwnd.
> 
> Any help will be more than welcome.

How about trying tcp_init_metrics in tcp_input.c?

-- 
 i.

^ permalink raw reply

* Re: linux-next: manual merge of the net tree with the i2c tree
From: Ben Hutchings @ 2009-10-26 13:46 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, netdev, linux-next, linux-kernel, Mika Kuoppala,
	Jean Delvare
In-Reply-To: <20091026133757.7cf87e49.sfr@canb.auug.org.au>

On Mon, 2009-10-26 at 13:37 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the net tree got a conflict in
> drivers/net/sfc/sfe4001.c between commit
> 3f7c0648f727a6d5baf6117653e4001dc877b90b ("i2c: Prevent priority
> inversion on top of bus lock") from the i2c tree and commit
> c9597d4f89565b6562bd3026adbe6eac6c317f47 ("sfc: Merge sfe4001.c into
> falcon_boards.c") from the net tree.
> 
> I have applied the following merge fixup patch (after removing
> drivers/net/sfc/sfe4001.c) and can carry it as necessary.
> 
> -- 
> Cheers,
> Stephen Rothwell                    sfr@canb.auug.org.au
> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Mon, 26 Oct 2009 12:24:55 +1100
> Subject: [PATCH] net: merge fixup for drivers/net/sfc/falcon_boards.c
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [PATCH 2/2] tc35815: Enable NAPI
From: Atsushi Nemoto @ 2009-10-26 13:46 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ralf Roesch

This driver has NAPI code but it has been disabled.  Enable it now.
The non-napi code will be removed lator.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
---
 drivers/net/tc35815.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/tc35815.c b/drivers/net/tc35815.c
index 3d38479..0d621ca 100644
--- a/drivers/net/tc35815.c
+++ b/drivers/net/tc35815.c
@@ -22,6 +22,7 @@
  * All Rights Reserved.
  */
 
+#define TC35815_NAPI
 #ifdef TC35815_NAPI
 #define DRV_VERSION	"1.38-NAPI"
 #else
-- 
1.5.6.5


^ permalink raw reply related

* [PATCH 1/2] tc35815: Fix return value of tc35815_do_interrupt when NAPI enabled
From: Atsushi Nemoto @ 2009-10-26 13:46 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ralf Roesch

Return received count correctly even if tx completed at the same time.
Currently NAPI is disabled for this driver so this patch does not fix
any real problem.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
---
 drivers/net/tc35815.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/tc35815.c b/drivers/net/tc35815.c
index d1298e5..3d38479 100644
--- a/drivers/net/tc35815.c
+++ b/drivers/net/tc35815.c
@@ -1592,7 +1592,12 @@ static int tc35815_do_interrupt(struct net_device *dev, u32 status)
 		lp->lstats.tx_ints++;
 		tc35815_txdone(dev);
 		netif_wake_queue(dev);
+#ifdef TC35815_NAPI
+		if (ret < 0)
+			ret = 0;
+#else
 		ret = 0;
+#endif
 	}
 	return ret;
 }
-- 
1.5.6.5


^ permalink raw reply related

* Re: [PATCH net-next-2.6]  net: sysfs: ethtool_ops can be NULL
From: Andy Gospodarek @ 2009-10-26 13:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Andy Gospodarek
In-Reply-To: <4AE586B5.4030308@gmail.com>

On Mon, Oct 26, 2009 at 12:23:33PM +0100, Eric Dumazet wrote:
> commit d519e17e2d01a0ee9abe083019532061b4438065
> (net: export device speed and duplex via sysfs)
> made the wrong assumption that netdev->ethtool_ops was always set.
> 
> This makes possible to crash kernel and let rtnl in locked state.
> 
> modprobe dummy
> ip link set dummy0 up
> (udev runs and crash)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/core/net-sysfs.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 753c420..89de182 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -139,7 +139,9 @@ static ssize_t show_speed(struct device *dev,
>  	if (!rtnl_trylock())
>  		return restart_syscall();
>  
> -	if (netif_running(netdev) && netdev->ethtool_ops->get_settings) {
> +	if (netif_running(netdev) &&
> +	    netdev->ethtool_ops &&
> +	    netdev->ethtool_ops->get_settings) {
>  		struct ethtool_cmd cmd = { ETHTOOL_GSET };
>  
>  		if (!netdev->ethtool_ops->get_settings(netdev, &cmd))
> @@ -158,7 +160,9 @@ static ssize_t show_duplex(struct device *dev,
>  	if (!rtnl_trylock())
>  		return restart_syscall();
>  
> -	if (netif_running(netdev) && netdev->ethtool_ops->get_settings) {
> +	if (netif_running(netdev) &&
> +	    netdev->ethtool_ops &&
> +	    netdev->ethtool_ops->get_settings) {
>  		struct ethtool_cmd cmd = { ETHTOOL_GSET };
>  
>  		if (!netdev->ethtool_ops->get_settings(netdev, &cmd))

Nice catch, Eric.

Acked-by: Andy Gospodarek <andy@greyhouse.net>


^ permalink raw reply

* Increasing TCP initial cwnd
From: Yair Gottdenker @ 2009-10-26 13:40 UTC (permalink / raw)
  To: netdev

Hi all,

I am working on research project to control the sender initial
congestion window size. I am trying to allow user space to set the
initial congestion window size but with no luck. The sender always
sends just 4 packets disregarding from the snd_cwnd.

I am working on kernel version 2.6.31.3.

I made the following changes:
1. tcp_ipv4.c -> in function tcp_v4_init_sock, changed from
tp->snd_cwnd = 2 to tp->snd_cwnd = user_space_initial_cwnd.
2. tcp_output.c -> tcp_init_cwnd - to always return user_space_initial_cwnd.
3. tcp_cong.c -> tcp_slow_start- to always return user_space_initial_cwnd.

Any help will be more than welcome.

Yair

^ permalink raw reply

* Re: [PATCH next-next-2.6] netdev: better dev_name_hash
From: Krishna Kumar2 @ 2009-10-26 13:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Hagen Paul Pfeifer, netdev, Octavian Purdila
In-Reply-To: <4AE53382.8020308@gmail.com>

Eric Dumazet wrote on 10/26/2009 10:58:34 AM:

> In fact, new10 should be the 'perfect' hash for the "eth%d"
> netdev use, not jhash (way more expensive in cpu cycles BTW)
>
> Most linux machines in the world have less than 10 interfaces, jhash
> would be really overkill.
>
>
> Thanks
>
>
> [PATCH net-next-2.6] netdev: better dev_name_hash

Changing Eric's test program to pass a multiplier to string_hash()
and calling string_hash with multipler=<2 - 63> confirms that 10
is almost always the best number for varying netdev names. I print
the number of times perfect 64 was scored, and for most passed
device names, the best is for n=10, followed by n=5 and others.
Almost the worst was n=31, slightly better was n=17.

But other variables matter too, like fewer entries (4K or 1K) but
above values for are still better compared to n=31.

Thanks,

- KK

#include <stdio.h>
#include <string.h>

#define NUM_ENTRIES 1024

static inline unsigned int
string_hash_n(const unsigned char *name, unsigned int len, int n)
{
        unsigned hash = 0;
        int i;

        for (i = 0; i < len; ++i)
                hash = n * hash + name[i];
        return hash;
}

int best_n = -1, best_n_val = -1;

void print_dist(unsigned int *dist, int n)
{
        unsigned int i;
        int perfect = 0;


printf("-----------------------------------------------------------\n");
        printf("n=%d[] = {\n", n);
        for (i = 0; i < 256; i++) {
                if (dist[i] == NUM_ENTRIES/256)
                        perfect++;
                printf("%d, ", dist[i]);
                if ((i & 15) == 15) putchar('\n');
        }
        if (perfect > best_n_val) {
                best_n_val = perfect;
                best_n = n;
        }
        printf("}; (PERFECT = %d (n=%d))\n", perfect, n);
}

int main(int argc, char *argv[])
{
        int n, i;
        char *str, name[16];
        unsigned int hash, hashdist[256];

        if (argc == 1)
                str="eth";
        else
                str=argv[1];

        for (n = 2; n < 63; n++) {
                memset(hashdist, 0, 256*sizeof(int));
                for (i = 0; i < NUM_ENTRIES; i++) {
                        unsigned int len = sprintf(name, "%s%d", str, i);

                        hash = string_hash_n(name, len, n);
                        hashdist[hash & 255]++;
                }
                print_dist(hashdist, n);
        }
        printf("Best N was %d, and perfect entries: %d\n",
                best_n, best_n_val);
        return 0;
}


^ permalink raw reply

* [PATCH net-next-2.6]  net: sysfs: ethtool_ops can be NULL
From: Eric Dumazet @ 2009-10-26 11:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Andy Gospodarek

commit d519e17e2d01a0ee9abe083019532061b4438065
(net: export device speed and duplex via sysfs)
made the wrong assumption that netdev->ethtool_ops was always set.

This makes possible to crash kernel and let rtnl in locked state.

modprobe dummy
ip link set dummy0 up
(udev runs and crash)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/net-sysfs.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 753c420..89de182 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -139,7 +139,9 @@ static ssize_t show_speed(struct device *dev,
 	if (!rtnl_trylock())
 		return restart_syscall();
 
-	if (netif_running(netdev) && netdev->ethtool_ops->get_settings) {
+	if (netif_running(netdev) &&
+	    netdev->ethtool_ops &&
+	    netdev->ethtool_ops->get_settings) {
 		struct ethtool_cmd cmd = { ETHTOOL_GSET };
 
 		if (!netdev->ethtool_ops->get_settings(netdev, &cmd))
@@ -158,7 +160,9 @@ static ssize_t show_duplex(struct device *dev,
 	if (!rtnl_trylock())
 		return restart_syscall();
 
-	if (netif_running(netdev) && netdev->ethtool_ops->get_settings) {
+	if (netif_running(netdev) &&
+	    netdev->ethtool_ops &&
+	    netdev->ethtool_ops->get_settings) {
 		struct ethtool_cmd cmd = { ETHTOOL_GSET };
 
 		if (!netdev->ethtool_ops->get_settings(netdev, &cmd))

^ permalink raw reply related

* Re: [PATCH] virtio-net: fix data corruption with OOM
From: Michael S. Tsirkin @ 2009-10-26  9:07 UTC (permalink / raw)
  To: Rusty Russell; +Cc: virtualization, kvm, netdev
In-Reply-To: <200910261211.52148.rusty@rustcorp.com.au>

On Mon, Oct 26, 2009 at 12:11:51PM +1030, Rusty Russell wrote:
> On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote:
> > virtio net used to unlink skbs from send queues on error,
> > but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> > we do not do this. This causes guest data corruption and crashes
> > with vhost since net core can requeue the skb or free it without
> > it being taken off the list.
> > 
> > This patch fixes this by queueing the skb after successfull
> > transmit.
> 
> I originally thought that this was racy: as soon as we do add_buf, we need to
> make sure we're ready for the callback (for virtio_pci, it's ->kick, but we
> shouldn't rely on that).
> 
> So a comment would be nice.  How's this?
> 
> Subject: virtio-net: fix data corruption with OOM
> Date: Sun, 25 Oct 2009 19:03:40 +0200
> From: "Michael S. Tsirkin" <mst@redhat.com>

Another, and hopefully the last, note, is that
git-am can only handle Subject/From lines
at the beginning of the message.
So git style of the mail would be

	Subject: XXX
	From: YYY

	Description

	---

	Discussion and notes.

I think it's weird. We could invent some kind of separator
that would make git-am accept Subject/From/Date lines in
the middle of the message, so that discussion can come before
the description. Worth it?

> 
> virtio net used to unlink skbs from send queues on error,
> but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> we do not do this. This causes guest data corruption and crashes
> with vhost since net core can requeue the skb or free it without
> it being taken off the list.
> 
> This patch fixes this by queueing the skb after successful
> transmit.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ comment)
> ---
> 
> Rusty, here's a fix for another data corrupter I saw.
> This fixes a regression from 2.6.31, so definitely
> 2.6.32 I think. Comments?
> 
>  drivers/net/virtio_net.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -516,8 +516,7 @@ again:
>  	/* Free up any pending old buffers before queueing new ones. */
>  	free_old_xmit_skbs(vi);
>  
> -	/* Put new one in send queue and do transmit */
> -	__skb_queue_head(&vi->send, skb);
> +	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
>  
>  	/* This can happen with OOM and indirect buffers. */
> @@ -531,8 +530,17 @@ again:
>  		}
>  		return NETDEV_TX_BUSY;
>  	}
> +	vi->svq->vq_ops->kick(vi->svq);
>  
> -	vi->svq->vq_ops->kick(vi->svq);
> +	/*
> +	 * Put new one in send queue.  You'd expect we'd need this before
> +	 * xmit_skb calls add_buf(), since the callback can be triggered
> +	 * immediately after that.  But since the callback just triggers
> +	 * another call back here, normal network xmit locking prevents the
> +	 * race.
> +	 */
> +	__skb_queue_head(&vi->send, skb);
> +
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
>  	nf_reset(skb);

^ permalink raw reply

* Re: [RFC] make per interface sysctl entries configurable
From: Cosmin Ratiu @ 2009-10-26  9:01 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: Eric Dumazet, Benjamin LaHaise, Stephen Hemminger, netdev
In-Reply-To: <200910260021.48785.opurdila@ixiacom.com>

On Monday 26 October 2009 00:21:48 Octavian Purdila wrote:
> On Sunday 25 October 2009 23:37:19 you wrote:
> > Octavian Purdila a écrit :
> > > RFC patches are attached.
> > >
> > > Another possible approach: add an interface flag and use it to decide
> > > whether we want per interface sysctl entries or not.
> >
> > Hmm, could we speedup sysctl instead, adding rbtree or something ?
> 
> Very good point, I think this is the best solution for people using a
> moderately high number of interfaces (a few thousand).
> 
> But for really large setups there is another issue: memory consumption. In
> fact, in order to be able to scale to 128K interfaces and still have a
> significant amount of memory available to applications we also had to
>  disable sysfs and #ifdef CONFIG_SYSFS struct device from net_device.
> 
> I would also argue that when you have such a large number of interfaces you
> don't need to change setting on a per interface basis. Or at least this is
>  our case :)  and I suspect that the case with a large number of PPP
>  interfaces is similar.
> 

Another possible approach: shared settings for an interface group. If you have 
a large number of interfaces of the same type it would be nice if you could 
change some setting for the whole group instead of globally or individually. 

Is this approach feasible anyway? Or I'm talking rubbish.

Cosmin.

^ permalink raw reply

* Re: [PATCH] virtio-net: fix data corruption with OOM
From: Michael S. Tsirkin @ 2009-10-26  9:00 UTC (permalink / raw)
  To: Rusty Russell; +Cc: virtualization, kvm, netdev
In-Reply-To: <200910261211.52148.rusty@rustcorp.com.au>

On Mon, Oct 26, 2009 at 12:11:51PM +1030, Rusty Russell wrote:
> On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote:
> > virtio net used to unlink skbs from send queues on error,
> > but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> > we do not do this. This causes guest data corruption and crashes
> > with vhost since net core can requeue the skb or free it without
> > it being taken off the list.
> > 
> > This patch fixes this by queueing the skb after successfull
> > transmit.
> 
> I originally thought that this was racy: as soon as we do add_buf, we need to
> make sure we're ready for the callback (for virtio_pci, it's ->kick, but we
> shouldn't rely on that).

BTW, wanted to note that unlink on error would *also* be racy if we did any
processing in the callback.

> So a comment would be nice.  How's this?
> 
> Subject: virtio-net: fix data corruption with OOM
> Date: Sun, 25 Oct 2009 19:03:40 +0200
> From: "Michael S. Tsirkin" <mst@redhat.com>
> 
> virtio net used to unlink skbs from send queues on error,
> but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> we do not do this. This causes guest data corruption and crashes
> with vhost since net core can requeue the skb or free it without
> it being taken off the list.
> 
> This patch fixes this by queueing the skb after successful
> transmit.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ comment)
> ---
> 
> Rusty, here's a fix for another data corrupter I saw.
> This fixes a regression from 2.6.31, so definitely
> 2.6.32 I think. Comments?
> 
>  drivers/net/virtio_net.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -516,8 +516,7 @@ again:
>  	/* Free up any pending old buffers before queueing new ones. */
>  	free_old_xmit_skbs(vi);
>  
> -	/* Put new one in send queue and do transmit */
> -	__skb_queue_head(&vi->send, skb);
> +	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
>  
>  	/* This can happen with OOM and indirect buffers. */
> @@ -531,8 +530,17 @@ again:
>  		}
>  		return NETDEV_TX_BUSY;
>  	}
> +	vi->svq->vq_ops->kick(vi->svq);
>  
> -	vi->svq->vq_ops->kick(vi->svq);
> +	/*
> +	 * Put new one in send queue.  You'd expect we'd need this before
> +	 * xmit_skb calls add_buf(), since the callback can be triggered
> +	 * immediately after that.  But since the callback just triggers
> +	 * another call back here, normal network xmit locking prevents the
> +	 * race.
> +	 */
> +	__skb_queue_head(&vi->send, skb);
> +
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
>  	nf_reset(skb);

^ permalink raw reply

* Re: [PATCH 00/13] TProxy IPv6 support 2nd round
From: Balazs Scheidler @ 2009-10-26  9:00 UTC (permalink / raw)
  To: Harald Welte; +Cc: netfilter-devel, netdev
In-Reply-To: <20091025101612.GS29852@prithivi.gnumonks.org>

On Sun, 2009-10-25 at 11:16 +0100, Harald Welte wrote:
> Dear Balazs,
> 
> as you might have read from other mails (and by the long period of silence),
> Patrick McHardy is currently unavailable to perform his usual maintainer
> role.
> 
> I personally am too much out of touch with recent developments in
> netfitler-land to be able to confidently review your patches...  
> 
> So unless somebody else from the team (Jozsef?, Pablo?) feels confident in
> ACKing your patchset, I will have to ask for your patience until Patrick is
> back and can do it by himself.

Thanks for letting me know, hopefully Patrick gets better soon. I've
planned another round of the TProxy patches as I got some comments at
the previous round I'm yet to address.

So no need to hurry.

-- 
Bazsi


^ permalink raw reply

* Re: [PATCH] net: Adjust softirq raising in __napi_schedule
From: Johannes Berg @ 2009-10-26  8:56 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: Jarek Poplawski, David Miller, hidave.darkstar, linux-kernel,
	tglx, linux-wireless, linux-ppp, netdev, paulus, Michael Buesch,
	Oliver Hartkopp
In-Reply-To: <4AE56217.3040708@imap.cc>

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

On Mon, 2009-10-26 at 10:47 +0200, Tilman Schmidt wrote:

> > However, I lost track now of why we're discussing this.
> 
> The starting point were several reports of the kernel message:
> 
> NOHZ: local_softirq_pending 08
> 
> Originally most if not all of them came from wireless networking,
> but I muddied the waters by adding to the mix a case involving ISDN.
> You stated that all the solutions proposed so far were wrong, so
> we're naturally turning to you for guidance on what the right
> solution might be.

Right. Sorry about getting lost here.

> > Basically it boils down to using netif_rx() when in (soft)irq, and
> > netif_rx_ni() when in process context. That could just be an
> > optimisation, but it's a very valid one.
> 
> Hmmm. That seems to contradict your earlier statement to me that
> simply replacing a call to netif_rx() by one to netif_rx_ni()
> when not in interrupt context isn't a valid solution either.
> What am I missing?

Well, I think you misunderstood me. It would be correct to do this, if
and only if the code that calls it doesn't need the extra guarantee.

Any code (say ISDN code) that calls netif_rx() is clearly assuming to
always be running in (soft)irq context, otherwise it couldn't call
netif_rx() unconditionally. Agree so far?

So now if you change the ISDN code to call netif_rx_ni(), you've changed
the assumption that the ISDN code makes -- that it is running in
(soft)irq context. Therefore, you need to verify that this is actually a
correct change, which is what I tried to say.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH] virtio-net: fix data corruption with OOM
From: Michael S. Tsirkin @ 2009-10-26  8:54 UTC (permalink / raw)
  To: Rusty Russell; +Cc: virtualization, kvm, netdev
In-Reply-To: <200910261211.52148.rusty@rustcorp.com.au>

On Mon, Oct 26, 2009 at 12:11:51PM +1030, Rusty Russell wrote:
> On Mon, 26 Oct 2009 03:33:40 am Michael S. Tsirkin wrote:
> > virtio net used to unlink skbs from send queues on error,
> > but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> > we do not do this. This causes guest data corruption and crashes
> > with vhost since net core can requeue the skb or free it without
> > it being taken off the list.
> > 
> > This patch fixes this by queueing the skb after successfull
> > transmit.
> 
> I originally thought that this was racy: as soon as we do add_buf, we need to
> make sure we're ready for the callback (for virtio_pci, it's ->kick, but we
> shouldn't rely on that).
> 
> So a comment would be nice.  How's this?

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> Subject: virtio-net: fix data corruption with OOM
> Date: Sun, 25 Oct 2009 19:03:40 +0200
> From: "Michael S. Tsirkin" <mst@redhat.com>
> 
> virtio net used to unlink skbs from send queues on error,
> but ever since 48925e372f04f5e35fec6269127c62b2c71ab794
> we do not do this. This causes guest data corruption and crashes
> with vhost since net core can requeue the skb or free it without
> it being taken off the list.
> 
> This patch fixes this by queueing the skb after successful
> transmit.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ comment)
> ---
> 
> Rusty, here's a fix for another data corrupter I saw.
> This fixes a regression from 2.6.31, so definitely
> 2.6.32 I think. Comments?
> 
>  drivers/net/virtio_net.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -516,8 +516,7 @@ again:
>  	/* Free up any pending old buffers before queueing new ones. */
>  	free_old_xmit_skbs(vi);
>  
> -	/* Put new one in send queue and do transmit */
> -	__skb_queue_head(&vi->send, skb);
> +	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
>  
>  	/* This can happen with OOM and indirect buffers. */
> @@ -531,8 +530,17 @@ again:
>  		}
>  		return NETDEV_TX_BUSY;
>  	}
> +	vi->svq->vq_ops->kick(vi->svq);
>  
> -	vi->svq->vq_ops->kick(vi->svq);
> +	/*
> +	 * Put new one in send queue.  You'd expect we'd need this before
> +	 * xmit_skb calls add_buf(), since the callback can be triggered
> +	 * immediately after that.  But since the callback just triggers
> +	 * another call back here, normal network xmit locking prevents the
> +	 * race.
> +	 */
> +	__skb_queue_head(&vi->send, skb);
> +
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
>  	nf_reset(skb);

^ permalink raw reply

* Re: VLAN and ARP failure on tg3 drivers
From: Eric Dumazet @ 2009-10-26  8:54 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: Gertjan Hofman, Matt Carlson, netdev@vger.kernel.org
In-Reply-To: <m34opmr2fz.fsf@ursa.amorsen.dk>

Benny Amorsen a écrit :
> Gertjan Hofman <gertjan_hofman@yahoo.com> writes:
> 
>> Dear Matt, Eric, Benny,
>>
>> Sorry about the slow response to your fast replies. I think Benny is
>> correct, the 'problem' lies in the fact that we were using a VLAN ID
>> of 0, without knowing its special significance. User error.
>>
>> I tested it with other VLAN id's (>0) and it appears to work fine. We
>> are not entirely sure we understand  why it used to work with VLAN ID
>> 0 on the Broadcom chips and still does with a number of different
>> cards (with >2.6.27 kernels).  What is the 'correct' behaviour for
>> this incorrect usage ?
> 
> VLAN 0 isn't incorrect, it's just surprising. When you send a packet
> tagged with VLAN 0, it means that the packet should be interpreted as
> being the same VLAN as a completely untagged packet.
> 
> So in theory, if both ends are using VLAN 0 and you aren't using eth0
> for anything, traffic should flow, at least if both ends are on the same
> kernel version. Feel free to debug why that isn't the case for you, of
> course...
> 

VLAN id 0 is not usable on current kernel because we use 16 bits in skb to
 store vlan_tci, and vlan_tci = 0 means there is no VLAN tagging.


We could use high order bit (0x8000) to tell if vlan tagging is set or not.

^ permalink raw reply

* Re: [PATCH] net: Adjust softirq raising in __napi_schedule
From: Tilman Schmidt @ 2009-10-26  8:47 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Jarek Poplawski, David Miller, hidave.darkstar, linux-kernel,
	tglx, linux-wireless, linux-ppp, netdev, paulus, Michael Buesch,
	Oliver Hartkopp
In-Reply-To: <1256543932.28230.9.camel@johannes.local>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 26.10.2009 09:58 schrieb Johannes Berg:
> On Mon, 2009-10-26 at 07:54 +0000, Jarek Poplawski wrote:
> 
>>> No, I wrote that I didn't know. I suppose now that I looked at it I do
>>> know, and only disabling preemption is required.
>> But netif_rx has preemption disabled most of the time (by hardirqs
>> disabling). So maybe disabling preemption isn't the main reason here
>> either?
> 
> Not for netpoll though, which may or may not be relevant (if I were to
> venture a guess I'd say it isn't and it disables preemption to be able
> to do the softirq thing)
> 
> However, I lost track now of why we're discussing this.

The starting point were several reports of the kernel message:

NOHZ: local_softirq_pending 08

Originally most if not all of them came from wireless networking,
but I muddied the waters by adding to the mix a case involving ISDN.
You stated that all the solutions proposed so far were wrong, so
we're naturally turning to you for guidance on what the right
solution might be.

> Basically it boils down to using netif_rx() when in (soft)irq, and
> netif_rx_ni() when in process context. That could just be an
> optimisation, but it's a very valid one.

Hmmm. That seems to contradict your earlier statement to me that
simply replacing a call to netif_rx() by one to netif_rx_ni()
when not in interrupt context isn't a valid solution either.
What am I missing?

Thanks,
Tilman

- --
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFK5WIXQ3+did9BuFsRAsj1AJ0e4VJ7Nsp69ROXCiT4oM/Q6lIpSwCfZvXS
4nV4tWTIzgk4IRlCt0CCF3Y=
=r15I
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: VLAN and ARP failure on tg3 drivers
From: Benny Amorsen @ 2009-10-26  8:20 UTC (permalink / raw)
  To: Gertjan Hofman; +Cc: Matt Carlson, netdev@vger.kernel.org, Eric Dumazet
In-Reply-To: <477963.52849.qm@web32605.mail.mud.yahoo.com>

Gertjan Hofman <gertjan_hofman@yahoo.com> writes:

> Dear Matt, Eric, Benny,
>
> Sorry about the slow response to your fast replies. I think Benny is
> correct, the 'problem' lies in the fact that we were using a VLAN ID
> of 0, without knowing its special significance. User error.
>
> I tested it with other VLAN id's (>0) and it appears to work fine. We
> are not entirely sure we understand  why it used to work with VLAN ID
> 0 on the Broadcom chips and still does with a number of different
> cards (with >2.6.27 kernels).  What is the 'correct' behaviour for
> this incorrect usage ?

VLAN 0 isn't incorrect, it's just surprising. When you send a packet
tagged with VLAN 0, it means that the packet should be interpreted as
being the same VLAN as a completely untagged packet.

So in theory, if both ends are using VLAN 0 and you aren't using eth0
for anything, traffic should flow, at least if both ends are on the same
kernel version. Feel free to debug why that isn't the case for you, of
course...


/Benny


^ permalink raw reply

* [PATCH v3 2/7] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

From: Gilad Ben-Yossef <gby@watson.codefidence.com>

We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/net/tcp.h        |    3 ++-
 net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c     |    9 ++++++---
 net/ipv4/tcp_ipv4.c      |   21 ++++++++++++---------
 net/ipv4/tcp_minisocks.c |    7 +++++--
 net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c      |    3 ++-
 7 files changed, 56 insertions(+), 42 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..740d09b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,7 +409,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  int estab,
+						  struct dst_entry *dst);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index a6e0e07..4990dd4 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,13 +276,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
 	if (!req)
@@ -298,12 +291,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->loc_addr		= ip_hdr(skb)->daddr;
 	ireq->rmt_addr		= ip_hdr(skb)->saddr;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
@@ -351,6 +338,20 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, &rt->u.dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->u.dst, RTAX_WINDOW);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..d502f49 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,12 +3698,14 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       int estab,  struct dst_entry *dst)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
 
+	BUG_ON(!estab && !dst);
+
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
 
@@ -3820,7 +3822,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
 	return 1;
 }
 
@@ -5364,8 +5366,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, 0, dst);
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..1d611e3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1256,11 +1256,21 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	ireq = inet_rsk(req);
+	ireq->loc_addr = daddr;
+	ireq->rmt_addr = saddr;
+	ireq->no_srccheck = inet_sk(sk)->transparent;
+	ireq->opt = tcp_v4_save_options(sk, skb);
+
+	dst = inet_csk_route_req(sk, req);
+	if(!dst)
+		goto drop_and_free;
+
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = 536;
 	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1269,14 +1279,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	tcp_openreq_init(req, &tmp_opt, skb);
 
-	ireq = inet_rsk(req);
-	ireq->loc_addr = daddr;
-	ireq->rmt_addr = saddr;
-	ireq->no_srccheck = inet_sk(sk)->transparent;
-	ireq->opt = tcp_v4_save_options(sk, skb);
-
 	if (security_inet_conn_request(sk, skb, req))
-		goto drop_and_free;
+		goto drop_and_release;
 
 	if (!want_cookie)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
@@ -1301,7 +1305,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
 		    peer->v4daddr == saddr) {
 			if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c49a550..70ff955 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -101,7 +101,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	int paws_reject = 0;
 
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 1);
+		tcp_parse_options(skb, &tmp_opt, 1, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -499,10 +499,11 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	int paws_reject = 0;
 	struct tcp_options_received tmp_opt;
 	struct sock *child;
+	struct dst_entry *dst = inet_csk_route_req(sk, req);
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -515,6 +516,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	dst_release(dst);
+
 	/* Check for pure retransmitted SYN. */
 	if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
 	    flg == TCP_FLAG_SYN &&
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 6b6ae91..6ece408 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -184,13 +184,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
 	if (!req)
@@ -224,12 +217,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	req->expires = 0UL;
 	req->retrans = 0;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
 
@@ -264,6 +251,21 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 			goto out_free;
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+
 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 21d100b..2eebab5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1165,6 +1165,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
 	__u32 isn = TCP_SKB_CB(skb)->when;
+	struct dst_entry *dst = __sk_dst_get(sk);
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
 #else
@@ -1203,7 +1204,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 3/7] Add dst_feature to query route entry features
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Adding an accessor to existing  dst_entry feautres field and
refactor the only supported feature (allfrag) to use it.


Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/net/dst.h |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 5a900dd..b562be3 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -111,6 +111,12 @@ dst_metric(const struct dst_entry *dst, int metric)
 	return dst->metrics[metric-1];
 }
 
+static inline u32
+dst_feature(const struct dst_entry *dst, u32 feature)
+{
+	return dst_metric(dst, RTAX_FEATURES) & feature;
+}
+
 static inline u32 dst_mtu(const struct dst_entry *dst)
 {
 	u32 mtu = dst_metric(dst, RTAX_MTU);
@@ -136,7 +142,7 @@ static inline void set_dst_metric_rtt(struct dst_entry *dst, int metric,
 static inline u32
 dst_allfrag(const struct dst_entry *dst)
 {
-	int ret = dst_metric(dst, RTAX_FEATURES) & RTAX_FEATURE_ALLFRAG;
+	int ret = dst_feature(dst,  RTAX_FEATURE_ALLFRAG);
 	/* Yes, _exactly_. This is paranoia. */
 	barrier();
 	return ret;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 5/7] Allow disabling TCP timestamp options per route
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no timestamp bit in the feature 
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    8 ++++++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9c802a6..2ab8c75 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -378,7 +378,7 @@ enum
 
 #define RTAX_FEATURE_ECN	0x00000001
 #define RTAX_FEATURE_NO_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
+#define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
 struct rta_session
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b14f780..d2f9742 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3755,7 +3755,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 			case TCPOPT_TIMESTAMP:
 				if ((opsize == TCPOLEN_TIMESTAMP) &&
 				    ((estab && opt_rx->tstamp_ok) ||
-				     (!estab && sysctl_tcp_timestamps))) {
+				     (!estab && sysctl_tcp_timestamps &&
+				      !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP)))) {
 					opt_rx->saw_tstamp = 1;
 					opt_rx->rcv_tsval = get_unaligned_be32(ptr);
 					opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64db8dd..8f30c18 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -488,7 +488,9 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	opts->mss = tcp_advertise_mss(sk);
 	size += TCPOLEN_MSS_ALIGNED;
 
-	if (likely(sysctl_tcp_timestamps && *md5 == NULL)) {
+	if (likely(sysctl_tcp_timestamps &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) &&
+		   *md5 == NULL)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = tp->rx_opt.ts_recent;
@@ -2317,7 +2319,9 @@ static void tcp_connect_init(struct sock *sk)
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
 	 */
 	tp->tcp_header_len = sizeof(struct tcphdr) +
-		(sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);
+		(sysctl_tcp_timestamps &&
+		(!dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) ?
+		  TCPOLEN_TSTAMP_ALIGNED : 0));
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (tp->af_specific->md5_lookup(sk, sk) != NULL)
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 1/7] Only parse time stamp TCP option in time wait sock
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef, Yony Amit
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>

---
 net/ipv4/tcp_minisocks.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 624c3c9..c49a550 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -100,9 +100,8 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 1);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 7/7] Allow disabling of DSACK TCP option per route
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Add and use no DSCAK bit in the features field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6784b34..e78b60c 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -381,6 +381,7 @@ enum
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 #define RTAX_FEATURE_NO_WSCALE	0x00000010
+#define RTAX_FEATURE_NO_DSACK	0x00000020
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4f5e914..4262da5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4080,8 +4080,10 @@ static inline int tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
 static void tcp_dsack_set(struct sock *sk, u32 seq, u32 end_seq)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+	if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+	    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 		int mib_idx;
 
 		if (before(seq, tp->rcv_nxt))
@@ -4110,13 +4112,15 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, u32 end_seq)
 static void tcp_send_dupack(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 	    before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
 		tcp_enter_quickack_mode(sk);
 
-		if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+		if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+		    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 			u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
 			if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 4/7] Add the no SACK route option feature
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no sack bit in the features
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index adf2068..9c802a6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -377,7 +377,7 @@ enum
 #define RTAX_MAX (__RTAX_MAX - 1)
 
 #define RTAX_FEATURE_ECN	0x00000001
-#define RTAX_FEATURE_SACK	0x00000002
+#define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_TIMESTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d502f49..b14f780 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3763,7 +3763,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_SACK_PERM:
 				if (opsize == TCPOLEN_SACK_PERM && th->syn &&
-				    !estab && sysctl_tcp_sack) {
+				    !estab && sysctl_tcp_sack &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_SACK)) {
 					opt_rx->sack_ok = 1;
 					tcp_sack_reset(opt_rx);
 				}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..64db8dd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -464,6 +464,7 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned size = 0;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
@@ -498,7 +499,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
 	}
-	if (likely(sysctl_tcp_sack)) {
+	if (likely(sysctl_tcp_sack &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_SACK))) {
 		opts->options |= OPTION_SACK_ADVERTISE;
 		if (unlikely(!(OPTION_TS & opts->options)))
 			size += TCPOLEN_SACKPERM_ALIGNED;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 6/7] Allow to turn off TCP window scale opt per route
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256544393-12450-1-git-send-email-gilad@codefidence.com>

Add and use no window scale bit in the features field.

Note that this is not the same as setting a window scale of 0 
as would happen with window limit on route.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    6 ++++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2ab8c75..6784b34 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -380,6 +380,7 @@ enum
 #define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
+#define RTAX_FEATURE_NO_WSCALE	0x00000010
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d2f9742..4f5e914 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3739,7 +3739,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_WINDOW:
 				if (opsize == TCPOLEN_WINDOW && th->syn &&
-				    !estab && sysctl_tcp_window_scaling) {
+				    !estab && sysctl_tcp_window_scaling &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)) {
 					__u8 snd_wscale = *(__u8 *)ptr;
 					opt_rx->wscale_ok = 1;
 					if (snd_wscale > 14) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8f30c18..ff60a21 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -496,7 +496,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->tsecr = tp->rx_opt.ts_recent;
 		size += TCPOLEN_TSTAMP_ALIGNED;
 	}
-	if (likely(sysctl_tcp_window_scaling)) {
+	if (likely(sysctl_tcp_window_scaling &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE))) {
 		opts->ws = tp->rx_opt.rcv_wscale;
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
@@ -2347,7 +2348,8 @@ static void tcp_connect_init(struct sock *sk)
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
-				  sysctl_tcp_window_scaling,
+				  (sysctl_tcp_window_scaling &&
+				   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)),
 				  &rcv_wscale);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH v3 0/7] Per route TCP options
From: Gilad Ben-Yossef @ 2009-10-26  8:06 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef

Third iteration of patch to allow disablng of TCP SACK, DSCAK, 
time stamp and window scale TCP options on a per route basis, now
with 100% less remote DoS opportunities (thank you Ilpo for 
spotting it ;-)

You usualy want to disable SACK, DSACK, time stamp or window
scale if you've got a piece of broken networking equipment somewhere 
as a stop gap until you can bring a big enough hammer to deal with
the broken network equipment. It doesn't make sense to "punish" the
entire connections going through the machine to destinations not 
related to the broken equipment.

This is doubly true when you're dealing with network containers
used to isolate several virtual domains.

Per route options implemented in free bits in the features route
entry property, which in some cases were reserved by name for these
options, so this does not inflate any structure.

Global sysctl based kill switches for these options are still
preserved, as some people seems to want them, so behaviour
is default to on, unless switched off either globaly or on
per route basis.

Tested on x86 using Qemu/KVM.  

Working but crude matching patch to iproute2 sent earlier to the list.

Patchset based on original work by Ori Finkelman and Yony Amit 
from ComSleep Ltd.

Gilad Ben-Yossef (7):
  Only parse time stamp TCP option in time wait sock
  Allow tcp_parse_options to consult dst entry
  Infrastructure for querying route entry features
  Add the no SACK route option feature
  Allow disabling TCP timestamp options per route
  Allow to turn off TCP window scale opt per route
  Allow disabling of DSACK TCP option per route

 include/linux/rtnetlink.h |    6 ++++--
 include/net/dst.h         |    8 +++++++-
 include/net/tcp.h         |    3 ++-
 net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
 net/ipv4/tcp_minisocks.c  |    8 +++++---
 net/ipv4/tcp_output.c     |   18 +++++++++++++-----
 net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c       |    3 ++-
 10 files changed, 92 insertions(+), 56 deletions(-)


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox