Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] xfrm: cleanup of xfrm_input.c.
From: David Miller @ 2010-07-15  0:59 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <AANLkTik6qHgbCVwZfMGXpso4p2DTC9w6U9agdFVl1ZbN@mail.gmail.com>

From: Rami Rosen <ramirose@gmail.com>
Date: Wed, 14 Jul 2010 11:18:41 +0300

> Hi,
>  The patch removes unneeded inclusion of header files
> (linux/module.h, linux/netdevice.h, net/dst.h and net/ip.h)
>  in net/xfrm/xfrm_input.c
> 
> Regards,
> Rami Rosen
> 
> Signed-off-by: Rami Rosen <ramirose@gmail.com>

If you do this, I also want to see you add includes for things like
linux/skbuff.h since data structures such as "struct sk_buff"
are used in this file.

Otherwise, this is how we end up with obscure build failures on
some configurations and not others, either now or in the future
when a similar change is made to some header file.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net/core: neighbour update Oops
From: David Miller @ 2010-07-15  1:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rdkehn, netdev
In-Reply-To: <1279035511.2634.456.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 13 Jul 2010 17:38:31 +0200

> Le mardi 13 juillet 2010 à 08:23 -0700, Doug Kehn a écrit :
>> When configuring DMVPN (GRE + openNHRP) and a GRE remote
>> address is configured a kernel Oops is observed.  The
>> obserseved Oops is caused by a NULL header_ops pointer
>> (neigh->dev->header_ops) in neigh_update_hhs() when
>> 
>> void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
>> = neigh->dev->header_ops->cache_update;
>> 
>> is executed.  The dev associated with the NULL header_ops is
>> the GRE interface.  This patch guards against the
>> possibility that header_ops is NULL.
>> 
>> This Oops was first observed in kernel version 2.6.26.8.
>> 
>> Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up for -stable, thanks!

^ permalink raw reply

* Re: [PATCH] bonding: fix a buffer overflow in bonding_show_queue_id.
From: David Miller @ 2010-07-15  1:25 UTC (permalink / raw)
  To: nicolas.2p.debian; +Cc: bonding-devel, andy, fubar, netdev
In-Reply-To: <1279146277-9381-1-git-send-email-nicolas.2p.debian@free.fr>

From: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Date: Thu, 15 Jul 2010 00:24:37 +0200

> The test for buffer overflow ensures we have room for 6 more bytes.
> sprintf, called with %s:%d, slave->dev->name, slave->queue_id may yield
> far more than 6 bytes.
> 
> The correct test is res > (PAGE_SIZE - IFNAMSIZ - 6) .
> 
> Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH] Net: ethernet: eth: fix some code style issues
From: David Miller @ 2010-07-15  1:26 UTC (permalink / raw)
  To: chihau; +Cc: eric.dumazet, opurdila, netdev, linux-devel, mitov
In-Reply-To: <1279138937-12985-1-git-send-email-chihau@gmail.com>

From: Chihau Chau <chihau@gmail.com>
Date: Wed, 14 Jul 2010 16:22:17 -0400

> @@ -180,7 +180,8 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
>  	 *      seems to set IFF_PROMISC.
>  	 */
>  
> -	else if (1 /*dev->flags&IFF_PROMISC */ ) {
> +	/*dev->flags&IFF_PROMISC */
> +	else if (1) {

This makes the code look worse, not better.

If the commented test is in the parenthesis, it is unambiguous
which piece of code the suggested test is meant for.

^ permalink raw reply

* Re: [PATCH] Net: ethernet: pe2.c: fix EXPORT_SYMBOL macro code style issue
From: David Miller @ 2010-07-15  1:27 UTC (permalink / raw)
  To: chihau; +Cc: tj, netdev, linux-kernel
In-Reply-To: <1279139074-13040-1-git-send-email-chihau@gmail.com>

From: Chihau Chau <chihau@gmail.com>
Date: Wed, 14 Jul 2010 16:24:34 -0400

> From: Chihau Chau <chihau@gmail.com>
> 
> This patch fix a code style issuei, if a function is exported, the
> EXPORT_SYMBOL macro for it should follow immediately after the closing
> function brace line.
> 
> Signed-off-by: Chihau Chau <chihau@gmail.com>

Applied.

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Bill Fink @ 2010-07-15  2:52 UTC (permalink / raw)
  To: David Miller; +Cc: davidsen, lists, linux-kernel, netdev
In-Reply-To: <20100714.111553.104052157.davem@davemloft.net>

On Wed, 14 Jul 2010, David Miller wrote:

> From: Bill Davidsen <davidsen@tmr.com>
> Date: Wed, 14 Jul 2010 11:21:15 -0400
> 
> > You may have to go into /proc/sys/net/core and crank up the
> > rmem_* settings, depending on your distribution.
> 
> You should never, ever, have to touch the various networking sysctl
> values to get good performance in any normal setup.  If you do, it's a
> bug, report it so we can fix it.
> 
> I cringe every time someone says to do this, so please do me a favor
> and don't spread this further. :-)
> 
> For one thing, TCP dynamically adjusts the socket buffer sizes based
> upon the behavior of traffic on the connection.
> 
> And the TCP memory limit sysctls (not the core socket ones) are sized
> based upon available memory.  They are there to protect you from
> situations such as having so much memory dedicated to socket buffers
> that there is none left to do other things effectively.  It's a
> protective limit, rather than a setting meant to increase or improve
> performance.  So like the others, leave these alone too.

What's normal?  :-)

netem1% cat /proc/version 
Linux version 2.6.30.10-105.2.23.fc11.x86_64 (mockbuild@x86-01.phx2.fedoraproject.org) (gcc version 4.4.1 20090725 (Red Hat 4.4.1-2) (GCC) ) #1 SMP Thu Feb 11 07:06:34 UTC 2010

Linux TCP autotuning across an 80 ms RTT cross country network path:

netem1% nuttcp -T10 -i1 192.168.1.18
   14.1875 MB /   1.00 sec =  119.0115 Mbps     0 retrans
  558.0000 MB /   1.00 sec = 4680.7169 Mbps     0 retrans
  872.8750 MB /   1.00 sec = 7322.3527 Mbps     0 retrans
  869.6875 MB /   1.00 sec = 7295.5478 Mbps     0 retrans
  858.4375 MB /   1.00 sec = 7201.0165 Mbps     0 retrans
  857.3750 MB /   1.00 sec = 7192.2116 Mbps     0 retrans
  865.5625 MB /   1.00 sec = 7260.7193 Mbps     0 retrans
  872.3750 MB /   1.00 sec = 7318.2095 Mbps     0 retrans
  862.7500 MB /   1.00 sec = 7237.2571 Mbps     0 retrans
  857.6250 MB /   1.00 sec = 7194.1864 Mbps     0 retrans

 7504.2771 MB /  10.09 sec = 6236.5068 Mbps 11 %TX 25 %RX 0 retrans 80.59 msRTT

Manually specified 100 MB TCP socket buffer on the same path:

netem1% nuttcp -T10 -i1 -w100m 192.168.1.18
  106.8125 MB /   1.00 sec =  895.9598 Mbps     0 retrans
 1092.0625 MB /   1.00 sec = 9160.3254 Mbps     0 retrans
 1111.2500 MB /   1.00 sec = 9322.6424 Mbps     0 retrans
 1115.4375 MB /   1.00 sec = 9356.2569 Mbps     0 retrans
 1116.4375 MB /   1.00 sec = 9365.6937 Mbps     0 retrans
 1115.3125 MB /   1.00 sec = 9356.2749 Mbps     0 retrans
 1121.2500 MB /   1.00 sec = 9405.6233 Mbps     0 retrans
 1125.5625 MB /   1.00 sec = 9441.6949 Mbps     0 retrans
 1130.0000 MB /   1.00 sec = 9478.7479 Mbps     0 retrans
 1139.0625 MB /   1.00 sec = 9555.8559 Mbps     0 retrans

10258.5120 MB /  10.20 sec = 8440.3558 Mbps 15 %TX 40 %RX 0 retrans 80.59 msRTT

The manually selected TCP socket buffer size both ramps up
quicker and achieves a much higher steady state rate.

					-Bill

^ permalink raw reply

* Re: [PATCH] fec: use interrupt for MDIO completion indication
From: Bryan Wu @ 2010-07-15  3:31 UTC (permalink / raw)
  To: Baruch Siach
  Cc: netdev, linux-arm-kernel, Sascha Hauer, Greg Ungerer,
	Wolfram Sang
In-Reply-To: <006416d38a8e51ba8dd8631613a991528dc7976a.1278918594.git.baruch@tkos.co.il>

Baruch,

Thanks for this patch, we tested on our i.MX51 board with Ubuntu. It works fine.

Wolfram, you can pick up this, too. -;)

-Bryan

On 07/12/2010 03:12 PM, Baruch Siach wrote:
> With the move to phylib (commit e6b043d) I was seeing sporadic "MDIO write
> timeout" messages. Measure of the actual time spent showed latency times of
> more than 1600us.
> 
> This patch uses the MII event indication of the FEC hardware to detect
> completion of MDIO transactions.
> 
> Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> ---
>  drivers/net/fec.c |   55 ++++++++++++++++++++++++----------------------------
>  1 files changed, 25 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
> index edfff92..89f3393 100644
> --- a/drivers/net/fec.c
> +++ b/drivers/net/fec.c
> @@ -187,6 +187,7 @@ struct fec_enet_private {
>  	int	index;
>  	int	link;
>  	int	full_duplex;
> +	struct	completion mdio_done;
>  };
>  
>  static irqreturn_t fec_enet_interrupt(int irq, void * dev_id);
> @@ -205,7 +206,7 @@ static void fec_stop(struct net_device *dev);
>  #define FEC_MMFR_TA		(2 << 16)
>  #define FEC_MMFR_DATA(v)	(v & 0xffff)
>  
> -#define FEC_MII_TIMEOUT		10000
> +#define FEC_MII_TIMEOUT		1000 /* us */
>  
>  /* Transmitter timeout */
>  #define TX_TIMEOUT (2 * HZ)
> @@ -334,6 +335,11 @@ fec_enet_interrupt(int irq, void * dev_id)
>  			ret = IRQ_HANDLED;
>  			fec_enet_tx(dev);
>  		}
> +
> +		if (int_events & FEC_ENET_MII) {
> +			ret = IRQ_HANDLED;
> +			complete(&fep->mdio_done);
> +		}
>  	} while (int_events);
>  
>  	return ret;
> @@ -608,18 +614,13 @@ spin_unlock:
>  		phy_print_status(phy_dev);
>  }
>  
> -/*
> - * NOTE: a MII transaction is during around 25 us, so polling it...
> - */
>  static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
>  {
>  	struct fec_enet_private *fep = bus->priv;
> -	int timeout = FEC_MII_TIMEOUT;
> +	unsigned long time_left;
>  
>  	fep->mii_timeout = 0;
> -
> -	/* clear MII end of transfer bit*/
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> +	init_completion(&fep->mdio_done);
>  
>  	/* start a read op */
>  	writel(FEC_MMFR_ST | FEC_MMFR_OP_READ |
> @@ -627,13 +628,12 @@ static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
>  		FEC_MMFR_TA, fep->hwp + FEC_MII_DATA);
>  
>  	/* wait for end of transfer */
> -	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
> -		cpu_relax();
> -		if (timeout-- < 0) {
> -			fep->mii_timeout = 1;
> -			printk(KERN_ERR "FEC: MDIO read timeout\n");
> -			return -ETIMEDOUT;
> -		}
> +	time_left = wait_for_completion_timeout(&fep->mdio_done,
> +			usecs_to_jiffies(FEC_MII_TIMEOUT));
> +	if (time_left == 0) {
> +		fep->mii_timeout = 1;
> +		printk(KERN_ERR "FEC: MDIO read timeout\n");
> +		return -ETIMEDOUT;
>  	}
>  
>  	/* return value */
> @@ -644,12 +644,10 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>  			   u16 value)
>  {
>  	struct fec_enet_private *fep = bus->priv;
> -	int timeout = FEC_MII_TIMEOUT;
> +	unsigned long time_left;
>  
>  	fep->mii_timeout = 0;
> -
> -	/* clear MII end of transfer bit*/
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> +	init_completion(&fep->mdio_done);
>  
>  	/* start a read op */
>  	writel(FEC_MMFR_ST | FEC_MMFR_OP_READ |
> @@ -658,13 +656,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>  		fep->hwp + FEC_MII_DATA);
>  
>  	/* wait for end of transfer */
> -	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
> -		cpu_relax();
> -		if (timeout-- < 0) {
> -			fep->mii_timeout = 1;
> -			printk(KERN_ERR "FEC: MDIO write timeout\n");
> -			return -ETIMEDOUT;
> -		}
> +	time_left = wait_for_completion_timeout(&fep->mdio_done,
> +			usecs_to_jiffies(FEC_MII_TIMEOUT));
> +	if (time_left == 0) {
> +		fep->mii_timeout = 1;
> +		printk(KERN_ERR "FEC: MDIO write timeout\n");
> +		return -ETIMEDOUT;
>  	}
>  
>  	return 0;
> @@ -1222,7 +1219,8 @@ fec_restart(struct net_device *dev, int duplex)
>  	writel(0, fep->hwp + FEC_R_DES_ACTIVE);
>  
>  	/* Enable interrupts we wish to service */
> -	writel(FEC_ENET_TXF | FEC_ENET_RXF, fep->hwp + FEC_IMASK);
> +	writel(FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII,
> +			fep->hwp + FEC_IMASK);
>  }
>  
>  static void
> @@ -1242,9 +1240,6 @@ fec_stop(struct net_device *dev)
>  	writel(1, fep->hwp + FEC_ECNTRL);
>  	udelay(10);
>  
> -	/* Clear outstanding MII command interrupts. */
> -	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
> -
>  	writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
>  }
>  


^ permalink raw reply

* [PATCH] net: fix problem in reading sock TX queue
From: Tom Herbert @ 2010-07-15  3:48 UTC (permalink / raw)
  To: davem, netdev

Fix problem in reading the tx_queue recorded in a socket.  In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.

We fix this by only calling sk_tx_queue_get which does the proper
checks.  The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1.  sk_tx_queue_recorded is no longer used so it can be
completely removed.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/net/sock.h b/include/net/sock.h
index 3100e71..a441c9c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1226,12 +1226,7 @@ static inline void sk_tx_queue_clear(struct sock *sk)

 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk->sk_tx_queue_mapping;
-}
-
-static inline bool sk_tx_queue_recorded(const struct sock *sk)
-{
-	return (sk && sk->sk_tx_queue_mapping >= 0);
+	return sk ? sk->sk_tx_queue_mapping : -1;
 }

 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
diff --git a/net/core/dev.c b/net/core/dev.c
index e2b9fa2..f071252 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2053,12 +2053,11 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	u16 queue_index;
+	int queue_index;
 	struct sock *sk = skb->sk;

-	if (sk_tx_queue_recorded(sk)) {
-		queue_index = sk_tx_queue_get(sk);
-	} else {
+	queue_index = sk_tx_queue_get(sk);
+	if (queue_index < 0) {
 		const struct net_device_ops *ops = dev->netdev_ops;

 		if (ops->ndo_select_queue) {

^ permalink raw reply related

* RE: Splice status
From: Ofer Heifetz @ 2010-07-15  3:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Jens Axboe, netdev@vger.kernel.org
In-Reply-To: <1279030308.2634.349.camel@edumazet-laptop>

Hi,

I managed to get splice use up to 64K which look to me as a samba limitation (smb.conf SO_RCVBUF limitation I think) but still do not get any performance improvement using splice, the write numbers for splice are in about the same as for regular read/write though refraining from copy_to_user and copy_from_user.

-Ofer

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Tuesday, July 13, 2010 5:12 PM
To: Ofer Heifetz
Cc: Changli Gao; Jens Axboe; netdev@vger.kernel.org
Subject: RE: Splice status

Le mardi 13 juillet 2010 à 14:41 +0300, Ofer Heifetz a écrit :
> Hi,
> 
> I wanted to let you know that I have been testing Samba splice on Marvell 6282 SoC on 2.6.35_rc3 and noticed that it gave worst performance than not using it and also noticed that on re-writing file the iowait is high.
> 
> iometer using 2G file (file is created before test)
> 
> Splice  write cpu% iow%
> -----------------------
>  No     58    98    0
> Yes     14   100   48
> 
> iozone using 2G file (file created during test)
> 
> Splice  write cpu% iow%  re-write cpu% iow%  
> -------------------------------------------
>  No     35    85    4    58.2     70    0
> Yes     33    85    4    15.7    100   58
> 
> Any clue why splice introduces a high iowait?
> I noticed samba uses up to 16K per splice syscall, changing the samba to try more did not help, so I guess it is a kernel limitation.
> 

splice(socket -> pipe) provides partial buffers (depending on the MTU)

With typical MTU=1500 and tcp timestamps, each network frame contains
1448 bytes of payload, partially filling one page (of 4096 bytes)

When doing the splice(pipe -> file), kernel has to coalesce partial
data, but amount of written data per syscall() is small (about 20
Kbytes)

Without splice(), the write() syscall provides more data, and vfs
overhead is smaller as buffer size is a power of two.

Samba uses a 128 KBytes TRANSFER_BUF_SIZE in its default_sys_recvfile()
implementation, it easily outperforms splice() implementation.

You could try extending pipe size (fcntl(fd, F_SETPIPE_SZ, 256)), maybe
it will be a bit better. (and ask 256*4096 bytes to splice())

I tried this and got about 256Kbytes per splice() call...

# perf report
# Events: 13K
#
# Overhead         Command      Shared Object  Symbol
# ........  ..............  .................  ......
#
     8.69%  splice-fromnet  [kernel.kallsyms]  [k] memcpy
     3.82%  splice-fromnet  [kernel.kallsyms]  [k] kunmap_atomic
     3.51%  splice-fromnet  [kernel.kallsyms]  [k] __block_prepare_write
     2.79%  splice-fromnet  [kernel.kallsyms]  [k] __skb_splice_bits
     2.58%  splice-fromnet  [kernel.kallsyms]  [k] ext3_mark_iloc_dirty
     2.45%  splice-fromnet  [kernel.kallsyms]  [k] do_get_write_access
     2.04%  splice-fromnet  [kernel.kallsyms]  [k] __find_get_block
     1.89%  splice-fromnet  [kernel.kallsyms]  [k] _raw_spin_lock
     1.83%  splice-fromnet  [kernel.kallsyms]  [k] journal_add_journal_head
     1.46%  splice-fromnet  [bnx2x]            [k] bnx2x_rx_int
     1.46%  splice-fromnet  [kernel.kallsyms]  [k] kfree
     1.42%  splice-fromnet  [kernel.kallsyms]  [k] journal_put_journal_head
     1.29%  splice-fromnet  [kernel.kallsyms]  [k] __ext3_get_inode_loc
     1.26%  splice-fromnet  [kernel.kallsyms]  [k] journal_dirty_metadata
     1.25%  splice-fromnet  [kernel.kallsyms]  [k] page_address
     1.20%  splice-fromnet  [kernel.kallsyms]  [k] journal_cancel_revoke
     1.15%  splice-fromnet  [kernel.kallsyms]  [k] tcp_read_sock
     1.09%  splice-fromnet  [kernel.kallsyms]  [k] unlock_buffer
     1.09%  splice-fromnet  [kernel.kallsyms]  [k] pipe_to_file
     1.05%  splice-fromnet  [kernel.kallsyms]  [k] radix_tree_lookup_element
     1.04%  splice-fromnet  [kernel.kallsyms]  [k] kmap_atomic_prot
     1.04%  splice-fromnet  [kernel.kallsyms]  [k] kmem_cache_free
     1.03%  splice-fromnet  [kernel.kallsyms]  [k] kmem_cache_alloc
     1.01%  splice-fromnet  [bnx2x]            [k] bnx2x_poll



^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Bill Fink @ 2010-07-15  3:49 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: David Miller, rick.jones2, lists, davidsen, linux-kernel, netdev
In-Reply-To: <20100714221301.GI6682@nuttenaction>

On Thu, 15 Jul 2010, Hagen Paul Pfeifer wrote:

> * David Miller | 2010-07-14 14:55:47 [-0700]:
> 
> >Although section 3 of RFC 5681 is a great text, it does not say at all
> >that increasing the initial CWND would lead to fairness issues.
> 
> Because it is only one side of the medal, probing conservative the available
> link capacity in conjunction with n simultaneous probing TCP/SCTP/DCCP
> instances is another.
> 
> >To be honest, I think google's proposal holds a lot of weight.  If
> >over time link sizes and speeds are increasing (they are) then nudging
> >the initial CWND every so often is a legitimate proposal.  Were
> >someone to claim that utilization is lower than it could be because of
> >the currenttly specified initial CWND, I would have no problem
> >believing them.
> >
> >And I'm happy to make Linux use an increased value once it has
> >traction in the standardization community.
> 
> Currently I know no working link capacity probing approach, without active
> network feedback, to conservatively probing the available link capacity with a
> high CWND. I am curious about any future trends.

A long, long time ago, I suggested a Path BW Discovery mechanism
to the IETF, analogous to the Path MTU Discovery mechanism, but
it didn't get any traction.  Such information could be extremely
useful to TCP endpoints, to determine a maximum window size to
use, to effectively rate limit a much stronger sender from
overpowering a much weaker receiver (for example 10-GigE -> GigE),
resulting in abominable performance across large RTT paths
(as low as 12 Mbps), even in the absence of any real network
contention.

						-Bill

^ permalink raw reply

* Re: [PATCH] net: fix problem in reading sock TX queue
From: David Miller @ 2010-07-15  3:50 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.1.00.1007142025430.22782@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Wed, 14 Jul 2010 20:48:08 -0700 (PDT)

> Fix problem in reading the tx_queue recorded in a socket.  In
> dev_pick_tx, the TX queue is read by doing a check with
> sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
> The problem is that there is not mutual exclusion across these
> calls in the socket so it it is possible that the queue in the
> sock can be invalidated after sk_tx_queue_recorded is called so
> that sk_tx_queue get returns -1, which sets 65535 in queue_index
> and thus dev_pick_tx returns 65536 which is a bogus queue and
> can cause crash in dev_queue_xmit.
> 
> We fix this by only calling sk_tx_queue_get which does the proper
> checks.  The interface is that sk_tx_queue_get returns the TX queue
> if the sock argument is non-NULL and TX queue is recorded, else it
> returns -1.  sk_tx_queue_recorded is no longer used so it can be
> completely removed.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>

Applied, thanks Tom!

^ permalink raw reply

* Re: [PATCH] fec: use interrupt for MDIO completion indication
From: Baruch Siach @ 2010-07-15  4:09 UTC (permalink / raw)
  To: Bryan Wu; +Cc: netdev, linux-arm-kernel, Sascha Hauer, Greg Ungerer,
	Wolfram Sang
In-Reply-To: <4C3E812C.10303@canonical.com>

Hi Bryan,

On Thu, Jul 15, 2010 at 11:31:56AM +0800, Bryan Wu wrote:
> Thanks for this patch, we tested on our i.MX51 board with Ubuntu. It works 
> fine.
> 
> Wolfram, you can pick up this, too. -;)

Dave has already applied this patch to his net-next tree.

baruch

> On 07/12/2010 03:12 PM, Baruch Siach wrote:
> > With the move to phylib (commit e6b043d) I was seeing sporadic "MDIO write
> > timeout" messages. Measure of the actual time spent showed latency times of
> > more than 1600us.
> > 
> > This patch uses the MII event indication of the FEC hardware to detect
> > completion of MDIO transactions.
> > 
> > Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> > ---
> >  drivers/net/fec.c |   55 ++++++++++++++++++++++++----------------------------
> >  1 files changed, 25 insertions(+), 30 deletions(-)
> > 

-- 
                                                     ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Tom Herbert @ 2010-07-15  4:12 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Rick Jones, Ed W, David Miller, davidsen, linux-kernel, netdev,
	Jerry Chu, Nandita Dukkipati
In-Reply-To: <20100714203919.GD6682@nuttenaction>

On Wed, Jul 14, 2010 at 1:39 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> * Rick Jones | 2010-07-14 13:17:24 [-0700]:
>
>>There is an effort under way, lead by some folks at Google and
>>including some others, to get the RFC's enhanced in support of the
>>concept of larger initial congestion windows.  Some of the discussion
>>may be in the "tcpm" mailing list (assuming I've not gotten my
>>mailing lists confused).  There may be some previous discussion of
>>that work in the netdev archives as well.
>
> tcpm is the right mailing list but there is currently no effort to develop
> this topic. Why? Because is not a standardization issue, rather it is a
> technical issue. You cannot rise the initial CWND and expect a fair behavior.
> This was discussed several times and is documented in several documents and
> RFCs.
>
> RFC 5681 Section 3.1. Google employees should start with Section 3. This topic
> pop's of every two months in netdev and until now I _never_ read a
> consolidated contribution.
>

There is an Internet draft
(http://datatracker.ietf.org/doc/draft-hkchu-tcpm-initcwnd/) on
raising the default Initial Congestion window to 10 segments, as well
as a SIGCOMM paper (http://ccr.sigcomm.org/online/?q=node/621).  We
presented this proposal and data supporting it at Anaheim IETF, and
will be following up in Netherlands with more data including some of
which should further address fairness questions.

In terms of Linux implementation, setting ICW via ip route is
sufficient support on the server side.  There is also a proposed patch
which could allow applications to set ICW themselves (in hopes that
application can reduce number of simultaneous connections).  On the
client side we can now adjust the receive window to advertise larger
initial windows.  Among current implementations, Linux advertises the
smallest default receive window of major OSes, so it turns out Linux
clients won't get lower latency benefits currently (so we'll probably
ask to raise the default some day :-)).

Tom

> Partial local issues can already be "fixed" via route specific ip options -
> see initcwnd.
>
> HGN
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: H.K. Jerry Chu @ 2010-07-15  4:51 UTC (permalink / raw)
  To: David Miller; +Cc: davidsen, lists, linux-kernel, netdev
In-Reply-To: <20100714.111553.104052157.davem@davemloft.net>

On Wed, Jul 14, 2010 at 11:15 AM, David Miller <davem@davemloft.net> wrote:
> From: Bill Davidsen <davidsen@tmr.com>
> Date: Wed, 14 Jul 2010 11:21:15 -0400
>
>> You may have to go into /proc/sys/net/core and crank up the
>> rmem_* settings, depending on your distribution.
>
> You should never, ever, have to touch the various networking sysctl
> values to get good performance in any normal setup.  If you do, it's a
> bug, report it so we can fix it.

Agreed, except there are indeed bugs in the code today in that the
code in various places assumes initcwnd as per RFC3390. So when
initcwnd is raised, that actual value may be limited unnecessarily by
the initial wmem/sk_sndbuf.

Will try to find time to submit a patch.

Jerry

>
> I cringe every time someone says to do this, so please do me a favor
> and don't spread this further. :-)
>
> For one thing, TCP dynamically adjusts the socket buffer sizes based
> upon the behavior of traffic on the connection.
>
> And the TCP memory limit sysctls (not the core socket ones) are sized
> based upon available memory.  They are there to protect you from
> situations such as having so much memory dedicated to socket buffers
> that there is none left to do other things effectively.  It's a
> protective limit, rather than a setting meant to increase or improve
> performance.  So like the others, leave these alone too.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: H.K. Jerry Chu @ 2010-07-15  5:09 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Rick Jones, Ed W, David Miller, davidsen, linux-kernel, netdev
In-Reply-To: <20100714203919.GD6682@nuttenaction>

On Wed, Jul 14, 2010 at 1:39 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> * Rick Jones | 2010-07-14 13:17:24 [-0700]:
>
>>There is an effort under way, lead by some folks at Google and
>>including some others, to get the RFC's enhanced in support of the
>>concept of larger initial congestion windows.  Some of the discussion
>>may be in the "tcpm" mailing list (assuming I've not gotten my
>>mailing lists confused).  There may be some previous discussion of
>>that work in the netdev archives as well.
>
> tcpm is the right mailing list but there is currently no effort to develop
> this topic. Why? Because is not a standardization issue, rather it is a

Please don't mislead. Raising the initcwnd is actively being pursued at IETF
right now. If not here, where else? It is following the same path where initcwnd
was first raised in late 90' through rfc2414/rfc3390.

IETF is not a standard organization just for protocol lawyers to play
word games.
It is responsible for solving real technical issues as well.

Jerry

> technical issue. You cannot rise the initial CWND and expect a fair behavior.
> This was discussed several times and is documented in several documents and
> RFCs.
>
> RFC 5681 Section 3.1. Google employees should start with Section 3. This topic
> pop's of every two months in netdev and until now I _never_ read a
> consolidated contribution.
>
> Partial local issues can already be "fixed" via route specific ip options -
> see initcwnd.
>
> HGN
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] xfrm: cleanup of xfrm_input.c. (resend)
From: Rami Rosen @ 2010-07-15  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

 Hi,
  The patch removes unneeded inclusion of header files
 (linux/module.h, linux/netdevice.h, net/dst.h and net/ip.h) and adds inclusion
of linux/skbuff.h instead, in net/xfrm/xfrm_input.c.

 Regards,
 Rami Rosen


On Thu, Jul 15, 2010 at 3:59 AM, David Miller <davem@davemloft.net> wrote:
> From: Rami Rosen <ramirose@gmail.com>
> Date: Wed, 14 Jul 2010 11:18:41 +0300
>
>> Hi,
>>  The patch removes unneeded inclusion of header files
>> (linux/module.h, linux/netdevice.h, net/dst.h and net/ip.h)
>>  in net/xfrm/xfrm_input.c
>>
>> Regards,
>> Rami Rosen
>>
>> Signed-off-by: Rami Rosen <ramirose@gmail.com>
>
> If you do this, I also want to see you add includes for things like
> linux/skbuff.h since data structures such as "struct sk_buff"
> are used in this file.
>
> Otherwise, this is how we end up with obscure build failures on
> some configurations and not others, either now or in the future
> when a similar change is made to some header file.
>
>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 464 bytes --]

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 45f1c98..c87aec8 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -6,12 +6,9 @@
  * 		Split up af-specific portion
  *
  */
-
+ 
+#include <linux/skbuff.h>
 #include <linux/slab.h>
-#include <linux/module.h>
-#include <linux/netdevice.h>
-#include <net/dst.h>
-#include <net/ip.h>
 #include <net/xfrm.h>
 
 static struct kmem_cache *secpath_cachep __read_mostly;

^ permalink raw reply related

* RE: [PATCHv3 NEXT 0/5]qlcnic: aer state
From: Amit Salecha @ 2010-07-15  5:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Ameen Rahman
In-Reply-To: <20100714.103147.193705183.davem@davemloft.net>

Thanks.

-----Original Message-----
From: David Miller [mailto:davem@davemloft.net] 
Sent: Wednesday, July 14, 2010 11:02 PM
To: Amit Salecha
Cc: netdev@vger.kernel.org; Ameen Rahman
Subject: Re: [PATCHv3 NEXT 0/5]qlcnic: aer state

From: amit.salecha@qlogic.com
Date: Tue, 13 Jul 2010 23:33:30 -0700

>   I was in under impression that using space after tab is illegal.

What's undesirable is using spaces instead of tabs for fresh
code-block lines like:

	foo();

Also, it is undesirable to have spaces mixed into the middle of a
sequence of tab characters.  The spaces, when used to align expression
continuation lines or lists of function arguments, should be at the
end.

So in cases like:

	if (foo &&
	    bar)

The second line should be a tab character, then the spaces to make
the alignment happen.

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: H.K. Jerry Chu @ 2010-07-15  5:29 UTC (permalink / raw)
  To: Bill Fink
  Cc: Hagen Paul Pfeifer, David Miller, rick.jones2, lists, davidsen,
	linux-kernel, netdev
In-Reply-To: <20100714234917.924f420d.billfink@mindspring.com>

On Wed, Jul 14, 2010 at 8:49 PM, Bill Fink <billfink@mindspring.com> wrote:
> On Thu, 15 Jul 2010, Hagen Paul Pfeifer wrote:
>
>> * David Miller | 2010-07-14 14:55:47 [-0700]:
>>
>> >Although section 3 of RFC 5681 is a great text, it does not say at all
>> >that increasing the initial CWND would lead to fairness issues.
>>
>> Because it is only one side of the medal, probing conservative the available
>> link capacity in conjunction with n simultaneous probing TCP/SCTP/DCCP
>> instances is another.
>>
>> >To be honest, I think google's proposal holds a lot of weight.  If
>> >over time link sizes and speeds are increasing (they are) then nudging
>> >the initial CWND every so often is a legitimate proposal.  Were
>> >someone to claim that utilization is lower than it could be because of
>> >the currenttly specified initial CWND, I would have no problem
>> >believing them.
>> >
>> >And I'm happy to make Linux use an increased value once it has
>> >traction in the standardization community.
>>
>> Currently I know no working link capacity probing approach, without active
>> network feedback, to conservatively probing the available link capacity with a
>> high CWND. I am curious about any future trends.
>
> A long, long time ago, I suggested a Path BW Discovery mechanism
> to the IETF, analogous to the Path MTU Discovery mechanism, but
> it didn't get any traction.  Such information could be extremely
> useful to TCP endpoints, to determine a maximum window size to
> use, to effectively rate limit a much stronger sender from
> overpowering a much weaker receiver (for example 10-GigE -> GigE),
> resulting in abominable performance across large RTT paths
> (as low as 12 Mbps), even in the absence of any real network
> contention.

Unfortunately that is not going to help initcwnd (unless one can invent a
PBWD protocol from just 3WHS), and the web is dominated by short-lived
connections so the small initcwnd becomes a choke point.

Jerry

>
>                                                -Bill
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH repost] sched: export sched_set/getaffinity to modules
From: Sridhar Samudrala @ 2010-07-15  5:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michael S. Tsirkin, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	netdev, lkml, kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev,
	Jiri Kosina, Thomas Gleixner, Andi Kleen
In-Reply-To: <20100715000506.GB27258@redhat.com>

On 7/14/2010 5:05 PM, Oleg Nesterov wrote:
> On 07/14, Sridhar Samudrala wrote:
>    
>> OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
>> from the caller. How about an exported kthread function kthread_create_in_current_cg()
>> that does this?
>>      
> Well. I must admit, this looks a bit strange to me ;)
>
> Instead of exporting sched_xxxaffinity() we export the new function
> which calls them. And I don't think this new helper is very useful
> in general. May be I am wrong...
>    
If we agree on exporting sched_xxxaffinity() functions, we don't need 
this new kthread function and we
can do the same in vhost as the original patch did.

Thanks
Sridhar

^ permalink raw reply

* Re: [PATCH net-next-2.6] xfrm: cleanup of xfrm_input.c. (resend)
From: David Miller @ 2010-07-15  5:42 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <AANLkTim7fVRtycf2B07iI7U4chv2NqD267dvunZJYqPj@mail.gmail.com>

From: Rami Rosen <ramirose@gmail.com>
Date: Thu, 15 Jul 2010 08:21:03 +0300

>  Hi,
>   The patch removes unneeded inclusion of header files
>  (linux/module.h, linux/netdevice.h, net/dst.h and net/ip.h) and adds inclusion
> of linux/skbuff.h instead, in net/xfrm/xfrm_input.c.

Rami, please don't process my feedback like an automaton. :-/

I specifically mentioned linux/skbuff.h because that happened to be
the very first obvious header dependency I found in the xfrm_input.c
file.

There are almost certainly others, and I'd like for you to take the
time to skim over the file and figure out what those might be.

Also, when submitting a new version of a patch, do not hijack the
existing thread.  Instead, make a full fresh submission with a
plain fresh subject line, commit log message, signoff, and patch.

^ permalink raw reply

* Re: [5/5] Remove REDWOOD_5 and REDWOOD_6 config options and conditional code
From: Milton Miller @ 2010-07-15  7:42 UTC (permalink / raw)
  To: Christian Dietrich
  Cc: Josh Boyer, Matt Porter, Benjamin Herrenschmidt, Paul Mackerras,
	Solomon Peachy, David Woodhouse, Mike Frysinger, Jiri Kosina,
	Artem Bityutskiy, Alexander Kurz, John Linn, David S. Miller,
	Randy Dunlap, Florian Fainelli, linuxppc-dev, linux-kernel,
	linux-mtd, netdev, vamos-dev
In-Reply-To: <4f07b3092cafbbba37d61d367cc7484e24d18d2a.1279116162.git.qy03fugy@stud.informatik.uni-erlangen.de>

On Wed, 14 Jul 2010 about 04:05:05 -0000, Christian Dietrich wrote:
> 
> The config options for REDWOOD_[56] were commented out in the powerpc
> Kconfig. The ifdefs referencing this options therefore are dead and all
> references to this can be removed (Also dependencies in other KConfig
> files).
> 
> Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de>
> 
> ---
> arch/powerpc/platforms/40x/Kconfig |   16 -------------
>  drivers/mtd/maps/Kconfig           |    2 +-
>  drivers/mtd/maps/redwood.c         |   43 ------------------------------------
>  drivers/net/Kconfig                |    2 +-
>  4 files changed, 2 insertions(+), 61 deletions(-)

> diff --git a/drivers/mtd/maps/Kconfig b/drivers/mtd/maps/Kconfig
> index f22bc9f..b5ebb72 100644
> --- a/drivers/mtd/maps/Kconfig
> +++ b/drivers/mtd/maps/Kconfig
> @@ -321,7 +321,7 @@ config MTD_CFI_FLAGADM
>  
>  config MTD_REDWOOD
>  	tristate "CFI Flash devices mapped on IBM Redwood"
> -	depends on MTD_CFI && ( REDWOOD_4 || REDWOOD_5 || REDWOOD_6 )
> +	depends on MTD_CFI && REDWOOD_4
>  	help
>  	  This enables access routines for the flash chips on the IBM
>  	  Redwood board. If you have one of these boards and would like to

REDWOOD_4 does not appear to be in the tree either so this mapping driver
should be deleted if the patch is otherwise acceptable.  Besides we
would express the info contained in this simple map driver the device tree
using physmap_of.

milton

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Ed W @ 2010-07-15  7:48 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hagen Paul Pfeifer, Rick Jones, David Miller, davidsen,
	linux-kernel, netdev, Jerry Chu, Nandita Dukkipati
In-Reply-To: <AANLkTikTyZGQWWIBzYSQWpUK30xxbMXLbJXeHahWnlIi@mail.gmail.com>

On 15/07/2010 05:12, Tom Herbert wrote:
> There is an Internet draft
> (http://datatracker.ietf.org/doc/draft-hkchu-tcpm-initcwnd/) on
> raising the default Initial Congestion window to 10 segments, as well
> as a SIGCOMM paper (http://ccr.sigcomm.org/online/?q=node/621).
>    

You guys have obviously done a lot of work on this, however, it seems 
that there is a case for introducing some heuristics into the choice of 
init cwnd as well as offering the option to go larger?  An initial size 
of 10 packets is just another magic number that obviously works with the 
median bandwidth delay product on today's networks - can we not do 
better still?

Seems like a bunch of clever folks have already suggested tweaks to the 
steady stage congestion avoidance, but so far everyone is afraid to 
touch the early stage heuristics?

Also would you guys not benefit from wider deployment of ECN?  Can you 
not help find some ways that deployment could be increased?  At present 
there are big warnings all over the option that it causes some problems, 
but there is no quantification of how much and really whether this 
warning is still appropriate?

Ed W

^ permalink raw reply

* Re: [PATCH 01/11] Removing dead RT2800PCI_SOC
From: Bartlomiej Zolnierkiewicz @ 2010-07-15  8:41 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: John W. Linville, Ivo Van Doorn, Christoph Egger,
	Gertjan van Wingerde, Helmut Schaa,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	users-poMEt7QlJxcwIE2E9O76wjtx2kNaKg5H,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	vamos-dev-DSeRVwK2yFR7NQr57o4jwX20dTPRyWU8FLXUG6abMr4,
	Luis Correia
In-Reply-To: <4C3DCD5C.1080705-p3rKhJxN3npAfugRpC6u6w@public.gmane.org>

On Wednesday 14 July 2010 04:44:44 pm Felix Fietkau wrote:
> On 2010-07-14 3:15 PM, John W. Linville wrote:
> > On Wed, Jul 14, 2010 at 02:52:14PM +0200, Ivo Van Doorn wrote:
> >> On Wed, Jul 14, 2010 at 2:46 PM, Luis Correia <luis.f.correia-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Wed, Jul 14, 2010 at 13:39, Christoph Egger <siccegge-n+aIp8eCc/CzQB+pC5nmwQ@public.gmane.org> wrote:
> >> >> While RT2800PCI_SOC exists in Kconfig, it depends on either
> >> >> RALINK_RT288X or RALINK_RT305X which are both not available in Kconfig
> >> >> so all Code depending on that can't ever be selected and, if there's
> >> >> no plan to add these options, should be cleaned up
> >> >>
> >> >> Signed-off-by: Christoph Egger <siccegge-n+aIp8eCc/CzQB+pC5nmwQ@public.gmane.org>
> >> >
> >> > NAK,
> >> >
> >> > this is not dead code, it is needed for the Ralink System-on-Chip
> >> > Platform devices.
> >> >
> >> > While I can't fix Kconfig errors and the current KConfig file may be
> >> > wrong, this code cannot and will not be deleted.
> >> 
> >> When the config option was introduced, the config options RALINK_RT288X and
> >> RALINK_RT305X were supposed to be merged as well soon after by somebody (Felix?)
> >> 
> >> But since testing is done on SoC boards by Helmut and Felix, I assume the code
> >> isn't dead but actually in use.
> > 
> > Perhaps Helmut and Felix can send us the missing code?
> The missing code is a MIPS platform port, which is currently being
> maintained in OpenWrt, but is not ready for upstream submission yet.
> I'm not working on this code at the moment, but I think it will be
> submitted once it's ready.

People are using automatic scripts to catch unused config options nowadays
so the issue is quite likely to come back again sooner or later..

Would it be possible to improve situation somehow till the missing parts
get merged?  Maybe by adding a tiny comment documenting RT2800PCI_SOC
situation to Kconfig (if the config option itself really cannot be removed)
until all code is ready etc.?

I bet that Christoph would be willing to update his patch if you ask him
nicely..

Thanks,
--
Bartlomiej Zolnierkiewicz
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter: xtables: userspace notification target
From: Patrick McHardy @ 2010-07-15  9:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Luciano Coelho, Changli Gao, Samuel Ortiz, David S. Miller,
	netdev@vger.kernel.org, netfilter-devel@vger.kernel.org
In-Reply-To: <4C3DE715.8070502@netfilter.org>

Am 14.07.2010 18:34, schrieb Pablo Neira Ayuso:
> Hi Luciano,
> 
> On 14/07/10 14:22, Luciano Coelho wrote:
>> On Wed, 2010-07-14 at 13:48 +0200, ext Patrick McHardy wrote:
>>> If you're using connection tracking, you can use conntrack marks
>>> to avoid sending more than a single message:
>>>
>>> iptables ... -m connmark --mark 0x1/0x1 -j RETURN
>>> iptables ... -j NFLOG ...
>>> iptables ... -j CONNMARK --set-mark 0x1/0x1
>>
>> Cool, thanks.
>>
>> It seems that there are lots of possibilities to get this to work, but
>> this is starting to get quite complex.  I would still prefer having the
>> NFNOTIF module included, since we would be able to do what we want in a
>> very simple way.  It's also probably much more efficient that using
>> several rules, which would increase the CPU usage considerably (in our
>> device we are already reaching the limit of a reasonable CPU resource
>> usage with high throughput WLAN connections).

Its hard to believe that a connmark match filtering out notifications
would require more CPU time than doing the same in a new target module.

>> While I agree that it is possible to achieve the NFNOTIF functionality
>> with existing modules, I still think there is a "niche" for such module,
>> because it is very simple, has a very clear purpose and would make the
>> ruleset simpler and more efficient.
>>
>> Does this make any sense?
> 
> I don't think that the NFNOTIF infrastructure fulfill the policy for
> inclusion. It seems to me like something quite specific for your needs.
> It is simple, yes, but we already have this feature into the kernel. I
> don't think that this will reduce CPU usage considerably with regards to
> the NFLOG way.
> 
> I would still prefer adding the once-per-matching notification feature
> to NFLOG than these extra lines in the kernel, Patrick?

I agree with Pablo.

^ permalink raw reply

* Re: [PATCH] netfilter: xtables: userspace notification target
From: Luciano Coelho @ 2010-07-15  9:18 UTC (permalink / raw)
  To: ext Patrick McHardy
  Cc: Pablo Neira Ayuso, Changli Gao, Samuel Ortiz, David S. Miller,
	netdev@vger.kernel.org, netfilter-devel@vger.kernel.org
In-Reply-To: <4C3ECF6D.50409@trash.net>

On Thu, 2010-07-15 at 11:05 +0200, ext Patrick McHardy wrote:
> Am 14.07.2010 18:34, schrieb Pablo Neira Ayuso:
> >> It seems that there are lots of possibilities to get this to work, but
> >> this is starting to get quite complex.  I would still prefer having the
> >> NFNOTIF module included, since we would be able to do what we want in a
> >> very simple way.  It's also probably much more efficient that using
> >> several rules, which would increase the CPU usage considerably (in our
> >> device we are already reaching the limit of a reasonable CPU resource
> >> usage with high throughput WLAN connections).
> 
> Its hard to believe that a connmark match filtering out notifications
> would require more CPU time than doing the same in a new target module.

Okay, you have convinced me. :) I studied connmark a bit more and now I
realize that it won't take more CPU.  In the solution with connmark that
you proposed the packets coming from a connection that is already marked
will be quickly returned to normal processing, so it will be fairly
efficient and certainly not more CPU hungry than the NFNOTIF.


> >> While I agree that it is possible to achieve the NFNOTIF functionality
> >> with existing modules, I still think there is a "niche" for such module,
> >> because it is very simple, has a very clear purpose and would make the
> >> ruleset simpler and more efficient.
> >>
> >> Does this make any sense?
> > 
> > I don't think that the NFNOTIF infrastructure fulfill the policy for
> > inclusion. It seems to me like something quite specific for your needs.
> > It is simple, yes, but we already have this feature into the kernel. I
> > don't think that this will reduce CPU usage considerably with regards to
> > the NFLOG way.
> > 
> > I would still prefer adding the once-per-matching notification feature
> > to NFLOG than these extra lines in the kernel, Patrick?
> 
> I agree with Pablo.

I have to admit that you're right here again.  I think that it will not
be necessary to make this change in the NFLOG, since the connmark
solution is actually pretty clear too.  If needed, I'll make this simple
change in the NFLOG module and submit.

Thanks for your help.


-- 
Cheers,
Luca.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox