Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/1 net-next] net: fec: put tx to napi poll function to fix dead lock
From: Shawn Guo @ 2013-03-20 11:48 UTC (permalink / raw)
  To: Frank Li; +Cc: B38611, netdev, s.hauer, lznuaa, davem, linux-arm-kernel
In-Reply-To: <1362368065-5809-1-git-send-email-Frank.Li@freescale.com>

Frank,

I'm running v3.9-rc3 image with NFS on imx6q, and seeing system fail
to resume back.  The resume seems being stopped by fec driver.  If you
wait long enough, you might see the repeated "eth0: tx queue full!."
error message.

Reverting the patch gets resume back to work.  Can you please
investigate?

Shawn

On Mon, Mar 04, 2013 at 11:34:25AM +0800, Frank Li wrote:
> up stack ndo_start_xmit already hold lock.
> fec_enet_start_xmit needn't spin lock.
> stat_xmit just update fep->cur_tx
> fec_enet_tx just update fep->dirty_tx
> 
> Reserve a empty bdb to check full or empty
> cur_tx == dirty_tx    means full
> cur_tx == dirty_tx +1 means empty
> 
> So needn't is_full variable.
> 
> Fix spin lock deadlock
> =================================
> [ INFO: inconsistent lock state ]
> 3.8.0-rc5+ #107 Not tainted
> ---------------------------------
> inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> ptp4l/615 [HC1[1]:SC0[0]:HE0:SE1] takes:
>  (&(&list->lock)->rlock#3){?.-...}, at: [<8042c3c4>] skb_queue_tail+0x20/0x50
>  {HARDIRQ-ON-W} state was registered at:
>  [<80067250>] mark_lock+0x154/0x4e8
>  [<800676f4>] mark_irqflags+0x110/0x1a4
>  [<80069208>] __lock_acquire+0x494/0x9c0
>  [<80069ce8>] lock_acquire+0x90/0xa4
>  [<80527ad0>] _raw_spin_lock_bh+0x44/0x54
>  [<804877e0>] first_packet_length+0x38/0x1f0
>  [<804879e4>] udp_poll+0x4c/0x5c
>  [<804231f8>] sock_poll+0x24/0x28
>  [<800d27f0>] do_poll.isra.10+0x120/0x254
>  [<800d36e4>] do_sys_poll+0x15c/0x1e8
>  [<800d3828>] sys_poll+0x60/0xc8
>  [<8000e780>] ret_fast_syscall+0x0/0x3c
> 
>  *** DEADLOCK ***
> 
>  1 lock held by ptp4l/615:
>   #0:  (&(&fep->hw_lock)->rlock){-.-...}, at: [<80355f9c>] fec_enet_tx+0x24/0x268
>   stack backtrace:
>   Backtrace:
>   [<800121e0>] (dump_backtrace+0x0/0x10c) from [<80516210>] (dump_stack+0x18/0x1c)
>   r6:8063b1fc r5:bf38b2f8 r4:bf38b000 r3:bf38b000
>   [<805161f8>] (dump_stack+0x0/0x1c) from [<805189d0>] (print_usage_bug.part.34+0x164/0x1a4)
>   [<8051886c>] (print_usage_bug.part.34+0x0/0x1a4) from [<80518a88>] (print_usage_bug+0x78/0x88)
>   r8:80065664 r7:bf38b2f8 r6:00000002 r5:00000000 r4:bf38b000
>   [<80518a10>] (print_usage_bug+0x0/0x88) from [<80518b58>] (mark_lock_irq+0xc0/0x270)
>   r7:bf38b000 r6:00000002 r5:bf38b2f8 r4:00000000
>   [<80518a98>] (mark_lock_irq+0x0/0x270) from [<80067270>] (mark_lock+0x174/0x4e8)
>   [<800670fc>] (mark_lock+0x0/0x4e8) from [<80067744>] (mark_irqflags+0x160/0x1a4)
>   [<800675e4>] (mark_irqflags+0x0/0x1a4) from [<80069208>] (__lock_acquire+0x494/0x9c0)
>   r5:00000002 r4:bf38b2f8
>   [<80068d74>] (__lock_acquire+0x0/0x9c0) from [<80069ce8>] (lock_acquire+0x90/0xa4)
>   [<80069c58>] (lock_acquire+0x0/0xa4) from [<805278d8>] (_raw_spin_lock_irqsave+0x4c/0x60)
>   [<8052788c>] (_raw_spin_lock_irqsave+0x0/0x60) from [<8042c3c4>] (skb_queue_tail+0x20/0x50)
>   r6:bfbb2180 r5:bf1d0190 r4:bf1d0184
>   [<8042c3a4>] (skb_queue_tail+0x0/0x50) from [<8042c4cc>] (sock_queue_err_skb+0xd8/0x188)
>   r6:00000056 r5:bfbb2180 r4:bf1d0000 r3:00000000
>   [<8042c3f4>] (sock_queue_err_skb+0x0/0x188) from [<8042d15c>] (skb_tstamp_tx+0x70/0xa0)
>   r6:bf0dddb0 r5:bf1d0000 r4:bfbb2180 r3:00000004
>   [<8042d0ec>] (skb_tstamp_tx+0x0/0xa0) from [<803561d0>] (fec_enet_tx+0x258/0x268)
>   r6:c089d260 r5:00001c00 r4:bfbd0000
>   [<80355f78>] (fec_enet_tx+0x0/0x268) from [<803562cc>] (fec_enet_interrupt+0xec/0xf8)
>   [<803561e0>] (fec_enet_interrupt+0x0/0xf8) from [<8007d5b0>] (handle_irq_event_percpu+0x54/0x1a0)
>   [<8007d55c>] (handle_irq_event_percpu+0x0/0x1a0) from [<8007d740>] (handle_irq_event+0x44/0x64)
>   [<8007d6fc>] (handle_irq_event+0x0/0x64) from [<80080690>] (handle_fasteoi_irq+0xc4/0x15c)
>   r6:bf0dc000 r5:bf811290 r4:bf811240 r3:00000000
>   [<800805cc>] (handle_fasteoi_irq+0x0/0x15c) from [<8007ceec>] (generic_handle_irq+0x28/0x38)
>   r5:807130c8 r4:00000096
>   [<8007cec4>] (generic_handle_irq+0x0/0x38) from [<8000f16c>] (handle_IRQ+0x54/0xb4)
>   r4:8071d280 r3:00000180
>   [<8000f118>] (handle_IRQ+0x0/0xb4) from [<80008544>] (gic_handle_irq+0x30/0x64)
>   r8:8000e924 r7:f4000100 r6:bf0ddef8 r5:8071c974 r4:f400010c
>   r3:00000000
>   [<80008514>] (gic_handle_irq+0x0/0x64) from [<8000e2e4>] (__irq_svc+0x44/0x5c)
>   Exception stack(0xbf0ddef8 to 0xbf0ddf40)
> 
> Signed-off-by: Frank Li <Frank.Li@freescale.com>
> ---
> Change from v3 to v2
>  * remove return value of fec_enet_tx
> Change from v1 to v2
>  * ignore TX package count in poll function
> 
>  drivers/net/ethernet/freescale/fec.c |   85 ++++++++++++++++-----------------
>  drivers/net/ethernet/freescale/fec.h |    3 -
>  2 files changed, 41 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.c b/drivers/net/ethernet/freescale/fec.c
> index 0fe68c4..bca5a24 100644
> --- a/drivers/net/ethernet/freescale/fec.c
> +++ b/drivers/net/ethernet/freescale/fec.c
> @@ -246,14 +246,13 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	struct bufdesc *bdp;
>  	void *bufaddr;
>  	unsigned short	status;
> -	unsigned long flags;
> +	unsigned int index;
>  
>  	if (!fep->link) {
>  		/* Link is down or autonegotiation is in progress. */
>  		return NETDEV_TX_BUSY;
>  	}
>  
> -	spin_lock_irqsave(&fep->hw_lock, flags);
>  	/* Fill in a Tx ring entry */
>  	bdp = fep->cur_tx;
>  
> @@ -264,7 +263,6 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  		 * This should not happen, since ndev->tbusy should be set.
>  		 */
>  		printk("%s: tx queue full!.\n", ndev->name);
> -		spin_unlock_irqrestore(&fep->hw_lock, flags);
>  		return NETDEV_TX_BUSY;
>  	}
>  
> @@ -280,13 +278,13 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	 * 4-byte boundaries. Use bounce buffers to copy data
>  	 * and get it aligned. Ugh.
>  	 */
> +	if (fep->bufdesc_ex)
> +		index = (struct bufdesc_ex *)bdp -
> +			(struct bufdesc_ex *)fep->tx_bd_base;
> +	else
> +		index = bdp - fep->tx_bd_base;
> +
>  	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> -		unsigned int index;
> -		if (fep->bufdesc_ex)
> -			index = (struct bufdesc_ex *)bdp -
> -				(struct bufdesc_ex *)fep->tx_bd_base;
> -		else
> -			index = bdp - fep->tx_bd_base;
>  		memcpy(fep->tx_bounce[index], skb->data, skb->len);
>  		bufaddr = fep->tx_bounce[index];
>  	}
> @@ -300,10 +298,7 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  		swap_buffer(bufaddr, skb->len);
>  
>  	/* Save skb pointer */
> -	fep->tx_skbuff[fep->skb_cur] = skb;
> -
> -	ndev->stats.tx_bytes += skb->len;
> -	fep->skb_cur = (fep->skb_cur+1) & TX_RING_MOD_MASK;
> +	fep->tx_skbuff[index] = skb;
>  
>  	/* Push the data cache so the CPM does not get stale memory
>  	 * data.
> @@ -331,26 +326,22 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  			ebdp->cbd_esc = BD_ENET_TX_INT;
>  		}
>  	}
> -	/* Trigger transmission start */
> -	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> -
>  	/* If this was the last BD in the ring, start at the beginning again. */
>  	if (status & BD_ENET_TX_WRAP)
>  		bdp = fep->tx_bd_base;
>  	else
>  		bdp = fec_enet_get_nextdesc(bdp, fep->bufdesc_ex);
>  
> -	if (bdp == fep->dirty_tx) {
> -		fep->tx_full = 1;
> +	fep->cur_tx = bdp;
> +
> +	if (fep->cur_tx == fep->dirty_tx)
>  		netif_stop_queue(ndev);
> -	}
>  
> -	fep->cur_tx = bdp;
> +	/* Trigger transmission start */
> +	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>  
>  	skb_tx_timestamp(skb);
>  
> -	spin_unlock_irqrestore(&fep->hw_lock, flags);
> -
>  	return NETDEV_TX_OK;
>  }
>  
> @@ -406,11 +397,8 @@ fec_restart(struct net_device *ndev, int duplex)
>  		writel((unsigned long)fep->bd_dma + sizeof(struct bufdesc)
>  			* RX_RING_SIZE,	fep->hwp + FEC_X_DES_START);
>  
> -	fep->dirty_tx = fep->cur_tx = fep->tx_bd_base;
>  	fep->cur_rx = fep->rx_bd_base;
>  
> -	/* Reset SKB transmit buffers. */
> -	fep->skb_cur = fep->skb_dirty = 0;
>  	for (i = 0; i <= TX_RING_MOD_MASK; i++) {
>  		if (fep->tx_skbuff[i]) {
>  			dev_kfree_skb_any(fep->tx_skbuff[i]);
> @@ -573,20 +561,35 @@ fec_enet_tx(struct net_device *ndev)
>  	struct bufdesc *bdp;
>  	unsigned short status;
>  	struct	sk_buff	*skb;
> +	int	index = 0;
>  
>  	fep = netdev_priv(ndev);
> -	spin_lock(&fep->hw_lock);
>  	bdp = fep->dirty_tx;
>  
> +	/* get next bdp of dirty_tx */
> +	if (bdp->cbd_sc & BD_ENET_TX_WRAP)
> +		bdp = fep->tx_bd_base;
> +	else
> +		bdp = fec_enet_get_nextdesc(bdp, fep->bufdesc_ex);
> +
>  	while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> -		if (bdp == fep->cur_tx && fep->tx_full == 0)
> +
> +		/* current queue is empty */
> +		if (bdp == fep->cur_tx)
>  			break;
>  
> +		if (fep->bufdesc_ex)
> +			index = (struct bufdesc_ex *)bdp -
> +				(struct bufdesc_ex *)fep->tx_bd_base;
> +		else
> +			index = bdp - fep->tx_bd_base;
> +
>  		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
>  				FEC_ENET_TX_FRSIZE, DMA_TO_DEVICE);
>  		bdp->cbd_bufaddr = 0;
>  
> -		skb = fep->tx_skbuff[fep->skb_dirty];
> +		skb = fep->tx_skbuff[index];
> +
>  		/* Check for errors. */
>  		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
>  				   BD_ENET_TX_RL | BD_ENET_TX_UN |
> @@ -631,8 +634,9 @@ fec_enet_tx(struct net_device *ndev)
>  
>  		/* Free the sk buffer associated with this last transmit */
>  		dev_kfree_skb_any(skb);
> -		fep->tx_skbuff[fep->skb_dirty] = NULL;
> -		fep->skb_dirty = (fep->skb_dirty + 1) & TX_RING_MOD_MASK;
> +		fep->tx_skbuff[index] = NULL;
> +
> +		fep->dirty_tx = bdp;
>  
>  		/* Update pointer to next buffer descriptor to be transmitted */
>  		if (status & BD_ENET_TX_WRAP)
> @@ -642,14 +646,12 @@ fec_enet_tx(struct net_device *ndev)
>  
>  		/* Since we have freed up a buffer, the ring is no longer full
>  		 */
> -		if (fep->tx_full) {
> -			fep->tx_full = 0;
> +		if (fep->dirty_tx != fep->cur_tx) {
>  			if (netif_queue_stopped(ndev))
>  				netif_wake_queue(ndev);
>  		}
>  	}
> -	fep->dirty_tx = bdp;
> -	spin_unlock(&fep->hw_lock);
> +	return;
>  }
>  
>  
> @@ -816,7 +818,7 @@ fec_enet_interrupt(int irq, void *dev_id)
>  		int_events = readl(fep->hwp + FEC_IEVENT);
>  		writel(int_events, fep->hwp + FEC_IEVENT);
>  
> -		if (int_events & FEC_ENET_RXF) {
> +		if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
>  			ret = IRQ_HANDLED;
>  
>  			/* Disable the RX interrupt */
> @@ -827,15 +829,6 @@ fec_enet_interrupt(int irq, void *dev_id)
>  			}
>  		}
>  
> -		/* Transmit OK, or non-fatal error. Update the buffer
> -		 * descriptors. FEC handles all errors, we just discover
> -		 * them as part of the transmit process.
> -		 */
> -		if (int_events & FEC_ENET_TXF) {
> -			ret = IRQ_HANDLED;
> -			fec_enet_tx(ndev);
> -		}
> -
>  		if (int_events & FEC_ENET_MII) {
>  			ret = IRQ_HANDLED;
>  			complete(&fep->mdio_done);
> @@ -851,6 +844,8 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
>  	int pkts = fec_enet_rx(ndev, budget);
>  	struct fec_enet_private *fep = netdev_priv(ndev);
>  
> +	fec_enet_tx(ndev);
> +
>  	if (pkts < budget) {
>  		napi_complete(napi);
>  		writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
> @@ -1646,6 +1641,7 @@ static int fec_enet_init(struct net_device *ndev)
>  
>  	/* ...and the same for transmit */
>  	bdp = fep->tx_bd_base;
> +	fep->cur_tx = bdp;
>  	for (i = 0; i < TX_RING_SIZE; i++) {
>  
>  		/* Initialize the BD for every fragment in the page. */
> @@ -1657,6 +1653,7 @@ static int fec_enet_init(struct net_device *ndev)
>  	/* Set the last buffer to wrap */
>  	bdp = fec_enet_get_prevdesc(bdp, fep->bufdesc_ex);
>  	bdp->cbd_sc |= BD_SC_WRAP;
> +	fep->dirty_tx = bdp;
>  
>  	fec_restart(ndev, 0);
>  
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 01579b8..c0f63be 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -214,8 +214,6 @@ struct fec_enet_private {
>  	unsigned char *tx_bounce[TX_RING_SIZE];
>  	struct	sk_buff *tx_skbuff[TX_RING_SIZE];
>  	struct	sk_buff *rx_skbuff[RX_RING_SIZE];
> -	ushort	skb_cur;
> -	ushort	skb_dirty;
>  
>  	/* CPM dual port RAM relative addresses */
>  	dma_addr_t	bd_dma;
> @@ -227,7 +225,6 @@ struct fec_enet_private {
>  	/* The ring entries to be free()ed */
>  	struct bufdesc	*dirty_tx;
>  
> -	uint	tx_full;
>  	/* hold while accessing the HW like ringbuffer for tx/rx but not MAC */
>  	spinlock_t hw_lock;
>  
> -- 
> 1.7.1
> 
> 

^ permalink raw reply

* 3.9-rc3 forcedeth lockdep splat with netconsole
From: Borislav Petkov @ 2013-03-20 11:01 UTC (permalink / raw)
  To: ML netdev; +Cc: lkml

Hi,

when trying to log stuff with netconsole, I get the following:

[ … ]

[    2.036977] netpoll: netconsole: device eth0 not up yet, forcing it
[    2.037632] forcedeth 0000:00:08.0: irq 42 for MSI/MSI-X
[    2.037763] forcedeth 0000:00:08.0 eth0: MSI enabled
[    2.038074] forcedeth 0000:00:08.0 eth0: no link during initialization
[    2.780061] forcedeth 0000:00:08.0 eth0: link up
[    2.780709] 
[    2.780710] =================================
[    2.780710] [ INFO: inconsistent lock state ]
[    2.780712] 3.9.0-rc3+ #2 Not tainted
[    2.780712] ---------------------------------
[    2.780713] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[    2.780715] swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
[    2.780723]  (&(&napi->poll_lock)->rlock){+.?...}, at: [<ffffffff8147b045>] netpoll_poll_dev+0xe5/0xc80
[    2.780723] {IN-SOFTIRQ-W} state was registered at:
[    2.780728]   [<ffffffff8109ade8>] __lock_acquire+0x608/0x1de0
[    2.780730]   [<ffffffff8109cc9e>] lock_acquire+0x9e/0x1f0
[    2.780733]   [<ffffffff81529811>] _raw_spin_lock+0x41/0x80
[    2.780736]   [<ffffffff8145fc1e>] net_rx_action+0xbe/0x330
[    2.780739]   [<ffffffff81044949>] __do_softirq+0xe9/0x3d0
[    2.780740]   [<ffffffff81044c65>] run_ksoftirqd+0x35/0x50
[    2.780744]   [<ffffffff8106dba4>] smpboot_thread_fn+0x1d4/0x2e0
[    2.780749]   [<ffffffff81064b4b>] kthread+0xdb/0xe0
[    2.780751]   [<ffffffff8152b45c>] ret_from_fork+0x7c/0xb0
[    2.780752] irq event stamp: 576121
[    2.780754] hardirqs last  enabled at (576121): [<ffffffff8152aa20>] restore_args+0x0/0x30
[    2.780756] hardirqs last disabled at (576120): [<ffffffff8152a96a>] common_interrupt+0x6a/0x6f
[    2.780758] softirqs last  enabled at (576090): [<ffffffff810449d5>] __do_softirq+0x175/0x3d0
[    2.780759] softirqs last disabled at (576085): [<ffffffff81044d96>] irq_exit+0x96/0xb0
[    2.780760] 
[    2.780760] other info that might help us debug this:
[    2.780760]  Possible unsafe locking scenario:
[    2.780760] 
[    2.780761]        CPU0
[    2.780761]        ----
[    2.780762]   lock(&(&napi->poll_lock)->rlock);
[    2.780763]   <Interrupt>
[    2.780764]     lock(&(&napi->poll_lock)->rlock);
[    2.780764] 
[    2.780764]  *** DEADLOCK ***
[    2.780764] 
[    2.780765] 3 locks held by swapper/0/1:
[    2.780768]  #0:  (console_lock){+.+.+.}, at: [<ffffffff8103e96e>] register_console+0xfe/0x390
[    2.780772]  #1:  (target_list_lock){+.+...}, at: [<ffffffff81370033>] write_msg+0x53/0x110
[    2.780775]  #2:  (&npinfo->dev_lock){+.+...}, at: [<ffffffff8147afa6>] netpoll_poll_dev+0x46/0xc80
[    2.780776] 
[    2.780776] stack backtrace:
[    2.780777] Pid: 1, comm: swapper/0 Not tainted 3.9.0-rc3+ #2
[    2.780778] Call Trace:
[    2.780781]  [<ffffffff81521b9a>] print_usage_bug.part.34+0x270/0x27f
[    2.780785]  [<ffffffff810108cf>] ? save_stack_trace+0x2f/0x50
[    2.780787]  [<ffffffff810792e3>] ? local_clock+0x43/0x50
[    2.780789]  [<ffffffff81097630>] ? print_shortest_lock_dependencies+0x1d0/0x1d0
[    2.780791]  [<ffffffff810984fb>] mark_lock+0x27b/0x600
[    2.780793]  [<ffffffff8109ae56>] __lock_acquire+0x676/0x1de0
[    2.780795]  [<ffffffff8152aa20>] ? retint_restore_args+0xe/0xe
[    2.780797]  [<ffffffff8152a3a7>] ? _raw_spin_unlock_irqrestore+0x67/0x80
[    2.780799]  [<ffffffff8109cc9e>] lock_acquire+0x9e/0x1f0
[    2.780801]  [<ffffffff8147b045>] ? netpoll_poll_dev+0xe5/0xc80
[    2.780803]  [<ffffffff8152a4e3>] _raw_spin_trylock+0x73/0x90
[    2.780804]  [<ffffffff8147b045>] ? netpoll_poll_dev+0xe5/0xc80
[    2.780806]  [<ffffffff8147b045>] netpoll_poll_dev+0xe5/0xc80
[    2.780808]  [<ffffffff810967ae>] ? put_lock_stats.isra.17+0xe/0x40
[    2.780810]  [<ffffffff8147bee4>] ? netpoll_send_skb_on_dev+0x304/0x400
[    2.780812]  [<ffffffff8147bef5>] netpoll_send_skb_on_dev+0x315/0x400
[    2.780814]  [<ffffffff8147c267>] netpoll_send_udp+0x287/0x3a0
[    2.780815]  [<ffffffff81370033>] ? write_msg+0x53/0x110
[    2.780817]  [<ffffffff8137009f>] write_msg+0xbf/0x110
[    2.780818]  [<ffffffff8103d8c7>] ? console_unlock+0x3d7/0x450
[    2.780820]  [<ffffffff8103c9da>] call_console_drivers.constprop.15+0x9a/0x1d0
[    2.780822]  [<ffffffff8103d8d7>] console_unlock+0x3e7/0x450
[    2.780823]  [<ffffffff8103e9a7>] register_console+0x137/0x390
[    2.780826]  [<ffffffff81a18281>] init_netconsole+0x1a6/0x214
[    2.780828]  [<ffffffff8131db00>] ? driver_deferred_probe_trigger.part.9+0x80/0x80
[    2.780830]  [<ffffffff81a180db>] ? option_setup+0x1f/0x1f
[    2.780832]  [<ffffffff81000312>] do_one_initcall+0x122/0x170
[    2.780836]  [<ffffffff819f5e2b>] kernel_init_freeable+0x103/0x192
[    2.780837]  [<ffffffff819f57e8>] ? do_early_param+0x8c/0x8c
[    2.780840]  [<ffffffff81512760>] ? rest_init+0x140/0x140
[    2.780842]  [<ffffffff8151276e>] kernel_init+0xe/0xe0
[    2.780844]  [<ffffffff8152b45c>] ret_from_fork+0x7c/0xb0
[    2.780845]  [<ffffffff81512760>] ? rest_init+0x140/0x140
[    2.780899] ------------[ cut here ]------------
[    2.780901] WARNING: at net/core/netpoll.c:412 netpoll_send_skb_on_dev+0x3e9/0x400()
[    2.780903] Hardware name:  
[    2.780905] netpoll_send_skb_on_dev(): eth0 enabled interrupts in poll (nv_start_xmit_optimized+0x0/0x720)
[    2.780907] Modules linked in:
[    2.780908] Pid: 1, comm: swapper/0 Not tainted 3.9.0-rc3+ #2
[    2.780908] Call Trace:
[    2.780910]  [<ffffffff8147bfc9>] ? netpoll_send_skb_on_dev+0x3e9/0x400
[    2.780913]  [<ffffffff8103b4ff>] warn_slowpath_common+0x7f/0xc0
[    2.780915]  [<ffffffff8103b5f6>] warn_slowpath_fmt+0x46/0x50
[    2.780916]  [<ffffffff813752a0>] ? nv_get_drvinfo+0x80/0x80
[    2.780918]  [<ffffffff8147bfc9>] netpoll_send_skb_on_dev+0x3e9/0x400
[    2.780920]  [<ffffffff8147c267>] netpoll_send_udp+0x287/0x3a0
[    2.780922]  [<ffffffff81370033>] ? write_msg+0x53/0x110
[    2.780924]  [<ffffffff8137009f>] write_msg+0xbf/0x110
[    2.780925]  [<ffffffff8103d8c7>] ? console_unlock+0x3d7/0x450
[    2.780927]  [<ffffffff8103c9da>] call_console_drivers.constprop.15+0x9a/0x1d0
[    2.780928]  [<ffffffff8103d8d7>] console_unlock+0x3e7/0x450
[    2.780931]  [<ffffffff8103e9a7>] register_console+0x137/0x390
[    2.780933]  [<ffffffff81a18281>] init_netconsole+0x1a6/0x214
[    2.780934]  [<ffffffff8131db00>] ? driver_deferred_probe_trigger.part.9+0x80/0x80
[    2.780936]  [<ffffffff81a180db>] ? option_setup+0x1f/0x1f
[    2.780938]  [<ffffffff81000312>] do_one_initcall+0x122/0x170
[    2.780940]  [<ffffffff819f5e2b>] kernel_init_freeable+0x103/0x192
[    2.780941]  [<ffffffff819f57e8>] ? do_early_param+0x8c/0x8c
[    2.780943]  [<ffffffff81512760>] ? rest_init+0x140/0x140
[    2.780945]  [<ffffffff8151276e>] kernel_init+0xe/0xe0
[    2.780947]  [<ffffffff8152b45c>] ret_from_fork+0x7c/0xb0
[    2.780948]  [<ffffffff81512760>] ? rest_init+0x140/0x140
[    2.780950] ---[ end trace a7ed4414d27e5db1 ]---
[    2.789399] console [netcon0] enabled
[    2.789445] netconsole: network logging started

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply

* [PATCH 5/7] ipvs: add backup_only flag to avoid loops
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Julian Anastasov <ja@ssi.bg>

Dmitry Akindinov is reporting for a problem where SYNs are looping
between the master and backup server when the backup server is used as
real server in DR mode and has IPVS rules to function as director.

Even when the backup function is enabled we continue to forward
traffic and schedule new connections when the current master is using
the backup server as real server. While this is not a problem for NAT,
for DR and TUN method the backup server can not determine if a request
comes from client or from director.

To avoid such loops add new sysctl flag backup_only. It can be needed
for DR/TUN setups that do not need backup and director function at the
same time. When the backup function is enabled we stop any forwarding
and pass the traffic to the local stack (real server mode). The flag
disables the director function when the backup function is enabled.

For setups that enable backup function for some virtual services and
director function for other virtual services there should be another
more complex solution to support DR/TUN mode, may be to assign
per-virtual service syncid value, so that we can differentiate the
requests.

Reported-by: Dmitry Akindinov <dimak@stalker.com>
Tested-by: German Myzovsky <lawyer@sipnet.ru>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 Documentation/networking/ipvs-sysctl.txt |    7 +++++++
 include/net/ip_vs.h                      |   12 ++++++++++++
 net/netfilter/ipvs/ip_vs_core.c          |   12 ++++++++----
 net/netfilter/ipvs/ip_vs_ctl.c           |    7 +++++++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt
index f2a2488..9573d0c 100644
--- a/Documentation/networking/ipvs-sysctl.txt
+++ b/Documentation/networking/ipvs-sysctl.txt
@@ -15,6 +15,13 @@ amemthresh - INTEGER
         enabled and the variable is automatically set to 2, otherwise
         the strategy is disabled and the variable is  set  to 1.
 
+backup_only - BOOLEAN
+	0 - disabled (default)
+	not 0 - enabled
+
+	If set, disable the director function while the server is
+	in backup mode to avoid packet loops for DR/TUN methods.
+
 conntrack - BOOLEAN
 	0 - disabled (default)
 	not 0 - enabled
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 68c69d5..fce8e6b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -976,6 +976,7 @@ struct netns_ipvs {
 	int			sysctl_sync_retries;
 	int			sysctl_nat_icmp_send;
 	int			sysctl_pmtu_disc;
+	int			sysctl_backup_only;
 
 	/* ip_vs_lblc */
 	int			sysctl_lblc_expiration;
@@ -1067,6 +1068,12 @@ static inline int sysctl_pmtu_disc(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_pmtu_disc;
 }
 
+static inline int sysctl_backup_only(struct netns_ipvs *ipvs)
+{
+	return ipvs->sync_state & IP_VS_STATE_BACKUP &&
+	       ipvs->sysctl_backup_only;
+}
+
 #else
 
 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1114,6 +1121,11 @@ static inline int sysctl_pmtu_disc(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_backup_only(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 #endif
 
 /*
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 47edf5a..18b4bc5 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1577,7 +1577,8 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int af)
 	}
 	/* ipvs enabled in this netns ? */
 	net = skb_net(skb);
-	if (!net_ipvs(net)->enable)
+	ipvs = net_ipvs(net);
+	if (unlikely(sysctl_backup_only(ipvs) || !ipvs->enable))
 		return NF_ACCEPT;
 
 	ip_vs_fill_iph_skb(af, skb, &iph);
@@ -1654,7 +1655,6 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int af)
 	}
 
 	IP_VS_DBG_PKT(11, af, pp, skb, 0, "Incoming packet");
-	ipvs = net_ipvs(net);
 	/* Check the server status */
 	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
@@ -1815,13 +1815,15 @@ ip_vs_forward_icmp(unsigned int hooknum, struct sk_buff *skb,
 {
 	int r;
 	struct net *net;
+	struct netns_ipvs *ipvs;
 
 	if (ip_hdr(skb)->protocol != IPPROTO_ICMP)
 		return NF_ACCEPT;
 
 	/* ipvs enabled in this netns ? */
 	net = skb_net(skb);
-	if (!net_ipvs(net)->enable)
+	ipvs = net_ipvs(net);
+	if (unlikely(sysctl_backup_only(ipvs) || !ipvs->enable))
 		return NF_ACCEPT;
 
 	return ip_vs_in_icmp(skb, &r, hooknum);
@@ -1835,6 +1837,7 @@ ip_vs_forward_icmp_v6(unsigned int hooknum, struct sk_buff *skb,
 {
 	int r;
 	struct net *net;
+	struct netns_ipvs *ipvs;
 	struct ip_vs_iphdr iphdr;
 
 	ip_vs_fill_iph_skb(AF_INET6, skb, &iphdr);
@@ -1843,7 +1846,8 @@ ip_vs_forward_icmp_v6(unsigned int hooknum, struct sk_buff *skb,
 
 	/* ipvs enabled in this netns ? */
 	net = skb_net(skb);
-	if (!net_ipvs(net)->enable)
+	ipvs = net_ipvs(net);
+	if (unlikely(sysctl_backup_only(ipvs) || !ipvs->enable))
 		return NF_ACCEPT;
 
 	return ip_vs_in_icmp_v6(skb, &r, hooknum, &iphdr);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index c68198b..9e2d1cc 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1808,6 +1808,12 @@ static struct ctl_table vs_vars[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "backup_only",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
 		.procname	= "debug_level",
@@ -3741,6 +3747,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct net *net)
 	tbl[idx++].data = &ipvs->sysctl_nat_icmp_send;
 	ipvs->sysctl_pmtu_disc = 1;
 	tbl[idx++].data = &ipvs->sysctl_pmtu_disc;
+	tbl[idx++].data = &ipvs->sysctl_backup_only;
 
 
 	ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 7/7] netfilter: remove unused "config IP_NF_QUEUE"
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Paul Bolle <pebolle@tiscali.nl>

Kconfig symbol IP_NF_QUEUE is unused since commit
d16cf20e2f2f13411eece7f7fb72c17d141c4a84 ("netfilter: remove ip_queue
support"). Let's remove it too.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/Kconfig |   13 -------------
 1 file changed, 13 deletions(-)

diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index ce2d43e..0d755c5 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -36,19 +36,6 @@ config NF_CONNTRACK_PROC_COMPAT
 
 	  If unsure, say Y.
 
-config IP_NF_QUEUE
-	tristate "IP Userspace queueing via NETLINK (OBSOLETE)"
-	depends on NETFILTER_ADVANCED
-	help
-	  Netfilter has the ability to queue packets to user space: the
-	  netlink device can be used to access them using this driver.
-
-	  This option enables the old IPv4-only "ip_queue" implementation
-	  which has been obsoleted by the new "nfnetlink_queue" code (see
-	  CONFIG_NETFILTER_NETLINK_QUEUE).
-
-	  To compile it as a module, choose M here.  If unsure, say N.
-
 config IP_NF_IPTABLES
 	tristate "IP tables support (required for filtering/masq/NAT)"
 	default m if NETFILTER_ADVANCED=n
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 4/7] ipvs: fix sctp chunk length order
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Julian Anastasov <ja@ssi.bg>

Fix wrong but non-fatal access to chunk length.
sch->length should be in network order, next chunk should
be aligned to 4 bytes. Problem noticed in sparse output.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_proto_sctp.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c b/net/netfilter/ipvs/ip_vs_proto_sctp.c
index ae8ec6f..cd1d729 100644
--- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
@@ -906,7 +906,7 @@ set_sctp_state(struct ip_vs_proto_data *pd, struct ip_vs_conn *cp,
 	sctp_chunkhdr_t _sctpch, *sch;
 	unsigned char chunk_type;
 	int event, next_state;
-	int ihl;
+	int ihl, cofs;
 
 #ifdef CONFIG_IP_VS_IPV6
 	ihl = cp->af == AF_INET ? ip_hdrlen(skb) : sizeof(struct ipv6hdr);
@@ -914,8 +914,8 @@ set_sctp_state(struct ip_vs_proto_data *pd, struct ip_vs_conn *cp,
 	ihl = ip_hdrlen(skb);
 #endif
 
-	sch = skb_header_pointer(skb, ihl + sizeof(sctp_sctphdr_t),
-				sizeof(_sctpch), &_sctpch);
+	cofs = ihl + sizeof(sctp_sctphdr_t);
+	sch = skb_header_pointer(skb, cofs, sizeof(_sctpch), &_sctpch);
 	if (sch == NULL)
 		return;
 
@@ -933,10 +933,12 @@ set_sctp_state(struct ip_vs_proto_data *pd, struct ip_vs_conn *cp,
 	 */
 	if ((sch->type == SCTP_CID_COOKIE_ECHO) ||
 	    (sch->type == SCTP_CID_COOKIE_ACK)) {
-		sch = skb_header_pointer(skb, (ihl + sizeof(sctp_sctphdr_t) +
-				sch->length), sizeof(_sctpch), &_sctpch);
-		if (sch) {
-			if (sch->type == SCTP_CID_ABORT)
+		int clen = ntohs(sch->length);
+
+		if (clen >= sizeof(sctp_chunkhdr_t)) {
+			sch = skb_header_pointer(skb, cofs + ALIGN(clen, 4),
+						 sizeof(_sctpch), &_sctpch);
+			if (sch && sch->type == SCTP_CID_ABORT)
 				chunk_type = sch->type;
 		}
 	}
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/7] netfilter: ip6t_NPT: restrict to mangle table
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

As the translation is stateless, using it in nat table
doesn't work (only initial packet is translated).
filter table OUTPUT works but won't re-route the packet after translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv6/netfilter/ip6t_NPT.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/netfilter/ip6t_NPT.c b/net/ipv6/netfilter/ip6t_NPT.c
index 83acc14..33608c6 100644
--- a/net/ipv6/netfilter/ip6t_NPT.c
+++ b/net/ipv6/netfilter/ip6t_NPT.c
@@ -114,6 +114,7 @@ ip6t_dnpt_tg(struct sk_buff *skb, const struct xt_action_param *par)
 static struct xt_target ip6t_npt_target_reg[] __read_mostly = {
 	{
 		.name		= "SNPT",
+		.table		= "mangle",
 		.target		= ip6t_snpt_tg,
 		.targetsize	= sizeof(struct ip6t_npt_tginfo),
 		.checkentry	= ip6t_npt_checkentry,
@@ -124,6 +125,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly = {
 	},
 	{
 		.name		= "DNPT",
+		.table		= "mangle",
 		.target		= ip6t_dnpt_tg,
 		.targetsize	= sizeof(struct ip6t_npt_tginfo),
 		.checkentry	= ip6t_npt_checkentry,
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/7] netfilter: nfnetlink_queue: fix incorrect initialization of copy range field
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

2^16 = 0xffff, not 0xfffff (note the extra 'f'). Not dangerous since you
adjust it to min_t(data_len, skb->len) just after on.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_queue_core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c
index 858fd52..1cb4854 100644
--- a/net/netfilter/nfnetlink_queue_core.c
+++ b/net/netfilter/nfnetlink_queue_core.c
@@ -112,7 +112,7 @@ instance_create(u_int16_t queue_num, int portid)
 	inst->queue_num = queue_num;
 	inst->peer_portid = portid;
 	inst->queue_maxlen = NFQNL_QMAX_DEFAULT;
-	inst->copy_range = 0xfffff;
+	inst->copy_range = 0xffff;
 	inst->copy_mode = NFQNL_COPY_NONE;
 	spin_lock_init(&inst->lock);
 	INIT_LIST_HEAD(&inst->queue_list);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/7] netfilter fixes for 3.9-rc
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>

Hi David,

The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:

* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
  complaining that this target does not work in the nat table, which is the
  wrong table for it, from Florian Westphal.

* Fix possible use before initialization in the netns init path of several
  conntrack protocol trackers (introduced recently while improving conntrack
  netns support), from Gao Feng.

* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
  by Eric Dumazet during the NFWS2013, patch from myself.

* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.

* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
  not required anymore after change introduced in 3.7, again from Julian.

* Fix SYN looping in IPVS state sync if the backup is used a real server
  in DR/TUN modes, this required a new /proc entry to disable the director
  function when acting as backup, also from Julian.

* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
  Paul Bolle.

You can pull these changes from:

git://1984.lsi.us.es/nf master

Thanks!

Florian Westphal (1):
  netfilter: ip6t_NPT: restrict to mangle table

Gao feng (1):
  netfilter: nf_conntrack: register pernet subsystem before register L4 proto

Julian Anastasov (3):
  ipvs: fix sctp chunk length order
  ipvs: add backup_only flag to avoid loops
  ipvs: remove extra rcu lock

Pablo Neira Ayuso (1):
  netfilter: nfnetlink_queue: fix incorrect initialization of copy range field

Paul Bolle (1):
  netfilter: remove unused "config IP_NF_QUEUE"

 Documentation/networking/ipvs-sysctl.txt   |    7 +++++++
 include/net/ip_vs.h                        |   12 ++++++++++++
 net/ipv4/netfilter/Kconfig                 |   13 -------------
 net/ipv6/netfilter/ip6t_NPT.c              |    2 ++
 net/netfilter/ipvs/ip_vs_core.c            |   14 ++++++++------
 net/netfilter/ipvs/ip_vs_ctl.c             |    7 +++++++
 net/netfilter/ipvs/ip_vs_proto_sctp.c      |   16 +++++++++-------
 net/netfilter/nf_conntrack_proto_dccp.c    |   12 ++++++------
 net/netfilter/nf_conntrack_proto_gre.c     |   12 ++++++------
 net/netfilter/nf_conntrack_proto_sctp.c    |   12 ++++++------
 net/netfilter/nf_conntrack_proto_udplite.c |   12 ++++++------
 net/netfilter/nfnetlink_queue_core.c       |    2 +-
 12 files changed, 70 insertions(+), 51 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [PATCH 6/7] ipvs: remove extra rcu lock
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Julian Anastasov <ja@ssi.bg>

In 3.7 we added code that uses ipv4_update_pmtu but after commit
c5ae7d4192 (ipv4: must use rcu protection while calling fib_lookup)
the RCU lock is not needed.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_core.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 18b4bc5..61f49d2 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1394,10 +1394,8 @@ ip_vs_in_icmp(struct sk_buff *skb, int *related, unsigned int hooknum)
 			skb_reset_network_header(skb);
 			IP_VS_DBG(12, "ICMP for IPIP %pI4->%pI4: mtu=%u\n",
 				&ip_hdr(skb)->saddr, &ip_hdr(skb)->daddr, mtu);
-			rcu_read_lock();
 			ipv4_update_pmtu(skb, dev_net(skb->dev),
 					 mtu, 0, 0, 0, 0);
-			rcu_read_unlock();
 			/* Client uses PMTUD? */
 			if (!(cih->frag_off & htons(IP_DF)))
 				goto ignore_ipip;
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 1/7] netfilter: nf_conntrack: register pernet subsystem before register L4 proto
From: pablo @ 2013-03-20 10:49 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1363776574-4766-1-git-send-email-pablo@netfilter.org>

From: Gao feng <gaofeng@cn.fujitsu.com>

In (c296bb4 netfilter: nf_conntrack: refactor l4proto support for netns)
the l4proto gre/dccp/udplite/sctp registration happened before the pernet
subsystem, which is wrong.

Register pernet subsystem before register L4proto since after register
L4proto, init_conntrack may try to access the resources which allocated
in register_pernet_subsys.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_proto_dccp.c    |   12 ++++++------
 net/netfilter/nf_conntrack_proto_gre.c     |   12 ++++++------
 net/netfilter/nf_conntrack_proto_sctp.c    |   12 ++++++------
 net/netfilter/nf_conntrack_proto_udplite.c |   12 ++++++------
 4 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index 432f957..ba65b20 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -969,6 +969,10 @@ static int __init nf_conntrack_proto_dccp_init(void)
 {
 	int ret;
 
+	ret = register_pernet_subsys(&dccp_net_ops);
+	if (ret < 0)
+		goto out_pernet;
+
 	ret = nf_ct_l4proto_register(&dccp_proto4);
 	if (ret < 0)
 		goto out_dccp4;
@@ -977,16 +981,12 @@ static int __init nf_conntrack_proto_dccp_init(void)
 	if (ret < 0)
 		goto out_dccp6;
 
-	ret = register_pernet_subsys(&dccp_net_ops);
-	if (ret < 0)
-		goto out_pernet;
-
 	return 0;
-out_pernet:
-	nf_ct_l4proto_unregister(&dccp_proto6);
 out_dccp6:
 	nf_ct_l4proto_unregister(&dccp_proto4);
 out_dccp4:
+	unregister_pernet_subsys(&dccp_net_ops);
+out_pernet:
 	return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_proto_gre.c b/net/netfilter/nf_conntrack_proto_gre.c
index bd7d01d..155ce9f 100644
--- a/net/netfilter/nf_conntrack_proto_gre.c
+++ b/net/netfilter/nf_conntrack_proto_gre.c
@@ -420,18 +420,18 @@ static int __init nf_ct_proto_gre_init(void)
 {
 	int ret;
 
-	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_gre4);
-	if (ret < 0)
-		goto out_gre4;
-
 	ret = register_pernet_subsys(&proto_gre_net_ops);
 	if (ret < 0)
 		goto out_pernet;
 
+	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_gre4);
+	if (ret < 0)
+		goto out_gre4;
+
 	return 0;
-out_pernet:
-	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_gre4);
 out_gre4:
+	unregister_pernet_subsys(&proto_gre_net_ops);
+out_pernet:
 	return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 480f616..ec83536 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -888,6 +888,10 @@ static int __init nf_conntrack_proto_sctp_init(void)
 {
 	int ret;
 
+	ret = register_pernet_subsys(&sctp_net_ops);
+	if (ret < 0)
+		goto out_pernet;
+
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_sctp4);
 	if (ret < 0)
 		goto out_sctp4;
@@ -896,16 +900,12 @@ static int __init nf_conntrack_proto_sctp_init(void)
 	if (ret < 0)
 		goto out_sctp6;
 
-	ret = register_pernet_subsys(&sctp_net_ops);
-	if (ret < 0)
-		goto out_pernet;
-
 	return 0;
-out_pernet:
-	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_sctp6);
 out_sctp6:
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_sctp4);
 out_sctp4:
+	unregister_pernet_subsys(&sctp_net_ops);
+out_pernet:
 	return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
index 1574895..ca969f6 100644
--- a/net/netfilter/nf_conntrack_proto_udplite.c
+++ b/net/netfilter/nf_conntrack_proto_udplite.c
@@ -371,6 +371,10 @@ static int __init nf_conntrack_proto_udplite_init(void)
 {
 	int ret;
 
+	ret = register_pernet_subsys(&udplite_net_ops);
+	if (ret < 0)
+		goto out_pernet;
+
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_udplite4);
 	if (ret < 0)
 		goto out_udplite4;
@@ -379,16 +383,12 @@ static int __init nf_conntrack_proto_udplite_init(void)
 	if (ret < 0)
 		goto out_udplite6;
 
-	ret = register_pernet_subsys(&udplite_net_ops);
-	if (ret < 0)
-		goto out_pernet;
-
 	return 0;
-out_pernet:
-	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udplite6);
 out_udplite6:
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udplite4);
 out_udplite4:
+	unregister_pernet_subsys(&udplite_net_ops);
+out_pernet:
 	return ret;
 }
 
-- 
1.7.10.4


^ permalink raw reply related

* Re: [patch] RxRPC: use copy_to_user() instead of memcpy()
From: David Howells @ 2013-03-20 10:23 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: dhowells, David Miller, netdev, kernel-janitors
In-Reply-To: <20130319225101.GZ9189@mwanda>

Dan Carpenter <dan.carpenter@oracle.com> wrote:

> Thanks for the explanation.  I misread the code and then managed to
> trigger an unrelated locking bug and got confused.
> 
> I think there is a spin_unlock(call->lock) missing somewhere on an
> error path.  I have attached a reproducer file.

Are you running with lockdep enabled?  That should catch such things.

David

^ permalink raw reply

* Re: Can we rely on ethernet header padding?
From: Michal Kubecek @ 2013-03-20  9:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, netfilter-devel
In-Reply-To: <20130319142553.3e08e7b9@nehalam.linuxnetplumber.net>

On Tue, Mar 19, 2013 at 02:25:53PM -0700, Stephen Hemminger wrote:
> 
> My view is that the bridge code must check before assuming headroom.
> But because of that, it means a packet copy would be necessary for cases
> where packets arrive without enough headroom.

Yes, but for most users this is better than a crash. How about issuing a
WARN_ONCE() in such case so that we avoid writing out of the buffer but
still allow developers (and users who check logs) to catch the problem?

                                                          Michal Kubecek

^ permalink raw reply

* [PATCH 1/1]core:Change a wrong explain about dev_get_by_name
From: tingwei liu @ 2013-03-20  9:33 UTC (permalink / raw)
  To: netdev, Alexey Kuznetsov, davem

>From 5d787e8c3725efa6af3036eeb52aba1905c70de9 Mon Sep 17 00:00:00 2001
From: root <root@linux-19.site>
Date: Wed, 20 Mar 2013 17:39:59 +0800
Subject: [PATCH] Change a wrong explain about dev_get_by_name


Signed-off-by: Tingwei Liu <tingw.liu@gmail.com>
---
 net/core/dev.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d540ced..20c5c7c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -700,10 +700,10 @@ EXPORT_SYMBOL(dev_get_by_name_rcu);
  *     @name: name to find
  *
  *     Find an interface by name. This can be called from any
- *     context and does its own locking. The returned handle has
- *     the usage count incremented and the caller must use dev_put() to
- *     release it when it is no longer needed. %NULL is returned if no
- *     matching device is found.
+ *     context except hard irq context and does its own locking.
+ *     The returned handle has the usage count incremented and the
+ *     caller must use dev_put() to release it when it is no longer
+ *     needed. %NULL is returned if no matching device is found.
  */

 struct net_device *dev_get_by_name(struct net *net, const char *name)
--
1.6.0.2

^ permalink raw reply related

* Re: [PATCH] ip xfrm state: Allow different selector family
From: Thomas Egerer @ 2013-03-20  9:18 UTC (permalink / raw)
  To: netdev; +Cc: stephen.hemminger
In-Reply-To: <5145A141.8000707@gmx.de>

My previous commit introduced a patch to allow for states with different
ip address families for selector and id. The must have somehow been a
mixup of the patch I tested and the one I send, so the patch sent breaks
the iproute2 build. This patch fixes this. My apologies.

Signed-off-by: Thomas Egerer <hakke_007@gmx.de>
---
Hi Stephen, *,

my sincere apologies for breaking the build. But I somehow build and tested
one commit and sent the wrong one to the list. The assignment was wrong.
This commit fixes this bug. My bad,

Thomas


 ip/xfrm_state.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)



0001-ip-xfrm-state-Fix-assignment-of-preferred_family.patch

diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 6ff5b51..c0cf88c 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -296,7 +296,7 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 			NEXT_ARG();
 			preferred_family = AF_UNSPEC;
 			xfrm_selector_parse(&req.xsinfo.sel, &argc, &argv);
-			preferred_family = req.xsinfo.sel;
+			preferred_family = req.xsinfo.sel.family;
 		} else if (strcmp(*argv, "limit") == 0) {
 			NEXT_ARG();
 			xfrm_lifetime_cfg_parse(&req.xsinfo.lft, &argc, &argv);

^ permalink raw reply related

* Re: [PATCH net-next 4/4] filter: add minimal BPF JIT emitted image disassembler
From: Daniel Borkmann @ 2013-03-20  8:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, davem, eric.dumazet, jasowang, Eric Dumazet
In-Reply-To: <1363748314.31336.30.camel@deadeye.wl.decadent.org.uk>

On 03/20/2013 03:58 AM, Ben Hutchings wrote:
> On Tue, 2013-03-19 at 14:28 +0100, Daniel Borkmann wrote:
>> This is a minimal stand-alone user space helper, that allows for debugging or
>> verification of emitted BPF JIT images. This is in particular useful for
>> emitted opcode debugging, since minor bugs in the JIT compiler can be fatal.
>> The disassembler is architecture generic and uses libopcodes and libbfd.
> [...]
>> ---
>>   scripts/bpf_jit_disasm.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 216 insertions(+)
>>   create mode 100644 scripts/bpf_jit_disasm.c
> [...]
>
> This might belong in tools.  scripts is mostly for build scripts now.

Ok, I thought this would be overkill, since it's just a small C program
(unlike perf, etc) and there are also other things present in scripts/
like Coccinelle scripts etc. But, sure we could do this. What would you
suggest? tools/net/bpf_jit_disasm.c?

^ permalink raw reply

* Cher utilisateur E-mail
From: Bousonville, Michael @ 2013-03-20  6:11 UTC (permalink / raw)


 
 
Cher utilisateur E-mail

Vous avez presque dépassé votre quota de stockage webmail. Pour éviter la suppression du compte, s'il vous plaît cliquer sur le lien ci-dessous

http://upgrademenow.atwebpages.com/

S'il vous plaît essayer de répondre dans les prochaines 48 heures pour éviter que votre compte à partir de mesures deletion.These font partie de nos politiques de sécurité et nous nous excusons pour la gêne occasionnée.

Meilleurs voeux,
Webmail équipe 
 

^ permalink raw reply

* Dear Email user
From: Bousonville, Michael @ 2013-03-20  6:17 UTC (permalink / raw)


 
 
Dear Email user
You have almost exceeded your webmail storage quota. To avoid account deletion, please click on the link below
http://upgrademenow.atwebpages.com/
Please endeavor to respond within the next 48 hours to prevent your account from deletion.These measures are part of our security policies and we sincerely apologize for any inconveniences caused.
Best wishes,
Webmail Team

^ permalink raw reply

* RE: [PATCH 1/1] connector: Added coredumping event to the process connector
From: Jesper Derehag @ 2013-03-20  6:53 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev@vger.kernel.org, zbr@ioremap.net
In-Reply-To: <20130319170932.GB24909@order.stressinduktion.org>

----------------------------------------
> Date: Tue, 19 Mar 2013 18:09:32 +0100
> From: hannes@stressinduktion.org
> To: jderehag@hotmail.com
> CC: netdev@vger.kernel.org; zbr@ioremap.net
> Subject: Re: [PATCH 1/1] connector: Added coredumping event to the process connector
>
> On Tue, Mar 19, 2013 at 07:32:10AM +0000, Jesper Derehag wrote:
> > > From: jderehag@hotmail.com
> > > To: netdev@vger.kernel.org
> > > Subject: RE: [PATCH 1/1] connector: Added coredumping event to the process connector
> > > Date: Sat, 16 Mar 2013 19:08:03 +0000
> > >
> > > ----------------------------------------
> > > > Date: Sat, 16 Mar 2013 19:40:36 +0100
> > > > From: hannes@stressinduktion.org
> > > > To: jderehag@hotmail.com
> > > > CC: zbr@ioremap.net; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH 1/1] connector: Added coredumping event to the process connector
> > > >
> > > > On Sat, Mar 16, 2013 at 05:57:20PM +0000, Jesper Derehag wrote:
> > > > >
> > > > >
> > > > > > Date: Sat, 16 Mar 2013 18:03:48 +0100
> > > > > > From: hannes@stressinduktion.org
> > > > > > To: jderehag@hotmail.com
> > > > > > CC: zbr@ioremap.net; netdev@vger.kernel.org
> > > > > > Subject: Re: [PATCH 1/1] connector: Added coredumping event to the process connector
> > > > > >
> > > > > > On Sat, Mar 16, 2013 at 11:50:50AM +0100, Jesper Derehag wrote:
> > > > > > > + ev->event_data.exit.exit_code = task->exit_code;
> > > > > > > + ev->event_data.exit.exit_signal = task->exit_signal;
> > > > > >
> > > > > > Do these already contain meaningful values?
> > > > > >
> > > > >
> > > > > I have to admit that they dont.And you are correct, I should add a new event struct specific for the coredump event instead of piggybacking on the exit struct.Will re-submit a patch..
> > > >
> > > > Hm, I am still unsure if such a patch is needed. Couldn't you test for
> > > > coredump by inspecting exit_code on PROC_EVENT_EXIT?
> > >
> > > *** resubmitted message due to it got dropped by vger.kernel.org ***
> > >
> > > Well, what this patch adds I think is more a question of timing.
> > > As an example, say you want to quickly detect process failures. In that case if we would only have the EXIT event, that would mean that we get notified after the dump is done, which could take minutes depending on how large the dump is.
> > > If we instead watch for both EXIT & COREDUMP events, it would mean that we would quickly catch any failing process, regardless of if its actually starting to coredump or if its exited for some other reason.
> >
> >
> > Any other comments on this before I send a v2 patch with the exit vs the coredump event struct change?
>
> I would say just go on and submit the patch. Perhaps you can Cc someone
> looking after the change in signal.c (maybe lkml). I just checked that
> you don't hold any spinlocks while doing the netlink send.
>

Great.
Just submitted the v2 patch (and cc:ing lkml), thanks for your help! Much appreciated.. 		 	   		  

^ permalink raw reply

* [PATCH v2 1/1] connector: Added coredumping event to the process connector
From: Jesper Derehag @ 2013-03-20  6:50 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev, linux-kernel, Jesper Derehag

Process connector can now also detect coredumping events.
Main aim of patch is get notified at start of coredumping, instead of having to wait for it to finish and then being notified through EXIT event.
Could be used for instance by process-managers that want to get notified as soon as possible about process failures, and not necessarily beeing notified after coredump, which could be in the order of minutes depending on size of coredump, piping and so on.

Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
---
 drivers/connector/cn_proc.c  |   25 +++++++++++++++++++++++++
 include/linux/cn_proc.h      |    4 ++++
 include/uapi/linux/cn_proc.h |   10 +++++++++-
 kernel/signal.c              |    2 ++
 4 filer ändrade, 40 tillägg(+), 1 borttagning(-)

diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
index fce2000..c613ab3 100644
--- a/drivers/connector/cn_proc.c
+++ b/drivers/connector/cn_proc.c
@@ -232,6 +232,31 @@ void proc_comm_connector(struct task_struct *task)
 	cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL);
 }
 
+void proc_coredump_connector(struct task_struct *task)
+{
+	struct cn_msg *msg;
+	struct proc_event *ev;
+	__u8 buffer[CN_PROC_MSG_SIZE];
+	struct timespec ts;
+
+	if (atomic_read(&proc_event_num_listeners) < 1)
+		return;
+
+	msg = (struct cn_msg *)buffer;
+	ev = (struct proc_event *)msg->data;
+	get_seq(&msg->seq, &ev->cpu);
+	ktime_get_ts(&ts); /* get high res monotonic timestamp */
+	put_unaligned(timespec_to_ns(&ts), (__u64 *)&ev->timestamp_ns);
+	ev->what = PROC_EVENT_COREDUMP;
+	ev->event_data.coredump.process_pid = task->pid;
+	ev->event_data.coredump.process_tgid = task->tgid;
+
+	memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
+	msg->ack = 0; /* not used */
+	msg->len = sizeof(*ev);
+	cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL);
+}
+
 void proc_exit_connector(struct task_struct *task)
 {
 	struct cn_msg *msg;
diff --git a/include/linux/cn_proc.h b/include/linux/cn_proc.h
index 2c1bc1e..1d5b02a 100644
--- a/include/linux/cn_proc.h
+++ b/include/linux/cn_proc.h
@@ -26,6 +26,7 @@ void proc_id_connector(struct task_struct *task, int which_id);
 void proc_sid_connector(struct task_struct *task);
 void proc_ptrace_connector(struct task_struct *task, int which_id);
 void proc_comm_connector(struct task_struct *task);
+void proc_coredump_connector(struct task_struct *task);
 void proc_exit_connector(struct task_struct *task);
 #else
 static inline void proc_fork_connector(struct task_struct *task)
@@ -48,6 +49,9 @@ static inline void proc_ptrace_connector(struct task_struct *task,
 					 int ptrace_id)
 {}
 
+static inline void proc_coredump_connector(struct task_struct *task)
+{}
+
 static inline void proc_exit_connector(struct task_struct *task)
 {}
 #endif	/* CONFIG_PROC_EVENTS */
diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
index 0d7b499..f6c2710 100644
--- a/include/uapi/linux/cn_proc.h
+++ b/include/uapi/linux/cn_proc.h
@@ -56,7 +56,9 @@ struct proc_event {
 		PROC_EVENT_PTRACE = 0x00000100,
 		PROC_EVENT_COMM = 0x00000200,
 		/* "next" should be 0x00000400 */
-		/* "last" is the last process event: exit */
+		/* "last" is the last process event: exit,
+		 * while "next to last" is coredumping event */
+		PROC_EVENT_COREDUMP = 0x40000000,
 		PROC_EVENT_EXIT = 0x80000000
 	} what;
 	__u32 cpu;
@@ -110,11 +112,17 @@ struct proc_event {
 			char           comm[16];
 		} comm;
 
+		struct coredump_proc_event {
+			__kernel_pid_t process_pid;
+			__kernel_pid_t process_tgid;
+		} coredump;
+
 		struct exit_proc_event {
 			__kernel_pid_t process_pid;
 			__kernel_pid_t process_tgid;
 			__u32 exit_code, exit_signal;
 		} exit;
+
 	} event_data;
 };
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 2ec870a..55f035b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -32,6 +32,7 @@
 #include <linux/user_namespace.h>
 #include <linux/uprobes.h>
 #include <linux/compat.h>
+#include <linux/cn_proc.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
 
@@ -2347,6 +2348,7 @@ relock:
 		if (sig_kernel_coredump(signr)) {
 			if (print_fatal_signals)
 				print_fatal_signal(info->si_signo);
+			proc_coredump_connector(current);
 			/*
 			 * If it was able to dump core, this kills all
 			 * other threads in the group and synchronizes with
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next] net: fix psock_fanout selftest hash collision
From: Willem de Bruijn @ 2013-03-20  6:42 UTC (permalink / raw)
  To: davem, netdev; +Cc: Willem de Bruijn
In-Reply-To: <CA+FuTSeN1wXrSP8-piCdnv8MuqgokndSYsiidvdtC=d2tCqcDQ@mail.gmail.com>

Fix flaky results with PACKET_FANOUT_HASH depending on whether the
two flows hash into the same packet socket or not.

Also adds tests for PACKET_FANOUT_LB and PACKET_FANOUT_CPU and
replaces the counting method with a packet ring.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 .../testing/selftests/net-afpacket/psock_fanout.c  | 160 +++++++++++++++------
 1 file changed, 120 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/net-afpacket/psock_fanout.c b/tools/testing/selftests/net-afpacket/psock_fanout.c
index af9b019..f765f09 100644
--- a/tools/testing/selftests/net-afpacket/psock_fanout.c
+++ b/tools/testing/selftests/net-afpacket/psock_fanout.c
@@ -16,11 +16,11 @@
  *   The test currently runs for
  *   - PACKET_FANOUT_HASH
  *   - PACKET_FANOUT_HASH with PACKET_FANOUT_FLAG_ROLLOVER
+ *   - PACKET_FANOUT_LB
+ *   - PACKET_FANOUT_CPU
  *   - PACKET_FANOUT_ROLLOVER
  *
  * Todo:
- * - datapath: PACKET_FANOUT_LB
- * - datapath: PACKET_FANOUT_CPU
  * - functionality: PACKET_FANOUT_FLAG_DEFRAG
  *
  * License (GPLv2):
@@ -39,18 +39,23 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#define _GNU_SOURCE		/* for sched_setaffinity */
+
 #include <arpa/inet.h>
 #include <errno.h>
+#include <fcntl.h>
 #include <linux/filter.h>
 #include <linux/if_packet.h>
 #include <net/ethernet.h>
 #include <netinet/ip.h>
 #include <netinet/udp.h>
-#include <fcntl.h>
+#include <poll.h>
+#include <sched.h>
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <sys/mman.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
 #include <sys/types.h>
@@ -58,6 +63,8 @@
 
 #define DATA_LEN			100
 #define DATA_CHAR			'a'
+#define RING_NUM_FRAMES			20
+#define PORT_BASE			8000
 
 static void pair_udp_open(int fds[], uint16_t port)
 {
@@ -162,37 +169,55 @@ static int sock_fanout_open(uint16_t typeflags, int num_packets)
 		return -1;
 	}
 
-	val = sizeof(struct iphdr) + sizeof(struct udphdr) + DATA_LEN;
-	val *= num_packets;
-	/* hack: apparently, the above calculation is too small (TODO: fix) */
-	val *= 3;
-	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &val, sizeof(val))) {
-		perror("setsockopt SO_RCVBUF");
-		exit(1);
-	}
-
 	sock_fanout_setfilter(fd);
 	return fd;
 }
 
-static void sock_fanout_read(int fds[], const int expect[])
+static char *sock_fanout_open_ring(int fd)
 {
-	struct tpacket_stats stats;
-	socklen_t ssize;
-	int ret[2];
+	struct tpacket_req req = {
+		.tp_block_size = getpagesize(),
+		.tp_frame_size = getpagesize(),
+		.tp_block_nr   = RING_NUM_FRAMES,
+		.tp_frame_nr   = RING_NUM_FRAMES,
+	};
+	char *ring;
 
-	ssize = sizeof(stats);
-	if (getsockopt(fds[0], SOL_PACKET, PACKET_STATISTICS, &stats, &ssize)) {
-		perror("getsockopt statistics 0");
+	if (setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req,
+		       sizeof(req))) {
+		perror("packetsock ring setsockopt");
 		exit(1);
 	}
-	ret[0] = stats.tp_packets - stats.tp_drops;
-	ssize = sizeof(stats);
-	if (getsockopt(fds[1], SOL_PACKET, PACKET_STATISTICS, &stats, &ssize)) {
-		perror("getsockopt statistics 1");
+
+	ring = mmap(0, req.tp_block_size * req.tp_block_nr,
+		    PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (!ring) {
+		fprintf(stderr, "packetsock ring mmap\n");
 		exit(1);
 	}
-	ret[1] = stats.tp_packets - stats.tp_drops;
+
+	return ring;
+}
+
+static int sock_fanout_read_ring(int fd, void *ring)
+{
+	struct tpacket_hdr *header = ring;
+	int count = 0;
+
+	while (header->tp_status & TP_STATUS_USER && count < RING_NUM_FRAMES) {
+		count++;
+		header = ring + (count * getpagesize());
+	}
+
+	return count;
+}
+
+static int sock_fanout_read(int fds[], char *rings[], const int expect[])
+{
+	int ret[2];
+
+	ret[0] = sock_fanout_read_ring(fds[0], rings[0]);
+	ret[1] = sock_fanout_read_ring(fds[1], rings[1]);
 
 	fprintf(stderr, "info: count=%d,%d, expect=%d,%d\n",
 			ret[0], ret[1], expect[0], expect[1]);
@@ -200,8 +225,10 @@ static void sock_fanout_read(int fds[], const int expect[])
 	if ((!(ret[0] == expect[0] && ret[1] == expect[1])) &&
 	    (!(ret[0] == expect[1] && ret[1] == expect[0]))) {
 		fprintf(stderr, "ERROR: incorrect queue lengths\n");
-		exit(1);
+		return 1;
 	}
+
+	return 0;
 }
 
 /* Test illegal mode + flag combination */
@@ -253,11 +280,12 @@ static void test_control_group(void)
 	}
 }
 
-static void test_datapath(uint16_t typeflags,
-			  const int expect1[], const int expect2[])
+static int test_datapath(uint16_t typeflags, int port_off,
+			 const int expect1[], const int expect2[])
 {
 	const int expect0[] = { 0, 0 };
-	int fds[2], fds_udp[2][2];
+	char *rings[2];
+	int fds[2], fds_udp[2][2], ret;
 
 	fprintf(stderr, "test: datapath 0x%hx\n", typeflags);
 
@@ -267,41 +295,93 @@ static void test_datapath(uint16_t typeflags,
 		fprintf(stderr, "ERROR: failed open\n");
 		exit(1);
 	}
-	pair_udp_open(fds_udp[0], 8000);
-	pair_udp_open(fds_udp[1], 8002);
-	sock_fanout_read(fds, expect0);
+	rings[0] = sock_fanout_open_ring(fds[0]);
+	rings[1] = sock_fanout_open_ring(fds[1]);
+	pair_udp_open(fds_udp[0], PORT_BASE);
+	pair_udp_open(fds_udp[1], PORT_BASE + port_off);
+	sock_fanout_read(fds, rings, expect0);
 
 	/* Send data, but not enough to overflow a queue */
 	pair_udp_send(fds_udp[0], 15);
 	pair_udp_send(fds_udp[1], 5);
-	sock_fanout_read(fds, expect1);
+	ret = sock_fanout_read(fds, rings, expect1);
 
 	/* Send more data, overflow the queue */
 	pair_udp_send(fds_udp[0], 15);
 	/* TODO: ensure consistent order between expect1 and expect2 */
-	sock_fanout_read(fds, expect2);
+	ret |= sock_fanout_read(fds, rings, expect2);
 
+	if (munmap(rings[1], RING_NUM_FRAMES * getpagesize()) ||
+	    munmap(rings[0], RING_NUM_FRAMES * getpagesize())) {
+		fprintf(stderr, "close rings\n");
+		exit(1);
+	}
 	if (close(fds_udp[1][1]) || close(fds_udp[1][0]) ||
 	    close(fds_udp[0][1]) || close(fds_udp[0][0]) ||
 	    close(fds[1]) || close(fds[0])) {
 		fprintf(stderr, "close datapath\n");
 		exit(1);
 	}
+
+	return ret;
+}
+
+static int set_cpuaffinity(int cpuid)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpuid, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask)) {
+		if (errno != EINVAL) {
+			fprintf(stderr, "setaffinity %d\n", cpuid);
+			exit(1);
+		}
+		return 1;
+	}
+
+	return 0;
 }
 
 int main(int argc, char **argv)
 {
-	const int expect_hash[2][2]	= { { 15, 5 }, { 5, 0 } };
-	const int expect_hash_rb[2][2]	= { { 15, 5 }, { 5, 10 } };
-	const int expect_rb[2][2]	= { { 20, 0 }, { 0, 15 } };
+	const int expect_hash[2][2]	= { { 15, 5 },  { 20, 5 } };
+	const int expect_hash_rb[2][2]	= { { 15, 5 },  { 20, 15 } };
+	const int expect_lb[2][2]	= { { 10, 10 }, { 18, 17 } };
+	const int expect_rb[2][2]	= { { 20, 0 },  { 20, 15 } };
+	const int expect_cpu0[2][2]	= { { 20, 0 },  { 20, 0 } };
+	const int expect_cpu1[2][2]	= { { 0, 20 },  { 0, 20 } };
+	int port_off = 2, tries = 5, ret;
 
 	test_control_single();
 	test_control_group();
 
-	test_datapath(PACKET_FANOUT_HASH, expect_hash[0], expect_hash[1]);
-	test_datapath(PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_ROLLOVER,
-		      expect_hash_rb[0], expect_hash_rb[1]);
-	test_datapath(PACKET_FANOUT_ROLLOVER, expect_rb[0], expect_rb[1]);
+	/* find a set of ports that do not collide onto the same socket */
+	ret = test_datapath(PACKET_FANOUT_HASH, port_off,
+			    expect_hash[0], expect_hash[1]);
+	while (ret && tries--) {
+		fprintf(stderr, "info: trying alternate ports (%d)\n", tries);
+		ret = test_datapath(PACKET_FANOUT_HASH, ++port_off,
+				    expect_hash[0], expect_hash[1]);
+	}
+
+	ret |= test_datapath(PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_ROLLOVER,
+			     port_off, expect_hash_rb[0], expect_hash_rb[1]);
+	ret |= test_datapath(PACKET_FANOUT_LB,
+			     port_off, expect_lb[0], expect_lb[1]);
+	ret |= test_datapath(PACKET_FANOUT_ROLLOVER,
+			     port_off, expect_rb[0], expect_rb[1]);
+
+	set_cpuaffinity(0);
+	ret |= test_datapath(PACKET_FANOUT_CPU, port_off,
+			     expect_cpu0[0], expect_cpu0[1]);
+	if (!set_cpuaffinity(1))
+		/* TODO: test that choice alternates with previous */
+		ret |= test_datapath(PACKET_FANOUT_CPU, port_off,
+				     expect_cpu1[0], expect_cpu1[1]);
+
+	if (ret)
+		return 1;
 
 	printf("OK. All tests passed\n");
 	return 0;
-- 
1.8.1.3

^ permalink raw reply related

* Re: [PATCH net-next v3] packet: packet fanout rollover during socket overload
From: Willem de Bruijn @ 2013-03-20  6:37 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev
In-Reply-To: <CA+FuTSeDd4Xjh0KsbdZscrkSMk_o_1y+w_t_N-7FM3GG4i+LzA@mail.gmail.com>

On Tue, Mar 19, 2013 at 5:34 PM, Willem de Bruijn <willemb@google.com> wrote:
> On Tue, Mar 19, 2013 at 5:16 PM, David Miller <davem@davemloft.net> wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 19 Mar 2013 13:37:17 -0700
>>
>>> On Tue, 2013-03-19 at 16:18 -0400, Willem de Bruijn wrote:
>>>> Changes:
>>>>   v3->v2: rebase (no other changes)
>>>>           passes selftest
>>>>   v2->v1: read f->num_members only once
>>>>           fix bug: test rollover mode + flag
>>  ...
>>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>  ...
>>> Reviewed-by: Eric Dumazet <edumazet@google.com>
>>
>> Applied, and I added the selftest too
>
> Thanks!
>
>>, but it fails for me on my
>> 128-cpu sparc64 machine:
>>
>> make[1]: Entering directory `/home/davem/src/GIT/net-next/tools/testing/selftests/net-afpacket'
>> --------------------
>> running psock_fanout test
>> --------------------
>> test: control single socket
>> test: control multiple sockets
>> test: datapath 0x0
>> info: count=0,0, expect=0,0
>> info: count=0,20, expect=15,5
>> ERROR: incorrect queue lengths
>
> I'm looking into it. The test needs the packet sockets to be able to
> hold an exact number of test packets and then overflow. Queue
> length configured in bytes (SO_RCVBUF) was arrived at by
> experimentation. My best hunch so far is that this differs between
> platforms/nic, if skb->truesize does. If queue length is the problem,
> I'll switch to a packet ring. Booting another machine right now.

I was able to reproduce this. The issue is not queue length, but an
rxhash collision. The test either passes or fails consistently between
reboots. Rebooting to get a different hashrnd seed may change the test
outcome.

Before finding this, I had already updated the patch to use a packet
ring (still better than the estimated queue length) and added
testcases for the LB and CPU modes. I may need to clean it up, but I
will send the current patch to see if it resolves the bug.


>> [FAIL]

^ permalink raw reply

* Re: [PATCH net-next 1/2] net_sched: don't do precise pkt_len computation for untrusted packets
From: Jason Wang @ 2013-03-20  6:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, mst, edumazet,
	Daniel Borkmann
In-Reply-To: <1363697906.21184.53.camel@edumazet-glaptop>

On 03/19/2013 08:58 PM, Eric Dumazet wrote:
> On Tue, 2013-03-19 at 05:10 -0700, Eric Dumazet wrote:
>> On Tue, 2013-03-19 at 17:25 +0800, Jason Wang wrote:
>>
>>> I believe before doing header check for untrusted packets, the only
>>> thing we can trust is skb->len and that's we've used before
>>> 1def9238d4aa2. But after that, we're trying to use unchecked or
>>> meaningless value (e.g gso_segs were reset to zero in
>>> tun/macvtap/packet), and guest then can utilize this to result a very
>>> huge (-1U) pkt_len by filling evil value in the header. Can all kinds of
>>> packet scheduler survive this kinds of possible DOS?
>> I would use the flow dissector to fix the transport header from all
>> DODGY providers.
>>
>> Daniel Borkmann is working on a patch serie adding nhoff into flow_keys,
>> and adding __skb_get_poff(const struct sk_buff *skb), for a BPF
>> extension we talked about in Copenhagen / Netfilter Workshop.
>>
>> You could then set the transport header offset to the right value.
>>
>> (and drop evil packets before they go further in the stack)
>>
>> if (gso_packet(skb)) {
>> 	u32 poff = __skb_get_poff(skb);
>>
>> 	if (!poff) {
>> 		drop_evil_packet(skb);
>> 	} else {
>> 		skb_set_transport_header(skb, poff);
>> 		...
>> 	}
>> }
>
> Oh well, no need to use __skb_get_poff() but plain skb_flow_dissect()
> (once patched to include thoff in struct flow_keys)
>
> struct flow_keys keys;
>
> if (!skb_flow_dissect(skb, &keys))
> 	goto drop;
>
> if ((gso_type & (SKB_GSO_TCPV4|SKB_GSO_TCPV6)) &&
>     keys.ip_proto != IP_PROTO_TCP)
> 	goto drop;
>
> skb_set_transport_header(skb, keys.thoff);

I was consider a just skb_reset_transport_header() here. Consider the
transport header maybe checked and reset during header check for packets
of gso or partial checksum. And bypass precise pkt len computation.

Some problems with skb_flow_dissect():

- it can only recognizes a subset of all ethernet protocols. The may
blocks guest who may want to use other protocol such as IPX.
- almost no check in the validity of the L4 protocol header which may be
used by qdisc_pkt_len_init(), which may still give a chance to evil
guest to use
- gso_segs were untouched (still zero)

Another method is doing header check here which needs more work.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] drivers/isdn: break out of the loop after call isdn_tty_send_msg
From: Chen Gang @ 2013-03-20  5:03 UTC (permalink / raw)
  To: Jiri Slaby, David Miller; +Cc: Jiri Kosina, isdn, Greg KH, Alan Cox, netdev
In-Reply-To: <51428134.50601@asianux.com>

Hello Maintainers:

  did I send incorrect mail address ?
    ./scripts/get_maintainers.pl leads to cc netdev, but exclude David Miller.
    is it a bug of ./scripts/get_maintainers.pl ?
      (this time, I include him in this mail address).

  welcome any members to providing suggestions or completions.

  thanks.

gchen.

On 2013年03月15日 10:02, Chen Gang wrote:
> Hello Maintainers:
> 
>   is it qualified to be applied ?
> 
>   thanks.
> 
> 
> 于 2013年02月28日 17:54, Jiri Slaby 写道:
>> On 02/28/2013 03:57 AM, Chen Gang wrote:
>>>
>>>   need break out of the loop after call isdn_tty_send_msg.
>>>     isdn_tty_send_msg is intended to eat the rest of the string.
>>>     so need not scan again the string which appended "+M...".
>>
>> Yes, looks good.
>>
>>> Signed-off-by: Chen Gang <gang.chen@asianux.com>
>>> ---
>>>  drivers/isdn/i4l/isdn_tty.c |    4 +++-
>>>  1 files changed, 3 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/isdn/i4l/isdn_tty.c b/drivers/isdn/i4l/isdn_tty.c
>>> index d8a7d83..8ac7b33 100644
>>> --- a/drivers/isdn/i4l/isdn_tty.c
>>> +++ b/drivers/isdn/i4l/isdn_tty.c
>>> @@ -3587,7 +3587,7 @@ isdn_tty_parse_at(modem_info *info)
>>>  			case 'M':	/* MESSAGE */
>>>  				p++;
>>>  				isdn_tty_send_msg(info, m, p);
>>> -				break;
>>> +				goto tail;
>>>  			default:
>>>  				PARSE_ERROR;
>>>  			}
>>> @@ -3601,6 +3601,8 @@ isdn_tty_parse_at(modem_info *info)
>>>  			PARSE_ERROR;
>>>  		}
>>>  	}
>>> +
>>> +tail:
>>>  #ifdef CONFIG_ISDN_AUDIO
>>>  	if (!info->vonline)
>>>  #endif
>>>
>>
>>
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH] tcp: dont handle MTU reduction on LISTEN socket
From: Eric Dumazet @ 2013-03-20  4:51 UTC (permalink / raw)
  To: dormando; +Cc: David Miller, netdev
In-Reply-To: <alpine.DEB.2.02.1303192114580.4793@dflat>

On Tue, 2013-03-19 at 21:16 -0700, dormando wrote:

> Thanks! We are all incredibly appreciative of your quick help with this.
> 

Well, given that I introduced the bug, I better fix it ;)

> I have 3.8.3 + this patch running on our machine for 20 hours so far. We
> haven't had a kernel newer than 3.2 go for a full day before. I'm giving
> it another peak cycle before calling it dead but it looks like you nailed
> it.
> 
> I assume you no longer need the kernel disassembly?

Cross our fingers that the bug was really this one.

If you have another crash, please let us know.

Thanks !

^ permalink raw reply

* Re: [PATCH] tcp: dont handle MTU reduction on LISTEN socket
From: dormando @ 2013-03-20  4:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1363626088.29475.155.camel@edumazet-glaptop>

> From: Eric Dumazet <edumazet@google.com>
>
> When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
> LISTEN socket, and this socket is currently owned by the user, we
> set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.
>
> This is bad because if we clone the parent before it had a chance to
> clear the flag, the child inherits the tsq_flags value, and next
> tcp_release_cb() on the child will decrement sk_refcnt.
>
> Result is that we might free a live TCP socket, as reported by
> Dormando.
>
> IPv4: Attempt to release TCP socket in state 1
>
> Fix this issue by testing sk_state against TCP_LISTEN early, so that we
> set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)
>
> This bug was introduced in commit 563d34d05786
> (tcp: dont drop MTU reduction indications)
>
> Reported-by: dormando <dormando@rydia.net>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_ipv4.c |   14 +++++++-------
>  net/ipv6/tcp_ipv6.c |    7 +++++++
>  2 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 4a8ec45..d09203c 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -274,13 +274,6 @@ static void tcp_v4_mtu_reduced(struct sock *sk)
>  	struct inet_sock *inet = inet_sk(sk);
>  	u32 mtu = tcp_sk(sk)->mtu_info;
>
> -	/* We are not interested in TCP_LISTEN and open_requests (SYN-ACKs
> -	 * send out by Linux are always <576bytes so they should go through
> -	 * unfragmented).
> -	 */
> -	if (sk->sk_state == TCP_LISTEN)
> -		return;
> -
>  	dst = inet_csk_update_pmtu(sk, mtu);
>  	if (!dst)
>  		return;
> @@ -408,6 +401,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
>  			goto out;
>
>  		if (code == ICMP_FRAG_NEEDED) { /* PMTU discovery (RFC1191) */
> +			/* We are not interested in TCP_LISTEN and open_requests
> +			 * (SYN-ACKs send out by Linux are always <576bytes so
> +			 * they should go through unfragmented).
> +			 */
> +			if (sk->sk_state == TCP_LISTEN)
> +				goto out;
> +
>  			tp->mtu_info = info;
>  			if (!sock_owned_by_user(sk)) {
>  				tcp_v4_mtu_reduced(sk);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 9b64600..f6d629f 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -389,6 +389,13 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
>  	}
>
>  	if (type == ICMPV6_PKT_TOOBIG) {
> +		/* We are not interested in TCP_LISTEN and open_requests
> +		 * (SYN-ACKs send out by Linux are always <576bytes so
> +		 * they should go through unfragmented).
> +		 */
> +		if (sk->sk_state == TCP_LISTEN)
> +			goto out;
> +
>  		tp->mtu_info = ntohl(info);
>  		if (!sock_owned_by_user(sk))
>  			tcp_v6_mtu_reduced(sk);
>
>
>

Thanks! We are all incredibly appreciative of your quick help with this.

I have 3.8.3 + this patch running on our machine for 20 hours so far. We
haven't had a kernel newer than 3.2 go for a full day before. I'm giving
it another peak cycle before calling it dead but it looks like you nailed
it.

I assume you no longer need the kernel disassembly?

thanks again,
-Dormando

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox