Netdev List
 help / color / mirror / Atom feed
* Re: [patch 1/2] be2net: add unlock on error path
From: Sarveshwar Bandi @ 2010-05-27  6:31 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Sathya Perla, Subbu Seetharaman, Ajit Khaparde, David S. Miller,
	netdev, kernel-janitors
In-Reply-To: <20100526144634.GL22515@bicker>

Thanks

Acked-by: Sarveshwar Bandi <sarveshwarb@serverengines.com> 
On 26/05/10 16:46 +0200, Dan Carpenter wrote:
> The unlock accidentally got removed from the error path in dd131e76e5:
> "be2net: Bug fix to avoid disabling bottom half during firmware upgrade."
>
> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> diff --git a/drivers/net/benet/be_cmds.c b/drivers/net/benet/be_cmds.c
> index c911bfb..18d5789 100644
> --- a/drivers/net/benet/be_cmds.c
> +++ b/drivers/net/benet/be_cmds.c
> @@ -1429,7 +1429,7 @@ int be_cmd_write_flashrom(struct be_adapter *adapter, struct be_dma_mem *cmd,
>  	wrb = wrb_from_mccq(adapter);
>  	if (!wrb) {
>  		status = -EBUSY;
> -		goto err;
> +		goto err_unlock;
>  	}
>  	req = cmd->va;
>  	sge = nonembedded_sgl(wrb);
> @@ -1457,7 +1457,10 @@ int be_cmd_write_flashrom(struct be_adapter *adapter, struct be_dma_mem *cmd,
>  	else
>  		status = adapter->flash_status;
>  
> -err:
> +	return status;
> +
> +err_unlock:
> +	spin_unlock_bh(&adapter->mcc_lock);
>  	return status;
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
From: Anton Blanchard @ 2010-05-27  6:09 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100526.222332.233688655.davem@davemloft.net>


> > [PATCH v2] net: fix lock_sock_bh/unlock_sock_bh
> > 
> > This new sock lock primitive was introduced to speedup some user context
> > socket manipulation. But it is unsafe to protect two threads, one using
> > regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> > 
> > This patch changes lock_sock_bh to be careful against 'owned' state.
> > If owned is found to be set, we must take the slow path.
> > lock_sock_bh() now returns a boolean to say if the slow path was taken,
> > and this boolean is used at unlock_sock_bh time to call the appropriate
> > unlock function.
> > 
> > After this change, BH are either disabled or enabled during the
> > lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> > so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> > 
> > Reported-by: Anton Blanchard <anton@samba.org>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good, I'll wait for positive testing from Anton before applying
> this.

Thanks guys, this fixed it.

Tested-by: Anton Blanchard <anton@samba.org>

Anton

^ permalink raw reply

* Re: [PATCH 3/3] cxgb4i_v3: iscsi and libcxgbi library for handling common part
From: Rakesh Ranjan @ 2010-05-27  6:05 UTC (permalink / raw)
  To: Mike Christie
  Cc: open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Rakesh Ranjan, NETDEVML,
	SCSIDEVML, LKML, Karen Xie, David Miller, James Bottomley,
	Anish Bhatt
In-Reply-To: <4BFE0A3E.9020804-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/27/2010 11:29 AM, Mike Christie wrote:
> On 05/15/2010 12:24 PM, Rakesh Ranjan wrote:
>> From: Rakesh Ranjan<rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>>
>>
>> Signed-off-by: Rakesh Ranjan<rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>> ---
>>   drivers/scsi/cxgb4i/cxgb4i_iscsi.c |  617
>> ++++++++++++++++++++++++++++++++++++
>>   drivers/scsi/cxgb4i/libcxgbi.c     |  589
>> ++++++++++++++++++++++++++++++++++
>>   drivers/scsi/cxgb4i/libcxgbi.h     |  430 +++++++++++++++++++++++++
> 
> I think the patch had some whitespace/newline issues. When I did git-am
> I got:
> 
> warning: squelched 1 whitespace error
> warning: 6 lines add whitespace errors.
> 
> I think James can just fix up when he merges with git, but in the future
> you might want to try a git-am on your patch before you send (git-am
> --whitespace=fix will fix it up for you).
> 
> 
> 
>> +
>> +int cxgbi_pdu_init(struct cxgbi_device *cdev)
>> +{
>> +    if (cdev->skb_tx_headroom>  (512 * MAX_SKB_FRAGS))
>> +        cdev->skb_extra_headroom = cdev->skb_tx_headroom;
>> +    cdev->pad_page = alloc_page(GFP_KERNEL);
>> +    if (cdev->pad_page)
>> +        return -ENOMEM;
>> +    memset(page_address(cdev->pad_page), 0, PAGE_SIZE);
>> +    return 0;
>> +
>> +    /*
>> +    if (SKB_TX_HEADROOM>  (512 * MAX_SKB_FRAGS))
>> +        skb_extra_headroom = SKB_TX_HEADROOM;
>> +    pad_page = alloc_page(GFP_KERNEL);
>> +    if (!pad_page)
>> +        return -ENOMEM;
>> +    memset(page_address(pad_page), 0, PAGE_SIZE);
>> +    return 0;*/
> 
> Clean this up.
> 
>> +
>> +void cxgbi_release_itt(struct iscsi_task *task, itt_t hdr_itt)
>> +{
>> +    struct scsi_cmnd *sc = task->sc;
>> +    struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
>> +    struct cxgbi_conn *cconn = tcp_conn->dd_data;
>> +    struct cxgbi_device *cdev = cconn->chba->cdev;
>> +    struct cxgbi_tag_format *tformat =&cdev->tag_format;
>> +    u32 tag = ntohl((__force u32)hdr_itt);
>> +
>> +    cxgbi_tag_debug("release tag 0x%x.\n", tag);
>> +
>> +    if (sc&&
>> +        (scsi_bidi_cmnd(sc) ||
>> +         sc->sc_data_direction == DMA_FROM_DEVICE)&&
>> +            cxgbi_is_ddp_tag(tformat, tag))
> 
> The formatting is a little weird. I think you want each line to start in
> the same column instead of each getting tabbed over.
> 
> 
>> +
>> +
>> +int cxgbi_reserve_itt(struct iscsi_task *task, itt_t *hdr_itt)
>> +{
>> +    struct scsi_cmnd *sc = task->sc;
>> +    struct iscsi_conn *conn = task->conn;
>> +    struct iscsi_session *sess = conn->session;
>> +    struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
>> +    struct cxgbi_conn *cconn = tcp_conn->dd_data;
>> +    struct cxgbi_device *cdev = cconn->chba->cdev;
>> +    struct cxgbi_tag_format *tformat =&cdev->tag_format;
>> +    u32 sw_tag = (sess->age<<  cconn->task_idx_bits) | task->itt;
>> +    u32 tag;
>> +    int err = -EINVAL;
>> +
>> +    if (sc&&
>> +        (scsi_bidi_cmnd(sc) ||
>> +         sc->sc_data_direction == DMA_FROM_DEVICE)&&
>> +            cxgbi_sw_tag_usable(tformat, sw_tag)) {
> 
> 
> Same tabbing.
> 
> 
>> +    volatile unsigned int state;
> 
> I did not get why this needed to be volatile (I see it is like this in
> cxgb3i and I did not see why in there too).
> 
> 
> 
>> +
>> +static inline void cxgbi_sock_hold(struct cxgbi_sock *csk)
>> +{
>> +    atomic_inc(&csk->refcnt);
> 
> 
> We want people to use krefs instead of their own refcounting now.
> 
>> +
>> +static inline void *cplhdr(struct sk_buff *skb)
>> +{
>> +    return skb->data;
>> +}
> 
> 
> Seems kinda useless.

Hi Mike,

Thanks for the review, will send you the updated patch.

Regards
Rakesh Ranjan

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQIcBAEBCgAGBQJL/guKAAoJEBqoHbxtDU4JFvQP/2uIWFmZo4nqcJ7xmEiBjnaq
BeUfjbRGaiAImEM/6OuMIB/Fm9p9NGtW0GsZdqKYSdZBmgO1kWNT1onCZgVF4eW1
Muikjo+I2wXGsWIt4ZqbZs722COC1oQHXpyulfh/6ONzeWJ3R8vos/TnCw256Wxj
HL42jIO882JCImzmP/wmdPmDlTUI/Z0UEYbu0djy20yzESo2NDHq1PYYopyXIboH
joJ/sZvK0jRYrRsDdq84XJ8CUv1yDE8m3NiMV8cV9JN1EwP/gzBrq0DXfFgpdEDU
+4/pfzzxJ5y/ekdZaSZdx9ke/n/vmamJCaQONaL2WfQaznogxqZypaoM/NRzfMlH
40KY5lqajGRnm0aEBffn8uqBzEbopYpb1m+3mzt/vDtvdBRuSJqmcPbUandTNHLe
cZQnn689GtWqJfoyVF/dfyubtAZFLu8VVZkFxx1e3UDEqVDUuwVEHXoXSt6K/SkB
UnZI7hMUYSof+WI+UEV8DR4KMsRbE1cnvXGnsXTZOMhMsU3KoMielaGTgepAbxT7
bLfaARiMwvU+vle5546uK8IuxjGH81nbUEy6R86OeS6Osv6PZFiGYP9yxjzkp4K7
NWOTnVV9qTaPPTtN9SORINPXUXuI90JgIDbh9qnfDDKl2oj5HmwBpa7S3KmMxxlI
eoAqa/9LLjrMVi5Wicv3
=TOBq
-----END PGP SIGNATURE-----

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply

* Re: [PATCH 3/3] cxgb4i_v3: iscsi and libcxgbi library for handling common part
From: Mike Christie @ 2010-05-27  5:59 UTC (permalink / raw)
  To: open-iscsi
  Cc: Rakesh Ranjan, NETDEVML, SCSIDEVML, LKML, Karen Xie, David Miller,
	James Bottomley, Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1273944249-311-4-git-send-email-rakesh@chelsio.com>

On 05/15/2010 12:24 PM, Rakesh Ranjan wrote:
> From: Rakesh Ranjan<rranjan@chelsio.com>
>
>
> Signed-off-by: Rakesh Ranjan<rakesh@chelsio.com>
> ---
>   drivers/scsi/cxgb4i/cxgb4i_iscsi.c |  617 ++++++++++++++++++++++++++++++++++++
>   drivers/scsi/cxgb4i/libcxgbi.c     |  589 ++++++++++++++++++++++++++++++++++
>   drivers/scsi/cxgb4i/libcxgbi.h     |  430 +++++++++++++++++++++++++

I think the patch had some whitespace/newline issues. When I did git-am 
I got:

warning: squelched 1 whitespace error
warning: 6 lines add whitespace errors.

I think James can just fix up when he merges with git, but in the future 
you might want to try a git-am on your patch before you send (git-am 
--whitespace=fix will fix it up for you).



> +
> +int cxgbi_pdu_init(struct cxgbi_device *cdev)
> +{
> +	if (cdev->skb_tx_headroom>  (512 * MAX_SKB_FRAGS))
> +		cdev->skb_extra_headroom = cdev->skb_tx_headroom;
> +	cdev->pad_page = alloc_page(GFP_KERNEL);
> +	if (cdev->pad_page)
> +		return -ENOMEM;
> +	memset(page_address(cdev->pad_page), 0, PAGE_SIZE);
> +	return 0;
> +
> +	/*
> +	if (SKB_TX_HEADROOM>  (512 * MAX_SKB_FRAGS))
> +		skb_extra_headroom = SKB_TX_HEADROOM;
> +	pad_page = alloc_page(GFP_KERNEL);
> +	if (!pad_page)
> +		return -ENOMEM;
> +	memset(page_address(pad_page), 0, PAGE_SIZE);
> +	return 0;*/

Clean this up.

> +
> +void cxgbi_release_itt(struct iscsi_task *task, itt_t hdr_itt)
> +{
> +	struct scsi_cmnd *sc = task->sc;
> +	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
> +	struct cxgbi_conn *cconn = tcp_conn->dd_data;
> +	struct cxgbi_device *cdev = cconn->chba->cdev;
> +	struct cxgbi_tag_format *tformat =&cdev->tag_format;
> +	u32 tag = ntohl((__force u32)hdr_itt);
> +
> +	cxgbi_tag_debug("release tag 0x%x.\n", tag);
> +
> +	if (sc&&
> +		(scsi_bidi_cmnd(sc) ||
> +		 sc->sc_data_direction == DMA_FROM_DEVICE)&&
> +			cxgbi_is_ddp_tag(tformat, tag))

The formatting is a little weird. I think you want each line to start in 
the same column instead of each getting tabbed over.


> +
> +
> +int cxgbi_reserve_itt(struct iscsi_task *task, itt_t *hdr_itt)
> +{
> +	struct scsi_cmnd *sc = task->sc;
> +	struct iscsi_conn *conn = task->conn;
> +	struct iscsi_session *sess = conn->session;
> +	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
> +	struct cxgbi_conn *cconn = tcp_conn->dd_data;
> +	struct cxgbi_device *cdev = cconn->chba->cdev;
> +	struct cxgbi_tag_format *tformat =&cdev->tag_format;
> +	u32 sw_tag = (sess->age<<  cconn->task_idx_bits) | task->itt;
> +	u32 tag;
> +	int err = -EINVAL;
> +
> +	if (sc&&
> +		(scsi_bidi_cmnd(sc) ||
> +		 sc->sc_data_direction == DMA_FROM_DEVICE)&&
> +			cxgbi_sw_tag_usable(tformat, sw_tag)) {


Same tabbing.


> +	volatile unsigned int state;

I did not get why this needed to be volatile (I see it is like this in 
cxgb3i and I did not see why in there too).



> +
> +static inline void cxgbi_sock_hold(struct cxgbi_sock *csk)
> +{
> +	atomic_inc(&csk->refcnt);


We want people to use krefs instead of their own refcounting now.

> +
> +static inline void *cplhdr(struct sk_buff *skb)
> +{
> +	return skb->data;
> +}


Seems kinda useless.

^ permalink raw reply

* [PATCH] mlx4_en: use net_device dev_id to indicate port number
From: Eli Cohen @ 2010-05-27  5:56 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	rdreier-FYB4Gu1CFyUAvxtiuMwx3w, yevgenyp-VPRAkNaXOzVS1MOuV/RT9w,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA

Today, there are no means to know which port of a hardware device a netdev
interface uses. struct net_device conatins a field, dev_id, that can be used
for that. Use this field to save the port number in ConnectX that is being used
by the net device; port numbers are zero based.

Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
 drivers/net/mlx4/en_netdev.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/mlx4/en_netdev.c b/drivers/net/mlx4/en_netdev.c
index 6c2b15b..ab1d480 100644
--- a/drivers/net/mlx4/en_netdev.c
+++ b/drivers/net/mlx4/en_netdev.c
@@ -978,6 +978,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	}
 
 	SET_NETDEV_DEV(dev, &mdev->dev->pdev->dev);
+	dev->dev_id =  port - 1;
 
 	/*
 	 * Initialize driver private data
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] net: add rtnl assertion to linkwatch_run_queue
From: Denis Kirjanov @ 2010-05-27  5:51 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, shemminger, herbert, kaber, netdev

Add RTNL assertion to linkwatch_run_queue since function 
must be called with semaphore held

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
net/core/link_watch.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/core/link_watch.c b/net/core/link_watch.c
index bdbce2f..e6d06c0 100644
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -220,6 +220,7 @@ void linkwatch_forget_dev(struct net_device *dev)
 /* Must be called with the rtnl semaphore held */
 void linkwatch_run_queue(void)
 {
+	ASSERT_RTNL();
 	__linkwatch_run_queue(0);
 }
 

^ permalink raw reply related

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
From: David Miller @ 2010-05-27  5:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev
In-Reply-To: <1274937618.2542.46.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 May 2010 07:20:18 +0200

> [PATCH v2] net: fix lock_sock_bh/unlock_sock_bh
> 
> This new sock lock primitive was introduced to speedup some user context
> socket manipulation. But it is unsafe to protect two threads, one using
> regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> 
> This patch changes lock_sock_bh to be careful against 'owned' state.
> If owned is found to be set, we must take the slow path.
> lock_sock_bh() now returns a boolean to say if the slow path was taken,
> and this boolean is used at unlock_sock_bh time to call the appropriate
> unlock function.
> 
> After this change, BH are either disabled or enabled during the
> lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, I'll wait for positive testing from Anton before applying
this.

^ permalink raw reply

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
From: Eric Dumazet @ 2010-05-27  5:20 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev
In-Reply-To: <1274936763.2542.42.camel@edumazet-laptop>

Le jeudi 27 mai 2010 à 07:06 +0200, Eric Dumazet a écrit :
> Le mercredi 26 mai 2010 à 21:21 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 27 May 2010 06:18:46 +0200
> > 
> > > Process context :
> > > 
> > > lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> > > 
> > > So I should add a test on owned. If set (by another thread), we should take the slow path.
> > 
> > Indeed, what you're doing now is broken.
> > 
> > If owned is non-zero when you take the spinlock you have to queue your
> > work, just like how we toss packets into the socket backlog, which is
> > processed by release_sock(), when this happens.
> 
> Here is the patch I am going to test today.
> 
> Anton, could you please test it too, just for sure ?
> 
> Thanks !
> 
> 
> [PATCH] net: fix lock_sock_bh/unlock_sock_bh
> 
> This new sock lock primitive was introduced to speedup some user context
> socket manipulation. But it is unsafe to protect two threads, one using
> regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> 
> This patch changes lock_sock_bh to be careful against 'owned' state.
> If owned is found to be set, we must take the slow path.
> lock_sock_bh() now returns a boolean to say if the slow path was taken,
> and this boolean is used at unlock_sock_bh time to call the appropriate
> unlock function.
> 
> After this change, BH are either disabled or enabled during the
> lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/sock.h  |   20 ++++++++++++++------
>  net/core/datagram.c |    6 ++++--
>  net/core/sock.c     |   32 ++++++++++++++++++++++++++++++++
>  net/ipv4/udp.c      |   14 ++++++++------
>  net/ipv6/udp.c      |    5 +++--
>  5 files changed, 61 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index d2a71b0..ca241ea 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
>  				SINGLE_DEPTH_NESTING)
>  #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
>  
> -static inline void lock_sock_bh(struct sock *sk)
> +extern bool lock_sock_fast(struct sock *sk);
> +/**
> + * unlock_sock_fast - complement of lock_sock_fast
> + * @sk: socket
> + * @slow: slow mode
> + *
> + * fast unlock socket for user context.
> + * If slow mode is on, we call regular release_sock()
> + */
> +static inline void unlock_sock_fast(struct sock *sk, bool slow)
>  {
> -	spin_lock_bh(&sk->sk_lock.slock);
> +	if (slow)
> +		release_sock(sk);
> +	else
> +		spin_unlock_bh(&sk->sk_lock.slock);
>  }
>  
> -static inline void unlock_sock_bh(struct sock *sk)
> -{
> -	spin_unlock_bh(&sk->sk_lock.slock);
> -}
>  
>  extern struct sock		*sk_alloc(struct net *net, int family,
>  					  gfp_t priority,
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index e009753..f5b6f43 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
>  
>  void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
> +	bool slow;
> +
>  	if (likely(atomic_read(&skb->users) == 1))
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>  		return;
>  
> -	lock_sock_bh(sk);
> +	slow = lock_sock_fast(sk);
>  	skb_orphan(skb);
>  	sk_mem_reclaim_partial(sk);
> -	unlock_sock_bh(sk);
> +	unlock_sock_fast(sk, slow);
>  
>  	/* skb is now orphaned, can be freed outside of locked section */
>  	__kfree_skb(skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 37fe9b6..7ab6398 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2007,6 +2007,38 @@ void release_sock(struct sock *sk)
>  }
>  EXPORT_SYMBOL(release_sock);
>  
> +/**
> + * lock_sock_fast - fast version of lock_sock
> + * @sk: socket
> + *
> + * This version should be used for very small section, where process wont block
> + * return false if fast path is taken
> + *   sk_lock.slock locked, owned = 0, BH disabled
> + * return true if slow path is taken
> + *   sk_lock.slock unlocked, owned = 1, BH enabled
> + */
> +bool lock_sock_fast(struct sock *sk)
> +{
> +	might_sleep();
> +	spin_lock_bh(&sk->sk_lock.slock);
> +
> +	if (!sk->sk_lock.owned)
> +		/*
> +		 * Note : We must disable BH or risk a deadlock
> +		 */
> +		return false;
> +
> +	__lock_sock(sk);
> +	sk->sk_lock.owned = 1;
> +	spin_unlock(&sk->sk_lock.slock);
> +	/*
> +	 * The sk_lock has mutex_lock() semantics here:
> +	 */
> +	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);

Oops, I missed a local_bh_enable(); here

> +	return true;
> +}
> +EXPORT_SYMBOL(lock_sock_fast);
> +
>  int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
>  {
>  	struct timeval tv;

Here is v2 of patch with local_bh_enable(); added in lock_sock_fast()

[PATCH v2] net: fix lock_sock_bh/unlock_sock_bh

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   20 ++++++++++++++------
 net/core/datagram.c |    6 ++++--
 net/core/sock.c     |   33 +++++++++++++++++++++++++++++++++
 net/ipv4/udp.c      |   14 ++++++++------
 net/ipv6/udp.c      |    5 +++--
 5 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d2a71b0..ca241ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
-static inline void lock_sock_bh(struct sock *sk)
+extern bool lock_sock_fast(struct sock *sk);
+/**
+ * unlock_sock_fast - complement of lock_sock_fast
+ * @sk: socket
+ * @slow: slow mode
+ *
+ * fast unlock socket for user context.
+ * If slow mode is on, we call regular release_sock()
+ */
+static inline void unlock_sock_fast(struct sock *sk, bool slow)
 {
-	spin_lock_bh(&sk->sk_lock.slock);
+	if (slow)
+		release_sock(sk);
+	else
+		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static inline void unlock_sock_bh(struct sock *sk)
-{
-	spin_unlock_bh(&sk->sk_lock.slock);
-}
 
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e009753..f5b6f43 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	bool slow;
+
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
 
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
 	__kfree_skb(skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 37fe9b6..2cf7f9f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2007,6 +2007,39 @@ void release_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(release_sock);
 
+/**
+ * lock_sock_fast - fast version of lock_sock
+ * @sk: socket
+ *
+ * This version should be used for very small section, where process wont block
+ * return false if fast path is taken
+ *   sk_lock.slock locked, owned = 0, BH disabled
+ * return true if slow path is taken
+ *   sk_lock.slock unlocked, owned = 1, BH enabled
+ */
+bool lock_sock_fast(struct sock *sk)
+{
+	might_sleep();
+	spin_lock_bh(&sk->sk_lock.slock);
+
+	if (!sk->sk_lock.owned)
+		/*
+		 * Note : We must disable BH
+		 */
+		return false;
+
+	__lock_sock(sk);
+	sk->sk_lock.owned = 1;
+	spin_unlock(&sk->sk_lock.slock);
+	/*
+	 * The sk_lock has mutex_lock() semantics here:
+	 */
+	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
+	local_bh_enable();
+	return true;
+}
+EXPORT_SYMBOL(lock_sock_fast);
+
 int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
 {
 	struct timeval tv;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9de6a69..b9d0d40 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1063,10 +1063,11 @@ static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock_bh(sk);
+		bool slow = lock_sock_fast(sk);
+
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		unlock_sock_bh(sk);
+		unlock_sock_fast(sk, slow);
 	}
 	return res;
 }
@@ -1123,6 +1124,7 @@ int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int peeked;
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
+	bool slow;
 
 	/*
 	 *	Check any passed addresses
@@ -1197,10 +1199,10 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1625,9 +1627,9 @@ int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock_bh(sk);
+	bool slow = lock_sock_fast(sk);
 	udp_flush_pending_frames(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3d7a2c0..87be586 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -328,6 +328,7 @@ int udpv6_recvmsg(struct kiocb *iocb, struct sock *sk,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	int is_udp4;
+	bool slow;
 
 	if (addr_len)
 		*addr_len=sizeof(struct sockaddr_in6);
@@ -424,7 +425,7 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +434,7 @@ csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;



^ permalink raw reply related

* [PATCH] net: fix lock_sock_bh/unlock_sock_bh
From: Eric Dumazet @ 2010-05-27  5:06 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev
In-Reply-To: <20100526.212105.212688828.davem@davemloft.net>

Le mercredi 26 mai 2010 à 21:21 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 27 May 2010 06:18:46 +0200
> 
> > Process context :
> > 
> > lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> > 
> > So I should add a test on owned. If set (by another thread), we should take the slow path.
> 
> Indeed, what you're doing now is broken.
> 
> If owned is non-zero when you take the spinlock you have to queue your
> work, just like how we toss packets into the socket backlog, which is
> processed by release_sock(), when this happens.

Here is the patch I am going to test today.

Anton, could you please test it too, just for sure ?

Thanks !


[PATCH] net: fix lock_sock_bh/unlock_sock_bh

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   20 ++++++++++++++------
 net/core/datagram.c |    6 ++++--
 net/core/sock.c     |   32 ++++++++++++++++++++++++++++++++
 net/ipv4/udp.c      |   14 ++++++++------
 net/ipv6/udp.c      |    5 +++--
 5 files changed, 61 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d2a71b0..ca241ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
-static inline void lock_sock_bh(struct sock *sk)
+extern bool lock_sock_fast(struct sock *sk);
+/**
+ * unlock_sock_fast - complement of lock_sock_fast
+ * @sk: socket
+ * @slow: slow mode
+ *
+ * fast unlock socket for user context.
+ * If slow mode is on, we call regular release_sock()
+ */
+static inline void unlock_sock_fast(struct sock *sk, bool slow)
 {
-	spin_lock_bh(&sk->sk_lock.slock);
+	if (slow)
+		release_sock(sk);
+	else
+		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static inline void unlock_sock_bh(struct sock *sk)
-{
-	spin_unlock_bh(&sk->sk_lock.slock);
-}
 
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e009753..f5b6f43 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	bool slow;
+
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
 
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
 	__kfree_skb(skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 37fe9b6..7ab6398 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2007,6 +2007,38 @@ void release_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(release_sock);
 
+/**
+ * lock_sock_fast - fast version of lock_sock
+ * @sk: socket
+ *
+ * This version should be used for very small section, where process wont block
+ * return false if fast path is taken
+ *   sk_lock.slock locked, owned = 0, BH disabled
+ * return true if slow path is taken
+ *   sk_lock.slock unlocked, owned = 1, BH enabled
+ */
+bool lock_sock_fast(struct sock *sk)
+{
+	might_sleep();
+	spin_lock_bh(&sk->sk_lock.slock);
+
+	if (!sk->sk_lock.owned)
+		/*
+		 * Note : We must disable BH or risk a deadlock
+		 */
+		return false;
+
+	__lock_sock(sk);
+	sk->sk_lock.owned = 1;
+	spin_unlock(&sk->sk_lock.slock);
+	/*
+	 * The sk_lock has mutex_lock() semantics here:
+	 */
+	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
+	return true;
+}
+EXPORT_SYMBOL(lock_sock_fast);
+
 int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
 {
 	struct timeval tv;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9de6a69..b9d0d40 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1063,10 +1063,11 @@ static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock_bh(sk);
+		bool slow = lock_sock_fast(sk);
+
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		unlock_sock_bh(sk);
+		unlock_sock_fast(sk, slow);
 	}
 	return res;
 }
@@ -1123,6 +1124,7 @@ int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int peeked;
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
+	bool slow;
 
 	/*
 	 *	Check any passed addresses
@@ -1197,10 +1199,10 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1625,9 +1627,9 @@ int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock_bh(sk);
+	bool slow = lock_sock_fast(sk);
 	udp_flush_pending_frames(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3d7a2c0..87be586 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -328,6 +328,7 @@ int udpv6_recvmsg(struct kiocb *iocb, struct sock *sk,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	int is_udp4;
+	bool slow;
 
 	if (addr_len)
 		*addr_len=sizeof(struct sockaddr_in6);
@@ -424,7 +425,7 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +434,7 @@ csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;



^ permalink raw reply related

* Re: [PATCH 3/3] net: deliver skbs on inactive slaves to exact matches
From: John Fastabend @ 2010-05-27  4:52 UTC (permalink / raw)
  To: andy@greyhouse.net, fubar@us.ibm.com, nhorman@tuxdriver.com,
	"bonding-devel@lists.sourceforge.net" <bonding-dev
In-Reply-To: <20100513073117.3528.33132.stgit@jf-dev2-dcblab>

Fastabend, John R wrote:
> Currently, the accelerated receive path for VLAN's will
> drop packets if the real device is an inactive slave and
> is not one of the special pkts tested for in
> skb_bond_should_drop().  This behavior is different then
> the non-accelerated path and for pkts over a bonded vlan.
> 
> For example,
> 
> vlanx -> bond0 -> ethx
> 
> will be dropped in the vlan path and not delivered to any
> packet handlers at all.  However,
> 
> bond0 -> vlanx -> ethx
> 
> and
> 
> bond0 -> ethx
> 
> will be delivered to handlers that match the exact dev,
> because the VLAN path checks the real_dev which is not a
> slave and netif_recv_skb() doesn't drop frames but only
> delivers them to exact matches.
> 
> This patch adds a sk_buff flag which is used for tagging
> skbs that would previously been dropped and allows the
> skb to continue to skb_netif_recv().  Here we add
> logic to check for the bond_should_drop flag and if it
> is set only deliver to handlers that match exactly.  This
> makes both paths above consistent and gives pkt handlers
> a way to identify skbs that come from inactive slaves.
> Without this patch in some configurations skbs will be
> delivered to handlers with exact matches and in others
> be dropped out right in the vlan path.
> 
> I have tested the following 4 configurations in failover modes
> and load balancing modes.
> 
> # bond0 -> ethx
> 
> # vlanx -> bond0 -> ethx
> 
> # bond0 -> vlanx -> ethx
> 
> # bond0 -> ethx
>             |
>   vlanx -> --
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 

Jay,

I know you were looking over this series at one point.  Did you have any 
comments?  Sorry for the ping just wanted to keep this on my radar.

Thanks,
John

^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2010-05-27  4:21 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev
In-Reply-To: <20100526.210600.242135655.davem@davemloft.net>

Le mercredi 26 mai 2010 à 21:06 -0700, David Miller a écrit :
> From: Anton Blanchard <anton@samba.org>
> Date: Thu, 27 May 2010 13:56:17 +1000
> 
> > I'm somewhat confused by the two stage locking in the socket lock
> > (ie sk_lock.slock and sk_lock.owned).
> > 
> > What state should the socket lock be in for serialising updates of
> > sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> > unlocked, sk_lock.owned = 1:
> 
> If sh_lock.owned=1 the user has grabbed exclusive sleepable lock on the
> socket, it does this via something approximating:
> 
> retry:
> 	spin_lock(&sk_lock.slock);
> 	was_locked = sk_lock.owned;
> 	if (!was_locked)
> 		sk_lock.owned = 1;
> 	spin_unlock(&sk_lock.slock);
> 	if (was_locked) {
> 		sleep_on(condition(sk_lock.owned));
> 		goto retry;
> 	}
> 
> This allows user context code to sleep with exclusive access to the
> socket.
> 
> Code that cannot sleep takes the spinlock, and queues the work if the
> owner field if non-zero.  Else, it keeps the spinlock held and does
> the work.
> 
> In either case, socket modifications are thus done completely protected
> from other contexts.
> 
> 

Yes, but not on the case one user context uses the 'slow locking', and
another uses the 'fast locking'.

I am working on a patch now.





^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: David Miller @ 2010-05-27  4:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev
In-Reply-To: <1274933926.2542.26.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 May 2010 06:18:46 +0200

> Process context :
> 
> lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> 
> So I should add a test on owned. If set (by another thread), we should take the slow path.

Indeed, what you're doing now is broken.

If owned is non-zero when you take the spinlock you have to queue your
work, just like how we toss packets into the socket backlog, which is
processed by release_sock(), when this happens.

^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2010-05-27  4:18 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: David Miller, netdev
In-Reply-To: <20100527035617.GB28295@kryten>

Le jeudi 27 mai 2010 à 13:56 +1000, Anton Blanchard a écrit :
> Hi Eric,
> 
> > You are 100% right David, maybe we should add a test when changing
> > sk_forward_alloc to test if socket is locked (lockdep only test), but
> > that's for 2.6.36 :)
> 
> Thanks for the patch, unfortunately I can still hit the WARN_ON. I'm somewhat
> confused by the two stage locking in the socket lock (ie sk_lock.slock and
> sk_lock.owned).
> 
> What state should the socket lock be in for serialising updates of
> sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> unlocked, sk_lock.owned = 1:
> 
> NIP [c0000000005b4ad0] .sock_queue_rcv_skb
> LR [c0000000005b4acc] .sock_queue_rcv_skb
> Call Trace:
> [c0000000005f9fcc] .ip_queue_rcv_skb
> [c00000000061d604] .__udp_queue_rcv_skb
> [c0000000005b1a38] .release_sock
> [c0000000006205f0] .udp_sendmsg
> [c0000000006290d4] .inet_sendmsg
> [c0000000005abfb4] .sock_sendmsg
> [c0000000005ae9dc] .SyS_sendto
> [c0000000005ab6c0] .SyS_send
> 
> And other times we update it with sk_lock.slock = locked, sk_lock.owned = 0:
> 
> NIP [c0000000005b2b6c] .sock_rfree
> LR [c0000000005b2b68] .sock_rfree
> Call Trace:
> [c0000000005bca10] .skb_free_datagram_locked
> [c00000000061fe88] .udp_recvmsg
> [c0000000006285e8] .inet_recvmsg
> [c0000000005abe0c] .sock_recvmsg
> [c0000000005ae358] .SyS_recvfrom
> 
> I see we sometimes take sk_lock.slock then check the owned field, but we
> aren't doing that all the time.
> 

Old rule was :

A Process context was using 
lock + test_and_set_or_sleep + unlock (sk_lock.slock = unlocked,
sk_lock.owned = 1)

softirq handler was using :
(sk_lock.slock = locked, sk_lock.owned =0)

I added a shortcut, but it seems wrong as is

Process context :

lock only (sk_lock.slock = locked, sk_lock.owned = ???)

So I should add a test on owned. If set (by another thread), we should take the slow path.

Thanks !



^ permalink raw reply

* Re: [PATCH 1/2] net: ll_temac: fix interrupt bug when interrupt 0 is used
From: John Williams @ 2010-05-27  4:12 UTC (permalink / raw)
  To: John Linn
  Cc: netdev, linuxppc-dev, grant.likely, jwboyer, michal.simek,
	Brian Hill
In-Reply-To: <6f264a7a-e5da-494a-a24d-1578ca422807@VA3EHSMHS016.ehs.local>

On Thu, May 27, 2010 at 3:29 AM, John Linn <john.linn@xilinx.com> wrote:
> The code is not checking the interrupt for DMA correctly so that an
> interrupt number of 0 will cause a false error.
>
> Signed-off-by: Brian Hill <brian.hill@xilinx.com>
> Signed-off-by: John Linn <john.linn@xilinx.com>
> ---
>  drivers/net/ll_temac_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/ll_temac_main.c b/drivers/net/ll_temac_main.c
> index fa7620e..0615737 100644
> --- a/drivers/net/ll_temac_main.c
> +++ b/drivers/net/ll_temac_main.c
> @@ -950,7 +950,7 @@ temac_of_probe(struct of_device *op, const struct of_device_id *match)
>
>        lp->rx_irq = irq_of_parse_and_map(np, 0);
>        lp->tx_irq = irq_of_parse_and_map(np, 1);
> -       if (!lp->rx_irq || !lp->tx_irq) {
> +       if ((lp->rx_irq == NO_IRQ) || (lp->tx_irq == NO_IRQ)) {

Personally I think this is the right thing to do.  But, I thought the
IRQ 0 == NO_IRQ (AKA "all-the-world's-an-x86-and-if-not-it-should-be")
holy war was already fought and won (or lost, depending on your
perspective)?

I seem to recall giving reluctant assent to a patch from Grant a few
months ago that touched MicroBlaze thus?

John

^ permalink raw reply

* Re: [PATCH 9/17] net/iucv: Add missing spin_unlock
From: David Miller @ 2010-05-27  4:10 UTC (permalink / raw)
  To: julia
  Cc: ursula.braun, linux390, linux-s390, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1005261756330.23743@ask.diku.dk>

From: Julia Lawall <julia@diku.dk>
Date: Wed, 26 May 2010 17:56:48 +0200 (CEST)

> Add a spin_unlock missing on the error path.  There seems like no reason
> why the lock should continue to be held if the kzalloc fail.

Applied, thanks.

^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: David Miller @ 2010-05-27  4:06 UTC (permalink / raw)
  To: anton; +Cc: eric.dumazet, netdev
In-Reply-To: <20100527035617.GB28295@kryten>

From: Anton Blanchard <anton@samba.org>
Date: Thu, 27 May 2010 13:56:17 +1000

> I'm somewhat confused by the two stage locking in the socket lock
> (ie sk_lock.slock and sk_lock.owned).
> 
> What state should the socket lock be in for serialising updates of
> sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> unlocked, sk_lock.owned = 1:

If sh_lock.owned=1 the user has grabbed exclusive sleepable lock on the
socket, it does this via something approximating:

retry:
	spin_lock(&sk_lock.slock);
	was_locked = sk_lock.owned;
	if (!was_locked)
		sk_lock.owned = 1;
	spin_unlock(&sk_lock.slock);
	if (was_locked) {
		sleep_on(condition(sk_lock.owned));
		goto retry;
	}

This allows user context code to sleep with exclusive access to the
socket.

Code that cannot sleep takes the spinlock, and queues the work if the
owner field if non-zero.  Else, it keeps the spinlock held and does
the work.

In either case, socket modifications are thus done completely protected
from other contexts.



^ permalink raw reply

* Re: Warning in net/ipv4/af_inet.c:154
From: Anton Blanchard @ 2010-05-27  3:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1274868776.2672.96.camel@edumazet-laptop>

 
Hi Eric,

> You are 100% right David, maybe we should add a test when changing
> sk_forward_alloc to test if socket is locked (lockdep only test), but
> that's for 2.6.36 :)

Thanks for the patch, unfortunately I can still hit the WARN_ON. I'm somewhat
confused by the two stage locking in the socket lock (ie sk_lock.slock and
sk_lock.owned).

What state should the socket lock be in for serialising updates of
sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
unlocked, sk_lock.owned = 1:

NIP [c0000000005b4ad0] .sock_queue_rcv_skb
LR [c0000000005b4acc] .sock_queue_rcv_skb
Call Trace:
[c0000000005f9fcc] .ip_queue_rcv_skb
[c00000000061d604] .__udp_queue_rcv_skb
[c0000000005b1a38] .release_sock
[c0000000006205f0] .udp_sendmsg
[c0000000006290d4] .inet_sendmsg
[c0000000005abfb4] .sock_sendmsg
[c0000000005ae9dc] .SyS_sendto
[c0000000005ab6c0] .SyS_send

And other times we update it with sk_lock.slock = locked, sk_lock.owned = 0:

NIP [c0000000005b2b6c] .sock_rfree
LR [c0000000005b2b68] .sock_rfree
Call Trace:
[c0000000005bca10] .skb_free_datagram_locked
[c00000000061fe88] .udp_recvmsg
[c0000000006285e8] .inet_recvmsg
[c0000000005abe0c] .sock_recvmsg
[c0000000005ae358] .SyS_recvfrom

I see we sometimes take sk_lock.slock then check the owned field, but we
aren't doing that all the time.

Anton

^ permalink raw reply

* Re: [PATCH 2/2] net: ll_temac: fix checksum offload logic
From: David Miller @ 2010-05-27  3:45 UTC (permalink / raw)
  To: john.linn
  Cc: netdev, linuxppc-dev, grant.likely, jwboyer, john.williams,
	michal.simek, brian.hill
In-Reply-To: <0690e798-1e5e-4416-90dc-04a9df16600f@SG2EHSMHS005.ehs.local>

aFrom: John Linn <john.linn@xilinx.com>
Date: Wed, 26 May 2010 11:29:19 -0600

> The current checksum offload code does not work and this corrects
> that functionality. It also updates the interrupt coallescing
> initialization so than there are fewer interrupts and performance
> is increased.
> 
> Signed-off-by: Brian Hill <brian.hill@xilinx.com>
> Signed-off-by: John Linn <john.linn@xilinx.com>

Applied, although I changed "temac_features" to be explicitly
"unsigned int" instead of just plain "unsigned"


^ permalink raw reply

* Re: [PATCH 1/2] net: ll_temac: fix interrupt bug when interrupt 0 is used
From: David Miller @ 2010-05-27  3:44 UTC (permalink / raw)
  To: john.linn
  Cc: netdev, linuxppc-dev, grant.likely, jwboyer, john.williams,
	michal.simek, brian.hill
In-Reply-To: <6f264a7a-e5da-494a-a24d-1578ca422807@VA3EHSMHS016.ehs.local>

From: John Linn <john.linn@xilinx.com>
Date: Wed, 26 May 2010 11:29:18 -0600

> The code is not checking the interrupt for DMA correctly so that an
> interrupt number of 0 will cause a false error.
> 
> Signed-off-by: Brian Hill <brian.hill@xilinx.com>
> Signed-off-by: John Linn <john.linn@xilinx.com>

Applied.

^ permalink raw reply

* Re: [PATCH] tcp: Socket option to set congestion window
From: David Miller @ 2010-05-27  3:04 UTC (permalink / raw)
  To: hagen; +Cc: andi, therbert, shemminger, netdev, ycheng
In-Reply-To: <20100526231512.GD2684@nuttenaction>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Thu, 27 May 2010 01:15:12 +0200

> How can a domain defined as {process,peer-IP} fair to the 1MB
> bottleneck link?

You're asking about a network level issue in terms of what can be done
on a local end-node.

All an end-node can do is abide by congestion control rules and respond
to packet drops, as has been going on for decades.

People have basically (especially in Europe) given up on crazy crap
like RSVP and other forms of bandwidth limiting and reservation.  They
just oversubscribe their links, and increase their capacity as traffic
increases dictate.  It just isn't all that manageable to put people's
traffic into classes and control what they do on a large scale.

I'm also skeptical about those who say the fight belongs squarely at
the end nodes.  If you want to control the network traffic of the
meeting point of your dumbbell, you'll need a machine there doing RED
or traffic limiting.  End-host schemes simply aren't going to work
because I can just add more end-hosts to reintroduce the problem.

The dumbbell situation is independant of the end-node issues, that's
all I'm really saying.

^ permalink raw reply

* UDP Fragmentation and DF bit..
From: Vincent, Pradeep @ 2010-05-27  1:45 UTC (permalink / raw)
  To: netdev@vger.kernel.org

After running into issues with UDP in multi-MTU network environment, I
started digging through the code and found somewhat inconsistent behavior
(if I read the code right)

OMan 7 ip¹ declares that ³The  system-wide  default  is  controlled  by  the
ip_no_pmtu_disc  sysctl  for SOCK_STREAM  sockets,  and  disabled  on all
others.² which led me to think ODF¹ bit will not be set for UDP packets.
But..

The code in ip_output.c seems to do the following,

1. If packet size <= current PMTU, then set the DF=1 and send the packet
out.
2. If packet size > current PMTU, then set DF=0 and send the packet out
after fragmentation.

In a network environment where MTU-big and MTU-small co-exist (and have
router¹s fragmentation turned off in favor of PMTU discovery), UDP packets
that are > MTU-small and < MTU-big find the PMTU effectively but UDP packets
that are > MTU-big get dropped. This looks like inconsistent behavior to me
and doesn¹t seem to match the advertised behavior.

Is there a reason why PMTU support for UDP is somewhat inconsistent ?

Is there a reason ODF¹ bit cannot be set on fragmented packets on UDP
transmission ? I couldn¹t find anything in RFC for IP protocol that
prohibited DF bit on fragmented packets. Did I miss
something here ?

Would it be reasonable if PMTU discovery is performed (DF bit set +
appropriate icmp logic) even for locally fragmented packets ? I think this
will be a great help for UDP users that don¹t have PMTU handling logic in
the application (most udp applications belong to this category in my
experience). Thoughts ?

Thanks,

- Pradeep Vincent


^ permalink raw reply

* issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode
From: Xin, Xiaohui @ 2010-05-27  1:21 UTC (permalink / raw)
  To: mst@redhat.com
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Herbert Xu

Michael,
I'm now looking into the vhost mergeable buffer, and I tried to use it to support PS mode with zero-copy. And I found an issue there that I have to modify the guest virito-net driver.

When guest virtio-net driver submits mergeable buffers, it submits multiple pages outside. In zero-copy case, vhost cannot know which page is used to put header, and which page is used to put payload. Then vhost can only reserves 12 bytes for each page. That means, the page_offset of the payload DMAed into the guest buffer is always 12 bytes. But guest virtio-net driver always use offset 0 to put the data (See receive_mergeable()). That's where the zero-copy use mergeable buffer must modify.

Have I missed something here? And how do you think about it?

Thanks 
Xiaohui 

^ permalink raw reply

* [PATCH] fs_enet: Adjust BDs after tx error
From: Mark Ware @ 2010-05-27  1:12 UTC (permalink / raw)
  To: Linuxppc-dev Development, netdev; +Cc: Vitaly Bordug

This patch fixes an occasional transmit lockup in the mac-fcc which
occurs after a tx error.  The test scenario had the local port set
to autoneg and the other end fixed at 100FD, resulting in a large
number of late collisions.

According to the MPC8280RM 30.10.1.3 (also 8272RM 29.10.1.3), after
a tx error occurs, TBPTR may sometimes point beyond BDs still marked
as ready.  This patch walks back through the BDs and points TBPTR to
the earliest one marked as ready.

Tested on a custom board with a MPC8280.

Signed-off-by: Mark Ware <mware@elphinstone.net>
---
 drivers/net/fs_enet/mac-fcc.c |   49 ++++++++++++++++++++++++++++++++++++-----
 1 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/fs_enet/mac-fcc.c b/drivers/net/fs_enet/mac-fcc.c
index 22e5a84..6023f15 100644
--- a/drivers/net/fs_enet/mac-fcc.c
+++ b/drivers/net/fs_enet/mac-fcc.c
@@ -503,17 +503,54 @@ static int get_regs_len(struct net_device *dev)
 }
 
 /* Some transmit errors cause the transmitter to shut
- * down.  We now issue a restart transmit.  Since the
- * errors close the BD and update the pointers, the restart
- * _should_ pick up without having to reset any of our
- * pointers either.  Also, To workaround 8260 device erratum
- * CPM37, we must disable and then re-enable the transmitter
- * following a Late Collision, Underrun, or Retry Limit error.
+ * down.  We now issue a restart transmit.
+ * Also, to workaround 8260 device erratum CPM37, we must
+ * disable and then re-enable the transmitterfollowing a
+ * Late Collision, Underrun, or Retry Limit error.
+ * In addition, tbptr may point beyond BDs beyond still marked
+ * as ready due to internal pipelining, so we need to look back
+ * through the BDs and adjust tbptr to point to the last BD
+ * marked as ready.  This may result in some buffers being
+ * retransmitted.
  */
 static void tx_restart(struct net_device *dev)
 {
 	struct fs_enet_private *fep = netdev_priv(dev);
 	fcc_t __iomem *fccp = fep->fcc.fccp;
+	const struct fs_platform_info *fpi = fep->fpi;
+	fcc_enet_t __iomem *ep = fep->fcc.ep;
+	cbd_t __iomem *curr_tbptr;
+	cbd_t __iomem *recheck_bd;
+	cbd_t __iomem *prev_bd;
+	cbd_t __iomem *last_tx_bd;
+
+	last_tx_bd = fep->tx_bd_base + (fpi->tx_ring * sizeof(cbd_t));
+
+	/* get the current bd held in TBPTR  and scan back from this point */
+	recheck_bd = curr_tbptr = (cbd_t __iomem *)
+		((R32(ep, fen_genfcc.fcc_tbptr) - fep->ring_mem_addr) +
+		fep->ring_base);
+
+	prev_bd = (recheck_bd == fep->tx_bd_base) ? last_tx_bd : recheck_bd - 1;
+
+	/* Move through the bds in reverse, look for the earliest buffer
+	 * that is not ready.  Adjust TBPTR to the following buffer */
+	while ((CBDR_SC(prev_bd) & BD_ENET_TX_READY) != 0) {
+		/* Go back one buffer */
+		recheck_bd = prev_bd;
+
+		/* update the previous buffer */
+		prev_bd = (prev_bd == fep->tx_bd_base) ? last_tx_bd : prev_bd - 1;
+
+		/* We should never see all bds marked as ready, check anyway */
+		if (recheck_bd == curr_tbptr)
+			break;
+	}
+	/* Now update the TBPTR and dirty flag to the current buffer */
+	W32(ep, fen_genfcc.fcc_tbptr,
+		(uint) (((void *)recheck_bd - fep->ring_base) +
+		fep->ring_mem_addr));
+	fep->dirty_tx = recheck_bd;
 
 	C32(fccp, fcc_gfmr, FCC_GFMR_ENT);
 	udelay(10);
-- 
1.5.6.5

^ permalink raw reply related

* Re: [REGRESSION,BISECTED] MIPv6 support broken by f4f914b58019f0
From: Brian Haley @ 2010-05-27  0:48 UTC (permalink / raw)
  To: Arnaud Ebalard
  Cc: David Miller,
	YOSHIFUJI Hideaki / 吉藤英明, Jiri Olsa,
	Scott Otto, netdev
In-Reply-To: <87zkzmppfg.fsf@small.ssi.corp>

On 05/26/2010 01:01 PM, Arnaud Ebalard wrote:
> Hi,
> 
> I just updated my laptop's kernel to 2.6.34 (previously running .33 and
> configured to act as an IPsec/IKE-protected MIPv6 Mobile Node using
> racoon and umip): after rebooting on the new kernel, the transport mode
> SA protecting MIPv6 signaling traffic are missing.
> 
> I bisected the issue down to f4f914b58019f0e50d521bbbadfaee260d766f95
> (net: ipv6 bind to device issue) which was added after 2.6.34-rc5: 
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..05ebd78 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>         int flags = 0;
>  
> -       if (rt6_need_strict(&fl->fl6_dst))
> +       if (fl->oif || rt6_need_strict(&fl->fl6_dst))
>                 flags |= RT6_LOOKUP_F_IFACE;

Can you see if fl->oif is at least a sane value here?  Maybe there's some
partially un-initialized flowi getting passed-in, a quick source code check
didn't find anything obvious.

The other thought is that it's the tunnel code calling it, as it's going
to set 'oif' (actually it caches a whole flowi) from the tunnel parms ifindex/link
value.  It could have been setting it forever, but ip6_route_output() just
never enforced it until now.

My $.02.

-Brian

^ permalink raw reply

* Re: [PATCH] tcp: Socket option to set congestion window
From: Hagen Paul Pfeifer @ 2010-05-26 23:15 UTC (permalink / raw)
  To: David Miller; +Cc: andi, therbert, shemminger, netdev, ycheng
In-Reply-To: <20100526.151014.70204145.davem@davemloft.net>

* David Miller | 2010-05-26 15:10:14 [-0700]:

>From: Andi Kleen <andi@firstfloor.org>
>Date: Wed, 26 May 2010 23:27:45 +0200
>
>> As I understand the idea was that the application knows
>> what flows belong to a single peer and wants to have
>> a single cwnd for all of those. Perhaps there would
>> be a way to generalize that to tell it to the kernel.
>> 
>> e.g. have a "peer id"  that is known by applications
>> and the kernel could manage cwnds shared between connections
>> associated with the same peer id?
>> 
>> Just an idea, I admit I haven't thought very deeply
>> about this. Feel free to poke holes into it.
>
>Yes, a CWND "domain" that can include multiple sockets is
>something that might gain some traction.
>
>The "domain" could just simply be the tuple {process,peer-IP}

This discussion - as once a month - is about fairness. But if we define a
domain as a tuple of {process,peer-IP} the fairness is applied only for the
last link before "peer-IP".

But fairness applies to *all* links in between! For example: consider a
dumpbell scenario:


+------+                                   +------+ 
|      |                                   |      |  
|  H1  |                                   |  H3  | 
|      |                                   |      |  
+------+                                   +------+  
  10MB  \   +------+            +------+  / 10MB
         \  |      |   1MB/s    |      | / 
          > |  R1  |------------|  R2  |<    
         /  |      |            |      | \      
  10MB  /   +------+            +------+  \ 10MB 
+------+                                   +------+  
|      |                                   |      |        
|  H2  |                                   |  H4  | 
|      |                                   |      | 
+------+                                   +------+


How can a domain defined as {process,peer-IP} fair to the 1MB bottleneck link?
It is not fair! And it is also not fair to open n simultaneous streams and so
on. This problem is discussed in several RFC's.

.02


Best regards, Hagen


-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox