Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Vladislav Yasevich @ 2011-05-18 12:33 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: Jacek Luczak, Eric Dumazet, netdev
In-Reply-To: <4DD38B30.9090601@cn.fujitsu.com>

On 05/18/2011 05:02 AM, Wei Yongjun wrote:

> fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
> need to remove the socket from the port hash before empty the bind address list.
> some thing like this, not test.
> 
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index c8cc24e..924d846 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
>  
>  	/* Cleanup. */
>  	sctp_inq_free(&ep->base.inqueue);
> -	sctp_bind_addr_free(&ep->base.bind_addr);
>  
>  	/* Remove and free the port */
>  	if (sctp_sk(ep->base.sk)->bind_hash)
>  		sctp_put_port(ep->base.sk);
>  
> +	sctp_bind_addr_free(&ep->base.bind_addr);
> +
>  	/* Give up our hold on the sock. */
>  	if (ep->base.sk)
>  		sock_put(ep->base.sk);
> 
> 

I am not sure that this will guarantee avoidance of this crash, even though it is the right
thing to do in general.

We simply make the race condition much smaller and much harder to hit.  Potentially, one
cpu may be doing lookup of the socket while the other is doing the destroy.  If the lookup cpu
finds the socket just as this code removes the socket from the hash, we still have this potential
race condition.

I agree with Eric, rcu_read_lock() is not strictly necessary, as what we are really after is call_rcu()
based destruction.  We need to delay memory destruction for the rcu grace period.

Thinking a little more about how bind_addr_clean() is used, it would probably benefit from getting
converted to call_rcu().  That function is used as local clean-up in case of malloc failure; however,
that doesn't preclude a potential race.  The fact that we do not have this race simply points out that
we don't have any kind of parallel lookup that can hit it (and the code confirms it).  This doesn't
mean that we wouldn't have it in the future.

-vlad

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 12:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1305707358.2983.14.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
>> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>
>> > If you're removing items from this list, you must be a writer here, with
>> > exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
>>
>> I could agree to some extend ... but strict RCU section IMO is needed here.
>> I can check this if the issue exists.
>>
>
> I can tell you for sure rcu_read_lock() is not needed here. Its only
> showing confusion from code's author.
>
> Please read Documentation/RCU/listRCU.txt for concise explanations,
> line 117.
>
>
>> > Therefore, I guess following code is better :
>> >
>> > list_for_each_entry(addr, &bp->address_list, list) {
>> >        list_del_rcu(&addr->list);
>> >        call_rcu(&addr->rcu, sctp_local_addr_free);
>> >        SCTP_DBG_OBJCNT_DEC(addr);
>> > }
>> >
>> > Then, why dont you fix sctp_bind_addr_clean() instead ?
>> >
>> > if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
>> > be protected as well.
>>
>> The _clean() as claimed by Vlad is called many times from various places
>> in code and this could give a overhead. I guess Vlad would need to comment.
>
> I guess a full review of this code is needed. You'll have to prove
> sctp_bind_addr_clean() is always called after one RCU grace period if
> you want to leave it as is.

Eric this is actually good point which I think did not found at first
glance. As the original
issue occurs between _clean() and _conflict() then if the former is
used here and
there we can hit the grace period not only in that case. By that then
_clean() should
be ,,fixed''. Right?

-Jacek

^ permalink raw reply

* Re: [PATCH 2/2] bna: Add Generic Netlink Interface
From: David Lamparter @ 2011-05-18 12:00 UTC (permalink / raw)
  To: Rasesh Mody; +Cc: davem, netdev, adapter_linux_open_src_team, Debashis Dutt
In-Reply-To: <1305694621-28023-3-git-send-email-rmody@brocade.com>

On Tue, May 17, 2011 at 09:57:01PM -0700, Rasesh Mody wrote:
> This patch adds the generic netlink communication interface to BNA driver. The
> in-kernel generic netlink infrastructure can be used to collect hardware
> specific control information and control attributes. The driver makes use of
> the "doit" handler provided by the generic netlink layer to accomplish this.
[...]
> +struct bnad_genl_ioc_info {
> +	int		status;
> +	u16		bnad_num;
> +	u16		rsvd;
> +	char		serialnum[64];
> +	char		hwpath[BFA_STRING_32];
> +	char		adapter_hwpath[BFA_STRING_32];
> +	char		guid[BFA_ADAPTER_SYM_NAME_LEN*2];
> +	char		name[BFA_ADAPTER_SYM_NAME_LEN];
> +	char		port_name[BFA_ADAPTER_SYM_NAME_LEN];
> +	char		eth_name[BFA_ADAPTER_SYM_NAME_LEN];
> +	u64		rsvd1[4];
> +	mac_t		mac;
> +	mac_t		factory_mac;	/* Factory mac address */
> +	mac_t		current_mac;	/* Currently assigned mac address */
> +	enum bfa_ioc_type     ioc_type;
> +	u16		pvid;		/* Port vlan id */
> +	u16		rsvd2;
> +	u32		host;
> +	u32		bandwidth;	/* For PF support */
> +	u32		rsvd3;
> +};
> +
> +struct bnad_genl_ioc_attr {
> +	int		status;
> +	u16		bnad_num;
> +	u16		rsvd;
> +	struct bfa_ioc_attr  ioc_attr;
> +};

These things all look like they're better put into sysfs. Why would you
create a genl protocol just to query some presumably static attributes?


-David


^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 11:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wei Yongjun, netdev, Vlad Yasevich
In-Reply-To: <1305718902.2991.3.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 13:01 +0200, Jacek Luczak a écrit :
>
>> @Eric, if you will take a look into the code you might notice
>> that there are few places where list operations could be
>> optimised and the main question here is do we really care
>> to have the data ,,safe'' so that things like that won't popup.
>> The good example can be the set of _local_ functions.

There should be a vertical space here ...

>> Ahhh... and I'm aware of how tricky can be abuse of RCU.
>
> I took a quick look at existing rcu_read_lock() uses in sctp and did not
> find other problems [or optimizations if you prefer this point of view].
> Please elaborate.

... so that the last line does not have any connection with what I've
wrote above.

I get your point Eric so you don't really have to prove it - if you still need.

-jacek

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 11:56 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTim9RQ94MT4L0QHx-yzZT5nXdbi0aQ@mail.gmail.com>

On Wed, May 18, 2011 at 01:47:33PM +0200, Michał Mirosław wrote:
> W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
> <mst@redhat.com> napisał:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> 
> One more question: is userspace (or whatever is sending those packets)
> denied from modifying passed pages? I assume it is, but just want to
> be sure.
> 
> Best Regards,
> Michał Mirosław

Good point.

It's not denied in the sense that it still can modify them if it's
buggy (the pages might not be read-only).
But well-behaved userspace won't modify them until the callback
is invoked.

That would be a problem if the underlying device is
a bridge where we might try to e.g. filter these packets -
data can get modified after the filter. We'd have to copy
whatever the filter accesses and use the copy - it's rarely
the data itself.

That's not normally a problem for macvtap connected to a physical NIC,
as that already bypasses any and all filtering.

But that's another limitation we should note in the comment,
and another reason to limit to specific devices.

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 11:47 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTi=g_PFfS5653K8ZoQ2Jhp8DhCV1hg@mail.gmail.com>

On Wed, May 18, 2011 at 01:40:29PM +0200, Michał Mirosław wrote:
> W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
> <mst@redhat.com> napisał:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >>
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> >
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> >
> >> Is it that you mean keeping a reference after all skbs pointing to the
> >> pages are released?
> > No one should reference the pages after the callback is invoked, yes.
> 
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> 
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.
> 
> Best Regards,
> Michał Mirosław

Long term that's a good plan, but it's a lot of work.
pages can also get into weird places like VFS
or devices might hang on to them for a long time.

So I think as a first step, using a flag to white-list
simple devices that don't do any tricks like the
above makes sense.

Just be sure to list all of the restrictions in
the comment where the flag is described.

And hey, we get features extended to 64 bit as a bonus :)

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <20110518111734.GO7589@redhat.com>

W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
<mst@redhat.com> napisał:
> On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
>> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > > Hello Michael,
>> >> > >
>> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> > to
>> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> > device?
>> >> >
>> >> > Is there any other restriction besides requiring driver to not recycle
>> >> > the skb? Are there any drivers that recycle TX skbs?
>> > Not just recycling skbs, keeping reference to any of the pages in the
>> > skb. Another requirement is to invoke the callback
>> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> > that happens (skbs are only freed when some other packet is
>> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> > scenarious, right?
>> Hmm. But every hardware driver supporting SG will keep reference to
>> the pages until the packet is sent (or DMA'd to the device). This can
>> take a long time if hardware queue happens to stall for some reason.
> That's a fundamental property of zero copy transmit.
> You can't let the application/guest reuse the memory until
> no one looks at it anymore.

One more question: is userspace (or whatever is sending those packets)
denied from modifying passed pages? I assume it is, but just want to
be sure.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-18 11:44 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <1305713111.2991.1.camel@edumazet-laptop>

Le mercredi 18 mai 2011 à 12:05 +0200, Eric Dumazet a écrit :
> Le mercredi 18 mai 2011 à 12:53 +0300, Denys Fedoryshchenko a écrit :
> 
> >  Yes, i will try. I should enable SLUB debugging only, or also anything 
> >  else?
> > 
> >  But possible it will take time to reproduce bug, seems it is occuring 
> >  rare. With 2.6.39 release i will rollout update to few hundreds PPPoE's, 
> >  maybe it will increase chances to get information.
> > 
> 
> I would try both : slub_debug=ZFP slub_nomerge
> 
> or maybe only slub_debug=ZFPU
> 
> Thanks !
> 

Hmm, do you have both ipv6/ipv4 trafic on your machine by any chance ?





^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18 11:41 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Wei Yongjun, netdev, Vlad Yasevich
In-Reply-To: <BANLkTikVxfxU2uW3AA--q8qt16og=HdDLg@mail.gmail.com>

Le mercredi 18 mai 2011 à 13:01 +0200, Jacek Luczak a écrit :

> @Eric, if you will take a look into the code you might notice
> that there are few places where list operations could be
> optimised and the main question here is do we really care
> to have the data ,,safe'' so that things like that won't popup.
> The good example can be the set of _local_ functions.
> Ahhh... and I'm aware of how tricky can be abuse of RCU.

I took a quick look at existing rcu_read_lock() uses in sctp and did not
find other problems [or optimizations if you prefer this point of view].
Please elaborate.




^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18 11:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <20110518111734.GO7589@redhat.com>

W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
<mst@redhat.com> napisał:
> On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
>> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > > Hello Michael,
>> >> > >
>> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> > to
>> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> > device?
>> >> >
>> >> > Is there any other restriction besides requiring driver to not recycle
>> >> > the skb? Are there any drivers that recycle TX skbs?
>> >
>> > Not just recycling skbs, keeping reference to any of the pages in the
>> > skb. Another requirement is to invoke the callback
>> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> > that happens (skbs are only freed when some other packet is
>> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> > scenarious, right?
>>
>> Hmm. But every hardware driver supporting SG will keep reference to
>> the pages until the packet is sent (or DMA'd to the device). This can
>> take a long time if hardware queue happens to stall for some reason.
>
> That's a fundamental property of zero copy transmit.
> You can't let the application/guest reuse the memory until
> no one looks at it anymore.
>
>> Is it that you mean keeping a reference after all skbs pointing to the
>> pages are released?
> No one should reference the pages after the callback is invoked, yes.

>> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
>> >> OK to me from code review.
>> > Hmm. pskb_expand_head calls skb_release_data while keeping
>> > references to pages. How is that ok? What do I miss?
>> It's making copy of the skb_shinfo earlier, so the pages refcount
>> stays the same.
> Exactly. But the callback is invoked so the guest thinks it's ok to
> change this memory. If it does a corrupted packet will be sent out.

Hmm. I tool a quick look at skb_clone(), and it looks like this
sequence will break this scheme:

skb2 = skb_clone(skb...);
kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
[use skb2, pages still referenced]
kfree_skb(skb); /* callback called again */

This sequence is common in bridge, might be in other places.

Maybe this ubuf thing should just track clones? This will make it work
on all devices then.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 11:17 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTimc1RbC3uQsra1HUvS_Trg63iMWGA@mail.gmail.com>

On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> > > Hello Michael,
> >> > >
> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> > to
> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> > device?
> >> >
> >> > Is there any other restriction besides requiring driver to not recycle
> >> > the skb? Are there any drivers that recycle TX skbs?
> >
> > Not just recycling skbs, keeping reference to any of the pages in the
> > skb. Another requirement is to invoke the callback
> > in a timely fashion.  For example virtio-net doesn't limit the time until
> > that happens (skbs are only freed when some other packet is
> > transmitted), so we need to avoid zcopy for such (nested-virt)
> > scenarious, right?
> 
> Hmm. But every hardware driver supporting SG will keep reference to
> the pages until the packet is sent (or DMA'd to the device). This can
> take a long time if hardware queue happens to stall for some reason.

That's a fundamental property of zero copy transmit.
You can't let the application/guest reuse the memory until
no one looks at it anymore.

> Is it that you mean keeping a reference after all skbs pointing to the
> pages are released?

No one should reference the pages after the callback is invoked, yes.

> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> OK to me from code review.
> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > references to pages. How is that ok? What do I miss?
> 
> It's making copy of the skb_shinfo earlier, so the pages refcount
> stays the same.
> 
> Best Regards,
> Michał Mirosław

Exactly. But the callback is invoked so the guest thinks it's ok to
change this memory. If it does a corrupted packet will be sent out.

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18 11:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <20110518103819.GL7589@redhat.com>

2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> > > Hello Michael,
>> > >
>> > > Looks like to use a new flag requires more time/work. I am thinking
>> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> > to
>> > > avoid the new flag for now since mavctap uses real NICs as lower
>> > device?
>> >
>> > Is there any other restriction besides requiring driver to not recycle
>> > the skb? Are there any drivers that recycle TX skbs?
>
> Not just recycling skbs, keeping reference to any of the pages in the
> skb. Another requirement is to invoke the callback
> in a timely fashion.  For example virtio-net doesn't limit the time until
> that happens (skbs are only freed when some other packet is
> transmitted), so we need to avoid zcopy for such (nested-virt)
> scenarious, right?

Hmm. But every hardware driver supporting SG will keep reference to
the pages until the packet is sent (or DMA'd to the device). This can
take a long time if hardware queue happens to stall for some reason.

Is it that you mean keeping a reference after all skbs pointing to the
pages are released?

>> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
>> OK to me from code review.
> Hmm. pskb_expand_head calls skb_release_data while keeping
> references to pages. How is that ok? What do I miss?

It's making copy of the skb_shinfo earlier, so the pages refcount
stays the same.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 11:01 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: Eric Dumazet, netdev, Vlad Yasevich
In-Reply-To: <4DD38B30.9090601@cn.fujitsu.com>

2011/5/18 Wei Yongjun <yjwei@cn.fujitsu.com>:
>
>> Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
>>> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>>>> If you're removing items from this list, you must be a writer here, with
>>>> exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
>>> I could agree to some extend ... but strict RCU section IMO is needed here.
>>> I can check this if the issue exists.
>>>
>> I can tell you for sure rcu_read_lock() is not needed here. Its only
>> showing confusion from code's author.
>>
>> Please read Documentation/RCU/listRCU.txt for concise explanations,
>> line 117.
>>
>>
>>>> Therefore, I guess following code is better :
>>>>
>>>> list_for_each_entry(addr, &bp->address_list, list) {
>>>>        list_del_rcu(&addr->list);
>>>>        call_rcu(&addr->rcu, sctp_local_addr_free);
>>>>        SCTP_DBG_OBJCNT_DEC(addr);
>>>> }
>>>>
>>>> Then, why dont you fix sctp_bind_addr_clean() instead ?
>>>>
>>>> if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
>>>> be protected as well.
>>> The _clean() as claimed by Vlad is called many times from various places
>>> in code and this could give a overhead. I guess Vlad would need to comment.
>> I guess a full review of this code is needed. You'll have to prove
>> sctp_bind_addr_clean() is always called after one RCU grace period if
>> you want to leave it as is.
>>
>> You cant get RCU for free, some rules must be followed or you risk
>> crashes.
>>
>
> fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
> need to remove the socket from the port hash before empty the bind address list.
> some thing like this, not test.
>
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index c8cc24e..924d846 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
>
>        /* Cleanup. */
>        sctp_inq_free(&ep->base.inqueue);
> -       sctp_bind_addr_free(&ep->base.bind_addr);
>
>        /* Remove and free the port */
>        if (sctp_sk(ep->base.sk)->bind_hash)
>                sctp_put_port(ep->base.sk);
>
> +       sctp_bind_addr_free(&ep->base.bind_addr);
> +
>        /* Give up our hold on the sock. */
>        if (ep->base.sk)
>                sock_put(ep->base.sk);
>

Done tests with that one and looks like it survive the flood.

How then this will be handled is up to you. Both ways fix
the issue which makes me happy as damn hell.

@Eric, if you will take a look into the code you might notice
that there are few places where list operations could be
optimised and the main question here is do we really care
to have the data ,,safe'' so that things like that won't popup.
The good example can be the set of _local_ functions.
Ahhh... and I'm aware of how tricky can be abuse of RCU.

-Jacek

^ permalink raw reply

* [PATCH net-next] bnx2x: add support for retrieving dcb peer configuration
From: Shmulik Ravid @ 2011-05-18 12:55 UTC (permalink / raw)
  To: davem; +Cc: Eilon Greenstein, netdev

This patch adds support to the bnx2x for retrieving dcb peer (remote)
configuration from the embedded DCBX stack.

Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x.h     |    4 +
 drivers/net/bnx2x/bnx2x_dcb.c |  152 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x.h b/drivers/net/bnx2x/bnx2x.h
index 16a76f0..668a578 100644
--- a/drivers/net/bnx2x/bnx2x.h
+++ b/drivers/net/bnx2x/bnx2x.h
@@ -1242,6 +1242,10 @@ struct bnx2x {
 	/* DCBX Negotiation results */
 	struct dcbx_features			dcbx_local_feat;
 	u32					dcbx_error;
+#ifdef BCM_DCBNL
+	struct dcbx_features			dcbx_remote_feat;
+	u32					dcbx_remote_flags;
+#endif
 	u32					pending_max;
 };
 
diff --git a/drivers/net/bnx2x/bnx2x_dcb.c b/drivers/net/bnx2x/bnx2x_dcb.c
index 0f83092..410a49e 100644
--- a/drivers/net/bnx2x/bnx2x_dcb.c
+++ b/drivers/net/bnx2x/bnx2x_dcb.c
@@ -485,6 +485,36 @@ static void bnx2x_dcbx_update_ets_params(struct bnx2x *bp)
 	}
 }
 
+#ifdef BCM_DCBNL
+static int bnx2x_dcbx_read_shmem_remote_mib(struct bnx2x *bp)
+{
+	struct lldp_remote_mib remote_mib = {0};
+	u32 dcbx_remote_mib_offset = SHMEM2_RD(bp, dcbx_remote_mib_offset);
+	int rc;
+
+	DP(NETIF_MSG_LINK, "dcbx_remote_mib_offset 0x%x\n",
+	   dcbx_remote_mib_offset);
+
+	if (SHMEM_DCBX_REMOTE_MIB_NONE == dcbx_remote_mib_offset) {
+		BNX2X_ERR("FW doesn't support dcbx_remote_mib_offset\n");
+		return -EINVAL;
+	}
+
+	rc = bnx2x_dcbx_read_mib(bp, (u32 *)&remote_mib, dcbx_remote_mib_offset,
+				 DCBX_READ_REMOTE_MIB);
+
+	if (rc) {
+		BNX2X_ERR("Faild to read remote mib from FW\n");
+		return rc;
+	}
+
+	/* save features and flags */
+	bp->dcbx_remote_feat = remote_mib.features;
+	bp->dcbx_remote_flags = remote_mib.flags;
+	return 0;
+}
+#endif
+
 static int bnx2x_dcbx_read_shmem_neg_results(struct bnx2x *bp)
 {
 	struct lldp_local_mib local_mib = {0};
@@ -601,6 +631,10 @@ void bnx2x_dcbx_set_params(struct bnx2x *bp, u32 state)
 			 * negotiation results
 			 */
 			bnx2x_dcbnl_update_applist(bp, true);
+
+			/* Read rmeote mib if dcbx is in the FW */
+			if (bnx2x_dcbx_read_shmem_remote_mib(bp))
+				return;
 #endif
 			/* Read neg results if dcbx is in the FW */
 			if (bnx2x_dcbx_read_shmem_neg_results(bp))
@@ -2031,7 +2065,6 @@ static u8 bnx2x_dcbnl_set_dcbx(struct net_device *netdev, u8 state)
 	return 0;
 }
 
-
 static u8 bnx2x_dcbnl_get_featcfg(struct net_device *netdev, int featid,
 				  u8 *flags)
 {
@@ -2111,31 +2144,100 @@ static u8 bnx2x_dcbnl_set_featcfg(struct net_device *netdev, int featid,
 	return rval;
 }
 
+static int bnx2x_peer_appinfo(struct net_device *netdev,
+			      struct dcb_peer_app_info *info, u16* app_count)
+{
+	int i;
+	struct bnx2x *bp = netdev_priv(netdev);
+
+	DP(NETIF_MSG_LINK, "APP-INFO\n");
+
+	info->willing = (bp->dcbx_remote_flags & DCBX_APP_REM_WILLING) ?: 0;
+	info->error = (bp->dcbx_remote_flags & DCBX_APP_RX_ERROR) ?: 0;
+	*app_count = 0;
+
+	for (i = 0; i < DCBX_MAX_APP_PROTOCOL; i++)
+		if (bp->dcbx_remote_feat.app.app_pri_tbl[i].appBitfield &
+		    DCBX_APP_ENTRY_VALID)
+			(*app_count)++;
+	return 0;
+}
+
+static int bnx2x_peer_apptable(struct net_device *netdev,
+			       struct dcb_app *table)
+{
+	int i, j;
+	struct bnx2x *bp = netdev_priv(netdev);
+
+	DP(NETIF_MSG_LINK, "APP-TABLE\n");
+
+	for (i = 0, j = 0; i < DCBX_MAX_APP_PROTOCOL; i++) {
+		struct dcbx_app_priority_entry *ent =
+			&bp->dcbx_remote_feat.app.app_pri_tbl[i];
+
+		if (ent->appBitfield & DCBX_APP_ENTRY_VALID) {
+			table[j].selector = bnx2x_dcbx_dcbnl_app_idtype(ent);
+			table[j].priority = bnx2x_dcbx_dcbnl_app_up(ent);
+			table[j++].protocol = ent->app_id;
+		}
+	}
+	return 0;
+}
+
+static int bnx2x_cee_peer_getpg(struct net_device *netdev, struct cee_pg *pg)
+{
+	int i;
+	struct bnx2x *bp = netdev_priv(netdev);
+
+	pg->willing = (bp->dcbx_remote_flags & DCBX_ETS_REM_WILLING) ?: 0;
+
+	for (i = 0; i < CEE_DCBX_MAX_PGS; i++) {
+		pg->pg_bw[i] =
+			DCBX_PG_BW_GET(bp->dcbx_remote_feat.ets.pg_bw_tbl, i);
+		pg->prio_pg[i] =
+			DCBX_PRI_PG_GET(bp->dcbx_remote_feat.ets.pri_pg_tbl, i);
+	}
+	return 0;
+}
+
+static int bnx2x_cee_peer_getpfc(struct net_device *netdev,
+				 struct cee_pfc *pfc)
+{
+	struct bnx2x *bp = netdev_priv(netdev);
+	pfc->tcs_supported = bp->dcbx_remote_feat.pfc.pfc_caps;
+	pfc->pfc_en = bp->dcbx_remote_feat.pfc.pri_en_bitmap;
+	return 0;
+}
+
 const struct dcbnl_rtnl_ops bnx2x_dcbnl_ops = {
-	.getstate       = bnx2x_dcbnl_get_state,
-	.setstate       = bnx2x_dcbnl_set_state,
-	.getpermhwaddr  = bnx2x_dcbnl_get_perm_hw_addr,
-	.setpgtccfgtx   = bnx2x_dcbnl_set_pg_tccfg_tx,
-	.setpgbwgcfgtx  = bnx2x_dcbnl_set_pg_bwgcfg_tx,
-	.setpgtccfgrx   = bnx2x_dcbnl_set_pg_tccfg_rx,
-	.setpgbwgcfgrx  = bnx2x_dcbnl_set_pg_bwgcfg_rx,
-	.getpgtccfgtx   = bnx2x_dcbnl_get_pg_tccfg_tx,
-	.getpgbwgcfgtx  = bnx2x_dcbnl_get_pg_bwgcfg_tx,
-	.getpgtccfgrx   = bnx2x_dcbnl_get_pg_tccfg_rx,
-	.getpgbwgcfgrx  = bnx2x_dcbnl_get_pg_bwgcfg_rx,
-	.setpfccfg      = bnx2x_dcbnl_set_pfc_cfg,
-	.getpfccfg      = bnx2x_dcbnl_get_pfc_cfg,
-	.setall         = bnx2x_dcbnl_set_all,
-	.getcap         = bnx2x_dcbnl_get_cap,
-	.getnumtcs      = bnx2x_dcbnl_get_numtcs,
-	.setnumtcs      = bnx2x_dcbnl_set_numtcs,
-	.getpfcstate    = bnx2x_dcbnl_get_pfc_state,
-	.setpfcstate    = bnx2x_dcbnl_set_pfc_state,
-	.setapp         = bnx2x_dcbnl_set_app_up,
-	.getdcbx        = bnx2x_dcbnl_get_dcbx,
-	.setdcbx        = bnx2x_dcbnl_set_dcbx,
-	.getfeatcfg     = bnx2x_dcbnl_get_featcfg,
-	.setfeatcfg     = bnx2x_dcbnl_set_featcfg,
+	.getstate		= bnx2x_dcbnl_get_state,
+	.setstate		= bnx2x_dcbnl_set_state,
+	.getpermhwaddr		= bnx2x_dcbnl_get_perm_hw_addr,
+	.setpgtccfgtx		= bnx2x_dcbnl_set_pg_tccfg_tx,
+	.setpgbwgcfgtx		= bnx2x_dcbnl_set_pg_bwgcfg_tx,
+	.setpgtccfgrx		= bnx2x_dcbnl_set_pg_tccfg_rx,
+	.setpgbwgcfgrx		= bnx2x_dcbnl_set_pg_bwgcfg_rx,
+	.getpgtccfgtx		= bnx2x_dcbnl_get_pg_tccfg_tx,
+	.getpgbwgcfgtx		= bnx2x_dcbnl_get_pg_bwgcfg_tx,
+	.getpgtccfgrx		= bnx2x_dcbnl_get_pg_tccfg_rx,
+	.getpgbwgcfgrx		= bnx2x_dcbnl_get_pg_bwgcfg_rx,
+	.setpfccfg		= bnx2x_dcbnl_set_pfc_cfg,
+	.getpfccfg		= bnx2x_dcbnl_get_pfc_cfg,
+	.setall			= bnx2x_dcbnl_set_all,
+	.getcap			= bnx2x_dcbnl_get_cap,
+	.getnumtcs		= bnx2x_dcbnl_get_numtcs,
+	.setnumtcs		= bnx2x_dcbnl_set_numtcs,
+	.getpfcstate		= bnx2x_dcbnl_get_pfc_state,
+	.setpfcstate		= bnx2x_dcbnl_set_pfc_state,
+	.setapp			= bnx2x_dcbnl_set_app_up,
+	.getdcbx		= bnx2x_dcbnl_get_dcbx,
+	.setdcbx		= bnx2x_dcbnl_set_dcbx,
+	.getfeatcfg		= bnx2x_dcbnl_get_featcfg,
+	.setfeatcfg		= bnx2x_dcbnl_set_featcfg,
+	.peer_getappinfo	= bnx2x_peer_appinfo,
+	.peer_getapptable	= bnx2x_peer_apptable,
+	.cee_peer_getpg		= bnx2x_cee_peer_getpg,
+	.cee_peer_getpfc	= bnx2x_cee_peer_getpfc,
 };
 
 #endif /* BCM_DCBNL */
-- 
1.7.3.5





^ permalink raw reply related

* Re: [Patch net-next-2.6] netpoll: disable netpoll when enslave a device
From: Neil Horman @ 2011-05-18 10:56 UTC (permalink / raw)
  To: Amerigo Wang
  Cc: linux-kernel, Neil Horman, Jay Vosburgh, Andy Gospodarek,
	David S. Miller, Alexey Dobriyan, Ferenc Wagner, Andrew Morton,
	Paul E. McKenney, Josh Triplett, Ian Campbell, netdev
In-Reply-To: <1305712845-11762-1-git-send-email-amwang@redhat.com>

On Wed, May 18, 2011 at 06:00:35PM +0800, Amerigo Wang wrote:
> Currently we do nothing when we enslave a net device which is running netconsole.
> Neil pointed out that we may get weird results in such case, so let's disable
> netpoll on the device being enslaved. I think it is too harsh to prevent
> the device being ensalved if it is running netconsole.
> 
> By the way, this patch also removes the NETDEV_GOING_DOWN from netconsole
> netdev notifier, because netpoll will check if the device is running or not
> and we don't handle NETDEV_PRE_UP neither.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Neil Horman <nhorman@redhat.com>
> 
> ---
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 088fd84..b9c70c5 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1640,6 +1640,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  		}
>  	}
>  
> +	netdev_bonding_change(slave_dev, NETDEV_ENSLAVE);
> +
>  	/* If this is the first slave, then we need to set the master's hardware
>  	 * address to be the same as the slave's. */
>  	if (is_zero_ether_addr(bond->dev->dev_addr))
> diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
> index a83e101..0c3e8de 100644
> --- a/drivers/net/netconsole.c
> +++ b/drivers/net/netconsole.c
> @@ -621,7 +621,8 @@ static int netconsole_netdev_event(struct notifier_block *this,
>  	bool stopped = false;
>  
>  	if (!(event == NETDEV_CHANGENAME || event == NETDEV_UNREGISTER ||
> -	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN))
> +	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN ||
> +	      event == NETDEV_ENSLAVE))
>  		goto done;
>  
>  	spin_lock_irqsave(&target_list_lock, flags);
> @@ -650,8 +651,8 @@ restart:
>  					goto restart;
>  				}
>  				/* Fall through */
> -			case NETDEV_GOING_DOWN:
>  			case NETDEV_BONDING_DESLAVE:
> +			case NETDEV_ENSLAVE:
>  				nt->enabled = 0;
>  				stopped = true;
>  				break;
This wasn't introduced by this patch, but looking at it made me realize that
nt->enabled, if it passes through this code path, doesn't properly track weather
or not netpoll_setup has been called on this interface.  If you look at
drop_netconsole_target, you'll see we only call netpoll_cleanup_target if
nt->enabled is set.  We should probably change the nt->enabled check there, and
in store_enabled to be if (nt->np.dev), like we do in the NETDEV_UNREGISTER case
in netconsole_netdev_event.

> @@ -660,10 +661,21 @@ restart:
>  		netconsole_target_put(nt);
>  	}
>  	spin_unlock_irqrestore(&target_list_lock, flags);
> -	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
> +	if (stopped) {
>  		printk(KERN_INFO "netconsole: network logging stopped on "
> -			"interface %s as it %s\n",  dev->name,
> -			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
> +		       "interface %s as it ", dev->name);
> +		switch (event) {
> +		case NETDEV_UNREGISTER:
> +			printk(KERN_CONT "unregistered\n");
> +			break;
> +		case NETDEV_BONDING_DESLAVE:
> +			printk(KERN_CONT "released slaves\n");
> +			break;
> +		case NETDEV_ENSLAVE:
> +			printk(KERN_CONT "is enslaved\n");
> +			break;
> +		}
> +	}
>  
>  done:
>  	return NOTIFY_DONE;
> diff --git a/include/linux/notifier.h b/include/linux/notifier.h
> index 621dfa1..3d82867 100644
> --- a/include/linux/notifier.h
> +++ b/include/linux/notifier.h
> @@ -211,6 +211,7 @@ static inline int notifier_to_errno(int ret)
>  #define NETDEV_UNREGISTER_BATCH 0x0011
>  #define NETDEV_BONDING_DESLAVE  0x0012
>  #define NETDEV_NOTIFY_PEERS	0x0013
> +#define NETDEV_ENSLAVE		0x0014
>  
Nit:
Shouldn't this be NETDEV_BONDING_ENSLAVE, to keep it in line with
NETDEV_BONDING_DESLAVE above?

>  #define SYS_DOWN	0x0001	/* Notify of system down */
>  #define SYS_RESTART	SYS_DOWN
> 


Other than those two points, this looks good to me
Thanks!
Neil

^ permalink raw reply

* [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Benoit Sigoure @ 2011-05-18 10:43 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber
  Cc: netdev, linux-kernel, Benoit Sigoure
In-Reply-To: <1305619677.2850.11.camel@edumazet-laptop>

Instead of hardcoding the initial RTO to 3s and requiring
the kernel to be recompiled to change it, expose it as a
sysctl that can be tuned at runtime.  Leave the default
value unchanged.

Signed-off-by: Benoit Sigoure <tsunanet@gmail.com>
---

v2 of the patch to address Eric's comments.  Of course I had to forget
to convert things back and forth between jiffies and ms -- /me n00b.
Code compiles.  It seems like no one is opposed to this change, but if
one of you guys could express explicit interest in merging this change,
I'd be happy to spend a bit more time to test it.

The new sysctl is exposed in milliseconds but internally the value remains
in jiffies to avoid having to convert back / and forth between jiffies and
ms in most places.

I'm glad to hear that the default value will be tuned down to 1s.  This
change will help people play with this value and easily revert it back at
runtime if they feel like they preferred the current value.

Thank you for your time.

 Documentation/networking/ip-sysctl.txt |    8 ++++++++
 include/net/tcp.h                      |    3 ++-
 net/ipv4/syncookies.c                  |    2 +-
 net/ipv4/sysctl_net_ipv4.c             |   11 +++++++++++
 net/ipv4/tcp.c                         |    4 ++--
 net/ipv4/tcp_input.c                   |    8 ++++----
 net/ipv4/tcp_ipv4.c                    |    6 +++---
 net/ipv4/tcp_minisocks.c               |    6 +++---
 net/ipv4/tcp_output.c                  |    2 +-
 net/ipv4/tcp_timer.c                   |    9 +++++----
 net/ipv6/syncookies.c                  |    2 +-
 net/ipv6/tcp_ipv6.c                    |    6 +++---
 12 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d3d653a..7f3c7d2 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,14 @@ tcp_retries2 - INTEGER
 	RFC 1122 recommends at least 100 seconds for the timeout,
 	which corresponds to a value of at least 8.
 
+tcp_initial_rto - INTEGER
+	This value sets the initial retransmit timeout (in milliseconds),
+	that is how long the kernel will wait before retransmitting the
+	initial SYN packet.
+
+	RFC 1122 says that this SHOULD be 3000 milliseconds, which is the
+	default.
+
 tcp_rfc1337 - BOOLEAN
 	If set, the TCP stack behaves conforming to RFC1337. If unset,
 	we are not conforming to RFC, but prevent TCP TIME_WAIT
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..d6d7dea 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
 extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
+extern int sysctl_tcp_initial_rto;  /* in jiffies */
 extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_syncookies;
 extern int sysctl_tcp_retrans_collapse;
@@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
 static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
 {
 	unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
-	return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
+	return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
 }
 
 extern struct proto tcp_prot;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8b44c6d..b035968 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
  * sysctl_tcp_retries1. It's a rather complicated formula (exponential
  * backoff) to compute at runtime so it's currently hardcoded here.
  */
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)
 /*
  * Check if a ack sequence number is a valid syncookie.
  * Return the decoded mss if it is, or 0 if not.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 321e6e8..51c778d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
 static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
+static int tcp_initial_rto_min = TCP_RTO_MIN;
+static int tcp_initial_rto_max = TCP_RTO_MAX;
 
 /* Update system visible IP port range */
 static void set_local_port_range(int range[2])
@@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+        {
+		.procname       = "tcp_initial_rto",
+		.data           = &sysctl_tcp_initial_rto,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies,
+		.extra1		= &tcp_initial_rto_min,
+		.extra2		= &tcp_initial_rto_max,
+	},
 	{
 		.procname	= "tcp_fin_timeout",
 		.data		= &sysctl_tcp_fin_timeout,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..e9e7c3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	case TCP_DEFER_ACCEPT:
 		/* Translate value in seconds to number of retransmits */
 		icsk->icsk_accept_queue.rskq_defer_accept =
-			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+			secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,
 					TCP_RTO_MAX / HZ);
 		break;
 
@@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		break;
 	case TCP_DEFER_ACCEPT:
 		val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
-				      TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
+				      sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
 		break;
 	case TCP_WINDOW_CLAMP:
 		val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..39f6c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_RTT) == 0)
 		goto reset;
 
-	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))
 		goto reset;
 
 	/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +916,7 @@ static void tcp_init_metrics(struct sock *sk)
 		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
 	}
 	tcp_set_rto(sk);
-	if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+	if (inet_csk(sk)->icsk_rto < sysctl_tcp_initial_rto && !tp->rx_opt.saw_tstamp) {
 reset:
 		/* Play conservative. If timestamps are not
 		 * supported, TCP will fail to recalculate correct
@@ -924,8 +924,8 @@ reset:
 		 */
 		if (!tp->rx_opt.saw_tstamp && tp->srtt) {
 			tp->srtt = 0;
-			tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
-			inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+			tp->mdev = tp->mdev_max = tp->rttvar = sysctl_tcp_initial_rto;
+			inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
 		}
 	}
 	tp->snd_cwnd = tcp_init_cwnd(tp, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..21920e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1383,7 +1383,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	    want_cookie)
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
 	return 0;
 
 drop_and_release:
@@ -1834,8 +1834,8 @@ static int tcp_v4_init_sock(struct sock *sk)
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 
-	icsk->icsk_rto = TCP_TIMEOUT_INIT;
-	tp->mdev = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto = sysctl_tcp_initial_rto;
+	tp->mdev = sysctl_tcp_initial_rto;
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..c63ffa0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,8 +472,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		tcp_init_wl(newtp, treq->rcv_isn);
 
 		newtp->srtt = 0;
-		newtp->mdev = TCP_TIMEOUT_INIT;
-		newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+		newtp->mdev = sysctl_tcp_initial_rto;
+		newicsk->icsk_rto = sysctl_tcp_initial_rto;
 
 		newtp->packets_out = 0;
 		newtp->retrans_out = 0;
@@ -582,7 +582,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * it can be estimated (approximately)
 			 * from another data.
 			 */
-			tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+			tmp_opt.ts_recent_stamp = get_seconds() - ((sysctl_tcp_initial_rto/HZ)<<req->retrans);
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17388c7..e34b0f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2599,7 +2599,7 @@ static void tcp_connect_init(struct sock *sk)
 	tp->rcv_wup = 0;
 	tp->copied_seq = 0;
 
-	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+	inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
 	inet_csk(sk)->icsk_retransmits = 0;
 	tcp_clear_retrans(tp);
 }
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..b9da62b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
 int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+int sysctl_tcp_initial_rto __read_mostly = TCP_TIMEOUT_INIT;
 int sysctl_tcp_orphan_retries __read_mostly;
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
 
@@ -135,8 +136,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 
 /* This function calculates a "timeout" which is equivalent to the timeout of a
  * TCP connection after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
+ * retransmissions with an initial RTO of TCP_RTO_MIN or
+ * sysctl_tcp_initial_rto if syn_set flag is set.
  */
 static bool retransmits_timed_out(struct sock *sk,
 				  unsigned int boundary,
@@ -144,7 +145,7 @@ static bool retransmits_timed_out(struct sock *sk,
 				  bool syn_set)
 {
 	unsigned int linear_backoff_thresh, start_ts;
-	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+	unsigned int rto_base = syn_set ? sysctl_tcp_initial_rto : TCP_RTO_MIN;
 
 	if (!inet_csk(sk)->icsk_retransmits)
 		return false;
@@ -495,7 +496,7 @@ out_unlock:
 static void tcp_synack_timer(struct sock *sk)
 {
 	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+				   sysctl_tcp_initial_rto, TCP_RTO_MAX);
 }
 
 void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req)
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 352c260..f8a07a8 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -45,7 +45,7 @@ static __u16 const msstab[] = {
  * sysctl_tcp_retries1. It's a rather complicated formula (exponential
  * backoff) to compute at runtime so it's currently hardcoded here.
  */
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)
 
 static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 					   struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..7e791e6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1349,7 +1349,7 @@ have_isn:
 	    want_cookie)
 		goto drop_and_free;
 
-	inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	inet6_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
 	return 0;
 
 drop_and_release:
@@ -1957,8 +1957,8 @@ static int tcp_v6_init_sock(struct sock *sk)
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 
-	icsk->icsk_rto = TCP_TIMEOUT_INIT;
-	tp->mdev = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto = sysctl_tcp_initial_rto;
+	tp->mdev = sysctl_tcp_initial_rto;
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 10:38 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305671318.10756.49.camel@localhost.localdomain>

On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> > > Hello Michael,
> > >
> > > Looks like to use a new flag requires more time/work. I am thinking
> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> > to
> > > avoid the new flag for now since mavctap uses real NICs as lower
> > device?
> > 
> > Is there any other restriction besides requiring driver to not recycle
> > the skb? Are there any drivers that recycle TX skbs? 

Not just recycling skbs, keeping reference to any of the pages in the
skb. Another requirement is to invoke the callback
in a timely fashion.  For example virtio-net doesn't limit the time until
that happens (skbs are only freed when some other packet is
transmitted), so we need to avoid zcopy for such (nested-virt)
scenarious, right?
> 
> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> OK to me from code review.

Hmm. pskb_expand_head calls skb_release_data while keeping
references to pages. How is that ok? What do I miss?

> Currently there is no drivers recycle TX skbs.
> 
> Thanks
> Shirley

Well, with e.g. bridge or veth the skb might enter
the host networking stack. Once that happens, we lose
track of the pages. Or is there anything that
prevents such setups?

-- 
MST

^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-18 10:05 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <0a630bbf006d1210220a6ba4d55c5804@visp.net.lb>

Le mercredi 18 mai 2011 à 12:53 +0300, Denys Fedoryshchenko a écrit :

>  Yes, i will try. I should enable SLUB debugging only, or also anything 
>  else?
> 
>  But possible it will take time to reproduce bug, seems it is occuring 
>  rare. With 2.6.39 release i will rollout update to few hundreds PPPoE's, 
>  maybe it will increase chances to get information.
> 

I would try both : slub_debug=ZFP slub_nomerge

or maybe only slub_debug=ZFPU

Thanks !



^ permalink raw reply

* [Patch net-next-2.6] netpoll: disable netpoll when enslave a device
From: Amerigo Wang @ 2011-05-18 10:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: WANG Cong, Neil Horman, Jay Vosburgh, Andy Gospodarek,
	David S. Miller, Neil Horman, Alexey Dobriyan, Ferenc Wagner,
	Andrew Morton, Paul E. McKenney, Josh Triplett, Ian Campbell,
	netdev

Currently we do nothing when we enslave a net device which is running netconsole.
Neil pointed out that we may get weird results in such case, so let's disable
netpoll on the device being enslaved. I think it is too harsh to prevent
the device being ensalved if it is running netconsole.

By the way, this patch also removes the NETDEV_GOING_DOWN from netconsole
netdev notifier, because netpoll will check if the device is running or not
and we don't handle NETDEV_PRE_UP neither.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>

---
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 088fd84..b9c70c5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1640,6 +1640,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		}
 	}
 
+	netdev_bonding_change(slave_dev, NETDEV_ENSLAVE);
+
 	/* If this is the first slave, then we need to set the master's hardware
 	 * address to be the same as the slave's. */
 	if (is_zero_ether_addr(bond->dev->dev_addr))
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index a83e101..0c3e8de 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -621,7 +621,8 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	bool stopped = false;
 
 	if (!(event == NETDEV_CHANGENAME || event == NETDEV_UNREGISTER ||
-	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN))
+	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN ||
+	      event == NETDEV_ENSLAVE))
 		goto done;
 
 	spin_lock_irqsave(&target_list_lock, flags);
@@ -650,8 +651,8 @@ restart:
 					goto restart;
 				}
 				/* Fall through */
-			case NETDEV_GOING_DOWN:
 			case NETDEV_BONDING_DESLAVE:
+			case NETDEV_ENSLAVE:
 				nt->enabled = 0;
 				stopped = true;
 				break;
@@ -660,10 +661,21 @@ restart:
 		netconsole_target_put(nt);
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
-	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
+	if (stopped) {
 		printk(KERN_INFO "netconsole: network logging stopped on "
-			"interface %s as it %s\n",  dev->name,
-			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
+		       "interface %s as it ", dev->name);
+		switch (event) {
+		case NETDEV_UNREGISTER:
+			printk(KERN_CONT "unregistered\n");
+			break;
+		case NETDEV_BONDING_DESLAVE:
+			printk(KERN_CONT "released slaves\n");
+			break;
+		case NETDEV_ENSLAVE:
+			printk(KERN_CONT "is enslaved\n");
+			break;
+		}
+	}
 
 done:
 	return NOTIFY_DONE;
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 621dfa1..3d82867 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -211,6 +211,7 @@ static inline int notifier_to_errno(int ret)
 #define NETDEV_UNREGISTER_BATCH 0x0011
 #define NETDEV_BONDING_DESLAVE  0x0012
 #define NETDEV_NOTIFY_PEERS	0x0013
+#define NETDEV_ENSLAVE		0x0014
 
 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN

^ permalink raw reply related

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-18  9:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305711471.2983.27.camel@edumazet-laptop>

 On Wed, 18 May 2011 11:37:51 +0200, Eric Dumazet wrote:
> Le mercredi 18 mai 2011 à 12:27 +0300, Denys Fedoryshchenko a écrit :
>> On Wed, 18 May 2011 01:16:29 +0300, Denys Fedoryshchenko wrote:
>> > Just got recently. 32Bit, PPPoE NAS, shapers, firewall, NAT
>> > Kernel i mention in subject, 2.6.39-rc7-git11
>> > If required i can give more information
>> >
>> > sharanal (sorry for ugly name) is libpcap based traffic analyser,
>> > sure userspace
>> >
>>  Here is some info, i hope it will be a little useful
>>
>>  (gdb)  l *(cleanup_once + 0x49)
>>  0xc02e85cc is in cleanup_once (include/linux/list.h:88).
>>  83       * This is only for internal list manipulation where we 
>> know
>>  84       * the prev/next entries already!
>>  85       */
>>  86      static inline void __list_del(struct list_head * prev, 
>> struct
>>  list_head * next)
>>  87      {
>>  88              next->prev = prev;
>>  89              prev->next = next;
>>  90      }
>>  91
>>  92      /**
>>
>>  (gdb)  l *(inet_getpeer + 0x2ab)
>>  0xc02e8ae8 is in inet_getpeer (net/ipv4/inetpeer.c:530).
>>  525             if (base->total >= inet_peer_threshold)
>>  526                     /* Remove one less-recently-used entry. */
>>  527                     cleanup_once(0, stack);
>>  528
>>  529             return p;
>>  530     }
>>  531
>>  532     static int compute_total(void)
>>  533     {
>>  534             return v4_peers.total + v6_peers.total;
>>
>
> I really begin to think we have a bug here...
>
> In previous reports, I suggested to use slub_nomerge because I 
> thought
> one corruption from another kernel layer was going on.
>
> (inetpeer was using 64 bytes objects). But now that inetpeer objects 
> are
> bigger and sit in another kmemcache, its bad news.
>
> Could you try this, and eventually add some SLUB debugging stuff as
> well ?

 Yes, i will try. I should enable SLUB debugging only, or also anything 
 else?

 But possible it will take time to reproduce bug, seems it is occuring 
 rare. With 2.6.39 release i will rollout update to few hundreds PPPoE's, 
 maybe it will increase chances to get information.


^ permalink raw reply

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Eric Dumazet @ 2011-05-18  9:37 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <41a1892fed59b411bb08d3ecb0d8cda5@visp.net.lb>

Le mercredi 18 mai 2011 à 12:27 +0300, Denys Fedoryshchenko a écrit :
> On Wed, 18 May 2011 01:16:29 +0300, Denys Fedoryshchenko wrote:
> > Just got recently. 32Bit, PPPoE NAS, shapers, firewall, NAT
> > Kernel i mention in subject, 2.6.39-rc7-git11
> > If required i can give more information
> >
> > sharanal (sorry for ugly name) is libpcap based traffic analyser,
> > sure userspace
> >
>  Here is some info, i hope it will be a little useful
> 
>  (gdb)  l *(cleanup_once + 0x49)
>  0xc02e85cc is in cleanup_once (include/linux/list.h:88).
>  83       * This is only for internal list manipulation where we know
>  84       * the prev/next entries already!
>  85       */
>  86      static inline void __list_del(struct list_head * prev, struct 
>  list_head * next)
>  87      {
>  88              next->prev = prev;
>  89              prev->next = next;
>  90      }
>  91
>  92      /**
> 
>  (gdb)  l *(inet_getpeer + 0x2ab)
>  0xc02e8ae8 is in inet_getpeer (net/ipv4/inetpeer.c:530).
>  525             if (base->total >= inet_peer_threshold)
>  526                     /* Remove one less-recently-used entry. */
>  527                     cleanup_once(0, stack);
>  528
>  529             return p;
>  530     }
>  531
>  532     static int compute_total(void)
>  533     {
>  534             return v4_peers.total + v6_peers.total;
> 

I really begin to think we have a bug here...

In previous reports, I suggested to use slub_nomerge because I thought
one corruption from another kernel layer was going on.

(inetpeer was using 64 bytes objects). But now that inetpeer objects are
bigger and sit in another kmemcache, its bad news.

Could you try this, and eventually add some SLUB debugging stuff as
well ?




^ permalink raw reply

* [PATCH net-next-2.6] net: make sure rtnl is held in rtnl_fill_ifinfo()
From: Eric Dumazet @ 2011-05-18  9:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Commit e67f88dd12f610 (net: dont hold rtnl mutex during netlink dump
callbacks) added a problem in rtnl_fill_ifinfo() not being always called
with RTNL held, which is racy.

1) This patch extends rtnl_mutex helper functions so that :

rtnl_lock() is able to BUG_ON() if recursively called.
[This was only provided if LOCKDEP was on]

rtnl_is_locked() is able to check if current thread owns rtnl_mutex
[ASSERT_RTNL() gets this added feature too]

2) Add one ASSERT_RTNL() in rtnl_fill_ifinfo()

3) Make sure rtnl is held in rtnl_dump_ifinfo()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/rtnetlink.c |   19 ++++++++++++++++---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d2ba259..4e09f70 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -59,15 +59,19 @@ struct rtnl_link {
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
+static struct thread_info  *rtnl_owner;
 
 void rtnl_lock(void)
 {
+	BUG_ON(rtnl_owner == current_thread_info());
 	mutex_lock(&rtnl_mutex);
+	rtnl_owner = current_thread_info();
 }
 EXPORT_SYMBOL(rtnl_lock);
 
 void __rtnl_unlock(void)
 {
+	rtnl_owner = NULL;
 	mutex_unlock(&rtnl_mutex);
 }
 
@@ -80,13 +84,17 @@ EXPORT_SYMBOL(rtnl_unlock);
 
 int rtnl_trylock(void)
 {
-	return mutex_trylock(&rtnl_mutex);
+	int ret = mutex_trylock(&rtnl_mutex);
+
+	if (ret)
+		rtnl_owner = current_thread_info();
+	return ret;
 }
 EXPORT_SYMBOL(rtnl_trylock);
 
 int rtnl_is_locked(void)
 {
-	return mutex_is_locked(&rtnl_mutex);
+	return mutex_is_locked(&rtnl_mutex) && rtnl_owner == current_thread_info();
 }
 EXPORT_SYMBOL(rtnl_is_locked);
 
@@ -850,6 +858,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	struct nlattr *attr, *af_spec;
 	struct rtnl_af_ops *af_ops;
 
+	ASSERT_RTNL();
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
 	if (nlh == NULL)
 		return -EMSGSIZE;
@@ -1003,10 +1012,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net_device *dev;
 	struct hlist_head *head;
 	struct hlist_node *node;
+	int rtnl_was_locked = rtnl_is_locked();
 
 	s_h = cb->args[0];
 	s_idx = cb->args[1];
-
+	if (!rtnl_was_locked)
+		rtnl_lock();
 	rcu_read_lock();
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
@@ -1025,6 +1036,8 @@ cont:
 	}
 out:
 	rcu_read_unlock();
+	if (!rtnl_was_locked)
+		rtnl_unlock();
 	cb->args[1] = idx;
 	cb->args[0] = h;
 



^ permalink raw reply related

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-18  9:27 UTC (permalink / raw)
  To: netdev
In-Reply-To: <54ec5cd14e5e5c76aa06c2e6899299ce@visp.net.lb>

 On Wed, 18 May 2011 01:16:29 +0300, Denys Fedoryshchenko wrote:
> Just got recently. 32Bit, PPPoE NAS, shapers, firewall, NAT
> Kernel i mention in subject, 2.6.39-rc7-git11
> If required i can give more information
>
> sharanal (sorry for ugly name) is libpcap based traffic analyser,
> sure userspace
>
 Here is some info, i hope it will be a little useful

 (gdb)  l *(cleanup_once + 0x49)
 0xc02e85cc is in cleanup_once (include/linux/list.h:88).
 83       * This is only for internal list manipulation where we know
 84       * the prev/next entries already!
 85       */
 86      static inline void __list_del(struct list_head * prev, struct 
 list_head * next)
 87      {
 88              next->prev = prev;
 89              prev->next = next;
 90      }
 91
 92      /**

 (gdb)  l *(inet_getpeer + 0x2ab)
 0xc02e8ae8 is in inet_getpeer (net/ipv4/inetpeer.c:530).
 525             if (base->total >= inet_peer_threshold)
 526                     /* Remove one less-recently-used entry. */
 527                     cleanup_once(0, stack);
 528
 529             return p;
 530     }
 531
 532     static int compute_total(void)
 533     {
 534             return v4_peers.total + v6_peers.total;



^ permalink raw reply

* Re: packet received in a wrong rx-queue?
From: David Miller @ 2011-05-18  9:13 UTC (permalink / raw)
  To: Jon.Zhou; +Cc: e1000-devel, netdev
In-Reply-To: <4A6A2125329CFD4D8CC40C9E8ABCAB9F250D748939@MILEXCH2.ds.jdsu.net>

From: Jon Zhou <Jon.Zhou@jdsu.com>
Date: Wed, 18 May 2011 02:00:50 -0700

> There are 2 packets in the traffic
> 
> #1 create PDP context request, IPV4--UDP--GTP, src_ip=A and dst_ip=B,src_port=C,dst_port=D
> 
> #2 create PDP context response, IPV4--UDP--GTP,src_ip=B, dst_ip=A, src_port=D,dst_port=C
> 
> I suppose both of them will be received in same rx-queue but actually it doesn't

Well, for hardware flow steering, it's not expected to.

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18  9:06 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305675865.10756.63.camel@localhost.localdomain>

W dniu 18 maja 2011 01:44 użytkownik Shirley Ma <mashirle@us.ibm.com> napisał:
> On Wed, 2011-05-18 at 00:58 +0200, Michał Mirosław wrote:
>> W dniu 18 maja 2011 00:28 użytkownik Shirley Ma <mashirle@us.ibm.com>
>> napisał:
>> > On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > Looks like to use a new flag requires more time/work. I am
>> thinking
>> >> > whether we can just use HIGHDMA flag to enable zero-copy in
>> macvtap
>> >> to
>> >> > avoid the new flag for now since mavctap uses real NICs as lower
>> >> device?
>> >>
>> >> Is there any other restriction besides requiring driver to not
>> recycle
>> >> the skb? Are there any drivers that recycle TX skbs?
>> > Not more other restrictions, skb clone is OK. pskb_expand_head()
>> looks
>> > OK to me from code review.
>>
>> > Currently there is no drivers recycle TX skbs.
>>
>> So why do you require the target device to have some flags at all?
> We could use macvtap to check lower device HIGHDMA to enable zero-copy,
> but I am not sure whether it is sufficient. If it's sufficient then we
> don't need to use a new flag here. To be safe, it's better to use a new
> flag to enable each device who can pass zero-copy test.

>> Do I understand correctly, that this zero-copy feature is about
>> packets received from VMs?
> Yes, packets sent from VMs, and received in local host for TX zero-copy
> here.

What is the zero-copy test? On some arches the HIGHDMA is not needed
at all so might be not enabled on anything. It looks like the correct
test would be per-packet check of !illegal_highdma() or maybe
NETIF_F_SG as returned from harmonize_features(). For virtual devices
or other software forwarding this might lead to skb_linearize() in
some cases, but is it that bad?

Best Regards,
Michał Mirosław

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox