Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Tim Chen @ 2011-09-07 21:15 UTC (permalink / raw)
  To: davem@davemloft.net
  Cc: sedat.dilek@gmail.com, Eric Dumazet, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks, Yan, Zheng
In-Reply-To: <1315429583.2361.3.camel@schen9-mobl>

On Wed, 2011-09-07 at 14:06 -0700, Tim Chen wrote:
> On Thu, 2011-09-08 at 08:27 +0800, Yan, Zheng wrote:
> 
> > >  	err = -EPIPE;
> > >  out_err:
> > > -	if (skb == NULL)
> > > +	if (!steal_refs)
> > >  		scm_destroy(siocb->scm);
> > 
> > I think we should call scm_release() here in the case of
> > steal_refs == true. Otherwise siocb->scm->fp may leak.
> 
> Yan Zheng,
> 
> I've updated the patch.  If it looks good to you now, can you add your
> Signed-off-by to the patch.  Pending Sedat's testing on this patch,
> I think it is good to go.
> 
> Tim

Oops, I've forgotten to add Eric's previous Signed-off-by in my latest
patch log.  David, please add that when you pick up the patch.  
Once Yan Zheng added his sign off and Sedat tested the patch, I think it
will be good to go.

Thanks.

Tim 

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-08  4:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: sedat.dilek@gmail.com, Eric Dumazet, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315429583.2361.3.camel@schen9-mobl>

On 09/08/2011 05:06 AM, Tim Chen wrote:
> On Thu, 2011-09-08 at 08:27 +0800, Yan, Zheng wrote:
> 
>>>  	err = -EPIPE;
>>>  out_err:
>>> -	if (skb == NULL)
>>> +	if (!steal_refs)
>>>  		scm_destroy(siocb->scm);
>>
>> I think we should call scm_release() here in the case of
>> steal_refs == true. Otherwise siocb->scm->fp may leak.
> 
> Yan Zheng,
> 
> I've updated the patch.  If it looks good to you now, can you add your
> Signed-off-by to the patch.  Pending Sedat's testing on this patch,
> I think it is good to go.
> 
> Tim
> 
> ---
> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
> in Unix socket's send and receive path) introduced a use-after-free bug.
> The sent skbs from unix_stream_sendmsg could be consumed and destructed 
> by the receive side, removing all references to the credentials, 
> before the send side has finished sending out all 
> packets. However, send side could continue to consturct new packets in the 
> stream, using credentials that have lost its last reference and been
> freed.  
> 
> In this fix, we don't steal the reference to credentials we have obtained 
> in scm_send at beginning of unix_stream_sendmsg, till we've reached
> the last packet.  This fixes the problem in commit 0856a30409.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reported-by: Jiri Slaby <jirislaby@gmail.com>
> Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
> ---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 136298c..47780dc 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1383,10 +1383,11 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  }
>  
>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
> -			   bool send_fds, bool ref)
> +			   bool send_fds, bool steal_refs)
>  {
>  	int err = 0;
> -	if (ref) {
> +
> +	if (!steal_refs) {
>  		UNIXCB(skb).pid  = get_pid(scm->pid);
>  		UNIXCB(skb).cred = get_cred(scm->cred);
>  	} else {
> @@ -1458,7 +1459,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	if (skb == NULL)
>  		goto out;
>  
> -	err = unix_scm_to_skb(siocb->scm, skb, true, false);
> +	err = unix_scm_to_skb(siocb->scm, skb, true, true);
>  	if (err < 0)
>  		goto out_free;
>  	max_level = err + 1;
> @@ -1581,6 +1582,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	int sent = 0;
>  	struct scm_cookie tmp_scm;
>  	bool fds_sent = false;
> +	bool steal_refs = false;
>  	int max_level;
>  
>  	if (NULL == siocb->scm)
> @@ -1642,11 +1644,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		size = min_t(int, size, skb_tailroom(skb));
>  
>  
> -		/* Only send the fds and no ref to pid in the first buffer */
> -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> +		/* Only send the fds in first buffer
> +		 * Last buffer can steal our references to pid/cred
> +		 */
> +		steal_refs = (sent + size >= len);
> +		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>  		if (err < 0) {
>  			kfree_skb(skb);
> -			goto out;
> +			goto out_err;
>  		}
>  		max_level = err + 1;
>  		fds_sent = true;
> @@ -1654,7 +1659,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
>  		if (err) {
>  			kfree_skb(skb);
> -			goto out;
> +			goto out_err;
>  		}
>  
>  		unix_state_lock(other);
> @@ -1671,7 +1676,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		sent += size;
>  	}
>  
> -	if (skb)
> +	if (steal_refs)
>  		scm_release(siocb->scm);
>  	else
>  		scm_destroy(siocb->scm);
> @@ -1687,9 +1692,10 @@ pipe_err:
>  		send_sig(SIGPIPE, current, 0);
>  	err = -EPIPE;
>  out_err:
> -	if (skb == NULL)
> +	if (steal_refs)
> +		scm_release(siocb->scm);
> +	else
>  		scm_destroy(siocb->scm);
> -out:
>  	siocb->scm = NULL;
>  	return sent ? : err;
>  }
> 
> 
> 

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>

^ permalink raw reply

* Re: [PATCH] per-cgroup tcp buffer limitation
From: Glauber Costa @ 2011-09-08  4:44 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, netdev, xemul,
	David S. Miller, Hiroyouki Kamezawa, Eric W. Biederman,
	Suleiman Souhlal
In-Reply-To: <CAHH2K0aq4s1_H-yY0kA3LhM00CCNNbJZyvyBoDD6rHC+qo_gNg@mail.gmail.com>

On 09/07/2011 06:35 PM, Greg Thelen wrote:
> On Tue, Sep 6, 2011 at 3:37 PM, Glauber Costa<glommer@parallels.com>  wrote:
>> I think memcg's usage is really all you need here. In the end of the day, it
>> tells you how many pages your container has available. The whole
>> point of kmem cgroup is not any kind of reservation or accounting.
>
> The memcg does not reserve memory.  It provides upper bound limits on
> memory usage.  A careful admin can configure soft_limit_in_bytes as an
> approximation of a memory reservation.  But the soft limit is really
> more like a reclaim target when there is global memory pressure.
>
>> Once a container (or cgroup) reaches a number of objects *pinned* in memory
>> (therefore, non-reclaimable), you won't be able to grab anything from it.
>>
>>> So
>>> far my use cases involve a single memory limit which includes both
>>> kernel and user memory.  So I would need a user space agent to poll
>>> {memcg,kmem}.usage_in_bytes to apply pressure to memcg if kmem grows
>>> and visa versa.
>>
>> Maybe not.
>> If userspace memory works for you today (supposing it does), why change?
>
> Good question.  Current upstream memcg user space memory limit does
> not work for me today.  I should have made that more obvious (sorry).
> See below for details.
>
>> Right now you assign X bytes of user memory to a container, and the kernel
>> memory is shared among all of them. If this works for you, kmem_cgroup won't
>> change that. It just will impose limits over which
>> your kernel objects can't grow.
>>
>> So you don't *need* a userspace agent doing this calculation, because
>> fundamentally, nothing changed: I am not unbilling memory in memcg to bill
>> it back in kmem_cg. Of course, once it is in, you will be able to do it in
>> such a fine grained fashion if you decide to do so.
>>
>>> Do you foresee instantiation of multiple kmem cgroups, so that a
>>> process could be added into kmem/K1 or kmem/K2?  If so do you plan on
>>> supporting migration between cgroups and/or migration of kmem charge
>>> between K1 to K2?
>>
>> Yes, each container should have its own cgroup, so at least in the use
>> cases I am concerned, we will have a lot of them. But the usual lifecycle,
>> is create, execute and die. Mobility between them
>> is not something I am overly concerned right now.
>>
>>
>>>>> Do you foresee the kmem cgroup growing to include reclaimable slab,
>>>>> where freeing one type of memory allows for reclaim of the other?
>>>>
>>>> Yes, absolutely.
>
> Now I see that you're using kmem to limit the amount of unreclaimable
> kernel memory.
>
> We have a work-in-progress patch series that adds kernel memory accounting to
> memcg.  These patches allow an admin to specify a single memory limit
> for a cgroup which encompasses both user memory (as upstream memcg
> does) and also includes many kernel memory allocations (especially
> slab, page-tables).  When kernel memory grows it puts pressure on user
> memory; when user memory grows it puts pressure on reclaimable kernel
> memory using registered shrinkers.  We are in the process of cleaning
> up these memcg slab accounting patches.
 >
> In my uses cases there is a single memory limit that applies to both
> kernel and user memory.  If a separate kmem cgroup is introduced to
> manage kernel memory outside of memcg with a distinct limit, then I
> would need a user space daemon which balances memory between the kmem
> and memcg subsystems.  As kmem grows, this daemon would apply pressure
> to memcg, and as memcg grows pressure would be applied to kmem.  As
> you stated kernel memory is not necessarily reclaimable.  So such
> reclaim may fail.  My resistance to this approach is that with a
> single memory cgroup admins can do a better job packing a machine.  If
> balancing daemons are employed then more memory would need to be
> reserved and more user space cpu time would be needed to apply VM
> pressure between the types of memory.
Well, it is a way to see this. The other way to see this, is that you're
proposing to move to the kernel, something that really belongs in 
userspace. That's because:

With the information you provided me, I have no reason to believe that 
the kernel has more condition to do this work. Do the kernel have access 
to any information that userspace do not, and can't be exported? If not, 
userspace is traditionally where this sort of stuff has been done.

Using userspace CPU is no different from using kernel cpu in this 
particular case. It is all overhead, regardless where it comes from. 
Moreover, you end up setting up a policy, instead of a mechanism. What 
should be this proportion?  Do we reclaim everything with the same 
frequency? Should we be more tolerant with a specific container?

Also, If you want to allow any flexibility in this scheme, like: "Should 
this network container be able to stress the network more, pinning more 
memory, but not other subsystems?", you end up having to touch all 
individual files anyway - probably with a userspace daemon.

Also, as you noticed yourself, kernel memory is fundamentally different 
from userspace memory. You can't just set reclaim limits, since you have 
no guarantees it will work. User memory is not a scarce resource.
Kernel memory is.

>
> While there are people (like me) who want a combined memory usage
> limit there are also people (like you) who want separate user and
> kernel limiting.

Combined excludes separate. Separate does not exclude combined.

> I have toyed with the idea of having a per cgroup
> flag that determines if kernel and user memory should be combined
> charged against a single limit or if they should have separate limits.

And then every other kind of mechanism one may think of involves a 
kernel patch, instead of a potentially simple userspace change.

>   I have also wondered if there was a way to wire the usage of two
> subsystems together, then it would also meet meet my needs.  But I am
> not sure how to do that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2 3/9] socket: initial cgroup code.
From: Glauber Costa @ 2011-09-08  4:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller
In-Reply-To: <20110907221710.GA7845@shutemov.name>

On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
>> We aim to control the amount of kernel memory pinned at any
>> time by tcp sockets. To lay the foundations for this work,
>> this patch adds a pointer to the kmem_cgroup to the socket
>> structure.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: David S. Miller<davem@davemloft.net>
>> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Eric W. Biederman<ebiederm@xmission.com>
>> ---
>>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
>>   include/net/sock.h          |    2 ++
>>   net/core/sock.c             |    5 ++---
>>   3 files changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
>> index 0e4a74b..77076d8 100644
>> --- a/include/linux/kmem_cgroup.h
>> +++ b/include/linux/kmem_cgroup.h
>> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
>>   	return NULL;
>>   }
>>   #endif /* CONFIG_CGROUP_KMEM */
>> +
>> +#ifdef CONFIG_INET
>
> Will it break something if you define the helpers even if CONFIG_INET
> is not defined?
> It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> case.

The helpers inside CONFIG_INET are needed for the network code, 
regardless of kmem cgroup is defined or not, not the other way around.

So I could remove CONFIG_INET, but I can't possibly move it inside
CONFIG_CGROUP_KMEM. So this buy us nothing.

>> +#include<net/sock.h>
>> +static inline void sock_update_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	sk->sk_cgrp = kcg_from_task(current);
>> +
>> +	/*
>> +	 * We don't need to protect against anything task-related, because
>> +	 * we are basically stuck with the sock pointer that won't change,
>> +	 * even if the task that originated the socket changes cgroups.
>> +	 *
>> +	 * What we do have to guarantee, is that the chain leading us to
>> +	 * the top level won't change under our noses. Incrementing the
>> +	 * reference count via cgroup_exclude_rmdir guarantees that.
>> +	 */
>> +	cgroup_exclude_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +static inline void sock_release_kmem_cgrp(struct sock *sk)
>> +{
>> +#ifdef CONFIG_CGROUP_KMEM
>> +	cgroup_release_and_wakeup_rmdir(&sk->sk_cgrp->css);
>> +#endif
>> +}
>> +
>> +#endif /* CONFIG_INET */
>>   #endif /* _LINUX_KMEM_CGROUP_H */
>
>> @@ -2252,9 +2254,6 @@ void sk_common_release(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(sk_common_release);
>>
>> -static DEFINE_RWLOCK(proto_list_lock);
>> -static LIST_HEAD(proto_list);
>> -
>
> Wrong patch?
Yes, it is. Thanks for noticing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
From: Roopa Prabhu @ 2011-09-08  5:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, dragos.tatulea, arnd, dwang2, benve, kaber, sri, davem,
	eric.dumazet, mchan, kvm
In-Reply-To: <20110907123435.GF9337@redhat.com>

On 9/7/11 5:34 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
>> This patch is an attempt at providing address filtering support for macvtap
>> devices in PASSTHRU mode. Its still a work in progress.
>> Briefly tested for basic functionality. Wanted to get some feedback on the
>> direction before proceeding.
>> 
> 
> Good work, thanks.
> 

Thanks.

>> I have hopefully CC'ed all concerned people.
> 
> kvm crowd might also be interested.
> Try using ./scripts/get_maintainer.pl as well.
> 
Thanks for the tip. Expanded CC list a bit more.

>> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode
>> there is a 1-1 mapping between macvtap device and physical nic or VF. And all
>> filtering is done in lowerdev hw. The lowerdev does not need to be in
>> promiscous mode as long as the guest filters are passed down to the lowerdev.
>> This patch tries to remove the need for putting the lowerdev in promiscous
>> mode. 
>> I have also referred to the thread below where TUNSETTXFILTER was mentioned
>> in 
>> this context: 
>>  http://patchwork.ozlabs.org/patch/69297/
>> 
>> This patch basically passes the addresses got by TUNSETTXFILTER to macvlan
>> lowerdev.
>> 
>> I have looked at previous work and discussions on this for qemu-kvm
>> by Michael Tsirkin, Alex Williamson and Dragos Tatulea
>> http://patchwork.ozlabs.org/patch/78595/
>> http://patchwork.ozlabs.org/patch/47160/
>> https://patchwork.kernel.org/patch/474481/
>> 
>> Redhat bugzilla by Michael Tsirkin:
>> https://bugzilla.redhat.com/show_bug.cgi?id=655013
>> 
>> I used Michael's qemu-kvm patch for testing the changes with KVM
>> 
>> I would like to cover both MAC and vlan filtering in this work.
>> 
>> Open Questions/Issues:
>> - There is a need for vlan filtering to complete the patch. It will require
>>   a new tap ioctl cmd for vlans.
>>   Some ideas on this are:
>> 
>>   a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap filter
>> (similar to tun_filter for addresses). Passing the vlan id's to lower
>> device will mean going thru the whole list of vlans every time.
>> 
>>   OR
>> 
>>   b) TUNSETVLAN with vlan id and flag to set/unset
>> 
>>   Does option 'b' sound ok ?
>> 
>> - In this implementation we make the macvlan address list same as the address
>>   list that came in the filter with TUNSETTXFILTER. This will not cover cases
>>   where the macvlan device needs to have other addresses that are not
>>   necessarily in the filter. Is this a problem ?
> 
> What cases do you have in mind?
> 
This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
see a problem with uc/mc address list being the same in all the stacked
netdevs in the path. I called that out above to make sure I was not missing
any case in PASSTHRU mode where this might be invalid. Otherwise I don't see
a problem in the simple PASSTHRU use case this patch supports.

>> - The patch currently only supports passing of IFF_PROMISC and IFF_MULTICAST
>> filter flags to lowerdev
>> 
>> This patch series implements the following
>> 01/3 - macvlan: Add support for unicast filtering in macvlan
>> 02/3 - macvlan: Add function to set addr filter on lower device in passthru
>> mode
>> 03/3 - macvtap: Add support for TUNSETTXFILTER
>> 
>> Please comment. Thanks.
>> 
>> Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
>> Signed-off-by: Christian Benvenuti <benve@cisco.com>
>> Signed-off-by: David Wang <dwang2@cisco.com>
> 
> The security isn't lower than with promisc, so I don't see
> a problem with this as such.
> 
> There are more features we'll want down the road though,
> so let's see whether the interface will be able to
> satisfy them in a backwards compatible way before we
> set it in stone. Here's what I came up with:
> 
> How will the filtering table be partitioned within guests?

Since this patch supports macvlan PASSTHRU mode only, in which the lower
device has 1-1 mapping to the guest nic, it does not require any
partitioning of filtering table within guests. Unless I missed understanding
something. 

If the lower device were being shared by multiple guest network interfaces
(non PASSTHRU mode), only then we will need to maintain separate filter
tables for each guest network interface in macvlan and forward the pkt to
respective guest interface after a filter lookup. This could affect
performance too I think.

I chose to support PASSTHRU Mode only at first because its simpler and all
code additions are in control path only.

> 
> A way to limit what the guest can do would also be useful.
> How can this be done? selinux?

I vaguely remember a thread on the same context.. had a suggestion to
maintain pre-approved address lists and allow guest filter registration of
only those addresses for security. This seemed reasonable. Plus the ability
to support additional address registration from guest could be made
configurable (One of your ideas again from prior work).

I am not an selinux expert, but I am thinking we can use it to only allow or
disallow access or operations to the macvtap device. (?). I will check more
on this.

> 
> Any thoughts on spoofing filtering?

I can only think of checking addresses against an allowed address list.
Don't know of any other ways. Any hints ?

In any case I am assuming all the protection/security measures should be
taken at the layer calling the TUNSETTXFILTER ie..In macvtap virtualization
use case its libvirt or qemu-kvm. No ?

> 
> Would it be possible to make the filtering programmable
> using netlink, e.g. ethtool, ip, or some such?

Should be possible via ethtool or ip calling ioctl TUNSETTXFILTER. Are you
thinking of macvlan having a netlink interface to set filter and not ioctl
?. Sure. But I was thinking the point of implementing TUNSETTXFILTER was to
maintain compatibility with the generic tap interface that does the same
thing. 
And having both the netlink op and ioctl interface might not be clean ?.
Sorry if I misunderstood your question.

> That would make this useful for bridged setups besides
> macvtap/virtualization.
> 

Thanks for the comments. 

^ permalink raw reply

* Re: [PATCH v2 3/9] socket: initial cgroup code.
From: Kirill A. Shutemov @ 2011-09-08  5:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, netdev, linux-mm, Eric W. Biederman,
	containers, David S. Miller
In-Reply-To: <4E684A6B.6030205@parallels.com>

On Thu, Sep 08, 2011 at 01:54:03AM -0300, Glauber Costa wrote:
> On 09/07/2011 07:17 PM, Kirill A. Shutemov wrote:
> > On Wed, Sep 07, 2011 at 01:23:13AM -0300, Glauber Costa wrote:
> >> We aim to control the amount of kernel memory pinned at any
> >> time by tcp sockets. To lay the foundations for this work,
> >> this patch adds a pointer to the kmem_cgroup to the socket
> >> structure.
> >>
> >> Signed-off-by: Glauber Costa<glommer@parallels.com>
> >> CC: David S. Miller<davem@davemloft.net>
> >> CC: Hiroyouki Kamezawa<kamezawa.hiroyu@jp.fujitsu.com>
> >> CC: Eric W. Biederman<ebiederm@xmission.com>
> >> ---
> >>   include/linux/kmem_cgroup.h |   29 +++++++++++++++++++++++++++++
> >>   include/net/sock.h          |    2 ++
> >>   net/core/sock.c             |    5 ++---
> >>   3 files changed, 33 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/kmem_cgroup.h b/include/linux/kmem_cgroup.h
> >> index 0e4a74b..77076d8 100644
> >> --- a/include/linux/kmem_cgroup.h
> >> +++ b/include/linux/kmem_cgroup.h
> >> @@ -49,5 +49,34 @@ static inline struct kmem_cgroup *kcg_from_task(struct task_struct *tsk)
> >>   	return NULL;
> >>   }
> >>   #endif /* CONFIG_CGROUP_KMEM */
> >> +
> >> +#ifdef CONFIG_INET
> >
> > Will it break something if you define the helpers even if CONFIG_INET
> > is not defined?
> > It will be much cleaner. You can reuse ifdef CONFIG_CGROUP_KMEM in this
> > case.
> 
> The helpers inside CONFIG_INET are needed for the network code, 
> regardless of kmem cgroup is defined or not, not the other way around.
> 
> So I could remove CONFIG_INET, but I can't possibly move it inside
> CONFIG_CGROUP_KMEM. So this buy us nothing.

You can define empty under CONFIG_CGROUP_KMEM's #else, can't you?
Like with kcg_from_cgroup()/kcg_from_task().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-08  5:59 UTC (permalink / raw)
  To: Tim Chen
  Cc: Yan, Zheng, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315429583.2361.3.camel@schen9-mobl>

Le mercredi 07 septembre 2011 à 14:06 -0700, Tim Chen a écrit :
> On Thu, 2011-09-08 at 08:27 +0800, Yan, Zheng wrote:
> 
> > >  	err = -EPIPE;
> > >  out_err:
> > > -	if (skb == NULL)
> > > +	if (!steal_refs)
> > >  		scm_destroy(siocb->scm);
> > 
> > I think we should call scm_release() here in the case of
> > steal_refs == true. Otherwise siocb->scm->fp may leak.
> 
> Yan Zheng,
> 
> I've updated the patch.  If it looks good to you now, can you add your
> Signed-off-by to the patch.  Pending Sedat's testing on this patch,
> I think it is good to go.
> 
> Tim
> 
> ---
> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
> in Unix socket's send and receive path) introduced a use-after-free bug.
> The sent skbs from unix_stream_sendmsg could be consumed and destructed 
> by the receive side, removing all references to the credentials, 
> before the send side has finished sending out all 
> packets. However, send side could continue to consturct new packets in the 
> stream, using credentials that have lost its last reference and been
> freed.  
> 
> In this fix, we don't steal the reference to credentials we have obtained 
> in scm_send at beginning of unix_stream_sendmsg, till we've reached
> the last packet.  This fixes the problem in commit 0856a30409.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reported-by: Jiri Slaby <jirislaby@gmail.com>
> Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
> ---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 136298c..47780dc 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1383,10 +1383,11 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  }
>  
>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
> -			   bool send_fds, bool ref)
> +			   bool send_fds, bool steal_refs)
>  {
>  	int err = 0;
> -	if (ref) {
> +
> +	if (!steal_refs) {
>  		UNIXCB(skb).pid  = get_pid(scm->pid);
>  		UNIXCB(skb).cred = get_cred(scm->cred);
>  	} else {
> @@ -1458,7 +1459,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	if (skb == NULL)
>  		goto out;
>  
> -	err = unix_scm_to_skb(siocb->scm, skb, true, false);
> +	err = unix_scm_to_skb(siocb->scm, skb, true, true);
>  	if (err < 0)
>  		goto out_free;
>  	max_level = err + 1;
> @@ -1581,6 +1582,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	int sent = 0;
>  	struct scm_cookie tmp_scm;
>  	bool fds_sent = false;
> +	bool steal_refs = false;
>  	int max_level;
>  
>  	if (NULL == siocb->scm)
> @@ -1642,11 +1644,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		size = min_t(int, size, skb_tailroom(skb));
>  
> 
> -		/* Only send the fds and no ref to pid in the first buffer */
> -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> +		/* Only send the fds in first buffer
> +		 * Last buffer can steal our references to pid/cred
> +		 */
> +		steal_refs = (sent + size >= len);
> +		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>  		if (err < 0) {
>  			kfree_skb(skb);
> -			goto out;
> +			goto out_err;
>  		}
>  		max_level = err + 1;
>  		fds_sent = true;
> @@ -1654,7 +1659,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
>  		if (err) {
>  			kfree_skb(skb);
> -			goto out;
> +			goto out_err;
>  		}
>  
>  		unix_state_lock(other);
> @@ -1671,7 +1676,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		sent += size;
>  	}
>  
> -	if (skb)
> +	if (steal_refs)
>  		scm_release(siocb->scm);
>  	else
>  		scm_destroy(siocb->scm);
> @@ -1687,9 +1692,10 @@ pipe_err:
>  		send_sig(SIGPIPE, current, 0);
>  	err = -EPIPE;
>  out_err:
> -	if (skb == NULL)
> +	if (steal_refs)
> +		scm_release(siocb->scm);
> +	else
>  		scm_destroy(siocb->scm);
> -out:
>  	siocb->scm = NULL;
>  	return sent ? : err;
>  }
> 
> 
> 

I dont think this patch is good.

Sedat traces have nothing to do with af_unix.

Once unix_scm_to_skb() was called and successful, and steal_refs is true
our refs are attached to this skb. They will be released by
skb_free(skb). Same for fp : They either were sent in a previous skb or
this one.

This is why the "goto out;" was OK.

^ permalink raw reply

* Re: [RFC] interface for outgoing packet
From: Eric Dumazet @ 2011-09-08  6:09 UTC (permalink / raw)
  To: Viral Mehta; +Cc: netdev
In-Reply-To: <CANX6DanC8DJsAZj4dV3oZo3DnLV6AyxBDm82vHaJm3C51aRqEg@mail.gmail.com>

Le mercredi 07 septembre 2011 à 20:24 -0400, Viral Mehta a écrit :
> Hi All,
> 
> How reasonably I can assume that,
> 
> on whatever interface, I received the last packet
> for a particular Socket Connection, the same interface will be
> used to SEND the next packet for same socket connection?
> 

No

> I am collecting the logs on one of my test machines.
> And it shows, I can assume all the time that "interface" which received packet
> for some socket connection, the same will be used to send packet for
> that connection
> 
> But, I am not sure if I am missing some scenarios
> or setup (for e.g., bonding) where it can be wrong ?
> 
> It would be more help if some one can shed some light.

By default, a socket is not bound to one interface, so you cant assume
this.

Check SO_BINDTODEVICE socket option if you need it.

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-08  6:21 UTC (permalink / raw)
  To: Tim Chen
  Cc: davem@davemloft.net, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, sfr@canb.auug.org.au, jirislaby@gmail.com,
	Shi, Alex, Valdis Kletnieks, Yan, Zheng
In-Reply-To: <1315430115.2361.11.camel@schen9-mobl>

Le mercredi 07 septembre 2011 à 14:15 -0700, Tim Chen a écrit :

> Oops, I've forgotten to add Eric's previous Signed-off-by in my latest
> patch log.  David, please add that when you pick up the patch.  
> Once Yan Zheng added his sign off and Sedat tested the patch, I think it
> will be good to go.

Tim, as soon as you post another patch version, you must remove all
prior Signed-off-by, and only add yours.

So it was fine to do so.

By the way your last version introduce a new bug, so I would rather NACK
it ;)

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-08  6:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315461572.2532.7.camel@edumazet-laptop>

On 09/08/2011 01:59 PM, Eric Dumazet wrote:
> Le mercredi 07 septembre 2011 à 14:06 -0700, Tim Chen a écrit :
>> On Thu, 2011-09-08 at 08:27 +0800, Yan, Zheng wrote:
>>
>>>>  	err = -EPIPE;
>>>>  out_err:
>>>> -	if (skb == NULL)
>>>> +	if (!steal_refs)
>>>>  		scm_destroy(siocb->scm);
>>>
>>> I think we should call scm_release() here in the case of
>>> steal_refs == true. Otherwise siocb->scm->fp may leak.
>>
>> Yan Zheng,
>>
>> I've updated the patch.  If it looks good to you now, can you add your
>> Signed-off-by to the patch.  Pending Sedat's testing on this patch,
>> I think it is good to go.
>>
>> Tim
>>
>> ---
>> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
>> in Unix socket's send and receive path) introduced a use-after-free bug.
>> The sent skbs from unix_stream_sendmsg could be consumed and destructed 
>> by the receive side, removing all references to the credentials, 
>> before the send side has finished sending out all 
>> packets. However, send side could continue to consturct new packets in the 
>> stream, using credentials that have lost its last reference and been
>> freed.  
>>
>> In this fix, we don't steal the reference to credentials we have obtained 
>> in scm_send at beginning of unix_stream_sendmsg, till we've reached
>> the last packet.  This fixes the problem in commit 0856a30409.
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reported-by: Jiri Slaby <jirislaby@gmail.com>
>> Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
>> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
>> ---
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index 136298c..47780dc 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -1383,10 +1383,11 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>>  }
>>  
>>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
>> -			   bool send_fds, bool ref)
>> +			   bool send_fds, bool steal_refs)
>>  {
>>  	int err = 0;
>> -	if (ref) {
>> +
>> +	if (!steal_refs) {
>>  		UNIXCB(skb).pid  = get_pid(scm->pid);
>>  		UNIXCB(skb).cred = get_cred(scm->cred);
>>  	} else {
>> @@ -1458,7 +1459,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>  	if (skb == NULL)
>>  		goto out;
>>  
>> -	err = unix_scm_to_skb(siocb->scm, skb, true, false);
>> +	err = unix_scm_to_skb(siocb->scm, skb, true, true);
>>  	if (err < 0)
>>  		goto out_free;
>>  	max_level = err + 1;
>> @@ -1581,6 +1582,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>  	int sent = 0;
>>  	struct scm_cookie tmp_scm;
>>  	bool fds_sent = false;
>> +	bool steal_refs = false;
>>  	int max_level;
>>  
>>  	if (NULL == siocb->scm)
>> @@ -1642,11 +1644,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>  		size = min_t(int, size, skb_tailroom(skb));
>>  
>>
>> -		/* Only send the fds and no ref to pid in the first buffer */
>> -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
>> +		/* Only send the fds in first buffer
>> +		 * Last buffer can steal our references to pid/cred
>> +		 */
>> +		steal_refs = (sent + size >= len);
>> +		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>>  		if (err < 0) {
>>  			kfree_skb(skb);
>> -			goto out;
>> +			goto out_err;
>>  		}
>>  		max_level = err + 1;
>>  		fds_sent = true;
>> @@ -1654,7 +1659,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>  		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
>>  		if (err) {
>>  			kfree_skb(skb);
>> -			goto out;
>> +			goto out_err;
>>  		}
>>  
>>  		unix_state_lock(other);
>> @@ -1671,7 +1676,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>  		sent += size;
>>  	}
>>  
>> -	if (skb)
>> +	if (steal_refs)
>>  		scm_release(siocb->scm);
>>  	else
>>  		scm_destroy(siocb->scm);
>> @@ -1687,9 +1692,10 @@ pipe_err:
>>  		send_sig(SIGPIPE, current, 0);
>>  	err = -EPIPE;
>>  out_err:
>> -	if (skb == NULL)
>> +	if (steal_refs)
>> +		scm_release(siocb->scm);
>> +	else
>>  		scm_destroy(siocb->scm);
>> -out:
>>  	siocb->scm = NULL;
>>  	return sent ? : err;
>>  }
>>
>>
>>
> 
> I dont think this patch is good.
> 
> Sedat traces have nothing to do with af_unix.
> 
> Once unix_scm_to_skb() was called and successful, and steal_refs is true
> our refs are attached to this skb. They will be released by
> skb_free(skb). Same for fp : They either were sent in a previous skb or
> this one.
> 
> This is why the "goto out;" was OK.
> 

I don't think so. unix_scm_to_skb() calls unix_attach_fds(), it
always duplicates scm->fp. 

Regards

^ permalink raw reply

* [PATCH] net: phy: Add config option to specify external switch port to be used if switch is used as PHY
From: Lambrecht Jürgen @ 2011-09-08  6:54 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: linux-embedded@vger.kernel.org

Hello,

In our embedded designs, this is a useful patch. Maybe it can be useful 
for somebody else too.
Or maybe there are already better solutions?
I know I could also write a driver for our switch, but that is too much 
effort just to select the active port.

Kind regards,
Jürgen

   In embedded design, instead of a PHY, sometimes a switch is used that
           behaves as a PHY through its MII port. For example to use a daisy
           chain network configuration instead of an expensive star config.
           In that case, many phy ports are available, but only 1 should 
be used
           to check link status, and not the first one available as is 
the case
           without this configuration (that is, set to its default value 0).
           So this options specifies the switch port number to be used 
to check
           link status, because if the link is down, no data is sent by the
           TCP/IP stack.

Signed-off-by: Jürgen Lambrecht <J.Lambrecht@televic.com>
---
  drivers/net/phy/Kconfig    |   14 ++++++++++++++
  drivers/net/phy/mdio_bus.c |    9 +++++++++
  2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index a702443..554561f 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -13,6 +13,20 @@ menuconfig PHYLIB

  if PHYLIB

+config SWITCH_PHY
+       int "External switch port to be used if switch is used as PHY"
+       default "0"
+       help
+         In embedded design, instead of a PHY, sometimes a switch is 
used that
+         behaves as a PHY through its MII port. For example to use a daisy
+         chain network configuration instead of an expensive star config.
+         In that case, many phy ports are available, but only 1 should 
be used
+         to check link status, and not the first one available as is 
the case
+         without this configuration (that is, set to its default value 0).
+         So this options specifies the switch port number to be used to 
check
+         link status, because if the link is down, no data is sent by the
+         TCP/IP stack.
+
  comment "MII PHY device drivers"

  config MARVELL_PHY
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 6c58da2..016437a 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -112,7 +112,14 @@ int mdiobus_register(struct mii_bus *bus)
         if (bus->reset)
                 bus->reset(bus);

+       /* The config below is always availble with CONFIG_PHYLIB. If 0, the
+          behavior is as before without this patch (or P0 of the switch is
+          taken because it is the first one found). */
+#if CONFIG_SWITCH_PHY
+       i = CONFIG_SWITCH_PHY;
+#else
         for (i = 0; i < PHY_MAX_ADDR; i++) {
+#endif
                 if ((bus->phy_mask & (1 << i)) == 0) {
                         struct phy_device *phydev;

@@ -122,7 +129,9 @@ int mdiobus_register(struct mii_bus *bus)
                                 goto error;
                         }
                 }
+#if !CONFIG_SWITCH_PHY
         }
+#endif

         bus->state = MDIOBUS_REGISTERED;
         pr_info("%s: probed\n", bus->name);
--
1.7.1

^ permalink raw reply related

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Sedat Dilek @ 2011-09-08  7:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, Yan, Zheng, Yan, Zheng, netdev@vger.kernel.org,
	davem@davemloft.net, sfr@canb.auug.org.au, jirislaby@gmail.com,
	Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315461572.2532.7.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 5903 bytes --]

On Thu, Sep 8, 2011 at 7:59 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 07 septembre 2011 à 14:06 -0700, Tim Chen a écrit :
>> On Thu, 2011-09-08 at 08:27 +0800, Yan, Zheng wrote:
>>
>> > >   err = -EPIPE;
>> > >  out_err:
>> > > - if (skb == NULL)
>> > > + if (!steal_refs)
>> > >           scm_destroy(siocb->scm);
>> >
>> > I think we should call scm_release() here in the case of
>> > steal_refs == true. Otherwise siocb->scm->fp may leak.
>>
>> Yan Zheng,
>>
>> I've updated the patch.  If it looks good to you now, can you add your
>> Signed-off-by to the patch.  Pending Sedat's testing on this patch,
>> I think it is good to go.
>>
>> Tim
>>
>> ---
>> Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
>> in Unix socket's send and receive path) introduced a use-after-free bug.
>> The sent skbs from unix_stream_sendmsg could be consumed and destructed
>> by the receive side, removing all references to the credentials,
>> before the send side has finished sending out all
>> packets. However, send side could continue to consturct new packets in the
>> stream, using credentials that have lost its last reference and been
>> freed.
>>
>> In this fix, we don't steal the reference to credentials we have obtained
>> in scm_send at beginning of unix_stream_sendmsg, till we've reached
>> the last packet.  This fixes the problem in commit 0856a30409.
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reported-by: Jiri Slaby <jirislaby@gmail.com>
>> Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
>> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
>> ---
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index 136298c..47780dc 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -1383,10 +1383,11 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>>  }
>>
>>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
>> -                        bool send_fds, bool ref)
>> +                        bool send_fds, bool steal_refs)
>>  {
>>       int err = 0;
>> -     if (ref) {
>> +
>> +     if (!steal_refs) {
>>               UNIXCB(skb).pid  = get_pid(scm->pid);
>>               UNIXCB(skb).cred = get_cred(scm->cred);
>>       } else {
>> @@ -1458,7 +1459,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>       if (skb == NULL)
>>               goto out;
>>
>> -     err = unix_scm_to_skb(siocb->scm, skb, true, false);
>> +     err = unix_scm_to_skb(siocb->scm, skb, true, true);
>>       if (err < 0)
>>               goto out_free;
>>       max_level = err + 1;
>> @@ -1581,6 +1582,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>       int sent = 0;
>>       struct scm_cookie tmp_scm;
>>       bool fds_sent = false;
>> +     bool steal_refs = false;
>>       int max_level;
>>
>>       if (NULL == siocb->scm)
>> @@ -1642,11 +1644,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>               size = min_t(int, size, skb_tailroom(skb));
>>
>>
>> -             /* Only send the fds and no ref to pid in the first buffer */
>> -             err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
>> +             /* Only send the fds in first buffer
>> +              * Last buffer can steal our references to pid/cred
>> +              */
>> +             steal_refs = (sent + size >= len);
>> +             err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>>               if (err < 0) {
>>                       kfree_skb(skb);
>> -                     goto out;
>> +                     goto out_err;
>>               }
>>               max_level = err + 1;
>>               fds_sent = true;
>> @@ -1654,7 +1659,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>               err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
>>               if (err) {
>>                       kfree_skb(skb);
>> -                     goto out;
>> +                     goto out_err;
>>               }
>>
>>               unix_state_lock(other);
>> @@ -1671,7 +1676,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>>               sent += size;
>>       }
>>
>> -     if (skb)
>> +     if (steal_refs)
>>               scm_release(siocb->scm);
>>       else
>>               scm_destroy(siocb->scm);
>> @@ -1687,9 +1692,10 @@ pipe_err:
>>               send_sig(SIGPIPE, current, 0);
>>       err = -EPIPE;
>>  out_err:
>> -     if (skb == NULL)
>> +     if (steal_refs)
>> +             scm_release(siocb->scm);
>> +     else
>>               scm_destroy(siocb->scm);
>> -out:
>>       siocb->scm = NULL;
>>       return sent ? : err;
>>  }
>>
>>
>>
>
> I dont think this patch is good.
>
> Sedat traces have nothing to do with af_unix.
>
> Once unix_scm_to_skb() was called and successful, and steal_refs is true
> our refs are attached to this skb. They will be released by
> skb_free(skb). Same for fp : They either were sent in a previous skb or
> this one.
>
> This is why the "goto out;" was OK.
>

Good morning,

/me sees so many patches :-).
Yes, I guess the patch by Eric has nothing to do with seen
call-traces, adding "irqpoll" to Kernel command line seems to fix them
(my uptime: approx. 10:30, Eric's proposal patch in my last
patch-series is attached).

- Sedat -

[-- Attachment #2: unix-stream-Fix-use-after-free-crashes-by-edumazet.patch --]
[-- Type: text/x-diff, Size: 2709 bytes --]

Commit 0856a30409 (Scm: Remove unnecessary pid & credential references
in Unix socket's send and receive path) introduced a use-after-free bug.
The sent skbs from unix_stream_sendmsg could be consumed and destructed 
by the receive side, removing all references to the credentials, 
before the send side has finished sending out all 
packets. However, send side could continue to consturct new packets in the 
stream, using credentials that have lost its last reference and been
freed.  

In this fix, we don't steal the reference to credentials we have obtained 
in scm_send at beginning of unix_stream_sendmsg, till we've reached
the last packet.  This fixes the problem in commit 0856a30409.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
---

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 136298c..4a324a0 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1383,10 +1383,11 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 }
 
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
-			   bool send_fds, bool ref)
+			   bool send_fds, bool steal_refs)
 {
 	int err = 0;
-	if (ref) {
+
+	if (!steal_refs) {
 		UNIXCB(skb).pid  = get_pid(scm->pid);
 		UNIXCB(skb).cred = get_cred(scm->cred);
 	} else {
@@ -1458,7 +1459,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
-	err = unix_scm_to_skb(siocb->scm, skb, true, false);
+	err = unix_scm_to_skb(siocb->scm, skb, true, true);
 	if (err < 0)
 		goto out_free;
 	max_level = err + 1;
@@ -1581,6 +1582,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int sent = 0;
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
+	bool steal_refs = false;
 	int max_level;
 
 	if (NULL == siocb->scm)
@@ -1642,8 +1644,11 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		size = min_t(int, size, skb_tailroom(skb));
 
 
-		/* Only send the fds and no ref to pid in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+		/* Only send the fds in first buffer
+		 * Last buffer can steal our references to pid/cred
+		 */
+		steal_refs = (sent + size >= len);
+		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
 		if (err < 0) {
 			kfree_skb(skb);
 			goto out;
@@ -1671,7 +1676,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	if (skb)
+	if (steal_refs)
 		scm_release(siocb->scm);
 	else
 		scm_destroy(siocb->scm);



^ permalink raw reply related

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-08  7:11 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Tim Chen, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <4E685F19.6030407@intel.com>

Le jeudi 08 septembre 2011 à 14:22 +0800, Yan, Zheng a écrit :

> I don't think so. unix_scm_to_skb() calls unix_attach_fds(), it
> always duplicates scm->fp. 

What a mess. This code is a nightmare.

Part of the mess comes from scm_destroy() and scm_release() duplication.

We should have scm_destroy() only, as before, and NULLify scm->pid/cred
in unix_scm_to_skb() when we steal references.

It makes more sense and keeps things clearer.


 include/net/scm.h  |    9 ---------
 net/unix/af_unix.c |   27 +++++++++++++++------------
 2 files changed, 15 insertions(+), 21 deletions(-)


diff --git a/include/net/scm.h b/include/net/scm.h
index 68e1e48..2a5b42f 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -78,15 +78,6 @@ static __inline__ void scm_destroy(struct scm_cookie *scm)
 		__scm_destroy(scm);
 }
 
-static __inline__ void scm_release(struct scm_cookie *scm)
-{
-	/* keep ref on pid and cred */
-	scm->pid = NULL;
-	scm->cred = NULL;
-	if (scm->fp)
-		__scm_destroy(scm);
-}
-
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm)
 {
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e6d9d10..1fa270a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1379,15 +1379,18 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 }
 
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
-			   bool send_fds, bool ref)
+			   bool send_fds, bool steal_refs)
 {
 	int err = 0;
-	if (ref) {
+
+	if (!steal_refs) {
 		UNIXCB(skb).pid  = get_pid(scm->pid);
 		UNIXCB(skb).cred = get_cred(scm->cred);
 	} else {
 		UNIXCB(skb).pid  = scm->pid;
 		UNIXCB(skb).cred = scm->cred;
+		scm->pid = NULL;
+		scm->cred = NULL;
 	}
 	UNIXCB(skb).fp = NULL;
 	if (scm->fp && send_fds)
@@ -1454,7 +1457,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
-	err = unix_scm_to_skb(siocb->scm, skb, true, false);
+	err = unix_scm_to_skb(siocb->scm, skb, true, true);
 	if (err < 0)
 		goto out_free;
 	max_level = err + 1;
@@ -1550,7 +1553,7 @@ restart:
 	unix_state_unlock(other);
 	other->sk_data_ready(other, len);
 	sock_put(other);
-	scm_release(siocb->scm);
+	scm_destroy(siocb->scm);
 	return len;
 
 out_unlock:
@@ -1577,6 +1580,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int sent = 0;
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
+	bool steal_refs = false;
 	int max_level;
 
 	if (NULL == siocb->scm)
@@ -1638,8 +1642,11 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		size = min_t(int, size, skb_tailroom(skb));
 
 
-		/* Only send the fds and no ref to pid in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+		/* Only send the fds in first buffer
+		 * Last buffer can steal our references to pid/cred
+		 */
+		steal_refs = (sent + size >= len);
+		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
 		if (err < 0) {
 			kfree_skb(skb);
 			goto out;
@@ -1667,10 +1674,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	if (skb)
-		scm_release(siocb->scm);
-	else
-		scm_destroy(siocb->scm);
+	scm_destroy(siocb->scm);
 	siocb->scm = NULL;
 
 	return sent;
@@ -1683,8 +1687,7 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	if (skb == NULL)
-		scm_destroy(siocb->scm);
+	scm_destroy(siocb->scm);
 out:
 	siocb->scm = NULL;
 	return sent ? : err;

^ permalink raw reply related

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Yan, Zheng @ 2011-09-08  7:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tim Chen, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315465919.2532.19.camel@edumazet-laptop>

On 09/08/2011 03:11 PM, Eric Dumazet wrote:
> Le jeudi 08 septembre 2011 à 14:22 +0800, Yan, Zheng a écrit :
> 
>> I don't think so. unix_scm_to_skb() calls unix_attach_fds(), it
>> always duplicates scm->fp. 
> 
> What a mess. This code is a nightmare.
> 
> Part of the mess comes from scm_destroy() and scm_release() duplication.
> 
> We should have scm_destroy() only, as before, and NULLify scm->pid/cred
> in unix_scm_to_skb() when we steal references.
> 
> It makes more sense and keeps things clearer.
> 
> 
>  include/net/scm.h  |    9 ---------
>  net/unix/af_unix.c |   27 +++++++++++++++------------
>  2 files changed, 15 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/include/net/scm.h b/include/net/scm.h
> index 68e1e48..2a5b42f 100644
> --- a/include/net/scm.h
> +++ b/include/net/scm.h
> @@ -78,15 +78,6 @@ static __inline__ void scm_destroy(struct scm_cookie *scm)
>  		__scm_destroy(scm);
>  }
>  
> -static __inline__ void scm_release(struct scm_cookie *scm)
> -{
> -	/* keep ref on pid and cred */
> -	scm->pid = NULL;
> -	scm->cred = NULL;
> -	if (scm->fp)
> -		__scm_destroy(scm);
> -}
> -
>  static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
>  			       struct scm_cookie *scm)
>  {
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index e6d9d10..1fa270a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1379,15 +1379,18 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  }
>  
>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
> -			   bool send_fds, bool ref)
> +			   bool send_fds, bool steal_refs)
>  {
>  	int err = 0;
> -	if (ref) {
> +
> +	if (!steal_refs) {
>  		UNIXCB(skb).pid  = get_pid(scm->pid);
>  		UNIXCB(skb).cred = get_cred(scm->cred);
>  	} else {
>  		UNIXCB(skb).pid  = scm->pid;
>  		UNIXCB(skb).cred = scm->cred;
> +		scm->pid = NULL;
> +		scm->cred = NULL;
>  	}
>  	UNIXCB(skb).fp = NULL;
>  	if (scm->fp && send_fds)
> @@ -1454,7 +1457,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	if (skb == NULL)
>  		goto out;
>  
> -	err = unix_scm_to_skb(siocb->scm, skb, true, false);
> +	err = unix_scm_to_skb(siocb->scm, skb, true, true);
>  	if (err < 0)
>  		goto out_free;
>  	max_level = err + 1;
> @@ -1550,7 +1553,7 @@ restart:
>  	unix_state_unlock(other);
>  	other->sk_data_ready(other, len);
>  	sock_put(other);
> -	scm_release(siocb->scm);
> +	scm_destroy(siocb->scm);
>  	return len;
>  
>  out_unlock:
> @@ -1577,6 +1580,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  	int sent = 0;
>  	struct scm_cookie tmp_scm;
>  	bool fds_sent = false;
> +	bool steal_refs = false;
>  	int max_level;
>  
>  	if (NULL == siocb->scm)
> @@ -1638,8 +1642,11 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		size = min_t(int, size, skb_tailroom(skb));
>  
>  
> -		/* Only send the fds and no ref to pid in the first buffer */
> -		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> +		/* Only send the fds in first buffer
> +		 * Last buffer can steal our references to pid/cred
> +		 */
> +		steal_refs = (sent + size >= len);
> +		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>  		if (err < 0) {
>  			kfree_skb(skb);
>  			goto out;
> @@ -1667,10 +1674,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>  		sent += size;
>  	}
>  
> -	if (skb)
> -		scm_release(siocb->scm);
> -	else
> -		scm_destroy(siocb->scm);
> +	scm_destroy(siocb->scm);
>  	siocb->scm = NULL;
>  
>  	return sent;
> @@ -1683,8 +1687,7 @@ pipe_err:
>  		send_sig(SIGPIPE, current, 0);
>  	err = -EPIPE;
>  out_err:
> -	if (skb == NULL)
> -		scm_destroy(siocb->scm);
> +	scm_destroy(siocb->scm);
>  out:
>  	siocb->scm = NULL;
>  	return sent ? : err;
> 
> 

This code looks great, except "goto out;" is still there. I think we should replace it with "goto out_err;" :)

Regards

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Eric Dumazet @ 2011-09-08  7:33 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Tim Chen, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	jirislaby@gmail.com, Shi, Alex, Valdis Kletnieks
In-Reply-To: <4E686D71.30603@intel.com>

Le jeudi 08 septembre 2011 à 15:23 +0800, Yan, Zheng a écrit :

> This code looks great, except "goto out;" is still there. I think we should replace it with "goto out_err;" :)
> 

Indeed, you're right, thanks

 include/net/scm.h  |    9 ---------
 net/unix/af_unix.c |   32 +++++++++++++++++---------------
 2 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/include/net/scm.h b/include/net/scm.h
index 68e1e48..2a5b42f 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -78,15 +78,6 @@ static __inline__ void scm_destroy(struct scm_cookie *scm)
 		__scm_destroy(scm);
 }
 
-static __inline__ void scm_release(struct scm_cookie *scm)
-{
-	/* keep ref on pid and cred */
-	scm->pid = NULL;
-	scm->cred = NULL;
-	if (scm->fp)
-		__scm_destroy(scm);
-}
-
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm)
 {
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e6d9d10..c8a08ba 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1379,15 +1379,18 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 }
 
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
-			   bool send_fds, bool ref)
+			   bool send_fds, bool steal_refs)
 {
 	int err = 0;
-	if (ref) {
+
+	if (!steal_refs) {
 		UNIXCB(skb).pid  = get_pid(scm->pid);
 		UNIXCB(skb).cred = get_cred(scm->cred);
 	} else {
 		UNIXCB(skb).pid  = scm->pid;
 		UNIXCB(skb).cred = scm->cred;
+		scm->pid = NULL;
+		scm->cred = NULL;
 	}
 	UNIXCB(skb).fp = NULL;
 	if (scm->fp && send_fds)
@@ -1454,7 +1457,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
-	err = unix_scm_to_skb(siocb->scm, skb, true, false);
+	err = unix_scm_to_skb(siocb->scm, skb, true, true);
 	if (err < 0)
 		goto out_free;
 	max_level = err + 1;
@@ -1550,7 +1553,7 @@ restart:
 	unix_state_unlock(other);
 	other->sk_data_ready(other, len);
 	sock_put(other);
-	scm_release(siocb->scm);
+	scm_destroy(siocb->scm);
 	return len;
 
 out_unlock:
@@ -1577,6 +1580,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int sent = 0;
 	struct scm_cookie tmp_scm;
 	bool fds_sent = false;
+	bool steal_refs = false;
 	int max_level;
 
 	if (NULL == siocb->scm)
@@ -1638,11 +1642,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		size = min_t(int, size, skb_tailroom(skb));
 
 
-		/* Only send the fds and no ref to pid in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
+		/* Only send the fds in first buffer
+		 * Last buffer can steal our references to pid/cred
+		 */
+		steal_refs = (sent + size >= len);
+		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
 		if (err < 0) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 		max_level = err + 1;
 		fds_sent = true;
@@ -1650,7 +1657,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
 		if (err) {
 			kfree_skb(skb);
-			goto out;
+			goto out_err;
 		}
 
 		unix_state_lock(other);
@@ -1667,10 +1674,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	if (skb)
-		scm_release(siocb->scm);
-	else
-		scm_destroy(siocb->scm);
+	scm_destroy(siocb->scm);
 	siocb->scm = NULL;
 
 	return sent;
@@ -1683,9 +1687,7 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	if (skb == NULL)
-		scm_destroy(siocb->scm);
-out:
+	scm_destroy(siocb->scm);
 	siocb->scm = NULL;
 	return sent ? : err;
 }

^ permalink raw reply related

* e1000e: NIC not working (afer resume?)
From: Frederik Himpe @ 2011-09-08  7:45 UTC (permalink / raw)
  To: netdev

I have a Dell Latitude E6400 which has a network card supported by the 
e1000e driver. Often (I think after a suspend/resume cycle), the network 
card does not work at all: the NIC is correctly seen by ifconfig, but 
running ethtool just returns: "No such device". dhclient -v gives the 
impression that it's correctly sending out DHCPDISCOVER packets on the 
NIC, but a tcpdump running on the same machine does not see any packets 
going out.

I'm using Debian's 3.0.0-3 kernel (corresponding with Linux 3.0.3).

Full lspci, .config and dmesg output at
http://artipc10.vub.ac.be/~frederik/e1000e/

Here is some relevant summary.

How can I find out what is going wrong?

# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:21:70:e1:bb:4c  
          UP BROADCAST PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:22 Memory:f6ae0000-f6b00000 


# ethtool eth0
Settings for eth0:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
No data available

# lspci -vvnn
00:19.0 Ethernet controller [0200]: Intel Corporation 82567LM Gigabit Network Connection [8086:10f5] (rev 03)
        Subsystem: Dell Device [1028:0233]
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 22
        Region 0: Memory at f6ae0000 (32-bit, non-prefetchable) [disabled] [size=128K]
        Region 1: Memory at f6adb000 (32-bit, non-prefetchable) [disabled] [size=4K]
        Region 2: I/O ports at efe0 [disabled] [size=32]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=1 PME+
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 00000000fee0300c  Data: 4182
        Capabilities: [e0] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: e1000e


# dmesg | grep -E "e1000e|eth0"
[    1.027437] e1000e: Intel(R) PRO/1000 Network Driver - 1.3.10-k2
[    1.027441] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[    1.027480] e1000e 0000:00:19.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
[    1.027491] e1000e 0000:00:19.0: setting latency timer to 64
[    1.027605] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    1.231440] e1000e 0000:00:19.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:21:70:e1:bb:4c
[    1.231444] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
[    1.231470] e1000e 0000:00:19.0: eth0: MAC: 7, PHY: 8, PBA No: 1004FF-0FF
[   22.896268] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[   22.952097] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[   22.954132] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   24.504903] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[   24.506413] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[   24.508402] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   34.788022] eth0: no IPv6 routers present
[   41.444922] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[   41.446325] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[25136.918393] e1000e 0000:00:19.0: PME# enabled
[25142.488050] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[25142.488058] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[25142.488066] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[25142.488085] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[25142.488110] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[25142.488167] e1000e 0000:00:19.0: PME# disabled
[25142.510966] e1000e 0000:00:19.0: PCI INT A disabled
[25142.510971] e1000e 0000:00:19.0: PME# enabled
[25143.468668] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[25143.468689] e1000e 0000:00:19.0: restoring config space at offset 0x6 (was 0x1, writing 0xefe1)
[25143.468696] e1000e 0000:00:19.0: restoring config space at offset 0x5 (was 0x0, writing 0xf6adb000)
[25143.468703] e1000e 0000:00:19.0: restoring config space at offset 0x4 (was 0x0, writing 0xf6ae0000)
[25143.468713] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[25143.471149] e1000e 0000:00:19.0: PME# disabled
[25143.471277] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[25145.276441] e1000e 0000:00:19.0: PME# enabled
[25146.082051] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[25146.082076] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[25146.082090] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[25146.082126] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[25146.082169] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[25146.082280] e1000e 0000:00:19.0: PME# disabled
[25146.176605] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[25146.232228] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[25146.234003] ADDRCONF(NETDEV_UP): eth0: link is not ready
[25147.328875] e1000e 0000:00:19.0: PME# enabled
[26261.896126] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[26261.896140] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[26261.896149] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[26261.896169] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26261.896196] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26261.896241] e1000e 0000:00:19.0: PME# disabled
[26261.896316] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26261.972148] e1000e 0000:00:19.0: PME# enabled
[26268.192160] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[26268.192168] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[26268.192175] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[26268.192194] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26268.192219] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26268.192274] e1000e 0000:00:19.0: PME# disabled
[26268.213875] e1000e 0000:00:19.0: PME# enabled
[26269.172481] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26269.172497] e1000e 0000:00:19.0: restoring config space at offset 0x6 (was 0x1, writing 0xefe1)
[26269.172502] e1000e 0000:00:19.0: restoring config space at offset 0x5 (was 0x0, writing 0xf6adb000)
[26269.172507] e1000e 0000:00:19.0: restoring config space at offset 0x4 (was 0x0, writing 0xf6ae0000)
[26269.172515] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26269.174262] e1000e 0000:00:19.0: PME# disabled
[26269.174339] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26272.384806] e1000e 0000:00:19.0: PME# enabled
[26274.688310] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26275.482490] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26276.277343] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26277.075194] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26277.870807] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26278.667881] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26279.463492] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26280.258982] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26281.056598] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26281.855386] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26282.649838] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26283.445674] e1000e 0000:00:19.0: eth0: Error reading PHY register
[26289.848036] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[26289.848044] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[26289.848050] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[26289.848067] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26289.848090] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26289.848146] e1000e 0000:00:19.0: PME# disabled
[26289.940279] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26289.996092] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26289.996761] ADDRCONF(NETDEV_UP): eth0: link is not ready
[26291.088343] e1000e 0000:00:19.0: PME# enabled
[26900.272077] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[26900.272096] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[26900.272111] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[26900.272142] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26900.272180] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26900.272250] e1000e 0000:00:19.0: PME# disabled
[26900.272379] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26900.274426] e1000e 0000:00:19.0: eth0: MAC Wakeup cause - Link Status Change
[26905.560584] e1000e: Intel(R) PRO/1000 Network Driver - 1.3.10-k2
[26905.560591] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[26905.560650] e1000e 0000:00:19.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
[26905.560665] e1000e 0000:00:19.0: setting latency timer to 64
[26905.560836] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26905.744547] e1000e 0000:00:19.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:21:70:e1:bb:4c
[26905.744552] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
[26905.744581] e1000e 0000:00:19.0: eth0: MAC: 7, PHY: 8, PBA No: 1004FF-0FF
[26905.744594] e1000e 0000:00:19.0: PME# enabled
[26905.880141] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf6ae0000-0xf6afffff] (PCI address [0xf6ae0000-0xf6afffff])
[26905.880149] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf6adb000-0xf6adbfff] (PCI address [0xf6adb000-0xf6adbfff])
[26905.880155] e1000e 0000:00:19.0: BAR 2: set to [io  0xefe0-0xefff] (PCI address [0xefe0-0xefff])
[26905.880172] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10a)
[26905.880194] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[26905.880247] e1000e 0000:00:19.0: PME# disabled
[26905.968320] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26906.024395] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[26906.025907] ADDRCONF(NETDEV_UP): eth0: link is not ready
[26907.112415] e1000e 0000:00:19.0: PME# enabled
[27308.975856] device eth0 entered promiscuous mode
[27326.414382] device eth0 left promiscuous mode
[27336.815362] device eth0 entered promiscuous mode

* Right after start up: machine is in docking station, network is working 
fine.
* At 25136: resume machine, not in docking station, no network cable 
plugged in, so network connection was not tested.
* At 26261: resume machine, in docking station, network cable plugged in, 
but network connection is not working
* At 26900: rmmod e1000e && modprobe e1000e: network still not working

-- 
Frederik Himpe

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Jiri Slaby @ 2011-09-08  7:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yan, Zheng, Tim Chen, sedat.dilek@gmail.com, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315465919.2532.19.camel@edumazet-laptop>

On 09/08/2011 09:11 AM, Eric Dumazet wrote:
> Le jeudi 08 septembre 2011 à 14:22 +0800, Yan, Zheng a écrit :
> 
>> I don't think so. unix_scm_to_skb() calls unix_attach_fds(), it
>> always duplicates scm->fp. 
> 
> What a mess. This code is a nightmare.
> 
> Part of the mess comes from scm_destroy() and scm_release() duplication.
> 
> We should have scm_destroy() only, as before, and NULLify scm->pid/cred
> in unix_scm_to_skb() when we steal references.

This patch works for me. I haven't tried the out_err fixup from the
followup, but I assume I won't spot a difference as those are fail paths
anyway...

thanks,
-- 
js

^ permalink raw reply

* Re: [PATCH] net: phy: Add config option to specify external switch port to be used if switch is used as PHY
From: Florian Fainelli @ 2011-09-08  8:39 UTC (permalink / raw)
  To: Lambrecht Jürgen
  Cc: netdev@vger.kernel.org, linux-embedded@vger.kernel.org
In-Reply-To: <4E68668F.9060008@televic.com>

Hello Jurgen,

On Thursday 08 September 2011 08:54:07 Lambrecht Jürgen wrote:
> Hello,
> 
> In our embedded designs, this is a useful patch. Maybe it can be useful
> for somebody else too.
> Or maybe there are already better solutions?
> I know I could also write a driver for our switch, but that is too much
> effort just to select the active port.

This is not going to work well with all switches out there. You could use the 
fixed-PHY driver to make your ethernet driver see the link as always up between 
the MAC and switch CPU port.

A better solution would be to have proper switch drivers and user-space, which 
reminds me that we (OpenWrt) should at some point propose our switch drivers 
[1] for review.

[1]: 
https://dev.openwrt.org/browser/trunk/target/linux/generic/files/drivers/net/phy/

> 
> Kind regards,
> Jürgen
> 
>    In embedded design, instead of a PHY, sometimes a switch is used that
>            behaves as a PHY through its MII port. For example to use a
> daisy chain network configuration instead of an expensive star config. In
> that case, many phy ports are available, but only 1 should be used
>            to check link status, and not the first one available as is
> the case
>            without this configuration (that is, set to its default value
> 0). So this options specifies the switch port number to be used to check
>            link status, because if the link is down, no data is sent by the
>            TCP/IP stack.
> 
> Signed-off-by: Jürgen Lambrecht <J.Lambrecht@televic.com>
> ---
>   drivers/net/phy/Kconfig    |   14 ++++++++++++++
>   drivers/net/phy/mdio_bus.c |    9 +++++++++
>   2 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index a702443..554561f 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -13,6 +13,20 @@ menuconfig PHYLIB
> 
>   if PHYLIB
> 
> +config SWITCH_PHY
> +       int "External switch port to be used if switch is used as PHY"
> +       default "0"
> +       help
> +         In embedded design, instead of a PHY, sometimes a switch is
> used that
> +         behaves as a PHY through its MII port. For example to use a daisy
> +         chain network configuration instead of an expensive star config.
> +         In that case, many phy ports are available, but only 1 should
> be used
> +         to check link status, and not the first one available as is
> the case
> +         without this configuration (that is, set to its default value 0).
> +         So this options specifies the switch port number to be used to
> check
> +         link status, because if the link is down, no data is sent by the
> +         TCP/IP stack.
> +
>   comment "MII PHY device drivers"
> 
>   config MARVELL_PHY
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 6c58da2..016437a 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -112,7 +112,14 @@ int mdiobus_register(struct mii_bus *bus)
>          if (bus->reset)
>                  bus->reset(bus);
> 
> +       /* The config below is always availble with CONFIG_PHYLIB. If 0,
> the +          behavior is as before without this patch (or P0 of the
> switch is +          taken because it is the first one found). */
> +#if CONFIG_SWITCH_PHY
> +       i = CONFIG_SWITCH_PHY;
> +#else
>          for (i = 0; i < PHY_MAX_ADDR; i++) {
> +#endif
>                  if ((bus->phy_mask & (1 << i)) == 0) {
>                          struct phy_device *phydev;
> 
> @@ -122,7 +129,9 @@ int mdiobus_register(struct mii_bus *bus)
>                                  goto error;
>                          }
>                  }
> +#if !CONFIG_SWITCH_PHY
>          }
> +#endif
> 
>          bus->state = MDIOBUS_REGISTERED;
>          pr_info("%s: probed\n", bus->name);
> --
> 1.7.1
> N�����r��y����b�X��ǧv�^�)޺{.n�+���z�^�)����w*\x1fjg���\x1e�����ݢj/���z�ޖ��2�ޙ����
> &�)ߡ�a��\x7f��\x1e�G���h�\x0f�j:+v���w��٥

-- 
Florian

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Sedat Dilek @ 2011-09-08  8:43 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Eric Dumazet, Yan, Zheng, Tim Chen, Yan, Zheng,
	netdev@vger.kernel.org, davem@davemloft.net, sfr@canb.auug.org.au,
	Shi, Alex, Valdis Kletnieks
In-Reply-To: <4E687511.3040107@gmail.com>

On Thu, Sep 8, 2011 at 9:56 AM, Jiri Slaby <jirislaby@gmail.com> wrote:
> On 09/08/2011 09:11 AM, Eric Dumazet wrote:
>> Le jeudi 08 septembre 2011 à 14:22 +0800, Yan, Zheng a écrit :
>>
>>> I don't think so. unix_scm_to_skb() calls unix_attach_fds(), it
>>> always duplicates scm->fp.
>>
>> What a mess. This code is a nightmare.
>>
>> Part of the mess comes from scm_destroy() and scm_release() duplication.
>>
>> We should have scm_destroy() only, as before, and NULLify scm->pid/cred
>> in unix_scm_to_skb() when we steal references.
>
> This patch works for me. I haven't tried the out_err fixup from the
> followup, but I assume I won't spot a difference as those are fail paths
> anyway...
>
> thanks,
> --
> js
>

I have tested the same patch here (before shopping, but can test the
"final" patch, too.).

- Sedat -

^ permalink raw reply

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
From: Patrick McHardy @ 2011-09-08  9:33 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: davem, netfilter-devel, netdev
In-Reply-To: <20110907200319.GA29545@rere.qmqm.pl>

Am 07.09.2011 22:03, schrieb Michał Mirosław:
> On Wed, Sep 07, 2011 at 05:22:00PM +0200, Patrick McHardy wrote:
>> On 04.09.2011 18:18, Michał Mirosław wrote:
>>> On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
>>>> From: Patrick McHardy <kaber@trash.net>
>>>>
>>>> Add support for memory mapped sendmsg() to netlink. Userspace queued to
>>>> be processed frames into the TX ring and invokes sendmsg with
>>>> msg.iov.iov_base = NULL to trigger processing of all pending messages.
>>>>
>>>> Since the kernel usually performs full message validation before beginning
>>>> processing, userspace must be prevented from modifying the message
>>>> contents while the kernel is processing them. In order to do so, the
>>>> frames contents are copied to an allocated skb in case the the ring is
>>>> mapped more than once or the file descriptor is shared (f.i. through
>>>> AF_UNIX file descriptor passing).
>>>>
>>>> Otherwise an skb without a data area is allocated, the data pointer set
>>>> to point to the data area of the ring frame and the skb is processed.
>>>> Once the skb is freed, the destructor releases the frame back to userspace
>>>> by setting the status to NL_MMAP_STATUS_UNUSED.
>>>
>>> Is this protected from threads? Like: one thread waits on sendmsg() and
>>> another (same process) changes the buffer.
>> Yes, if the ring is mapped multiple times (or the file descriptor
>> is changed), the contents are copied to an allocated skb.
> 
> I mean:
> 
> [1] mmap()
> [1] fill buffers
> [1] pthread_create() [creates: 2]
> [1] sendmsg() starts
> [2] modify buffers
> [1] sendmsg() returns
> 
> So: no multiple mmaps, and no touching of the fd. I haven't dug into
> filesystem layer to see if threads affect file->f_count, but there
> sure are no multiple mappings here.

If CLONE_VM is given to clone(), the mapping is visible in both
threads and thus we have multiple mappings (vma_ops->open() is
invoked through clone()). Without CLONE_VM, the second thread
can't access the ring unless it mmap()s it itself, in case we'd
also have multiple mappings.

The file descriptor check is only meant for the case that
the fd is passed to a second process through AF_UNIX, the
first process invokes sendmsg(), sendmsg() checks for multiple
mappings and the second process invokes mmap() after that.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: phy: Add config option to specify external switch port to be used if switch is used as PHY
From: Lambrecht Jürgen @ 2011-09-08 10:00 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev@vger.kernel.org, linux-embedded@vger.kernel.org
In-Reply-To: <201109081039.35272.florian@openwrt.org>

On 09/08/2011 10:39 AM, Florian Fainelli wrote:
>
> Hello Jurgen,
>
> On Thursday 08 September 2011 08:54:07 Lambrecht Jürgen wrote:
> > Hello,
> >
> > In our embedded designs, this is a useful patch. Maybe it can be useful
> > for somebody else too.
> > Or maybe there are already better solutions?
> > I know I could also write a driver for our switch, but that is too much
> > effort just to select the active port.
>
> This is not going to work well with all switches out there. You could 
> use the
>
Do not all switches follow the basic MII register map with room for 31 
phy's?
>
> fixed-PHY driver to make your ethernet driver see the link as always 
> up between
> the MAC and switch CPU port.
>
Indeed, I tried to, but it didn't work. (would be my preferred solution)
I juist enabled FIXED_PHY in menuconfig (and kept MII, NET_ETHERNET and 
FEC (for my iMX cpu); also PHYLIB is on that makes mdio_bus.c compile). 
I checked the architecture file for mpc866ads and didn't find any init 
for it, but maybe I need to initialize fixed-PHY somewhere?

However, it could be interesting sometimes from application side to know 
if the real external link is up, then fixed-PHY is not ok.
>
>
> A better solution would be to have proper switch drivers and 
> user-space, which
> reminds me that we (OpenWrt) should at some point propose our switch 
> drivers
> [1] for review.
>
> [1]:
> https://dev.openwrt.org/browser/trunk/target/linux/generic/files/drivers/net/phy/
>
Just had a fast look to the Marvell switch driver. Nice to know it is there.
>
>
> >
> > Kind regards,
> > Jürgen
> >
> >    In embedded design, instead of a PHY, sometimes a switch is used that
> >            behaves as a PHY through its MII port. For example to use a
> > daisy chain network configuration instead of an expensive star 
> config. In
> > that case, many phy ports are available, but only 1 should be used
> >            to check link status, and not the first one available as is
> > the case
> >            without this configuration (that is, set to its default value
> > 0). So this options specifies the switch port number to be used to check
> >            link status, because if the link is down, no data is sent 
> by the
> >            TCP/IP stack.
> >
> > Signed-off-by: Jürgen Lambrecht <J.Lambrecht@televic.com>
> > ---
>
[snip]
>
> --
> Florian
>


-- 
Jürgen Lambrecht
R&D Associate
Tel: +32 (0)51 303045    Fax: +32 (0)51 310670
http://www.televic-rail.com
Televic Rail NV - Leo Bekaertlaan 1 - 8870 Izegem - Belgium
Company number 0825.539.581 - RPR Kortrijk

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Sedat Dilek @ 2011-09-08 10:05 UTC (permalink / raw)
  To: Tim Chen
  Cc: Eric Dumazet, Yan, Zheng, netdev@vger.kernel.org,
	davem@davemloft.net, sfr@canb.auug.org.au, jirislaby@gmail.com,
	alex.shi
In-Reply-To: <1315339157.2576.3079.camel@schen9-DESK>

On Tue, Sep 6, 2011 at 9:59 PM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> On Tue, 2011-09-06 at 21:43 +0200, Eric Dumazet wrote:
>> Le mardi 06 septembre 2011 à 12:33 -0700, Tim Chen a écrit :
>>
>> > Yes, I think locking the sendmsg for the entire duration of
>> > unix_stream_sendmsg makes a lot of sense.  It simplifies the logic a lot
>> > more.  I'll try to cook something up in the next couple of days.
>>
>> Thats not really possible, we cant hold a spinlock and call
>> sock_alloc_send_skb() and/or memcpy_fromiovec(), wich might sleep.
>>
>> You would need to prepare the full skb list, then :
>> - stick the ref on the last skb of the list.
>>
>> Transfert the whole skb list in other->sk_receive_queue in one go,
>> instead of one after another.
>>
>> Unfortunately, this would break streaming (big send(), and another
>> thread doing the receive)
>>
>> Listen, I am wondering why hackbench even triggers SCM code. This is
>> really odd. We should not have a _single_ pid/cred ref/unref at all.
>>
>
> Hackbench triggers the code because it has a bunch of threads sending
> msgs on UNIX socket.
>>
>

# lsof | grep socket | wc -l
198

Aprrox 200 sockets in usage here, can you post your hackbench line, please?
I would compare hackbench results with and without new improvements in SCM code.

- Sedat -

[...]
>
> Tim
>

^ permalink raw reply

* Re: [PATCH -next v2] unix stream: Fix use-after-free crashes
From: Sedat Dilek @ 2011-09-08  9:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yan, Zheng, Tim Chen, Yan, Zheng, netdev@vger.kernel.org,
	davem@davemloft.net, sfr@canb.auug.org.au, jirislaby@gmail.com,
	Shi, Alex, Valdis Kletnieks
In-Reply-To: <1315467184.2532.22.camel@edumazet-laptop>

On Thu, Sep 8, 2011 at 9:33 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 08 septembre 2011 à 15:23 +0800, Yan, Zheng a écrit :
>
>> This code looks great, except "goto out;" is still there. I think we should replace it with "goto out_err;" :)
>>
>
> Indeed, you're right, thanks
>
>  include/net/scm.h  |    9 ---------
>  net/unix/af_unix.c |   32 +++++++++++++++++---------------
>  2 files changed, 17 insertions(+), 24 deletions(-)
>
> diff --git a/include/net/scm.h b/include/net/scm.h
> index 68e1e48..2a5b42f 100644
> --- a/include/net/scm.h
> +++ b/include/net/scm.h
> @@ -78,15 +78,6 @@ static __inline__ void scm_destroy(struct scm_cookie *scm)
>                __scm_destroy(scm);
>  }
>
> -static __inline__ void scm_release(struct scm_cookie *scm)
> -{
> -       /* keep ref on pid and cred */
> -       scm->pid = NULL;
> -       scm->cred = NULL;
> -       if (scm->fp)
> -               __scm_destroy(scm);
> -}
> -
>  static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
>                               struct scm_cookie *scm)
>  {
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index e6d9d10..c8a08ba 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1379,15 +1379,18 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  }
>
>  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb,
> -                          bool send_fds, bool ref)
> +                          bool send_fds, bool steal_refs)
>  {
>        int err = 0;
> -       if (ref) {
> +
> +       if (!steal_refs) {
>                UNIXCB(skb).pid  = get_pid(scm->pid);
>                UNIXCB(skb).cred = get_cred(scm->cred);
>        } else {
>                UNIXCB(skb).pid  = scm->pid;
>                UNIXCB(skb).cred = scm->cred;
> +               scm->pid = NULL;
> +               scm->cred = NULL;
>        }
>        UNIXCB(skb).fp = NULL;
>        if (scm->fp && send_fds)
> @@ -1454,7 +1457,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
>        if (skb == NULL)
>                goto out;
>
> -       err = unix_scm_to_skb(siocb->scm, skb, true, false);
> +       err = unix_scm_to_skb(siocb->scm, skb, true, true);
>        if (err < 0)
>                goto out_free;
>        max_level = err + 1;
> @@ -1550,7 +1553,7 @@ restart:
>        unix_state_unlock(other);
>        other->sk_data_ready(other, len);
>        sock_put(other);
> -       scm_release(siocb->scm);
> +       scm_destroy(siocb->scm);
>        return len;
>
>  out_unlock:
> @@ -1577,6 +1580,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>        int sent = 0;
>        struct scm_cookie tmp_scm;
>        bool fds_sent = false;
> +       bool steal_refs = false;
>        int max_level;
>
>        if (NULL == siocb->scm)
> @@ -1638,11 +1642,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>                size = min_t(int, size, skb_tailroom(skb));
>
>
> -               /* Only send the fds and no ref to pid in the first buffer */
> -               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, fds_sent);
> +               /* Only send the fds in first buffer
> +                * Last buffer can steal our references to pid/cred
> +                */
> +               steal_refs = (sent + size >= len);
> +               err = unix_scm_to_skb(siocb->scm, skb, !fds_sent, steal_refs);
>                if (err < 0) {
>                        kfree_skb(skb);
> -                       goto out;
> +                       goto out_err;
>                }
>                max_level = err + 1;
>                fds_sent = true;
> @@ -1650,7 +1657,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>                err = memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size);
>                if (err) {
>                        kfree_skb(skb);
> -                       goto out;
> +                       goto out_err;
>                }
>
>                unix_state_lock(other);
> @@ -1667,10 +1674,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
>                sent += size;
>        }
>
> -       if (skb)
> -               scm_release(siocb->scm);
> -       else
> -               scm_destroy(siocb->scm);
> +       scm_destroy(siocb->scm);
>        siocb->scm = NULL;
>
>        return sent;
> @@ -1683,9 +1687,7 @@ pipe_err:
>                send_sig(SIGPIPE, current, 0);
>        err = -EPIPE;
>  out_err:
> -       if (skb == NULL)
> -               scm_destroy(siocb->scm);
> -out:
> +       scm_destroy(siocb->scm);
>        siocb->scm = NULL;
>        return sent ? : err;
>  }
>

I have tested this fixup patch on i386.
Can we have a separate patch with corrected descriptive text?

Thanks to all involved people.

- Sedat -

^ permalink raw reply

* Re: [PATCH] net: phy: Add config option to specify external switch port to be used if switch is used as PHY
From: Francois Romieu @ 2011-09-08 10:13 UTC (permalink / raw)
  To: Lambrecht Jürgen
  Cc: netdev@vger.kernel.org, linux-embedded@vger.kernel.org
In-Reply-To: <4E68668F.9060008@televic.com>

Lambrecht Jürgen <J.Lambrecht@TELEVIC.com> :
[...]
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 6c58da2..016437a 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -112,7 +112,14 @@ int mdiobus_register(struct mii_bus *bus)
>          if (bus->reset)
>                  bus->reset(bus);
> 
> +       /* The config below is always availble with CONFIG_PHYLIB. If 0, the
> +          behavior is as before without this patch (or P0 of the switch is
> +          taken because it is the first one found). */
> +#if CONFIG_SWITCH_PHY
> +       i = CONFIG_SWITCH_PHY;
> +#else
>          for (i = 0; i < PHY_MAX_ADDR; i++) {
> +#endif
>                  if ((bus->phy_mask & (1 << i)) == 0) {
>                          struct phy_device *phydev;

This config option may help your platform but it is nowhere reusable on a
slightly different one (each mii_bus behavior is changed).

Which driver(s) do you use that you can not set phy_mask directly ?

-- 
Ueimor

^ permalink raw reply

* Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
From: Michael S. Tsirkin @ 2011-09-08 11:08 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: netdev, dragos.tatulea, arnd, dwang2, benve, kaber, sri, davem,
	eric.dumazet, mchan, kvm
In-Reply-To: <CA8D9EAC.33A98%roprabhu@cisco.com>

On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
> On 9/7/11 5:34 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
> >> This patch is an attempt at providing address filtering support for macvtap
> >> devices in PASSTHRU mode. Its still a work in progress.
> >> Briefly tested for basic functionality. Wanted to get some feedback on the
> >> direction before proceeding.
> >> 
> > 
> > Good work, thanks.
> > 
> 
> Thanks.
> 
> >> I have hopefully CC'ed all concerned people.
> > 
> > kvm crowd might also be interested.
> > Try using ./scripts/get_maintainer.pl as well.
> > 
> Thanks for the tip. Expanded CC list a bit more.
> 
> >> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode
> >> there is a 1-1 mapping between macvtap device and physical nic or VF. And all
> >> filtering is done in lowerdev hw. The lowerdev does not need to be in
> >> promiscous mode as long as the guest filters are passed down to the lowerdev.
> >> This patch tries to remove the need for putting the lowerdev in promiscous
> >> mode. 
> >> I have also referred to the thread below where TUNSETTXFILTER was mentioned
> >> in 
> >> this context: 
> >>  http://patchwork.ozlabs.org/patch/69297/
> >> 
> >> This patch basically passes the addresses got by TUNSETTXFILTER to macvlan
> >> lowerdev.
> >> 
> >> I have looked at previous work and discussions on this for qemu-kvm
> >> by Michael Tsirkin, Alex Williamson and Dragos Tatulea
> >> http://patchwork.ozlabs.org/patch/78595/
> >> http://patchwork.ozlabs.org/patch/47160/
> >> https://patchwork.kernel.org/patch/474481/
> >> 
> >> Redhat bugzilla by Michael Tsirkin:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=655013
> >> 
> >> I used Michael's qemu-kvm patch for testing the changes with KVM
> >> 
> >> I would like to cover both MAC and vlan filtering in this work.
> >> 
> >> Open Questions/Issues:
> >> - There is a need for vlan filtering to complete the patch. It will require
> >>   a new tap ioctl cmd for vlans.
> >>   Some ideas on this are:
> >> 
> >>   a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap filter
> >> (similar to tun_filter for addresses). Passing the vlan id's to lower
> >> device will mean going thru the whole list of vlans every time.
> >> 
> >>   OR
> >> 
> >>   b) TUNSETVLAN with vlan id and flag to set/unset
> >> 
> >>   Does option 'b' sound ok ?
> >> 
> >> - In this implementation we make the macvlan address list same as the address
> >>   list that came in the filter with TUNSETTXFILTER. This will not cover cases
> >>   where the macvlan device needs to have other addresses that are not
> >>   necessarily in the filter. Is this a problem ?
> > 
> > What cases do you have in mind?
> > 
> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
> see a problem with uc/mc address list being the same in all the stacked
> netdevs in the path. I called that out above to make sure I was not missing
> any case in PASSTHRU mode where this might be invalid. Otherwise I don't see
> a problem in the simple PASSTHRU use case this patch supports.
> 
> >> - The patch currently only supports passing of IFF_PROMISC and IFF_MULTICAST
> >> filter flags to lowerdev
> >> 
> >> This patch series implements the following
> >> 01/3 - macvlan: Add support for unicast filtering in macvlan
> >> 02/3 - macvlan: Add function to set addr filter on lower device in passthru
> >> mode
> >> 03/3 - macvtap: Add support for TUNSETTXFILTER
> >> 
> >> Please comment. Thanks.
> >> 
> >> Signed-off-by: Roopa Prabhu <roprabhu@cisco.com>
> >> Signed-off-by: Christian Benvenuti <benve@cisco.com>
> >> Signed-off-by: David Wang <dwang2@cisco.com>
> > 
> > The security isn't lower than with promisc, so I don't see
> > a problem with this as such.
> > 
> > There are more features we'll want down the road though,
> > so let's see whether the interface will be able to
> > satisfy them in a backwards compatible way before we
> > set it in stone. Here's what I came up with:
> > 
> > How will the filtering table be partitioned within guests?
> 
> Since this patch supports macvlan PASSTHRU mode only, in which the lower
> device has 1-1 mapping to the guest nic, it does not require any
> partitioning of filtering table within guests. Unless I missed understanding
> something. 
> If the lower device were being shared by multiple guest network interfaces
> (non PASSTHRU mode), only then we will need to maintain separate filter
> tables for each guest network interface in macvlan and forward the pkt to
> respective guest interface after a filter lookup. This could affect
> performance too I think.

Not with hardware filtering support. Which is where we'd need to
partition the host nic mac table between guests.

> I chose to support PASSTHRU Mode only at first because its simpler and all
> code additions are in control path only.

I agree. It would be a bit silly to have a dedicated interface
for passthough and a completely separate one for
non passthrough.

> > 
> > A way to limit what the guest can do would also be useful.
> > How can this be done? selinux?
> 
> I vaguely remember a thread on the same context.. had a suggestion to
> maintain pre-approved address lists and allow guest filter registration of
> only those addresses for security. This seemed reasonable. Plus the ability
> to support additional address registration from guest could be made
> configurable (One of your ideas again from prior work).
> 
> I am not an selinux expert, but I am thinking we can use it to only allow or
> disallow access or operations to the macvtap device. (?). I will check more
> on this.

We'd have to have a way to revoke that as well.

> > 
> > Any thoughts on spoofing filtering?
> 
> I can only think of checking addresses against an allowed address list.
> Don't know of any other ways. Any hints ?

Hardware (esp SRIOV) often has ways to do this check, too.

> 
> In any case I am assuming all the protection/security measures should be
> taken at the layer calling the TUNSETTXFILTER ie..In macvtap virtualization
> use case its libvirt or qemu-kvm. No ?

Ideally we'd have a way to separate these capabilities, so that libvirt
can override qemu.

> > 
> > Would it be possible to make the filtering programmable
> > using netlink, e.g. ethtool, ip, or some such?
> 
> Should be possible via ethtool or ip calling ioctl TUNSETTXFILTER. Are you
> thinking of macvlan having a netlink interface to set filter and not ioctl
> ?. Sure.

Yes.

> But I was thinking the point of implementing TUNSETTXFILTER was to
> maintain compatibility with the generic tap interface that does the same
> thing. 

Yes. OTOH I don't think anyone uses that ATM so it might not
be important if it's not a good fit.
E.g. we could notify libvirt and have it use netlink for us
if we like that better.

> And having both the netlink op and ioctl interface might not be clean ?.

No idea.

> Sorry if I misunderstood your question.
> 
> > That would make this useful for bridged setups besides
> > macvtap/virtualization.
> > 
> 
> Thanks for the comments. 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox