Netdev List
 help / color / mirror / Atom feed
* Re: fanotify as syscalls
From: Davide Libenzi @ 2009-09-23 15:35 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Tvrtko Ursulin, Andreas Gruenbacher, Jamie Lokier, Eric Paris,
	Linus Torvalds, Evgeniy Polyakov, David Miller,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org,
	netdev@vger.kernel.org, viro@zeniv.linux.org.uk,
	alan@linux.intel.com
In-Reply-To: <20090923112018.GA2946@infradead.org>

On Wed, 23 Sep 2009, hch@infradead.org wrote:

> On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote:
> > Lived with it because there was no other option. We used LSM while it was 
> > available for modules but then it was taken away. 
> > 
> > And not all vendors even use syscall interception, not even across platforms, 
> > of which you sound so sure about. You can't even scan something which is not 
> > in your namespace if you are at the syscall level. And you can't catch things 
> > like kernel nfsd. No, syscall interception is not really appropriate at all.
> 
> The "Anti-Malware" industry is just snake oil anyway.  I think the
> proper approach to support it is just to add various no-op exports claim
> to do something and all the people requiring anti-virus on Linux will be
> just as happy with it.

The fear is that this becomes a trojan horse (no pun intended) for more 
and more hooks and "stuff", driven by we-really-need-this-too and 
we-really-need-that-too. And once something it's in, it's harder to say no, 
under the pressure of offering a "limited solution".
This ws the reason I threw the syscall tracing thing in, so they have a 
low level generic hook, and they cam knock themselves out in their module 
(might need a few exports, but that's about it).



- Davide



^ permalink raw reply

* Re: fanotify as syscalls
From: Davide Libenzi @ 2009-09-23 15:26 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Andreas Gruenbacher, Jamie Lokier, Eric Paris, Linus Torvalds,
	Evgeniy Polyakov, David Miller, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org,
	viro@zeniv.linux.org.uk, alan@linux.intel.com, hch@infradead.org
In-Reply-To: <200909230939.34003.tvrtko.ursulin@sophos.com>

On Wed, 23 Sep 2009, Tvrtko Ursulin wrote:

> Lived with it because there was no other option. We used LSM while it was 
> available for modules but then it was taken away. 
> 
> And not all vendors even use syscall interception, not even across platforms, 
> of which you sound so sure about. You can't even scan something which is not 
> in your namespace if you are at the syscall level. And you can't catch things 
> like kernel nfsd. No, syscall interception is not really appropriate at all.

Really?
And *if* namespaces were the problem for the devices you were targeting, 
what prevented you to resolving the object and offering a stream to 
userspace?
In *your* module, hosting at the same time all the other logic required 
for it (caches, whitelists, etc...), instead of pushing this stuff into 
the kernel.
WRT to the "other" system, never said they were using syscall 
interception, if you read carefully. I said that minifilters typically 
sends path names to userspace, which might drive you in the pitfall 
Andreas was describing.


- Davide



^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-23 15:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABA32AF.50602@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7035 bytes --]

Avi Kivity wrote:
> On 09/23/2009 05:26 PM, Gregory Haskins wrote:
>>
>>   
>>>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci,
>>>> and
>>>> virtio-s390. It isn't especially easy. I can steal lots of code from
>>>> the
>>>> lguest bus model, but sometimes it is good to generalize, especially
>>>> after the fourth implemention or so. I think this is what GHaskins
>>>> tried
>>>> to do.
>>>>
>>>>        
>>> Yes.  vbus is more finely layered so there is less code duplication.
>>>      
>> To clarify, Ira was correct in stating this generalizing some of these
>> components was one of the goals for the vbus project: IOW vbus finely
>> layers and defines what's below virtio, not replaces it.
>>
>> You can think of a virtio-stack like this:
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | ? undefined ?
>> --------------------------
>>
>> IOW: The way I see it, virtio is a device interface model only.  The
>> rest of it is filled in by the virtio-transport and some kind of
>> back-end.
>>
>> So today, we can complete the "? undefined ?" block like this for KVM:
>>
>> --------------------------
>> | virtio-pci
>> --------------------------
>>               |
>> --------------------------
>> | kvm.ko
>> --------------------------
>> | qemu
>> --------------------------
>> | tuntap
>> --------------------------
>>
>> In this case, kvm.ko and tuntap are providing plumbing, and qemu is
>> providing a backend device model (pci-based, etc).
>>
>> You can, of course, plug a different stack in (such as virtio-lguest,
>> virtio-ira, etc) but you are more or less on your own to recreate many
>> of the various facilities contained in that stack (such as things
>> provided by QEMU, like discovery/hotswap/addressing), as Ira is
>> discovering.
>>
>> Vbus tries to commoditize more components in the stack (like the bus
>> model and backend-device model) so they don't need to be redesigned each
>> time we solve this "virtio-transport" problem.  IOW: stop the
>> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
>> virtio.  Instead, we can then focus on the value add on top, like the
>> models themselves or the simple glue between them.
>>
>> So now you might have something like
>>
>> --------------------------
>> | virtio-vbus
>> --------------------------
>> | vbus-proxy
>> --------------------------
>> | kvm-guest-connector
>> --------------------------
>>               |
>> --------------------------
>> | kvm.ko
>> --------------------------
>> | kvm-host-connector.ko
>> --------------------------
>> | vbus.ko
>> --------------------------
>> | virtio-net-backend.ko
>> --------------------------
>>
>> so now we don't need to worry about the bus-model or the device-model
>> framework.  We only need to implement the connector, etc.  This is handy
>> when you find yourself in an environment that doesn't support PCI (such
>> as Ira's rig, or userspace containers), or when you want to add features
>> that PCI doesn't have (such as fluid event channels for things like IPC
>> services, or priortizable interrupts, etc).
>>    
> 
> Well, vbus does more, for example it tunnels interrupts instead of
> exposing them 1:1 on the native interface if it exists.

As I've previously explained, that trait is a function of the
kvm-connector I've chosen to implement, not of the overall design of vbus.

The reason why my kvm-connector is designed that way is because my early
testing/benchmarking shows one of the issues in KVM performance is the
ratio of exits per IO operation are fairly high, especially as your
scale io-load.  Therefore, the connector achieves a substantial
reduction in that ratio by treating "interrupts" to the same kind of
benefits that NAPI brought to general networking: That is, we enqueue
"interrupt" messages into a lockless ring and only hit the IDT for the
first occurrence.  Subsequent interrupts are injected in a
parallel/lockless manner, without hitting the IDT nor incurring an extra
EOI.  This pays dividends as the IO rate increases, which is when the
guest needs the most help.

OTOH, it is entirely possible to design the connector such that we
maintain a 1:1 ratio of signals to traditional IDT interrupts.  It is
also possible to design a connector which surfaces as something else,
such as PCI devices (by terminating the connector in QEMU and utilizing
its PCI emulation facilities), which would naturally employ 1:1 mapping.

So if 1:1 mapping is a critical feature (I would argue to the contrary),
vbus can support it.

> It also pulls parts of the device model into the host kernel.

That is the point.  Most of it needs to be there for performance.  And
what doesn't need to be there for performance can either be:

a) skipped at the discretion of the connector/device-model designer

OR

b) included because its trivially small subset of the model (e.g. a
mac-addr attribute) and its nice to have a cohesive solution instead of
requiring a separate binary blob that can get out of sync, etc.

The example Ive provided to date (venet on kvm) utilizes (b), but it
certainly doesn't have to.  Therefore, I don't think vbus as a whole can
be judged on this one point.

> 
>>> The virtio layering was more or less dictated by Xen which doesn't have
>>> shared memory (it uses grant references instead).  As a matter of fact
>>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>>> library that people who do have shared memory can reuse.
>>>      
>> Note that I do not believe the Xen folk use virtio, so while I can
>> appreciate the foresight that went into that particular aspect of the
>> design of the virtio model, I am not sure if its a realistic constraint.
>>    
> 
> Since a virtio goal was to reduce virtual device driver proliferation,
> it was necessary to accommodate Xen.

Fair enough, but I don't think the Xen community will ever use it.

To your point, a vbus goal was to reduce the bus-model and
backend-device-model proliferation for environments served by Linux as
the host.  This naturally complements virtio's driver non-proliferation
goal, but probably excludes Xen for reasons beyond the lack of shmem
(since it has its own non-linux hypervisor kernel).

In any case, I've already stated that we simply make the virtio-shmem
(vbus-proxy-device) facility optionally defined, and unavailable on
non-shmem based architectures to work around that issue.

The alternative is that we abstract the shmem concept further (ala
->add_buf() from the virtqueue world) but it is probably pointless to
try to accommodate shared-memory if you don't really have it, and no-one
will likely use it.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-23 14:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABA3005.60905@gmail.com>

On 09/23/2009 05:26 PM, Gregory Haskins wrote:
>
>    
>>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
>>> virtio-s390. It isn't especially easy. I can steal lots of code from the
>>> lguest bus model, but sometimes it is good to generalize, especially
>>> after the fourth implemention or so. I think this is what GHaskins tried
>>> to do.
>>>
>>>        
>> Yes.  vbus is more finely layered so there is less code duplication.
>>      
> To clarify, Ira was correct in stating this generalizing some of these
> components was one of the goals for the vbus project: IOW vbus finely
> layers and defines what's below virtio, not replaces it.
>
> You can think of a virtio-stack like this:
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | ? undefined ?
> --------------------------
>
> IOW: The way I see it, virtio is a device interface model only.  The
> rest of it is filled in by the virtio-transport and some kind of back-end.
>
> So today, we can complete the "? undefined ?" block like this for KVM:
>
> --------------------------
> | virtio-pci
> --------------------------
>               |
> --------------------------
> | kvm.ko
> --------------------------
> | qemu
> --------------------------
> | tuntap
> --------------------------
>
> In this case, kvm.ko and tuntap are providing plumbing, and qemu is
> providing a backend device model (pci-based, etc).
>
> You can, of course, plug a different stack in (such as virtio-lguest,
> virtio-ira, etc) but you are more or less on your own to recreate many
> of the various facilities contained in that stack (such as things
> provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering.
>
> Vbus tries to commoditize more components in the stack (like the bus
> model and backend-device model) so they don't need to be redesigned each
> time we solve this "virtio-transport" problem.  IOW: stop the
> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
> virtio.  Instead, we can then focus on the value add on top, like the
> models themselves or the simple glue between them.
>
> So now you might have something like
>
> --------------------------
> | virtio-vbus
> --------------------------
> | vbus-proxy
> --------------------------
> | kvm-guest-connector
> --------------------------
>               |
> --------------------------
> | kvm.ko
> --------------------------
> | kvm-host-connector.ko
> --------------------------
> | vbus.ko
> --------------------------
> | virtio-net-backend.ko
> --------------------------
>
> so now we don't need to worry about the bus-model or the device-model
> framework.  We only need to implement the connector, etc.  This is handy
> when you find yourself in an environment that doesn't support PCI (such
> as Ira's rig, or userspace containers), or when you want to add features
> that PCI doesn't have (such as fluid event channels for things like IPC
> services, or priortizable interrupts, etc).
>    

Well, vbus does more, for example it tunnels interrupts instead of 
exposing them 1:1 on the native interface if it exists.  It also pulls 
parts of the device model into the host kernel.

>> The virtio layering was more or less dictated by Xen which doesn't have
>> shared memory (it uses grant references instead).  As a matter of fact
>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>> library that people who do have shared memory can reuse.
>>      
> Note that I do not believe the Xen folk use virtio, so while I can
> appreciate the foresight that went into that particular aspect of the
> design of the virtio model, I am not sure if its a realistic constraint.
>    

Since a virtio goal was to reduce virtual device driver proliferation, 
it was necessary to accommodate Xen.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-23 14:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB89C48.4020903@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4551 bytes --]

Avi Kivity wrote:
> On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
>>
>>> Sure, virtio-ira and he is on his own to make a bus-model under that, or
>>> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
>>> model can work, I agree.
>>>
>>>      
>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
>> virtio-s390. It isn't especially easy. I can steal lots of code from the
>> lguest bus model, but sometimes it is good to generalize, especially
>> after the fourth implemention or so. I think this is what GHaskins tried
>> to do.
>>    
> 
> Yes.  vbus is more finely layered so there is less code duplication.

To clarify, Ira was correct in stating this generalizing some of these
components was one of the goals for the vbus project: IOW vbus finely
layers and defines what's below virtio, not replaces it.

You can think of a virtio-stack like this:

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| ? undefined ?
--------------------------

IOW: The way I see it, virtio is a device interface model only.  The
rest of it is filled in by the virtio-transport and some kind of back-end.

So today, we can complete the "? undefined ?" block like this for KVM:

--------------------------
| virtio-pci
--------------------------
             |
--------------------------
| kvm.ko
--------------------------
| qemu
--------------------------
| tuntap
--------------------------

In this case, kvm.ko and tuntap are providing plumbing, and qemu is
providing a backend device model (pci-based, etc).

You can, of course, plug a different stack in (such as virtio-lguest,
virtio-ira, etc) but you are more or less on your own to recreate many
of the various facilities contained in that stack (such as things
provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering.

Vbus tries to commoditize more components in the stack (like the bus
model and backend-device model) so they don't need to be redesigned each
time we solve this "virtio-transport" problem.  IOW: stop the
proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
virtio.  Instead, we can then focus on the value add on top, like the
models themselves or the simple glue between them.

So now you might have something like

--------------------------
| virtio-vbus
--------------------------
| vbus-proxy
--------------------------
| kvm-guest-connector
--------------------------
             |
--------------------------
| kvm.ko
--------------------------
| kvm-host-connector.ko
--------------------------
| vbus.ko
--------------------------
| virtio-net-backend.ko
--------------------------

so now we don't need to worry about the bus-model or the device-model
framework.  We only need to implement the connector, etc.  This is handy
when you find yourself in an environment that doesn't support PCI (such
as Ira's rig, or userspace containers), or when you want to add features
that PCI doesn't have (such as fluid event channels for things like IPC
services, or priortizable interrupts, etc).

> 
> The virtio layering was more or less dictated by Xen which doesn't have
> shared memory (it uses grant references instead).  As a matter of fact
> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
> part is duplicated.  It's probably possible to add a virtio-shmem.ko
> library that people who do have shared memory can reuse.

Note that I do not believe the Xen folk use virtio, so while I can
appreciate the foresight that went into that particular aspect of the
design of the virtio model, I am not sure if its a realistic constraint.

The reason why I decided to not worry about that particular model is
twofold:

1) Trying to support non shared-memory designs is prohibitively high for
my performance goals (for instance, requiring an exit on each
->add_buf() in addition to the ->kick()).

2) The Xen guys are unlikely to diverge from something like
xenbus/xennet anyway, so it would be for naught.

Therefore, I just went with a device model optimized for shared-memory
outright.

That said, I believe we can refactor what is called the
"vbus-proxy-device" into this virtio-shmem interface that you and
Anthony have described.  We could make the feature optional and only
support on architectures where this makes sense.

<snip>

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: r8169, enabling TX checksumming breaks things?
From: Denys Fedoryschenko @ 2009-09-23 14:24 UTC (permalink / raw)
  To: David Dillow; +Cc: romieu, netdev
In-Reply-To: <1253714544.3925.6.camel@lap75545.ornl.gov>

On Wednesday 23 September 2009 17:02:24 David Dillow wrote:
> On Wed, 2009-09-23 at 09:15 +0300, Denys Fedoryschenko wrote:
> > Hi
> >
> > Is it expected that:
> > 1)TX checksumming is off by default
> > 2)If i try to enable it over ethtool -K eth0 tx on , TCP sessions on
> > proxy getting stuck, even in tcpdump looks everything fine and packets
> > reaching destination, i don't understand what is a reason of failure.
> > Maybe if this feature supposed to not work - user must not be able just
> > to turn it on?
>
> It is broken for large swaths of the hardware -- I have patches that got
> it and TSO working on my hardware, and they provide a framework to see
> about getting it working on yours.
>
> Basically, the fields are in different places depending on the chip
> revision. I'll try to dig those out tonight and send them along so we
> can experiment.
Thanks, i have 8 hosts (4 hosts with RTL8168b/8111b. and 4 with 
RTL8168d/8111d) to test. Ready for patches to test them :-)

^ permalink raw reply

* Re: r8169, enabling TX checksumming breaks things?
From: David Dillow @ 2009-09-23 14:02 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: romieu, netdev
In-Reply-To: <200909230915.27854.denys@visp.net.lb>

On Wed, 2009-09-23 at 09:15 +0300, Denys Fedoryschenko wrote:
> Hi
> 
> Is it expected that:
> 1)TX checksumming is off by default
> 2)If i try to enable it over ethtool -K eth0 tx on , TCP sessions on proxy 
> getting stuck, even in tcpdump looks everything fine and packets reaching 
> destination, i don't understand what is a reason of failure.
> Maybe if this feature supposed to not work - user must not be able just to 
> turn it on?

It is broken for large swaths of the hardware -- I have patches that got
it and TSO working on my hardware, and they provide a framework to see
about getting it working on yours.

Basically, the fields are in different places depending on the chip
revision. I'll try to dig those out tonight and send them along so we
can experiment.

^ permalink raw reply

* [PATCH] skge: Make sure both ports initialize correctly
From: Mike McCormack @ 2009-09-23 13:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

If allocation of the second ports fails, make sure that hw->ports
 is not 2 otherwise we'll crash trying to access the second port.

This fix is copied from a similar fix in the sky2 driver (ca519274...),
but is untested, as I don't have a skge card.

Signed-off-by: Mike McCormack <mikem@ring3k.org>
---
 drivers/net/skge.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index 62e852e..21b816f 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -3982,14 +3982,17 @@ static int __devinit skge_probe(struct pci_dev *pdev,
 	}
 	skge_show_addr(dev);
 
-	if (hw->ports > 1 && (dev1 = skge_devinit(hw, 1, using_dac))) {
-		if (register_netdev(dev1) == 0)
+	if (hw->ports > 1) {
+		dev1 = skge_devinit(hw, 1, using_dac);
+		if (dev1 && register_netdev(dev1) == 0)
 			skge_show_addr(dev1);
 		else {
 			/* Failure to register second port need not be fatal */
 			dev_warn(&pdev->dev, "register of second port failed\n");
 			hw->dev[1] = NULL;
-			free_netdev(dev1);
+			hw->ports = 1;
+			if (dev1)
+				free_netdev(dev1);
 		}
 	}
 	pci_set_drvdata(pdev, hw);
-- 
1.5.6.5


^ permalink raw reply related

* Re: [PATCH] net: Fix sock_wfree() race
From: Eric Dumazet @ 2009-09-23 13:44 UTC (permalink / raw)
  To: David Miller; +Cc: albcamus, parag.lkml, linux-kernel, netdev
In-Reply-To: <20090911.125242.244008840.davem@davemloft.net>

David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Fri, 11 Sep 2009 11:43:37 -0700 (PDT)
> 
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 09 Sep 2009 00:49:31 +0200
>>
>>> [PATCH] net: Fix sock_wfree() race
>>>
>>> Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
>>> (net: No more expensive sock_hold()/sock_put() on each tx)
>>> opens a window in sock_wfree() where another cpu
>>> might free the socket we are working on.
>>>
>>> Fix is to call sk->sk_write_space(sk) only
>>> while still holding a reference on sk.
>>>
>>> Since doing this call is done before the 
>>> atomic_sub(truesize, &sk->sk_wmem_alloc), we should pass truesize as 
>>> a bias for possible sk_wmem_alloc evaluations.
>>>
>>> Reported-by: Jike Song <albcamus@gmail.com>
>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Applied to net-next-2.6, thanks.  I'll queue up your simpler
>> version for -stable.
> 
> Eric, I have to revert, as you didn't update the callbacks
> of several protocols such as SCTP and RDS in this change.
> 
> Let me know when you have a fixed version of this patch :-)

Sorry for the delay David. But this is complex. I am not
sure we can do a clean and safe thing, not counting
the added bloat.

If we do :

void sock_wfree(struct sk_buff *skb)
{
        struct sock *sk = skb->sk;
        int res;

        if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
                sk->sk_write_space(sk, skb->truesize);

        res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
        /*
         * if sk_wmem_alloc reached 0, we are last user and should
         * free this sock, as sk_free() call could not do it.
         */
        if (res == 0)
                __sk_free(sk);
}


There is still a possibility multiple cpus call sock_wfree()
for the same socket, and that they all call sk_write_space()
with their bias, yet the protocol still has a possible too
big estimation of sk_wmem_alloc

We could miss to wakeup a blocked writer in case low sk->sk_sndbuf
values are setup. (One could argue that with small sk_sndbuf
values we should not have many packets in flight : Keep in mind
sk_sndbuf can be lowered by the user)


With second patch we instead have :

void sock_wfree(struct sk_buff *skb)
{
	struct sock *sk = skb->sk;
	unsigned int len = skb->truesize;

	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
		/*
		 * Keep a reference on sk_wmem_alloc, this will be released
		 * after sk_write_space() call
		 */
		atomic_sub(len - 1, &sk->sk_wmem_alloc);
		sk->sk_write_space(sk);
		len = 1;
	}
	/*
	 * if sk_wmem_alloc reaches 0, we must finish what sk_free()
	 * could not do because of in-flight packets
	 */
	if (atomic_sub_return(len, &sk->sk_wmem_alloc) == 0)
		__sk_free(sk);
}

The accumulated transient error on sk_wmem_alloc is then < num_online_cpus(),
that should be OK even for very small sk_sndbuf values.

Of course TCP doesnt have to pay the price of sk_write_space() and the second
atomic operation re-added by this fix.

Here is the patch for reference :

[PATCH] net: Fix sock_wfree() race

Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
opens a window in sock_wfree() where another cpu
might free the socket we are working on.

A fix is to call sk->sk_write_space(sk) while still
holding a reference on sk.


Reported-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/sock.c |   19 ++++++++++++-------
 1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 30d5446..e1f034e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1228,17 +1228,22 @@ void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
-	int res;
+	unsigned int len = skb->truesize;
 
-	/* In case it might be waiting for more memory. */
-	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
-	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
+	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
+		/*
+		 * Keep a reference on sk_wmem_alloc, this will be released
+		 * after sk_write_space() call
+		 */
+		atomic_sub(len - 1, &sk->sk_wmem_alloc);
 		sk->sk_write_space(sk);
+		len = 1;
+	}
 	/*
-	 * if sk_wmem_alloc reached 0, we are last user and should
-	 * free this sock, as sk_free() call could not do it.
+	 * if sk_wmem_alloc reaches 0, we must finish what sk_free()
+	 * could not do because of in-flight packets
 	 */
-	if (res == 0)
+	if (atomic_sub_return(len, &sk->sk_wmem_alloc) == 0)
 		__sk_free(sk);
 }
 EXPORT_SYMBOL(sock_wfree);


^ permalink raw reply related

* [PATCH] Phonet: fix race for port number in concurrent bind()
From: Rémi Denis-Courmont @ 2009-09-23 13:17 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

Allocating a port number to a socket and hashing that socket shall be
an atomic operation with regards to other port allocation. Otherwise,
we could allocate a port that is already being allocated to another
socket.

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 net/phonet/socket.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 7a4ee39..07aa9f0 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -113,6 +113,8 @@ void pn_sock_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL(pn_sock_unhash);
 
+static DEFINE_MUTEX(port_mutex);
+
 static int pn_socket_bind(struct socket *sock, struct sockaddr *addr, int len)
 {
 	struct sock *sk = sock->sk;
@@ -140,9 +142,11 @@ static int pn_socket_bind(struct socket *sock, struct sockaddr *addr, int len)
 		err = -EINVAL; /* attempt to rebind */
 		goto out;
 	}
+	WARN_ON(sk_hashed(sk));
+	mutex_lock(&port_mutex);
 	err = sk->sk_prot->get_port(sk, pn_port(handle));
 	if (err)
-		goto out;
+		goto out_port;
 
 	/* get_port() sets the port, bind() sets the address if applicable */
 	pn->sobject = pn_object(saddr, pn_port(pn->sobject));
@@ -150,6 +154,8 @@ static int pn_socket_bind(struct socket *sock, struct sockaddr *addr, int len)
 
 	/* Enable RX on the socket */
 	sk->sk_prot->hash(sk);
+out_port:
+	mutex_unlock(&port_mutex);
 out:
 	release_sock(sk);
 	return err;
@@ -357,8 +363,6 @@ const struct proto_ops phonet_stream_ops = {
 };
 EXPORT_SYMBOL(phonet_stream_ops);
 
-static DEFINE_MUTEX(port_mutex);
-
 /* allocate port for a socket */
 int pn_sock_get_port(struct sock *sk, unsigned short sport)
 {
@@ -370,9 +374,7 @@ int pn_sock_get_port(struct sock *sk, unsigned short sport)
 
 	memset(&try_sa, 0, sizeof(struct sockaddr_pn));
 	try_sa.spn_family = AF_PHONET;
-
-	mutex_lock(&port_mutex);
-
+	WARN_ON(!mutex_is_locked(&port_mutex));
 	if (!sport) {
 		/* search free port */
 		int port, pmin, pmax;
@@ -401,8 +403,6 @@ int pn_sock_get_port(struct sock *sk, unsigned short sport)
 		else
 			sock_put(tmpsk);
 	}
-	mutex_unlock(&port_mutex);
-
 	/* the port must be in use already */
 	return -EADDRINUSE;
 
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] Phonet: error on broadcast sending (unimplemented)
From: Rémi Denis-Courmont @ 2009-09-23 13:17 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont
In-Reply-To: <1253711831-7947-1-git-send-email-remi@remlab.net>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

If we ever implement this, then we can stop returning an error.

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 include/linux/phonet.h |    1 +
 net/phonet/af_phonet.c |    6 ++++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/phonet.h b/include/linux/phonet.h
index 1ef5a07..e5126cf 100644
--- a/include/linux/phonet.h
+++ b/include/linux/phonet.h
@@ -38,6 +38,7 @@
 #define PNPIPE_IFINDEX		2
 
 #define PNADDR_ANY		0
+#define PNADDR_BROADCAST	0xFC
 #define PNPORT_RESOURCE_ROUTING	0
 
 /* Values for PNPIPE_ENCAP option */
diff --git a/net/phonet/af_phonet.c b/net/phonet/af_phonet.c
index a662e62..f60c0c2 100644
--- a/net/phonet/af_phonet.c
+++ b/net/phonet/af_phonet.c
@@ -168,6 +168,12 @@ static int pn_send(struct sk_buff *skb, struct net_device *dev,
 		goto drop;
 	}
 
+	/* Broadcast sending is not implemented */
+	if (pn_addr(dst) == PNADDR_BROADCAST) {
+		err = -EOPNOTSUPP;
+		goto drop;
+	}
+
 	skb_reset_transport_header(skb);
 	WARN_ON(skb_headroom(skb) & 1); /* HW assumes word alignment */
 	skb_push(skb, sizeof(struct phonethdr));
-- 
1.6.0.4


^ permalink raw reply related

* Re: [PATCH 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Tomas Winkler @ 2009-09-23 12:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, linville-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA, yi.zhu-ral2JQCrhuEAvxtiuMwx3w,
	inaky.perez-gonzalez-ral2JQCrhuEAvxtiuMwx3w,
	cindy.h.kao-ral2JQCrhuEAvxtiuMwx3w,
	guy.cohen-ral2JQCrhuEAvxtiuMwx3w,
	ron.rindjunsky-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <1253691283.4458.38.camel-YfaajirXv2244ywRPIzf9A@public.gmane.org>

On Wed, Sep 23, 2009 at 10:34 AM, Johannes Berg
<johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote:
> On Wed, 2009-09-23 at 10:23 +0300, Tomas Winkler wrote:
>
>> From HW perspective your assumption is not exactly correct. All the
>> devices are visible on the SDIO bus but they are not operational
>> (probe won't succeed) until TOP download the firmware and kicks the
>> devices. From SW perspective to create another bus layer is an option.
>> I'm not sure if it's not more complicated one.
>
> Ah, ok, so it is quite different. Not sure how sdio probing works, so I
> guess I can't say much here.

This is not about SDIO probing this is rather unusual HW design.
Anyhow all comments and ideas are welcome.

Thanks
Tomas
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: fanotify as syscalls
From: Arjan van de Ven @ 2009-09-23 11:32 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Davide Libenzi, Andreas Gruenbacher, Jamie Lokier, Eric Paris,
	Linus Torvalds, Evgeniy Polyakov, David Miller,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org,
	netdev@vger.kernel.org, viro@zeniv.linux.org.uk,
	alan@linux.intel.com, hch@infradead.org
In-Reply-To: <200909230939.34003.tvrtko.ursulin@sophos.com>

On Wed, 23 Sep 2009 09:39:33 +0100
Tvrtko Ursulin <tvrtko.ursulin@sophos.com> wrote:

> Lived with it because there was no other option. We used LSM while it
> was available for modules but then it was taken away. 

... at which point you could have submitted your LSM module for
inclusion... you'd be the first (and only?) Anti Virus vendor that
would be in the mainline kernel.. speaking of competitive advantage,
coming out of the box in all distributions.

sadly this road hasn't been chosen....



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply

* Re: fanotify as syscalls
From: hch @ 2009-09-23 11:20 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Davide Libenzi, Andreas Gruenbacher, Jamie Lokier, Eric Paris,
	Linus Torvalds, Evgeniy Polyakov, David Miller,
	Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org,
	netdev@vger.kernel.org, viro@zeniv.linux.org.uk,
	alan@linux.intel.com, hch@infradead.org
In-Reply-To: <200909230939.34003.tvrtko.ursulin@sophos.com>

On Wed, Sep 23, 2009 at 09:39:33AM +0100, Tvrtko Ursulin wrote:
> Lived with it because there was no other option. We used LSM while it was 
> available for modules but then it was taken away. 
> 
> And not all vendors even use syscall interception, not even across platforms, 
> of which you sound so sure about. You can't even scan something which is not 
> in your namespace if you are at the syscall level. And you can't catch things 
> like kernel nfsd. No, syscall interception is not really appropriate at all.

The "Anti-Malware" industry is just snake oil anyway.  I think the
proper approach to support it is just to add various no-op exports claim
to do something and all the people requiring anti-virus on Linux will be
just as happy with it.


^ permalink raw reply

* Re: [PATCH][RESEND 3] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-23 11:07 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev
In-Reply-To: <20090923184314.a2a2701d.yoshfuji@linux-ipv6.org>

On Wed, 2009-09-23 at 18:43 +0900, YOSHIFUJI Hideaki wrote:
> Hello.
> 
> First of all, thank you for this work.
> 
> On Wed, 23 Sep 2009 00:02:51 +0200
> Alexandre Cassen <acassen@freebox.fr> wrote:
> 
> > This patch add support to 6rd tunnel mode as described into
> > draft-despres-6rd-03.
> > 
> > Patch history :
> > * http://patchwork.ozlabs.org/patch/26870/
> > * http://patchwork.ozlabs.org/patch/34026/
> > * http://patchwork.ozlabs.org/patch/34045/
> > 
> > IPv6 rapid deployment (draft-despres-6rd-03) builds upon mechanisms
> 
> Well, I was confused.  I think draft-softwire-ipv6-6rd
> is the latest one, no?

draft-despres-6rd-03    : targeting informational RFC
                          (=> currently pending)
draft-softwire-ipv6-6rd : targeting standard track

after last IETF meeting previous draft-townsley-ipv6-6rd as been pushed
to IETG softwires WG.

So you right, ref should be set to draft-softwire-ipv6-6rd.

> Another comment is that we should combine 6to4
> and 6rd.

Completly agree. 6to4 is a special case of 6rd.

Okay, so I stop producing (fixing according to last comments) and
resending new patch for 6rd.

regs,
Alexandre


^ permalink raw reply

* Re: Resend: [PATCH] TCP Early Retransmit: reduce required dupacks for triggering fast retrans
From: Ilpo Järvinen @ 2009-09-23  9:58 UTC (permalink / raw)
  To: Christian Samsel, David Miller; +Cc: Netdev
In-Reply-To: <fab65f44d75b.4ab891f2@rwth-aachen.de>

On Tue, 22 Sep 2009, Christian Samsel wrote:

> This patch implements draft-ietf-tcpm-early-rexmt. The early retransmit 
> mechanism allows the transport to reduce the number of duplicate
> acknowledgments required to trigger a fast retransmission in case we
> don't expect enough dupacks, (e.g. because there are not enough
> packets inflight and nothing to send). This allows the transport to use
> fast retransmit to recover packet losses that would otherwise require
> a lengthy retransmission timeout.
> 
> See: http://tools.ietf.org/html/draft-ietf-tcpm-early-rexmt-01
> 
> Signed-off-by: Christian Samsel <christian.samsel@rwth-aachen.de>
> 
> ---
>  net/ipv4/tcp_input.c |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index af6d6fa..c0cc4fd 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2913,6 +2913,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
>   int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
>                                      (tcp_fackets_out(tp) > tp->reordering));
>   int fast_rexmit = 0, mib_idx;
> + u32 in_flight;
>  
>   if (WARN_ON(!tp->packets_out && tp->sacked_out))
>           tp->sacked_out = 0;
> @@ -3062,6 +3063,21 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
>   if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
>           tcp_update_scoreboard(sk, fast_rexmit);
>   tcp_cwnd_down(sk, flag);
> +       
> +
> + /* draft-ietf-tcpm-early-rexmt: lowers dup ack threshold to prevent rto
> +         * in case we don't expect enough dup ack. if number of outstanding
> +         * packets is less than four and there is either no unsent data ready
> +         * for transmission or the advertised window does not permit new
> +         * segments.
> +         */
> + in_flight = tcp_packets_in_flight(tp);
> + if ( in_flight < 4 && (skb_queue_empty(&sk->sk_write_queue) ||
> +         tcp_may_send_now(sk) == 0) )
> +         tp->reordering = in_flight - 1;
> + else if (tp->reordering != sysctl_tcp_reordering)
> +         tp->reordering = sysctl_tcp_reordering;
> +
>   tcp_xmit_retransmit_queue(sk);
>  }

This is entirely flawed approach, I'd recommend you start from the 
scratch, almost nothing of this current one is worth keeping (expect 
parts of the comment). ...It will just not work for many cases, however, 
it's nice that you tried nevertheless.

First, the right place to change is in tcp_time_to_recover(). Another 
thing you need is to add a min() into tcp_update_scoreboard. Also, I don't 
think you should be touching tp->reordering at all to artificially lower 
the threshold for a period, just calculate the artificial value on the 
fly. And skb_queue_empty is not doing what you want, in fact I'm unsure 
what you want it to do in the first place? Instead of four, use 
tp->reordering. (I could have coded all that in couple of minutes, in fact 
in less than writing this mail but it's more useful that you go to those 
places, learn and code that instead :-)).

With all the cases that I know to not work with this _submitted_ version, 
I doubt that this is well tested, if any at all. ...I hope you're not 
submitting somebody elses work without understanding at all what the code 
does and what it doesn't...?

Also, before starting, please go through what is written in 
Documentation/CodingStyle.

-- 
 i.

^ permalink raw reply

* Re: [PATCH][RESEND 3] IPv6: 6rd tunnel mode
From: YOSHIFUJI Hideaki @ 2009-09-23  9:43 UTC (permalink / raw)
  To: Alexandre Cassen; +Cc: yoshfuji, netdev
In-Reply-To: <20090922220251.GA22874@lnxos.staff.proxad.net>

Hello.

First of all, thank you for this work.

On Wed, 23 Sep 2009 00:02:51 +0200
Alexandre Cassen <acassen@freebox.fr> wrote:

> This patch add support to 6rd tunnel mode as described into
> draft-despres-6rd-03.
> 
> Patch history :
> * http://patchwork.ozlabs.org/patch/26870/
> * http://patchwork.ozlabs.org/patch/34026/
> * http://patchwork.ozlabs.org/patch/34045/
> 
> IPv6 rapid deployment (draft-despres-6rd-03) builds upon mechanisms

Well, I was confused.  I think draft-softwire-ipv6-6rd
is the latest one, no?

Another comment is that we should combine 6to4
and 6rd.

In fact, I've been taking care of it since I met
with Mark Townsley last week.  Here's my tentative
version for reference.

Several points:
 - based on latest version.
 - share code path with 6to4.

(If anyone can invent better bitops,
it will great help...)

Regards,

--yoshfuji

----
>From 7c82f67d361155a2e8ee831c66c9663617ae45bc Mon Sep 17 00:00:00 2001
From: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Date: Tue, 22 Sep 2009 16:29:54 +0900
Subject: [PATCH] ipv6 sit: 6rd (IPv6 Rapid Deployment) Support.

IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
deploy IPv6 unicast service to IPv4 sites to which it provides
customer premise equipment.  Like 6to4, it utilizes stateless IPv6 in
IPv4 encapsulation in order to transit IPv4-only network
infrastructure.  Unlike 6to4, a 6rd service provider uses an IPv6
prefix of its own in place of the fixed 6to4 prefix.

With this option enabled, the SIT driver offers 6rd functionality by
providing additional ioctl API to configure the IPv6 Prefix for in
stead of static 2002::/16 for 6to4.

Original patch was done by Alexandre Cassen <acassen@freebox.fr>
based on old Internet-Draft.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/linux/if_tunnel.h |   11 ++++
 include/net/ipip.h        |   13 +++++
 net/ipv6/Kconfig          |   19 +++++++
 net/ipv6/sit.c            |  124 ++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 159 insertions(+), 8 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 5eb9b0f..cab4938 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -15,6 +15,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -51,6 +55,13 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+struct ip_tunnel_6rd {
+	struct in6_addr		prefix;
+	__be32			relay_prefix;
+	__u16			prefixlen;
+	__u16			relay_prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/include/net/ipip.h b/include/net/ipip.h
index 5d3036f..157be1c 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -7,6 +7,15 @@
 /* Keep error state on tunnel for 30 sec */
 #define IPTUNNEL_ERR_TIMEO	(30*HZ)
 
+/* 6rd prefix/relay information */
+struct ip_tunnel_6rd_parm
+{
+	struct in6_addr		prefix;
+	__be32			relay_prefix;
+	u16			prefixlen;
+	u16			relay_prefixlen;
+};
+
 struct ip_tunnel
 {
 	struct ip_tunnel	*next;
@@ -24,6 +33,10 @@ struct ip_tunnel
 
 	struct ip_tunnel_parm	parms;
 
+	/* for SIT */
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd_parm	ip6rd;
+#endif
 	struct ip_tunnel_prl_entry	*prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
 };
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ead6c7a..f561998 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -170,6 +170,25 @@ config IPV6_SIT
 
 	  Saying M here will produce a module called sit. If unsure, say Y.
 
+config IPV6_SIT_6RD
+	bool "IPv6: IPv6 Rapid Development (6RD) (EXPERIMENTAL)"
+	depends on IPV6_SIT && EXPERIMENTAL
+	default n
+	---help---
+	  IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
+	  mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
+	  deploy IPv6 unicast service to IPv4 sites to which it provides
+	  customer premise equipment.  Like 6to4, it utilizes stateless IPv6 in
+	  IPv4 encapsulation in order to transit IPv4-only network
+	  infrastructure.  Unlike 6to4, a 6rd service provider uses an IPv6
+	  prefix of its own in place of the fixed 6to4 prefix.
+
+	  With this option enabled, the SIT driver offers 6rd functionality by
+	  providing additional ioctl API to configure the IPv6 Prefix for in
+	  stead of static 2002::/16 for 6to4.
+
+	  If unsure, say N.
+
 config IPV6_NDISC_NODETYPE
 	bool
 
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0ae4f64..14bd503 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -162,6 +162,21 @@ static void ipip6_tunnel_link(struct sit_net *sitn, struct ip_tunnel *t)
 	write_unlock_bh(&ipip6_lock);
 }
 
+static void ipip6_tunnel_clone_6rd(struct ip_tunnel *t, struct sit_net *sitn)
+{
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (t->dev == sitn->fb_tunnel_dev) {
+		ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0);
+		t->ip6rd.relay_prefix = 0;
+		t->ip6rd.prefixlen = 16;
+		t->ip6rd.relay_prefixlen = 0;
+	} else {
+		struct ip_tunnel *t0 = netdev_priv(sitn->fb_tunnel_dev);
+		memcpy(&t->ip6rd, &t0->ip6rd, sizeof(t->ip6rd));
+	}
+#endif
+}
+
 static struct ip_tunnel * ipip6_tunnel_locate(struct net *net,
 		struct ip_tunnel_parm *parms, int create)
 {
@@ -214,6 +229,8 @@ static struct ip_tunnel * ipip6_tunnel_locate(struct net *net,
 
 	dev_hold(dev);
 
+	ipip6_tunnel_clone_6rd(t, sitn);
+
 	ipip6_tunnel_link(sitn, nt);
 	return nt;
 
@@ -590,17 +607,41 @@ out:
 	return 0;
 }
 
-/* Returns the embedded IPv4 address if the IPv6 address
-   comes from 6to4 (RFC 3056) addr space */
-
-static inline __be32 try_6to4(struct in6_addr *v6dst)
+/*
+ * Returns the embedded IPv4 address if the IPv6 address
+ * comes from 6rd / 6to4 (RFC 3056) addr space.
+ */
+static inline
+__be32 try_6rd(struct in6_addr *v6dst, struct ip_tunnel *tunnel)
 {
 	__be32 dst = 0;
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipv6_prefix_equal(v6dst, &tunnel->ip6rd.prefix,
+			      tunnel->ip6rd.prefixlen)) {
+		unsigned pbw0, pbi0;
+		int pbi1;
+		u32 d;
+
+		pbw0 = tunnel->ip6rd.prefixlen >> 5;
+		pbi0 = tunnel->ip6rd.prefixlen & 0x1f;
+
+		d = (ntohl(tunnel->ip6rd.prefix.s6_addr32[pbw0]) << pbi0) >>
+		    tunnel->ip6rd.relay_prefixlen;
+
+		pbi1 = pbi0 - tunnel->ip6rd.relay_prefixlen;
+		if (pbi1 > 0)
+			d |= ntohl(tunnel->ip6rd.prefix.s6_addr32[pbw0 + 1]) >>
+			     (32 - pbi1);
+
+		dst = tunnel->ip6rd.relay_prefix | htonl(d);
+	}
+#else
 	if (v6dst->s6_addr16[0] == htons(0x2002)) {
 		/* 6to4 v6 addr has 16 bits prefix, 32 v4addr, 16 SLA, ... */
 		memcpy(&dst, &v6dst->s6_addr16[1], 4);
 	}
+#endif
 	return dst;
 }
 
@@ -658,7 +699,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 	}
 
 	if (!dst)
-		dst = try_6to4(&iph6->daddr);
+		dst = try_6rd(&iph6->daddr, tunnel);
 
 	if (!dst) {
 		struct neighbour *neigh = NULL;
@@ -851,9 +892,15 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	struct ip_tunnel *t;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 
 	switch (cmd) {
 	case SIOCGETTUNNEL:
+#ifdef CONFIG_IPV6_SIT_6RD
+	case SIOCGET6RD:
+#endif
 		t = NULL;
 		if (dev == sitn->fb_tunnel_dev) {
 			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p))) {
@@ -864,9 +911,25 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		}
 		if (t == NULL)
 			t = netdev_priv(dev);
-		memcpy(&p, &t->parms, sizeof(p));
-		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
-			err = -EFAULT;
+
+		err = -EFAULT;
+		if (cmd == SIOCGETTUNNEL) {
+			memcpy(&p, &t->parms, sizeof(p));
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &p,
+					 sizeof(p)))
+				goto done;
+#ifdef CONFIG_IPV6_SIT_6RD
+		} else {
+			ipv6_addr_copy(&ip6rd.prefix, &t->ip6rd.prefix);
+			ip6rd.relay_prefix = t->ip6rd.relay_prefix;
+			ip6rd.prefixlen = t->ip6rd.prefixlen;
+			ip6rd.relay_prefixlen = t->ip6rd.relay_prefixlen;
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &ip6rd,
+					 sizeof(ip6rd)))
+				goto done;
+#endif
+		}
+		err = 0;
 		break;
 
 	case SIOCADDTUNNEL:
@@ -987,6 +1050,51 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		netdev_state_change(dev);
 		break;
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	case SIOCADD6RD:
+	case SIOCCHG6RD:
+	case SIOCDEL6RD:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&ip6rd, ifr->ifr_ifru.ifru_data,
+				   sizeof(ip6rd)))
+			goto done;
+
+		t = netdev_priv(dev);
+
+		if (cmd != SIOCDEL6RD) {
+			struct in6_addr prefix;
+			__be32 relay_prefix;
+
+			err = -EINVAL;
+			if (ip6rd.relay_prefixlen > 32 ||
+			    ip6rd.prefixlen + (32 - ip6rd.relay_prefixlen) > 64)
+				goto done;
+
+			ipv6_addr_prefix(&prefix, &ip6rd.prefix,
+					 ip6rd.prefixlen);
+			if (!ipv6_addr_equal(&prefix, &ip6rd.prefix))
+				goto done;
+			relay_prefix = ip6rd.relay_prefix &
+				       htonl(0xffffffffUL <<
+					     (32 - ip6rd.relay_prefixlen));
+			if (relay_prefix != ip6rd.relay_prefix)
+				goto done;
+
+			ipv6_addr_copy(&t->ip6rd.prefix, &prefix);
+			t->ip6rd.relay_prefix = relay_prefix;
+			t->ip6rd.prefixlen = ip6rd.prefixlen;
+			t->ip6rd.relay_prefixlen = ip6rd.relay_prefixlen;
+		} else
+			ipip6_tunnel_clone_6rd(t, sitn);
+
+		err = 0;
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 	}
-- 
1.5.6.5


--yoshfuji

^ permalink raw reply related

* [PATCH] genetlink: fix netns vs. netlink table locking (2)
From: Johannes Berg @ 2009-09-23  9:34 UTC (permalink / raw)
  To: netdev

Similar to commit d136f1bd366fdb7e747ca7e0218171e7a00a98a5,
there's a bug when unregistering a generic netlink family,
which is caught by the might_sleep() added in that commit:

    BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
    in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
    2 locks held by rmmod/1510:
     #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8138283b>] genl_unregister_family+0x2b/0x130
     #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8138270c>] __genl_unregister_mc_group+0x1c/0x120
    Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
    Call Trace:
     [<ffffffff81044ff9>] __might_sleep+0x119/0x150
     [<ffffffff81380501>] netlink_table_grab+0x21/0x100
     [<ffffffff813813a3>] netlink_clear_multicast_users+0x23/0x60
     [<ffffffff81382761>] __genl_unregister_mc_group+0x71/0x120
     [<ffffffff81382866>] genl_unregister_family+0x56/0x130
     [<ffffffffa0007d85>] nl80211_exit+0x15/0x20 [cfg80211]
     [<ffffffffa000005a>] cfg80211_exit+0x1a/0x40 [cfg80211]

Fix in the same way by grabbing the netlink table lock
before doing rcu_read_lock().

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
---
 include/linux/netlink.h  |    1 +
 net/netlink/af_netlink.c |   19 +++++++++++--------
 net/netlink/genetlink.c  |    4 +++-
 3 files changed, 15 insertions(+), 9 deletions(-)

--- wireless-testing.orig/include/linux/netlink.h	2009-09-23 11:15:56.000000000 +0200
+++ wireless-testing/include/linux/netlink.h	2009-09-23 11:16:14.000000000 +0200
@@ -187,6 +187,7 @@ extern struct sock *netlink_kernel_creat
 extern void netlink_kernel_release(struct sock *sk);
 extern int __netlink_change_ngroups(struct sock *sk, unsigned int groups);
 extern int netlink_change_ngroups(struct sock *sk, unsigned int groups);
+extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group);
 extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);
--- wireless-testing.orig/net/netlink/af_netlink.c	2009-09-23 11:09:44.000000000 +0200
+++ wireless-testing/net/netlink/af_netlink.c	2009-09-23 11:14:52.000000000 +0200
@@ -1610,6 +1610,16 @@ int netlink_change_ngroups(struct sock *
 }
 EXPORT_SYMBOL(netlink_change_ngroups);
 
+void __netlink_clear_multicast_users(struct sock *ksk, unsigned int group)
+{
+	struct sock *sk;
+	struct hlist_node *node;
+	struct netlink_table *tbl = &nl_table[ksk->sk_protocol];
+
+	sk_for_each_bound(sk, node, &tbl->mc_list)
+		netlink_update_socket_mc(nlk_sk(sk), group, 0);
+}
+
 /**
  * netlink_clear_multicast_users - kick off multicast listeners
  *
@@ -1620,15 +1630,8 @@ EXPORT_SYMBOL(netlink_change_ngroups);
  */
 void netlink_clear_multicast_users(struct sock *ksk, unsigned int group)
 {
-	struct sock *sk;
-	struct hlist_node *node;
-	struct netlink_table *tbl = &nl_table[ksk->sk_protocol];
-
 	netlink_table_grab();
-
-	sk_for_each_bound(sk, node, &tbl->mc_list)
-		netlink_update_socket_mc(nlk_sk(sk), group, 0);
-
+	__netlink_clear_multicast_users(ksk, group);
 	netlink_table_ungrab();
 }
 EXPORT_SYMBOL(netlink_clear_multicast_users);
--- wireless-testing.orig/net/netlink/genetlink.c	2009-09-23 11:09:46.000000000 +0200
+++ wireless-testing/net/netlink/genetlink.c	2009-09-23 11:16:50.000000000 +0200
@@ -220,10 +220,12 @@ static void __genl_unregister_mc_group(s
 	struct net *net;
 	BUG_ON(grp->family != family);
 
+	netlink_table_grab();
 	rcu_read_lock();
 	for_each_net_rcu(net)
-		netlink_clear_multicast_users(net->genl_sock, grp->id);
+		__netlink_clear_multicast_users(net->genl_sock, grp->id);
 	rcu_read_unlock();
+	netlink_table_ungrab();
 
 	clear_bit(grp->id, mc_groups);
 	list_del(&grp->list);



^ permalink raw reply

* Re: fanotify as syscalls
From: Tvrtko Ursulin @ 2009-09-23  8:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andreas Gruenbacher, Jamie Lokier, Eric Paris, Linus Torvalds,
	Evgeniy Polyakov, David Miller, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org, netdev@vger.kernel.org,
	viro@zeniv.linux.org.uk, alan@linux.intel.com, hch@infradead.org
In-Reply-To: <alpine.DEB.2.00.0909220836200.10460@makko.or.mcafeemobile.com>

On Tuesday 22 September 2009 17:04:44 Davide Libenzi wrote:
> On Tue, 22 Sep 2009, Andreas Gruenbacher wrote:
> > The fatal flaw of syscall interception is race conditions: you look up a
> > pathname in your interception layer; then when you call into the proper
> > syscall, the kernel again looks up the same pathname. There is no way to
> > guarantee that you end up at the same object in both lookups. The
> > security and fsnotify hooks are placed in the appropriate spots to avoid
> > exactly that.
>
> Fatal? You mean, for this corner case that the anti-malware industry lived
> with for so much time (in Linux and Windows), you're prepared in pushing
> all the logic that is currently implemented into their modules, into the
> kernel?

Lived with it because there was no other option. We used LSM while it was 
available for modules but then it was taken away. 

And not all vendors even use syscall interception, not even across platforms, 
of which you sound so sure about. You can't even scan something which is not 
in your namespace if you are at the syscall level. And you can't catch things 
like kernel nfsd. No, syscall interception is not really appropriate at all.

Tvrtko

^ permalink raw reply

* Re: [PATCH 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Johannes Berg @ 2009-09-23  7:34 UTC (permalink / raw)
  To: Tomas Winkler
  Cc: davem, linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1ba2fa240909230023v17fe2b49v4981d464dba469ed@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 510 bytes --]

On Wed, 2009-09-23 at 10:23 +0300, Tomas Winkler wrote:

> From HW perspective your assumption is not exactly correct. All the
> devices are visible on the SDIO bus but they are not operational
> (probe won't succeed) until TOP download the firmware and kicks the
> devices. From SW perspective to create another bus layer is an option.
> I'm not sure if it's not more complicated one.

Ah, ok, so it is quite different. Not sure how sdio probing works, so I
guess I can't say much here.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Tomas Winkler @ 2009-09-23  7:23 UTC (permalink / raw)
  To: Johannes Berg
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, linville-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA, yi.zhu-ral2JQCrhuEAvxtiuMwx3w,
	inaky.perez-gonzalez-ral2JQCrhuEAvxtiuMwx3w,
	cindy.h.kao-ral2JQCrhuEAvxtiuMwx3w,
	guy.cohen-ral2JQCrhuEAvxtiuMwx3w,
	ron.rindjunsky-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <1253689036.4458.22.camel-YfaajirXv2244ywRPIzf9A@public.gmane.org>

On Wed, Sep 23, 2009 at 9:57 AM, Johannes Berg
<johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote:
> On Wed, 2009-09-23 at 02:38 +0300, Tomas Winkler wrote:
>
>> +config IWMC3200TOP
>> +        tristate "Intel Wireless MultiCom Top Driver"
>> +        depends on MMC && EXPERIMENTAL
>> +        select FW_LOADER
>> +     ---help---
>> +       Intel Wireless MultiCom 3200 Top driver is responsible for
>> +       for firmware load and enabled coms enumeration
>
> This seems like the wrong approach to me.
>
> To me, it seems like you have a device that contains an internal bus and
> allows bus enumeration. Typically, we would surface that bus in the
> driver/device model and allow sub-drivers to bind to that by way of
> exposing the internal bus, like e.g. drivers/ssb/.

From HW perspective your assumption is not exactly correct. All the
devices are visible on the SDIO bus but they are not operational
(probe won't succeed) until TOP download the firmware and kicks the
devices. From SW perspective to create another bus layer is an option.
I'm not sure if it's not more complicated one.

Thanks
Tomas
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Johannes Berg @ 2009-09-23  6:57 UTC (permalink / raw)
  To: Tomas Winkler
  Cc: davem, linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1253662724-16497-2-git-send-email-tomas.winkler@intel.com>

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Wed, 2009-09-23 at 02:38 +0300, Tomas Winkler wrote:

> +config IWMC3200TOP
> +        tristate "Intel Wireless MultiCom Top Driver"
> +        depends on MMC && EXPERIMENTAL
> +        select FW_LOADER
> +	---help---
> +	  Intel Wireless MultiCom 3200 Top driver is responsible for
> +	  for firmware load and enabled coms enumeration

This seems like the wrong approach to me.

To me, it seems like you have a device that contains an internal bus and
allows bus enumeration. Typically, we would surface that bus in the
driver/device model and allow sub-drivers to bind to that by way of
exposing the internal bus, like e.g. drivers/ssb/.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* r8169, enabling TX checksumming breaks things?
From: Denys Fedoryschenko @ 2009-09-23  6:15 UTC (permalink / raw)
  To: romieu, netdev

Hi

Is it expected that:
1)TX checksumming is off by default
2)If i try to enable it over ethtool -K eth0 tx on , TCP sessions on proxy 
getting stuck, even in tcpdump looks everything fine and packets reaching 
destination, i don't understand what is a reason of failure.
Maybe if this feature supposed to not work - user must not be able just to 
turn it on?

Checksum OFF, connection established, no data received.
www.nuclearcat.com/files/r8169_tx_off.txt

Checksum ON, connection established, no data received.
www.nuclearcat.com/files/r8169_tx_on.txt

If required i can capture binary pcap files.

^ permalink raw reply

* Re: [PATCH][RESEND 3] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-23  6:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AB94D2C.4000006@gmail.com>



On Wed, 23 Sep 2009, Eric Dumazet wrote:
>> +#ifdef CONFIG_IPV6_SIT_6RD
>> +	case SIOCGET6RD:
>> +		err = -EINVAL;
>> +		if (dev == sitn->fb_tunnel_dev)
>> +			goto done;
>> +		err = -ENOENT;
>> +		if (!(t = netdev_priv(dev)))
>> +			goto done;
>
>> +		memcpy(&ip6rd, &t->ip6rd_prefix, sizeof(ip6rd));
>
> Just wondering why you need a temporary ip6rd here,
> why dont you copy_to_user(ifr->ifr_ifru.ifru_data, &t->ip6rd_prefix, sizeof(ip6rd)); ?
>
>> +		if (copy_to_user(ifr->ifr_ifru.ifru_data, &ip6rd, sizeof(ip6rd)))
>> +			err = -EFAULT;
>> +		else
>> +			err = 0;
>> +		break;

agreed. will fix.

^ permalink raw reply

* pktgen: tricks
From: Stephen Hemminger @ 2009-09-23  5:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Robert Olsson; +Cc: netdev

I thought others want to know how to get maximum speed of pktgen.

1. Run nothing else (even X11), just a command line
2. Make sure ethernet flow control is disabled
   ethtool -A eth0 autoneg off rx off tx off
3. Make sure clocksource is TSC.  On my old SMP Opteron's
   needed to get patch since in 2.6.30 or later, the clock guru's
   decided to remove it on all non Intel machines.  Look for patch
   than enables "tsc=reliable"
4. Compile Ethernet drivers in, the overhead of the indirect
   function call required for modules (or cache footprint),
   slows things down.
5. Increase transmit ring size to 1000
   ethtool -G eth0 tx 1000

Result: OK: 70408581(c70405979+d2602) nsec, 100000000 (60byte,0frags)
  1420281pps 681Mb/sec (681734880bps) errors: 0

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox