Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Jamie Lokier @ 2009-09-14  0:03 UTC (permalink / raw)
  To: jamal
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch, balbir
In-Reply-To: <1252709567.25158.61.camel@dogo.mojatatu.com>

jamal wrote:
> On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote:
> 
> > One of the uses of fanotify is as a security or auditing mechanism.
> > That can't tolerate gaps.
> > 
> > It's fundemantally different from inotify in one important respect:
> > inotify apps can recover from losing events by checking what they are
> > watching.
> > 
> > The fanotify application will know that it missed events, but what
> > happens to the other application which _caused_ those events?  Does it
> > get to do things it shouldn't, or hide them from the fanotify app, by
> > simply overloading the system?  Or the opposite, does it get access
> > denied - spurious file errors when the system is overloaded?
> > 
> > There's no way to handle that by dropping events.  A transport
> > mechanism can be dropped (say skbs), but the event itself has to be
> > kept, and then retried.
> > 
> >
> > Since you have to keep an event object around until it's handled,
> > there's no point tying it to an unreliable delivery mechanism which
> > you'd have to wrap a retry mechanism around.
> > 
> > In other words, that part of netlink is a poor match.  It would match
> > inotify much better.
> 
> Reliability is something that you should build in. Netlink provides you
> all the necessary tools. What you are asking for here is essentially 
> reliable multicasting.

Almost.  It's reliable multicasting plus unicast responses which must
be waited for.  That changes things.

> You dont have infinite memory, therefore there
> will be times when you will overload one of the users, and they wont
> have sufficient buffer space and then you have to retransmit.

If you have enough memory to remember _what_ to retransmit, then you
have enough memory to buffer a fixed-size message.  It just depends on
how you do the buffering.  To say netlink drops the message and you
can retry is just saying that the buffering is happening one step
earlier, before netlink.  That's what I mean by netlink being a
pointless complication for this, because you can just as easily write
code which gets to the message to userspace without going through
netlink and with no chance of it being dropped.

> Is the current proposed mechanism capable of reliably multicasting
> without need for retransmit?

Yes.  It uses positive acknowledge and flow control, because these
match naturally with what fanotify does at the next higher level.

The process generating the multicast (e.g. trying to write a file) is
blocked until the receiver gets the message, handles it and
acknowledges with a "yes you can" or "no you can't" response.

That's part of fanotify's design.  The pattern conveniently has no
issues with using unbounded memory for message, because the sending
process is blocked.

> > Speaking of skbs, how fast and compact are they for this?
> 
> They are largish relative to say if you trimmed down to basic necessity.
> But then you get a lot of the buffer management aspects for free.
> In this case, the concept of multicasting is built in so for one event
> to be sent to X users - you only need one skb.

True you only need one skb.  But netlink doesn't handle waiting for
positive acknowledge responses from every receiver, and combining
their value, does it?  You can't really take advantage of netlink's
built in multicast, because to known when it has all the responses,
the fanotify layer has to track the subscriber list itself anyway.

What I'm saying is perhaps skbs are useful for fanotify, but I don't
know that netlink's multicasting is useful.  But storing the messages
in skbs for transmission, and using parts of netlink to manage them,
and to provide some of the API, that might be useful.

> > Eric's explained that it would be normal for _every_ file operation on
> > some systems to trigger a fanotify event and possibly wait on the
> > response, or at least in major directory trees on the filesystem.
> > Even if it's just for the fanotify app to say "oh I don't care about
> > that file, carry on".
> > 
> 
> That doesnt sound very scalable. Should it not be you get nothing unless
> you register for interest in something?

You do get nothing unless you register interest.  The problem is
there's no way to register interest on just a subtree, so the fanotify
approach is let you register for events on the whole filesystem, and
let the userspace daemon filter paths.  At least it's decisions can be
cached, although I'm not sure how that works when multiple processes
want to monitor overlapping parts of the filesystem.

It doesn't sound scalable to me, either, and that's why I don't like
this part, and described a solution to monitoring subtrees - which
would also solve the problem for inotify.  (Both use fsnotify under
the hood, and that's where subtree notification would go).

Eric's mentioned interest in a way to monitor subtrees, but that
hasn't gone anywhere as far as I know.  He doesn't seem convinced by
my solution - or even that scalability will be an issue.  I think
there's a bit of vision lacking here, and I'll admit I'm more
interested in the inotify uses of fsnotify (being able to detect
changes) than the fanotify uses (being able to _block_ or _modify_
changes).  I think both inotify and fanotify ought to benefit from the
same improvements to file monitoring.

> > File performance is one of those things which really needs to be fast
> > for a good user experience - and it's not unusual to grep the odd
> > 10,000 files here or there (just think of what a kernel developer
> > does), or to replace a few thousand quickly (rpm/dpkg) and things like
> > that.
> > 
> 
> So grepping 10000 files would cause 10000 events? I am not sure how the
> scheme works; filtering of what events get delivered sounds more
> reasonable if it happens in the kernel.

I believe it would cause 10000 events, yes, even if they are files
that userspace policy is not interested in.  Eric, is that right?

However I believe after the first grep, subsequent greps' decisions
would be cached by marking the inodes.  I'm not sure what happens if
two fanotify monitors both try marking the inodes.

Arguably if a fanotify monitor is running before those files are in
page cache anyway, then I/O may dominate, and when the files are
cached, fanotify has already cached it's decisions in the kernel.
However fanotify is synchronous: each new file access involves a round
trip to the fanotify userspace and back before it can proceed, so
there's quite a lot of IPC and scheduling too.  Without testing, it's
hard to guess how it'll really perform.

> > While skbs and netlink aren't that slow, I suspect they're an order of
> > magnitude or two slower than, say, epoll or inotify at passing events
> > around.
> 
> not familiar with inotify.

inotify is like dnotify, and like a signal or epoll: a message that
something happened.  You register interest in individual files or
directories only, and inotify does not (yet) provide a way to monitor
the whole filesystem or a subtree.

fanotify is different: it provides access control, and can _refuse_
attempts to read file X, or even modify the file before permitting the
file to be read.

> Theres a difference between events which are abbreviated in the form
> "hey some read happened on fd you are listening on" vs "hey a read
> of file X for 16 bytes at offset 200 by process Y just occured while
> at the same time process Z was writting at offset 2000". The later
> (which netlink will give you) includes a lot more attribute details
> which could be filtered or can be extended to include a lot
> more. The former(what epoll will give you) is merely a signal.

Firstly, it's really hard to retain the ordering of userspace events
like that in a useful way, given the non-determinstic parallelism
going on with multiple processes doing I/O do the same file :-)

Second, you can't really pump messages with that much detail into
netlink and let _it_ filter them to userspace; that would be too much
processing.  You'd have to have some way of not generating that much
detail except when it's been requested, and preferably only for files
you want it for.

But this part is irrelevant to fanotify, because there's no plan or
intention to provide that much detail about I/O.

If you want, feel free to provide a stracenotify subsystem to track
everything in detail :-)

-- Jamie

^ permalink raw reply

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Jamie Lokier @ 2009-09-14  0:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <20090912094110.GB24709@ioremap.net>

Evgeniy Polyakov wrote:
> On Fri, Sep 11, 2009 at 05:51:42PM -0400, Eric Paris (eparis@redhat.com) wrote:
> > For some things yes, some things no.  I'd have to understand where loss
> > can happen to know if it's feasible.  If I know loss happens in the
> > sender context that's great.  If it's somewhere in the middle and the
> > sender doesn't immediately know it'll never be delivered, yes, I don't
> > think it can solve all my needs.  How many places can and skb get lost
> > between the sender and the receiver?
> 
> When queue is full or you do not have enough RAM. Both are reported at
> 'sending' time.

Can you ->poll() and wait reliably until the queue will accept an skb?
(A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm).

-- Jamie

^ permalink raw reply

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Eric Paris @ 2009-09-14  1:26 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: jamal, David Miller, linux-kernel, linux-fsdevel, netdev, viro,
	alan, hch, balbir
In-Reply-To: <20090914000303.GA30621@shareable.org>

On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote:
> jamal wrote:
> > On Fri, 2009-09-11 at 22:42 +0100, Jamie Lokier wrote:

First let me state the 2 main new things fanotify gives so neither is
lost.

#1 fanotify implements the same basic thing as inotify except rather
than an arbitrary number (inotify speak a watch descriptor) which
userspace has to somehow convert back into a file, fanotify gives
userspace an open file descriptor to that object.  (this is the part
that requires recv side processing)

#2 fanotify allows the userspace 'listener' or 'group' (I use group to
describe it's actions in kernel and listener to describe it's action in
userspace) to request that it be allowed to arbitrate open and access
(read) security decisions.

> > > Eric's explained that it would be normal for _every_ file operation on
> > > some systems to trigger a fanotify event and possibly wait on the
> > > response, or at least in major directory trees on the filesystem.
> > > Even if it's just for the fanotify app to say "oh I don't care about
> > > that file, carry on".
> > > 
> > 
> > That doesnt sound very scalable. Should it not be you get nothing unless
> > you register for interest in something?
> 
> You do get nothing unless you register interest.  The problem is
> there's no way to register interest on just a subtree, so the fanotify
> approach is let you register for events on the whole filesystem, and
> let the userspace daemon filter paths.  At least it's decisions can be
> cached, although I'm not sure how that works when multiple processes
> want to monitor overlapping parts of the filesystem.

fanotify provides 3 options to register.
1) this inode
2) this dir and it's children
3) all files on the whole fscking system

this patch only does #1 and #2.  After it's in I'm going to take a
serious look at #4 subtrees.

Responses to access decisions are cached and checked in kernel per
fanotify listener.  So listener 1 can ignore requests for a given inode
while listener 2 still gets notification and forces the original process
to block.

> It doesn't sound scalable to me, either, and that's why I don't like
> this part, and described a solution to monitoring subtrees - which
> would also solve the problem for inotify.  (Both use fsnotify under
> the hood, and that's where subtree notification would go).

Subtree checking hasn't seen and work from me but it is something I plan
to work on.  And one thing that makes me scared to tie myself to
syscalls when I already have something that works relatively cleanly and
easily.

> Eric's mentioned interest in a way to monitor subtrees, but that
> hasn't gone anywhere as far as I know.  He doesn't seem convinced by
> my solution - or even that scalability will be an issue.  I think
> there's a bit of vision lacking here, and I'll admit I'm more
> interested in the inotify uses of fsnotify (being able to detect
> changes) than the fanotify uses (being able to _block_ or _modify_
> changes).  I think both inotify and fanotify ought to benefit from the
> same improvements to file monitoring.

sort of agree with you here.  anything that gets added to support
subtrees would have to be in the generic code.  Although I question how
inotify could be used, as a wd is not (in my mind) a reasonable way to
tell userspace about files.  (and with subtrees it would be a wd and a
pathname....)    I think fanotify with notification only (what I'm
giving in this patch series is a much better fit for subtree
notification.

> > > File performance is one of those things which really needs to be fast
> > > for a good user experience - and it's not unusual to grep the odd
> > > 10,000 files here or there (just think of what a kernel developer
> > > does), or to replace a few thousand quickly (rpm/dpkg) and things like
> > > that.
> > > 
> > 
> > So grepping 10000 files would cause 10000 events? I am not sure how the
> > scheme works; filtering of what events get delivered sounds more
> > reasonable if it happens in the kernel.
> 
> I believe it would cause 10000 events, yes, even if they are files
> that userspace policy is not interested in.  Eric, is that right?

If fanotify wants it, yes, that's exactly what happens.

> However I believe after the first grep, subsequent greps' decisions
> would be cached by marking the inodes.  I'm not sure what happens if
> two fanotify monitors both try marking the inodes.

Each can mark individually.

> Arguably if a fanotify monitor is running before those files are in
> page cache anyway, then I/O may dominate, and when the files are
> cached, fanotify has already cached it's decisions in the kernel.
> However fanotify is synchronous: each new file access involves a round
> trip to the fanotify userspace and back before it can proceed, so
> there's quite a lot of IPC and scheduling too.  Without testing, it's
> hard to guess how it'll really perform.

As I recall my old old tests on a 32 way system showed around a 10%
performance penalty when building a kernel when making userspace
arbitrate decisions and the cache was blank.  So yes, there is a serious
performance hit to making a userspace application control access
decisions.  Then again, I'd rather not have those people who need this
system wide access controls to do it in the kernel (anti-malware
vendors)

I believe that people who chose to use this interface will have to
realize there is a severe up front performance penalty.  On a steady
state system like a web server you'd see near 0% performance (a new srcu
lock, inode->i_lock, and running a short list)  But yes, controling
access to every file on a system eats performance, that's the nature of
the beast.

> > Theres a difference between events which are abbreviated in the form
> > "hey some read happened on fd you are listening on" vs "hey a read
> > of file X for 16 bytes at offset 200 by process Y just occured while
> > at the same time process Z was writting at offset 2000". The later
> > (which netlink will give you) includes a lot more attribute details
> > which could be filtered or can be extended to include a lot
> > more. The former(what epoll will give you) is merely a signal.

> But this part is irrelevant to fanotify, because there's no plan or
> intention to provide that much detail about I/O.

We have ZERO plan to include ordering.  ZERO.  inotify sorta pretends it
deals with ordering by only dropping a notification if it is the same as
the last one in the queue.  fanotify will gladly merge events which
exist anywhere in the queue clearly throwing ordering to the wind.

We do plan to include the pid, uid, and gid of the process making the
original request.  We also plan to include the f_flags of the file in
the original process when possible.

^ permalink raw reply

* RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Xin, Xiaohui @ 2009-09-14  5:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ira W. Snyder, netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org,
	akpm@linux-foundation.org, hpa@zytor.com,
	gregory.haskins@gmail.com, Rusty Russell, s.hetze@linux-ag.com,
	avi@redhat.com
In-Reply-To: <20090913054610.GA4446@redhat.com>

>The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
>git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
>
>I expect them to be merged by 2.6.32-rc1 - right, Avi?

Michael,

I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd.

I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this:

/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function)
/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function)

Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd?

Thanks
Xiaohui

-----Original Message-----
From: Michael S. Tsirkin [mailto:mst@redhat.com] 
Sent: Sunday, September 13, 2009 1:46 PM
To: Xin, Xiaohui
Cc: Ira W. Snyder; netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com; avi@redhat.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
> Michael,
> We are very interested in your patch and want to have a try with it.
> I have collected your 3 patches in kernel side and 4 patches in queue side.
> The patches are listed here:
> 
> PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
> PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
> PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
> 
> PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
> PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
> PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
> PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
> 
> I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
> But seems there are some patches are needed at least irqfd and ioeventfd patches on
> current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
> 
> May you kindly advice us the patch lists all exactly to make it work?
> Thanks a lot. :-)
> 
> Thanks
> Xiaohui


The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git

I expect them to be merged by 2.6.32-rc1 - right, Avi?

-- 
MST

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-14  7:05 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: Ira W. Snyder, netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org,
	akpm@linux-foundation.org, hpa@zytor.com,
	gregory.haskins@gmail.com, Rusty Russell, s.hetze@linux-ag.com,
	avi@redhat.com
In-Reply-To: <C85CEDA13AB1CF4D9D597824A86D2B9006AECB9FE7@PDSMSX501.ccr.corp.intel.com>

On Mon, Sep 14, 2009 at 01:57:06PM +0800, Xin, Xiaohui wrote:
> >The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
> >git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> >
> >I expect them to be merged by 2.6.32-rc1 - right, Avi?
> 
> Michael,
> 
> I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd.
> 
> I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this:
> 
> /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function)
> /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function)
> 
> Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd?

I'm using the headers from upstream kernel.
I'll send a patch for that.

> Thanks
> Xiaohui
> 
> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com] 
> Sent: Sunday, September 13, 2009 1:46 PM
> To: Xin, Xiaohui
> Cc: Ira W. Snyder; netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com; avi@redhat.com
> Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
> 
> On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
> > Michael,
> > We are very interested in your patch and want to have a try with it.
> > I have collected your 3 patches in kernel side and 4 patches in queue side.
> > The patches are listed here:
> > 
> > PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
> > PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
> > PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
> > 
> > PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
> > PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
> > PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
> > PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
> > 
> > I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
> > But seems there are some patches are needed at least irqfd and ioeventfd patches on
> > current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
> > 
> > May you kindly advice us the patch lists all exactly to make it work?
> > Thanks a lot. :-)
> > 
> > Thanks
> > Xiaohui
> 
> 
> The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> 
> I expect them to be merged by 2.6.32-rc1 - right, Avi?
> 
> -- 
> MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14  8:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paul Moore, David Miller, netdev, herbert, Or Gerlitz
In-Reply-To: <20090911053610.GA10324@redhat.com>

On 9/11/09, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Sep 11, 2009 at 12:17:27AM -0400, Paul Moore wrote:
>>> No comments on the code at this point - I'm just trying to understand the
>>> intended user right now which I'm assuming is the vhost-net bits you sent previously

> More specifically, vhost would then be patched with:
>
>  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>  index aeffb3a..b54f9d6 100644
>  --- a/drivers/vhost/net.c
>  +++ b/drivers/vhost/net.c
>  @@ -331,15 +331,26 @@ err:
>         return ERR_PTR(r);
>   }
>
>  +static struct socket *get_tun_socket(int fd)
>  +{
>  +       struct file *file = fget(fd);
>  +       if (!file)
>  +               return ERR_PTR(-EBADF);
>  +       return tun_get_socket(file);
>  +}
>  +
>   static struct socket *get_socket(int fd)

Michael,

your latest posting "[PATCHv5 3/3] vhost_net: a kernel-level virtio
server" doesn't have a function named get_socket, so I don't see how
one can really get what you are up with this snip from vhost/net.c

Or.

>   {
>         struct socket *sock;
>         sock = get_raw_socket(fd);
>
>         if (!IS_ERR(sock))
>                 return sock;
>
> +       sock = get_tun_socket(fd);
>  +       if (!IS_ERR(sock))
>  +               return sock;
>         return ERR_PTR(-ENOTSOCK);
>   }
>
>   static long vhost_net_set_socket(struct vhost_net *n, int fd)
>   {
>         struct socket *sock, *oldsock = NULL;
>

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14  8:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, m.s.tsirkin, netdev, herbert, Or Gerlitz
In-Reply-To: <20090910125929.GA32593@redhat.com>

On 9/10/09, Michael S. Tsirkin <mst@redhat.com> wrote:
>  This way, code using raw sockets to inject packets into a physical device,
>  can support injecting packets into host network stack almost without modification.
>  First user of this interface will be vhost virtualization accelerator.

vhost injects packet into physical device, what is the use case of
vhost injecting packet into the host network stack?

Or.

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-14  8:09 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, herbert, Or Gerlitz
In-Reply-To: <15ddcffd0909140107m4d94f5abh5405074b654bd15d@mail.gmail.com>

On Mon, Sep 14, 2009 at 11:07:25AM +0300, Or Gerlitz wrote:
> On 9/10/09, Michael S. Tsirkin <mst@redhat.com> wrote:
> >  This way, code using raw sockets to inject packets into a physical device,
> >  can support injecting packets into host network stack almost without modification.
> >  First user of this interface will be vhost virtualization accelerator.
> 
> vhost injects packet into physical device, what is the use case of
> vhost injecting packet into the host network stack?

The case where the user requires bridging, typically.

> Or.

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14  8:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090914080923.GC14030@redhat.com>

Michael S. Tsirkin wrote:
> On Mon, Sep 14, 2009 at 11:07:25AM +0300, Or Gerlitz wrote:
>   
>> vhost injects packets into physical device, what is the use case of vhost injecting packets into the host network stack?
>>     
> The case where the user requires bridging, typically.
>   
So you want to support a scheme where someone wants to attach vhost to a 
bridge? why not just telling these users to just set their vhost on top 
of veth couple, such that one veth device is added to a bridge as 
interface and a vhost instance is tied to the other veth device?


Or.


^ permalink raw reply

* [PATCH] pkt_sched: Fix qdisc_graft WRT ingress qdisc
From: Jarek Poplawski @ 2009-09-14  8:35 UTC (permalink / raw)
  To: David Miller; +Cc: Patrick McHardy, netdev

After the recent mq change using ingress qdisc overwrites dev->qdisc;
there is also a wrong old qdisc pointer passed to notify_and_destroy.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/sched/sch_api.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 3af1061..c5a6d62 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -693,13 +693,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 			if (new && i > 0)
 				atomic_inc(&new->refcnt);
 
-			qdisc_destroy(old);
+			if (!ingress)
+				qdisc_destroy(old);
 		}
 
-		notify_and_destroy(skb, n, classid, dev->qdisc, new);
-		if (new && !new->ops->attach)
-			atomic_inc(&new->refcnt);
-		dev->qdisc = new ? : &noop_qdisc;
+		if (!ingress) {
+			notify_and_destroy(skb, n, classid, dev->qdisc, new);
+			if (new && !new->ops->attach)
+				atomic_inc(&new->refcnt);
+			dev->qdisc = new ? : &noop_qdisc;
+		} else {
+			notify_and_destroy(skb, n, classid, old, new);
+		}
 
 		if (dev->flags & IFF_UP)
 			dev_activate(dev);

^ permalink raw reply related

* Re: igb bandwidth allocation configuration
From: Or Gerlitz @ 2009-09-14  8:42 UTC (permalink / raw)
  To: Simon Horman
  Cc: e1000-devel, netdev, Alexander Duyck, Kirsher, Jeffrey T,
	Or Gerlitz
In-Reply-To: <20090910081844.GA5421@verge.net.au>

On 9/10/09, Simon Horman <horms@verge.net.au> wrote:
> I have been looking into adding support the 82586's per-PF/VF bandwidth allocation to
> the igb driver. It seems that the trickiest part is working out how to expose things to
> user-space.

Please note that there are bunch (not many, but more then 1-2) of
things to configure from user space in a PF/VF scheme, and I think it
would be best if we do that through one mechanism, which may be
netlink based or extension to ethtool as you suggested, we've started
to discuss this on the "L2 switching in igb" thread

The "82576 SR-IOV Driver Companion Guide" document, section 7.6
mentions "Transmit Bandwidth Allocation to VFs... define minimum
transmit bandwidth for individual VMs".

I'm not clear if one can program rate limiter (upper bound) per VF or
actually rate guarantee per VF, even  with these being  just details
of specific device, alex, I would be happy if you can clarify that.

Or.

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-14  9:11 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, herbert
In-Reply-To: <4AADFC0A.30305@voltaire.com>

On Mon, Sep 14, 2009 at 11:17:14AM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>> On Mon, Sep 14, 2009 at 11:07:25AM +0300, Or Gerlitz wrote:
>>   
>>> vhost injects packets into physical device, what is the use case of vhost injecting packets into the host network stack?
>>>     
>> The case where the user requires bridging, typically.
>>   
> So you want to support a scheme where someone wants to attach vhost to a  
> bridge? why not just telling these users to just set their vhost on top  
> of veth couple, such that one veth device is added to a bridge as  
> interface and a vhost instance is tied to the other veth device?
>
>
> Or.

That's already possible. However virtualization users are familiar with
configuring the tun device, and tun has grown virtualization-specific
extensions, so I don't see a reason not to accomodate these uses.

-- 
MST

^ permalink raw reply

* Re: [PATCH RESEND] bonding: remap muticast addresses without using dev_close() and dev_open()
From: Moni Shoua @ 2009-09-14  9:38 UTC (permalink / raw)
  To: David Miller; +Cc: ogerlitz, fubar, jgunthorpe, netdev, bonding-devel
In-Reply-To: <20090911.123652.92869438.davem@davemloft.net>


> But unfortunately this patch doesn't apply to net-next-2.6 and
> you'll need to respin it against current sources so I can apply
> it, thanks.
The patch was written against linux-2.6 but I just pulled net-next-2.6 (git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git) and found out that besides offsets it applies well.
Is the URL above correct? Anyway, I add here the one that I took from net-next-2.6. Please let me know if it still doesn't apply. Thanks
-------------------------------------------------------------------------------

This patch fixes commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:

*. It assumes tha the device is UP when calling dev_close(), or otherwise
   dev_close() has no affect. It is worth to mention that initscripts (Redhat)
   and sysconfig (Suse) doesn't act the same in this matter. 
*. dev_close() has other side affects, like deleting entries from the routing
   table, which might be unnecessary.

The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.
   
Reported-by:   Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
--
 drivers/net/bonding/bond_main.c |    9 ++++++---
 include/linux/igmp.h            |    2 ++
 include/linux/netdevice.h       |    3 ++-
 include/linux/notifier.h        |    2 ++
 include/net/addrconf.h          |    2 ++
 net/core/dev.c                  |    4 ++--
 net/ipv4/devinet.c              |    6 ++++++
 net/ipv4/igmp.c                 |   22 ++++++++++++++++++++++
 net/ipv6/addrconf.c             |   19 +++++++++++++++++++
 net/ipv6/mcast.c                |   19 +++++++++++++++++++
 10 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index a7e731f..6419cf9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1211,7 +1211,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 			write_unlock_bh(&bond->curr_slave_lock);
 			read_unlock(&bond->lock);
 
-			netdev_bonding_change(bond->dev);
+			netdev_bonding_change(bond->dev, NETDEV_BONDING_FAILOVER);
 
 			read_lock(&bond->lock);
 			write_lock_bh(&bond->curr_slave_lock);
@@ -1469,14 +1469,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	 */
 	if (bond->slave_cnt == 0) {
 		if (bond_dev->type != slave_dev->type) {
-			dev_close(bond_dev);
 			pr_debug("%s: change device type from %d to %d\n",
 				bond_dev->name, bond_dev->type, slave_dev->type);
+
+			netdev_bonding_change(bond_dev, NETDEV_BONDING_OLDTYPE);
+
 			if (slave_dev->type != ARPHRD_ETHER)
 				bond_setup_by_slave(bond_dev, slave_dev);
 			else
 				ether_setup(bond_dev);
-			dev_open(bond_dev);
+
+			netdev_bonding_change(bond_dev, NETDEV_BONDING_NEWTYPE);
 		}
 	} else if (bond_dev->type != slave_dev->type) {
 		pr_err(DRV_NAME ": %s ether type (%d) is different "
diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 92fbd8c..fe158e0 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -233,6 +233,8 @@ extern void ip_mc_init_dev(struct in_device *);
 extern void ip_mc_destroy_dev(struct in_device *);
 extern void ip_mc_up(struct in_device *);
 extern void ip_mc_down(struct in_device *);
+extern void ip_mc_unmap(struct in_device *);
+extern void ip_mc_remap(struct in_device *);
 extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
 extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
 extern void ip_mc_rejoin_group(struct ip_mc_list *im);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 65ee192..f46db6c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1873,7 +1873,8 @@ extern void		__dev_addr_unsync(struct dev_addr_list **to, int *to_count, struct
 extern int		dev_set_promiscuity(struct net_device *dev, int inc);
 extern int		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
-extern void		netdev_bonding_change(struct net_device *dev);
+extern void		netdev_bonding_change(struct net_device *dev,
+					      unsigned long event);
 extern void		netdev_features_change(struct net_device *dev);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 81bc252..44428d2 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -199,6 +199,8 @@ static inline int notifier_to_errno(int ret)
 #define NETDEV_FEAT_CHANGE	0x000B
 #define NETDEV_BONDING_FAILOVER 0x000C
 #define NETDEV_PRE_UP		0x000D
+#define NETDEV_BONDING_OLDTYPE  0x000E
+#define NETDEV_BONDING_NEWTYPE  0x000F
 
 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 7b55ab2..0f7c378 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -143,6 +143,8 @@ extern int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr
 extern int ipv6_dev_mc_dec(struct net_device *dev, const struct in6_addr *addr);
 extern void ipv6_mc_up(struct inet6_dev *idev);
 extern void ipv6_mc_down(struct inet6_dev *idev);
+extern void ipv6_mc_unmap(struct inet6_dev *idev);
+extern void ipv6_mc_remap(struct inet6_dev *idev);
 extern void ipv6_mc_init_dev(struct inet6_dev *idev);
 extern void ipv6_mc_destroy_dev(struct inet6_dev *idev);
 extern void addrconf_dad_failure(struct inet6_ifaddr *ifp);
diff --git a/net/core/dev.c b/net/core/dev.c
index f843a0c..f6a4495 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1017,9 +1017,9 @@ void netdev_state_change(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_state_change);
 
-void netdev_bonding_change(struct net_device *dev)
+void netdev_bonding_change(struct net_device *dev, unsigned long event)
 {
-	call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev);
+	call_netdevice_notifiers(event, dev);
 }
 EXPORT_SYMBOL(netdev_bonding_change);
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 3863c3a..07336c6 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1087,6 +1087,12 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 	case NETDEV_DOWN:
 		ip_mc_down(in_dev);
 		break;
+	case NETDEV_BONDING_OLDTYPE:
+		ip_mc_unmap(in_dev);
+		break;
+	case NETDEV_BONDING_NEWTYPE:
+		ip_mc_remap(in_dev);
+		break;
 	case NETDEV_CHANGEMTU:
 		if (inetdev_valid_mtu(dev->mtu))
 			break;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 01b4284..d41e5de 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1298,6 +1298,28 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 	}
 }
 
+/* Device changing type */
+
+void ip_mc_unmap(struct in_device *in_dev)
+{
+	struct ip_mc_list *i;
+
+	ASSERT_RTNL();
+
+	for (i = in_dev->mc_list; i; i = i->next)
+		igmp_group_dropped(i);
+}
+
+void ip_mc_remap(struct in_device *in_dev)
+{
+	struct ip_mc_list *i;
+
+	ASSERT_RTNL();
+
+	for (i = in_dev->mc_list; i; i = i->next)
+		igmp_group_added(i);
+}
+
 /* Device going down */
 
 void ip_mc_down(struct in_device *in_dev)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c9b3690..f216a41 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -137,6 +137,8 @@ static DEFINE_SPINLOCK(addrconf_verify_lock);
 static void addrconf_join_anycast(struct inet6_ifaddr *ifp);
 static void addrconf_leave_anycast(struct inet6_ifaddr *ifp);
 
+static void addrconf_bonding_change(struct net_device *dev,
+				    unsigned long event);
 static int addrconf_ifdown(struct net_device *dev, int how);
 
 static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags);
@@ -2582,6 +2584,10 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 				return notifier_from_errno(err);
 		}
 		break;
+	case NETDEV_BONDING_OLDTYPE:
+	case NETDEV_BONDING_NEWTYPE:
+		addrconf_bonding_change(dev, event);
+		break;
 	}
 
 	return NOTIFY_OK;
@@ -2595,6 +2601,19 @@ static struct notifier_block ipv6_dev_notf = {
 	.priority = 0
 };
 
+static void addrconf_bonding_change(struct net_device *dev, unsigned long event)
+{
+	struct inet6_dev *idev;
+	ASSERT_RTNL();
+
+	idev = __in6_dev_get(dev);
+
+	if (event == NETDEV_BONDING_NEWTYPE)
+		ipv6_mc_remap(idev);
+	else if (event == NETDEV_BONDING_OLDTYPE)
+		ipv6_mc_unmap(idev);
+}
+
 static int addrconf_ifdown(struct net_device *dev, int how)
 {
 	struct inet6_dev *idev;
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 71c3dac..f9fcf69 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -2249,6 +2249,25 @@ static void igmp6_timer_handler(unsigned long data)
 	ma_put(ma);
 }
 
+/* Device changing type */
+
+void ipv6_mc_unmap(struct inet6_dev *idev)
+{
+	struct ifmcaddr6 *i;
+
+	/* Install multicast list, except for all-nodes (already installed) */
+
+	read_lock_bh(&idev->lock);
+	for (i = idev->mc_list; i; i = i->next)
+		igmp6_group_dropped(i);
+	read_unlock_bh(&idev->lock);
+}
+
+void ipv6_mc_remap(struct inet6_dev *idev)
+{
+	ipv6_mc_up(idev);
+}
+
 /* Device going down */
 
 void ipv6_mc_down(struct inet6_dev *idev)

^ permalink raw reply related

* Re: bisect results of MSI-X related panic (help!)
From: Tejun Heo @ 2009-09-14  9:40 UTC (permalink / raw)
  To: Frans Pop; +Cc: Jesse Brandeburg, linux-kernel, netdev, Ingo Molnar
In-Reply-To: <200909120623.49764.elendil@planet.nl>

Frans Pop wrote:
> Jesse Brandeburg wrote:
>> I've bisected, here is my bisect log, problem is that the commit
>> identified is a merge commit, and *I don't know what to revert to test*.
>> It appears the parent of the merge:
>> 6e15cf04860074ad032e88c306bea656bbdd0f22 is marked good, but looks to be
>> in a possibly related area to the panic.
> 
> That merge does contain quite a few merge fixups, so it's quite possible 
> one of them is the cause of the failure.
> Maybe the simplest way to verify that is to compile both parents of the 
> merge to doublecheck that they work OK. Then, if a compile of the merge 
> itself is bad, the problem really is in the merge commit itself.
> 
> That commit is the "percpu" merge, so I've added Tejun (author of most of 
> that branch) and Ingo (merger) in CC.

Sorry, the oops doesn't ring a bell, well, not yet at least.  It would
be great if the bisection can be narrowed down more.

-- 
tejun

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090914091151.GE14030@redhat.com>

Michael S. Tsirkin wrote:
> That's already possible. However virtualization users are familiar with configuring the tun device, and tun has grown virtualization-specific extensions, so I don't see a reason not to accomodate these uses
Today packets are written/read from/to Qemu to/from tun device, how 
would the use case with vhost will look like?

Is this the user setting an uplink NIC + bridge + per VM tun device but 
the packets will go from/to virtio-net in the guest kernel to/from vhost 
in the host kernel and then from/to vhost to/from tun? so eventually no 
packets will be seen by the qemu process? I don't see what these scheme 
buys people, I got very much confused.

Or.

^ permalink raw reply

* Re: bisect results of MSI-X related panic (help!)
From: Tejun Heo @ 2009-09-14  9:43 UTC (permalink / raw)
  To: Frans Pop; +Cc: Jesse Brandeburg, linux-kernel, netdev, Ingo Molnar
In-Reply-To: <4AAE0F7B.5050203@kernel.org>

Tejun Heo wrote:
> Frans Pop wrote:
>> Jesse Brandeburg wrote:
>>> I've bisected, here is my bisect log, problem is that the commit
>>> identified is a merge commit, and *I don't know what to revert to test*.
>>> It appears the parent of the merge:
>>> 6e15cf04860074ad032e88c306bea656bbdd0f22 is marked good, but looks to be
>>> in a possibly related area to the panic.
>> That merge does contain quite a few merge fixups, so it's quite possible 
>> one of them is the cause of the failure.
>> Maybe the simplest way to verify that is to compile both parents of the 
>> merge to doublecheck that they work OK. Then, if a compile of the merge 
>> itself is bad, the problem really is in the merge commit itself.
>>
>> That commit is the "percpu" merge, so I've added Tejun (author of most of 
>> that branch) and Ingo (merger) in CC.
> 
> Sorry, the oops doesn't ring a bell, well, not yet at least.  It would
> be great if the bisection can be narrowed down more.

Also, building w/ debug option on, capturing more oops traces and
pasting gdb output of l *<oops address> might shed some more light.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: L2 switching in igb
From: Or Gerlitz @ 2009-09-14 10:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Kirsher, Jeffrey T, Fischer, Anna,
	netdev@vger.kernel.org, David Miller, Stephen Hemminger
In-Reply-To: <4AA937A1.9070504@intel.com>

Alexander Duyck wrote:
> You are correct, the vSwitch can basically do VEPA by disabling local 
> loopback enable bit in the DTXSWC register.  This would force all 
> traffic from the PF/VFs out the lan physical port and from the lan 
> physical port to the appropriate PF/VFs without doing any switching in 
> between PF/VFs.
To have VEPA support another bit has to be programmed... its the one 
that doesn't let the PF to forward a packet to a VF whose source mac 
matches the one in the packet (e.g multicast sender).

> add an rtnl_link_ops interface to handle vSwitch configuration that 
> could then be applied to the igb netdevs that support VEPA/vSwitch 
> technologies.  A subset of that interface could then be dedicated to 
> VF configuration to handle things such as spawning VFs, setting the 
> default mac addresses, security controls, etc.
Yes, lets do that. I'd like to suggest that a "VF programmable from user 
space" context  will contain a <mac, vlan-id, priority-bits, rate> 
tuple, such that in the absence of vlan tag, the VF driver will "sign" 
the packet (skb) with vlan-id and priority-bits assigned by the admin 
and the PF NIC will mandate that the VF originated traffic will not 
exceed the rate.

Or.


Or.


^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-14 10:10 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, herbert
In-Reply-To: <4AAE1026.4090702@voltaire.com>

On Mon, Sep 14, 2009 at 12:43:02PM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>> That's already possible. However virtualization users are familiar
>> with configuring the tun device, and tun has grown
>> virtualization-specific extensions, so I don't see a reason not to
>> accomodate these uses
> Today packets are written/read from/to Qemu to/from tun device, how  
> would the use case with vhost will look like?

- Configure bridge and tun using existing scripts
- pass tun fd to vhost via an ioctl
- vhost calls tun_get_socket
- from this point, guest networking just goes faster

> Is this the user setting an uplink NIC + bridge + per VM tun device but  
> the packets will go from/to virtio-net in the guest kernel to/from vhost  
> in the host kernel and then from/to vhost to/from tun? so eventually no  
> packets will be seen by the qemu process? I don't see what these scheme  
> buys people, I got very much confused.
>
> Or.

A lot of people have asked for tun support in vhost, because qemu
currently uses tun.  With this scheme existing code and scripts can be
used to configure both tun and bridge.  You also can utilize
virtualization-specific features in tun.

-- 
MST

^ permalink raw reply

* Re: [PATCH 13/19] uwb: convert to netdev_tx_t
From: David Vrabel @ 2009-09-14 10:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090901055129.729527950-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org>

Acked-by: David Vrabel <david.vrabel-kQvG35nSl+M@public.gmane.org>

David
-- 
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park,  Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ                 http://www.csr.com/


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/4] RxRPC: Declare the security index constants symbolically
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells

Declare the security index constants symbolically rather than just referring
to them numerically.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/rxrpc.h |    7 +++++++
 net/rxrpc/ar-key.c    |    4 ++--
 net/rxrpc/rxkad.c     |    6 +++---
 3 files changed, 12 insertions(+), 5 deletions(-)


diff --git a/include/linux/rxrpc.h b/include/linux/rxrpc.h
index f7b826b..a53915c 100644
--- a/include/linux/rxrpc.h
+++ b/include/linux/rxrpc.h
@@ -58,5 +58,12 @@ struct sockaddr_rxrpc {
 #define RXRPC_SECURITY_AUTH	1	/* authenticated packets */
 #define RXRPC_SECURITY_ENCRYPT	2	/* encrypted packets */
 
+/*
+ * RxRPC security indices
+ */
+#define RXRPC_SECURITY_NONE	0	/* no security protocol */
+#define RXRPC_SECURITY_RXKAD	2	/* kaserver or kerberos 4 */
+#define RXRPC_SECURITY_RXGK	4	/* gssapi-based */
+#define RXRPC_SECURITY_RXK5	5	/* kerberos 5 */
 
 #endif /* _LINUX_RXRPC_H */
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index ad8c7a7..b3d10e7 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -122,7 +122,7 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 		       tsec->ticket[6], tsec->ticket[7]);
 
 	ret = -EPROTONOSUPPORT;
-	if (tsec->security_index != 2)
+	if (tsec->security_index != RXRPC_SECURITY_RXKAD)
 		goto error;
 
 	key->type_data.x[0] = tsec->security_index;
@@ -308,7 +308,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 	_debug("key %d", key_serial(key));
 
 	data.kver = 1;
-	data.tsec.security_index = 2;
+	data.tsec.security_index = RXRPC_SECURITY_RXKAD;
 	data.tsec.ticket_len = 0;
 	data.tsec.expiry = expiry;
 	data.tsec.kvno = 0;
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index ef8f910..acec762 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -42,7 +42,7 @@ struct rxkad_level2_hdr {
 	__be32	checksum;	/* decrypted data checksum */
 };
 
-MODULE_DESCRIPTION("RxRPC network protocol type-2 security (Kerberos)");
+MODULE_DESCRIPTION("RxRPC network protocol type-2 security (Kerberos 4)");
 MODULE_AUTHOR("Red Hat, Inc.");
 MODULE_LICENSE("GPL");
 
@@ -506,7 +506,7 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	if (!call->conn->cipher)
 		return 0;
 
-	if (sp->hdr.securityIndex != 2) {
+	if (sp->hdr.securityIndex != RXRPC_SECURITY_RXKAD) {
 		*_abort_code = RXKADINCONSISTENCY;
 		_leave(" = -EPROTO [not rxkad]");
 		return -EPROTO;
@@ -1122,7 +1122,7 @@ static void rxkad_clear(struct rxrpc_connection *conn)
 static struct rxrpc_security rxkad = {
 	.owner				= THIS_MODULE,
 	.name				= "rxkad",
-	.security_index			= RXKAD_VERSION,
+	.security_index			= RXRPC_SECURITY_RXKAD,
 	.init_connection_security	= rxkad_init_connection_security,
 	.prime_packet_security		= rxkad_prime_packet_security,
 	.secure_packet			= rxkad_secure_packet,


^ permalink raw reply related

* [PATCH 2/4] RxRPC: Allow key payloads to be passed in XDR form
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Allow add_key() and KEYCTL_INSTANTIATE to accept key payloads in XDR form as
described by openafs-1.4.10/src/auth/afs_token.xg.  This provides a way of
passing kaserver, Kerberos 4, Kerberos 5 and GSSAPI keys from userspace, and
allows for future expansion.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/keys/rxrpc-type.h |   55 ++++++++
 net/rxrpc/ar-internal.h   |   16 --
 net/rxrpc/ar-key.c        |  308 +++++++++++++++++++++++++++++++++++++++------
 net/rxrpc/ar-security.c   |    8 +
 net/rxrpc/rxkad.c         |   41 +++---
 5 files changed, 353 insertions(+), 75 deletions(-)


diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index 7609365..c0d9121 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -21,4 +21,59 @@ extern struct key_type key_type_rxrpc;
 
 extern struct key *rxrpc_get_null_key(const char *);
 
+/*
+ * RxRPC key for Kerberos IV (type-2 security)
+ */
+struct rxkad_key {
+	u32	vice_id;
+	u32	start;			/* time at which ticket starts */
+	u32	expiry;			/* time at which ticket expires */
+	u32	kvno;			/* key version number */
+	u8	primary_flag;		/* T if key for primary cell for this user */
+	u16	ticket_len;		/* length of ticket[] */
+	u8	session_key[8];		/* DES session key */
+	u8	ticket[0];		/* the encrypted ticket */
+};
+
+/*
+ * list of tokens attached to an rxrpc key
+ */
+struct rxrpc_key_token {
+	u16	security_index;		/* RxRPC header security index */
+	struct rxrpc_key_token *next;	/* the next token in the list */
+	union {
+		struct rxkad_key *kad;
+	};
+};
+
+/*
+ * structure of raw payloads passed to add_key() or instantiate key
+ */
+struct rxrpc_key_data_v1 {
+	u32		kif_version;		/* 1 */
+	u16		security_index;
+	u16		ticket_length;
+	u32		expiry;			/* time_t */
+	u32		kvno;
+	u8		session_key[8];
+	u8		ticket[0];
+};
+
+/*
+ * AF_RXRPC key payload derived from XDR format
+ * - based on openafs-1.4.10/src/auth/afs_token.xg
+ */
+#define AFSTOKEN_LENGTH_MAX		16384	/* max payload size */
+#define AFSTOKEN_CELL_MAX		64	/* max cellname length */
+#define AFSTOKEN_MAX			8	/* max tokens per payload */
+#define AFSTOKEN_RK_TIX_MAX		12000	/* max RxKAD ticket size */
+#define AFSTOKEN_GK_KEY_MAX		64	/* max GSSAPI key size */
+#define AFSTOKEN_GK_TOKEN_MAX		16384	/* max GSSAPI token size */
+#define AFSTOKEN_K5_COMPONENTS_MAX	16	/* max K5 components */
+#define AFSTOKEN_K5_NAME_MAX		128	/* max K5 name length */
+#define AFSTOKEN_K5_REALM_MAX		64	/* max K5 realm name length */
+#define AFSTOKEN_K5_TIX_MAX		16384	/* max K5 ticket size */
+#define AFSTOKEN_K5_ADDRESSES_MAX	16	/* max K5 addresses */
+#define AFSTOKEN_K5_AUTHDATA_MAX	16	/* max K5 pieces of auth data */
+
 #endif /* _KEYS_RXRPC_TYPE_H */
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 3e7318c..46c6d88 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -402,22 +402,6 @@ struct rxrpc_call {
 };
 
 /*
- * RxRPC key for Kerberos (type-2 security)
- */
-struct rxkad_key {
-	u16	security_index;		/* RxRPC header security index */
-	u16	ticket_len;		/* length of ticket[] */
-	u32	expiry;			/* time at which expires */
-	u32	kvno;			/* key version number */
-	u8	session_key[8];		/* DES session key */
-	u8	ticket[0];		/* the encrypted ticket */
-};
-
-struct rxrpc_key_payload {
-	struct rxkad_key k;
-};
-
-/*
  * locally abort an RxRPC call
  */
 static inline void rxrpc_abort_call(struct rxrpc_call *call, u32 abort_code)
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index b3d10e7..a3a7acb 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <linux/key-type.h>
 #include <linux/crypto.h>
+#include <linux/ctype.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
 #include <keys/rxrpc-type.h>
@@ -55,6 +56,202 @@ struct key_type key_type_rxrpc_s = {
 };
 
 /*
+ * parse an RxKAD type XDR format token
+ * - the caller guarantees we have at least 4 words
+ */
+static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
+				       unsigned toklen)
+{
+	struct rxrpc_key_token *token;
+	size_t plen;
+	u32 tktlen;
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       toklen);
+
+	if (toklen <= 8 * 4)
+		return -EKEYREJECTED;
+	tktlen = ntohl(xdr[7]);
+	_debug("tktlen: %x", tktlen);
+	if (tktlen > AFSTOKEN_RK_TIX_MAX)
+		return -EKEYREJECTED;
+	if (8 * 4 + tktlen != toklen)
+		return -EKEYREJECTED;
+
+	plen = sizeof(*token) + sizeof(*token->kad) + tktlen;
+	ret = key_payload_reserve(key, key->datalen + plen);
+	if (ret < 0)
+		return ret;
+
+	plen -= sizeof(*token);
+	token = kmalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
+		return -ENOMEM;
+
+	token->kad = kmalloc(plen, GFP_KERNEL);
+	if (!token->kad) {
+		kfree(token);
+		return -ENOMEM;
+	}
+
+	token->security_index	= RXRPC_SECURITY_RXKAD;
+	token->kad->ticket_len	= tktlen;
+	token->kad->vice_id	= ntohl(xdr[0]);
+	token->kad->kvno	= ntohl(xdr[1]);
+	token->kad->start	= ntohl(xdr[4]);
+	token->kad->expiry	= ntohl(xdr[5]);
+	token->kad->primary_flag = ntohl(xdr[6]);
+	memcpy(&token->kad->session_key, &xdr[2], 8);
+	memcpy(&token->kad->ticket, &xdr[8], tktlen);
+
+	_debug("SCIX: %u", token->security_index);
+	_debug("TLEN: %u", token->kad->ticket_len);
+	_debug("EXPY: %x", token->kad->expiry);
+	_debug("KVNO: %u", token->kad->kvno);
+	_debug("PRIM: %u", token->kad->primary_flag);
+	_debug("SKEY: %02x%02x%02x%02x%02x%02x%02x%02x",
+	       token->kad->session_key[0], token->kad->session_key[1],
+	       token->kad->session_key[2], token->kad->session_key[3],
+	       token->kad->session_key[4], token->kad->session_key[5],
+	       token->kad->session_key[6], token->kad->session_key[7]);
+	if (token->kad->ticket_len >= 8)
+		_debug("TCKT: %02x%02x%02x%02x%02x%02x%02x%02x",
+		       token->kad->ticket[0], token->kad->ticket[1],
+		       token->kad->ticket[2], token->kad->ticket[3],
+		       token->kad->ticket[4], token->kad->ticket[5],
+		       token->kad->ticket[6], token->kad->ticket[7]);
+
+	/* count the number of tokens attached */
+	key->type_data.x[0]++;
+
+	/* attach the data */
+	token->next = key->payload.data;
+	key->payload.data = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+
+	_leave(" = 0");
+	return 0;
+}
+
+/*
+ * attempt to parse the data as the XDR format
+ * - the caller guarantees we have more than 7 words
+ */
+static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datalen)
+{
+	const __be32 *xdr = data, *token;
+	const char *cp;
+	unsigned len, tmp, loop, ntoken, toklen, sec_ix;
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%zu",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       datalen);
+
+	if (datalen > AFSTOKEN_LENGTH_MAX)
+		goto not_xdr;
+
+	/* XDR is an array of __be32's */
+	if (datalen & 3)
+		goto not_xdr;
+
+	/* the flags should be 0 (the setpag bit must be handled by
+	 * userspace) */
+	if (ntohl(*xdr++) != 0)
+		goto not_xdr;
+	datalen -= 4;
+
+	/* check the cell name */
+	len = ntohl(*xdr++);
+	if (len < 1 || len > AFSTOKEN_CELL_MAX)
+		goto not_xdr;
+	datalen -= 4;
+	tmp = (len + 3) & ~3;
+	if (tmp > datalen)
+		goto not_xdr;
+
+	cp = (const char *) xdr;
+	for (loop = 0; loop < len; loop++)
+		if (!isprint(cp[loop]))
+			goto not_xdr;
+	if (len < tmp)
+		for (; loop < tmp; loop++)
+			if (cp[loop])
+				goto not_xdr;
+	_debug("cellname: [%u/%u] '%*.*s'",
+	       len, tmp, len, len, (const char *) xdr);
+	datalen -= tmp;
+	xdr += tmp >> 2;
+
+	/* get the token count */
+	if (datalen < 12)
+		goto not_xdr;
+	ntoken = ntohl(*xdr++);
+	datalen -= 4;
+	_debug("ntoken: %x", ntoken);
+	if (ntoken < 1 || ntoken > AFSTOKEN_MAX)
+		goto not_xdr;
+
+	/* check each token wrapper */
+	token = xdr;
+	loop = ntoken;
+	do {
+		if (datalen < 8)
+			goto not_xdr;
+		toklen = ntohl(*xdr++);
+		sec_ix = ntohl(*xdr);
+		datalen -= 4;
+		_debug("token: [%x/%zx] %x", toklen, datalen, sec_ix);
+		if (toklen < 20 || toklen > datalen)
+			goto not_xdr;
+		datalen -= (toklen + 3) & ~3;
+		xdr += (toklen + 3) >> 2;
+
+	} while (--loop > 0);
+
+	_debug("remainder: %zu", datalen);
+	if (datalen != 0)
+		goto not_xdr;
+
+	/* okay: we're going to assume it's valid XDR format
+	 * - we ignore the cellname, relying on the key to be correctly named
+	 */
+	do {
+		xdr = token;
+		toklen = ntohl(*xdr++);
+		token = xdr + ((toklen + 3) >> 2);
+		sec_ix = ntohl(*xdr++);
+		toklen -= 4;
+
+		switch (sec_ix) {
+		case RXRPC_SECURITY_RXKAD:
+			ret = rxrpc_instantiate_xdr_rxkad(key, xdr, toklen);
+			if (ret != 0)
+				goto error;
+			break;
+
+		default:
+			ret = -EPROTONOSUPPORT;
+			goto error;
+		}
+
+	} while (--ntoken > 0);
+
+	_leave(" = 0");
+	return 0;
+
+not_xdr:
+	_leave(" = -EPROTO");
+	return -EPROTO;
+error:
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
  * instantiate an rxrpc defined key
  * data should be of the form:
  *	OFFSET	LEN	CONTENT
@@ -70,8 +267,8 @@ struct key_type key_type_rxrpc_s = {
  */
 static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 {
-	const struct rxkad_key *tsec;
-	struct rxrpc_key_payload *upayload;
+	const struct rxrpc_key_data_v1 *v1;
+	struct rxrpc_key_token *token, **pp;
 	size_t plen;
 	u32 kver;
 	int ret;
@@ -82,6 +279,13 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 	if (!data && datalen == 0)
 		return 0;
 
+	/* determine if the XDR payload format is being used */
+	if (datalen > 7 * 4) {
+		ret = rxrpc_instantiate_xdr(key, data, datalen);
+		if (ret != -EPROTO)
+			return ret;
+	}
+
 	/* get the key interface version number */
 	ret = -EINVAL;
 	if (datalen <= 4 || !data)
@@ -98,53 +302,67 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 
 	/* deal with a version 1 key */
 	ret = -EINVAL;
-	if (datalen < sizeof(*tsec))
+	if (datalen < sizeof(*v1))
 		goto error;
 
-	tsec = data;
-	if (datalen != sizeof(*tsec) + tsec->ticket_len)
+	v1 = data;
+	if (datalen != sizeof(*v1) + v1->ticket_length)
 		goto error;
 
-	_debug("SCIX: %u", tsec->security_index);
-	_debug("TLEN: %u", tsec->ticket_len);
-	_debug("EXPY: %x", tsec->expiry);
-	_debug("KVNO: %u", tsec->kvno);
+	_debug("SCIX: %u", v1->security_index);
+	_debug("TLEN: %u", v1->ticket_length);
+	_debug("EXPY: %x", v1->expiry);
+	_debug("KVNO: %u", v1->kvno);
 	_debug("SKEY: %02x%02x%02x%02x%02x%02x%02x%02x",
-	       tsec->session_key[0], tsec->session_key[1],
-	       tsec->session_key[2], tsec->session_key[3],
-	       tsec->session_key[4], tsec->session_key[5],
-	       tsec->session_key[6], tsec->session_key[7]);
-	if (tsec->ticket_len >= 8)
+	       v1->session_key[0], v1->session_key[1],
+	       v1->session_key[2], v1->session_key[3],
+	       v1->session_key[4], v1->session_key[5],
+	       v1->session_key[6], v1->session_key[7]);
+	if (v1->ticket_length >= 8)
 		_debug("TCKT: %02x%02x%02x%02x%02x%02x%02x%02x",
-		       tsec->ticket[0], tsec->ticket[1],
-		       tsec->ticket[2], tsec->ticket[3],
-		       tsec->ticket[4], tsec->ticket[5],
-		       tsec->ticket[6], tsec->ticket[7]);
+		       v1->ticket[0], v1->ticket[1],
+		       v1->ticket[2], v1->ticket[3],
+		       v1->ticket[4], v1->ticket[5],
+		       v1->ticket[6], v1->ticket[7]);
 
 	ret = -EPROTONOSUPPORT;
-	if (tsec->security_index != RXRPC_SECURITY_RXKAD)
+	if (v1->security_index != RXRPC_SECURITY_RXKAD)
 		goto error;
 
-	key->type_data.x[0] = tsec->security_index;
-
-	plen = sizeof(*upayload) + tsec->ticket_len;
-	ret = key_payload_reserve(key, plen);
+	plen = sizeof(*token->kad) + v1->ticket_length;
+	ret = key_payload_reserve(key, plen + sizeof(*token));
 	if (ret < 0)
 		goto error;
 
 	ret = -ENOMEM;
-	upayload = kmalloc(plen, GFP_KERNEL);
-	if (!upayload)
+	token = kmalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
 		goto error;
+	token->kad = kmalloc(plen, GFP_KERNEL);
+	if (!token->kad)
+		goto error_free;
+
+	token->security_index		= RXRPC_SECURITY_RXKAD;
+	token->kad->ticket_len		= v1->ticket_length;
+	token->kad->expiry		= v1->expiry;
+	token->kad->kvno		= v1->kvno;
+	memcpy(&token->kad->session_key, &v1->session_key, 8);
+	memcpy(&token->kad->ticket, v1->ticket, v1->ticket_length);
 
 	/* attach the data */
-	memcpy(&upayload->k, tsec, sizeof(*tsec));
-	memcpy(&upayload->k.ticket, (void *)tsec + sizeof(*tsec),
-	       tsec->ticket_len);
-	key->payload.data = upayload;
-	key->expiry = tsec->expiry;
+	key->type_data.x[0]++;
+
+	pp = (struct rxrpc_key_token **)&key->payload.data;
+	while (*pp)
+		pp = &(*pp)->next;
+	*pp = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+	token = NULL;
 	ret = 0;
 
+error_free:
+	kfree(token);
 error:
 	return ret;
 }
@@ -184,7 +402,22 @@ static int rxrpc_instantiate_s(struct key *key, const void *data,
  */
 static void rxrpc_destroy(struct key *key)
 {
-	kfree(key->payload.data);
+	struct rxrpc_key_token *token;
+
+	while ((token = key->payload.data)) {
+		key->payload.data = token->next;
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			kfree(token->kad);
+			break;
+		default:
+			printk(KERN_ERR "Unknown token type %x on rxrpc key\n",
+			       token->security_index);
+			BUG();
+		}
+
+		kfree(token);
+	}
 }
 
 /*
@@ -293,7 +526,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 
 	struct {
 		u32 kver;
-		struct rxkad_key tsec;
+		struct rxrpc_key_data_v1 v1;
 	} data;
 
 	_enter("");
@@ -308,13 +541,12 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 	_debug("key %d", key_serial(key));
 
 	data.kver = 1;
-	data.tsec.security_index = RXRPC_SECURITY_RXKAD;
-	data.tsec.ticket_len = 0;
-	data.tsec.expiry = expiry;
-	data.tsec.kvno = 0;
+	data.v1.security_index = RXRPC_SECURITY_RXKAD;
+	data.v1.ticket_length = 0;
+	data.v1.expiry = expiry;
+	data.v1.kvno = 0;
 
-	memcpy(&data.tsec.session_key, session_key,
-	       sizeof(data.tsec.session_key));
+	memcpy(&data.v1.session_key, session_key, sizeof(data.v1.session_key));
 
 	ret = key_instantiate_and_link(key, &data, sizeof(data), NULL, NULL);
 	if (ret < 0)
diff --git a/net/rxrpc/ar-security.c b/net/rxrpc/ar-security.c
index dc62920..49b3cc3 100644
--- a/net/rxrpc/ar-security.c
+++ b/net/rxrpc/ar-security.c
@@ -16,6 +16,7 @@
 #include <linux/crypto.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
+#include <keys/rxrpc-type.h>
 #include "ar-internal.h"
 
 static LIST_HEAD(rxrpc_security_methods);
@@ -122,6 +123,7 @@ EXPORT_SYMBOL_GPL(rxrpc_unregister_security);
  */
 int rxrpc_init_client_conn_security(struct rxrpc_connection *conn)
 {
+	struct rxrpc_key_token *token;
 	struct rxrpc_security *sec;
 	struct key *key = conn->key;
 	int ret;
@@ -135,7 +137,11 @@ int rxrpc_init_client_conn_security(struct rxrpc_connection *conn)
 	if (ret < 0)
 		return ret;
 
-	sec = rxrpc_security_lookup(key->type_data.x[0]);
+	if (!key->payload.data)
+		return -EKEYREJECTED;
+	token = key->payload.data;
+
+	sec = rxrpc_security_lookup(token->security_index);
 	if (!sec)
 		return -EKEYREJECTED;
 	conn->security = sec;
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index acec762..713ac59 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -18,6 +18,7 @@
 #include <linux/ctype.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
+#include <keys/rxrpc-type.h>
 #define rxrpc_debug rxkad_debug
 #include "ar-internal.h"
 
@@ -59,14 +60,14 @@ static DEFINE_MUTEX(rxkad_ci_mutex);
  */
 static int rxkad_init_connection_security(struct rxrpc_connection *conn)
 {
-	struct rxrpc_key_payload *payload;
 	struct crypto_blkcipher *ci;
+	struct rxrpc_key_token *token;
 	int ret;
 
 	_enter("{%d},{%x}", conn->debug_id, key_serial(conn->key));
 
-	payload = conn->key->payload.data;
-	conn->security_ix = payload->k.security_index;
+	token = conn->key->payload.data;
+	conn->security_ix = token->security_index;
 
 	ci = crypto_alloc_blkcipher("pcbc(fcrypt)", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(ci)) {
@@ -75,8 +76,8 @@ static int rxkad_init_connection_security(struct rxrpc_connection *conn)
 		goto error;
 	}
 
-	if (crypto_blkcipher_setkey(ci, payload->k.session_key,
-				    sizeof(payload->k.session_key)) < 0)
+	if (crypto_blkcipher_setkey(ci, token->kad->session_key,
+				    sizeof(token->kad->session_key)) < 0)
 		BUG();
 
 	switch (conn->security_level) {
@@ -110,7 +111,7 @@ error:
  */
 static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 {
-	struct rxrpc_key_payload *payload;
+	struct rxrpc_key_token *token;
 	struct blkcipher_desc desc;
 	struct scatterlist sg[2];
 	struct rxrpc_crypt iv;
@@ -123,8 +124,8 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 	if (!conn->key)
 		return;
 
-	payload = conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
 	desc.tfm = conn->cipher;
 	desc.info = iv.x;
@@ -197,7 +198,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 					u32 data_size,
 					void *sechdr)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_level2_hdr rxkhdr
 		__attribute__((aligned(8))); /* must be all on one page */
 	struct rxrpc_skb_priv *sp;
@@ -219,8 +220,8 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 	rxkhdr.checksum = 0;
 
 	/* encrypt from the session key */
-	payload = call->conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = call->conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 	desc.tfm = call->conn->cipher;
 	desc.info = iv.x;
 	desc.flags = 0;
@@ -400,7 +401,7 @@ static int rxkad_verify_packet_encrypt(const struct rxrpc_call *call,
 				       struct sk_buff *skb,
 				       u32 *_abort_code)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_level2_hdr sechdr;
 	struct rxrpc_skb_priv *sp;
 	struct blkcipher_desc desc;
@@ -431,8 +432,8 @@ static int rxkad_verify_packet_encrypt(const struct rxrpc_call *call,
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	/* decrypt from the session key */
-	payload = call->conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = call->conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 	desc.tfm = call->conn->cipher;
 	desc.info = iv.x;
 	desc.flags = 0;
@@ -737,7 +738,7 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 				      struct sk_buff *skb,
 				      u32 *_abort_code)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_challenge challenge;
 	struct rxkad_response resp
 		__attribute__((aligned(8))); /* must be aligned for crypto */
@@ -778,7 +779,7 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 	if (conn->security_level < min_level)
 		goto protocol_error;
 
-	payload = conn->key->payload.data;
+	token = conn->key->payload.data;
 
 	/* build the response packet */
 	memset(&resp, 0, sizeof(resp));
@@ -797,13 +798,13 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 		(conn->channels[3] ? conn->channels[3]->call_id : 0);
 	resp.encrypted.inc_nonce = htonl(nonce + 1);
 	resp.encrypted.level = htonl(conn->security_level);
-	resp.kvno = htonl(payload->k.kvno);
-	resp.ticket_len = htonl(payload->k.ticket_len);
+	resp.kvno = htonl(token->kad->kvno);
+	resp.ticket_len = htonl(token->kad->ticket_len);
 
 	/* calculate the response checksum and then do the encryption */
 	rxkad_calc_response_checksum(&resp);
-	rxkad_encrypt_response(conn, &resp, &payload->k);
-	return rxkad_send_response(conn, &sp->hdr, &resp, &payload->k);
+	rxkad_encrypt_response(conn, &resp, token->kad);
+	return rxkad_send_response(conn, &sp->hdr, &resp, token->kad);
 
 protocol_error:
 	*_abort_code = abort_code;


^ permalink raw reply related

* [PATCH 3/4] RxRPC: Allow RxRPC keys to be read
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Allow RxRPC keys to be read.  This is to allow pioctl() to be implemented in
userspace.  RxRPC keys are read out in XDR format.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-key.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 109 insertions(+), 0 deletions(-)


diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index a3a7acb..bf4d623 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -29,6 +29,7 @@ static int rxrpc_instantiate_s(struct key *, const void *, size_t);
 static void rxrpc_destroy(struct key *);
 static void rxrpc_destroy_s(struct key *);
 static void rxrpc_describe(const struct key *, struct seq_file *);
+static long rxrpc_read(const struct key *, char __user *, size_t);
 
 /*
  * rxrpc defined keys take an arbitrary string as the description and an
@@ -40,6 +41,7 @@ struct key_type key_type_rxrpc = {
 	.match		= user_match,
 	.destroy	= rxrpc_destroy,
 	.describe	= rxrpc_describe,
+	.read		= rxrpc_read,
 };
 EXPORT_SYMBOL(key_type_rxrpc);
 
@@ -592,3 +594,110 @@ struct key *rxrpc_get_null_key(const char *keyname)
 	return key;
 }
 EXPORT_SYMBOL(rxrpc_get_null_key);
+
+/*
+ * read the contents of an rxrpc key
+ * - this returns the result in XDR form
+ */
+static long rxrpc_read(const struct key *key,
+		       char __user *buffer, size_t buflen)
+{
+	struct rxrpc_key_token *token;
+	size_t size, toksize;
+	__be32 __user *xdr;
+	u32 cnlen, tktlen, ntoks, zero;
+
+	_enter("");
+
+	/* we don't know what form we should return non-AFS keys in */
+	if (memcmp(key->description, "afs@", 4) != 0)
+		return -EOPNOTSUPP;
+	cnlen = strlen(key->description + 4);
+
+	/* AFS keys we return in XDR form, so we need to work out the size of
+	 * the XDR */
+	size = 2 * 4;	/* flags, cellname len */
+	size += (cnlen + 3) & ~3;	/* cellname */
+	size += 1 * 4;	/* token count */
+
+	ntoks = 0;
+	for (token = key->payload.data; token; token = token->next) {
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			size += 2 * 4;	/* length, security index (switch ID) */
+			size += 8 * 4;	/* viceid, kvno, key*2, begin, end,
+					 * primary, tktlen */
+			size += (token->kad->ticket_len + 3) & ~3; /* ticket */
+			ntoks++;
+			break;
+
+		default: /* can't encode */
+			break;
+		}
+	}
+
+	if (!buffer || buflen < size)
+		return size;
+
+	xdr = (__be32 __user *) buffer;
+	zero = 0;
+#define ENCODE(x)				\
+	do {					\
+		__be32 y = htonl(x);		\
+		if (put_user(y, xdr++) < 0)	\
+			goto fault;		\
+	} while(0)
+
+	ENCODE(0);	/* flags */
+	ENCODE(cnlen);	/* cellname length */
+	if (copy_to_user(xdr, key->description + 4, cnlen) != 0)
+		goto fault;
+	if (cnlen & 3 &&
+	    copy_to_user((u8 *)xdr + cnlen, &zero, 4 - (cnlen & 3)) != 0)
+		goto fault;
+	xdr += (cnlen + 3) >> 2;
+	ENCODE(ntoks);	/* token count */
+
+	for (token = key->payload.data; token; token = token->next) {
+		toksize = 1 * 4;	/* sec index */
+
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			toksize += 8 * 4;
+			toksize += (token->kad->ticket_len + 3) & ~3;
+			ENCODE(toksize);
+			ENCODE(token->security_index);
+			ENCODE(token->kad->vice_id);
+			ENCODE(token->kad->kvno);
+			if (copy_to_user(xdr, token->kad->session_key, 8) != 0)
+				goto fault;
+			xdr += 8 >> 2;
+			ENCODE(token->kad->start);
+			ENCODE(token->kad->expiry);
+			ENCODE(token->kad->primary_flag);
+			tktlen = token->kad->ticket_len;
+			ENCODE(tktlen);
+			if (copy_to_user(xdr, token->kad->ticket, tktlen) != 0)
+				goto fault;
+			if (tktlen & 3 &&
+			    copy_to_user((u8 *)xdr + tktlen, &zero,
+					 4 - (tktlen & 3)) != 0)
+				goto fault;
+			xdr += (tktlen + 3) >> 2;
+			break;
+
+		default:
+			break;
+		}
+	}
+
+#undef ENCODE
+
+	ASSERTCMP((char __user *) xdr - buffer, ==, size);
+	_leave(" = %zu", size);
+	return size;
+
+fault:
+	_leave(" = -EFAULT");
+	return -EFAULT;
+}


^ permalink raw reply related

* [PATCH 4/4] RxRPC: Parse security index 5 keys (Kerberos 5)
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Parse RxRPC security index 5 type keys (Kerberos 5 tokens).

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/keys/rxrpc-type.h |   52 ++++
 net/rxrpc/ar-key.c        |  577 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 589 insertions(+), 40 deletions(-)


diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index c0d9121..5eb2357 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -36,6 +36,54 @@ struct rxkad_key {
 };
 
 /*
+ * Kerberos 5 principal
+ *	name/name/name@realm
+ */
+struct krb5_principal {
+	u8	n_name_parts;		/* N of parts of the name part of the principal */
+	char	**name_parts;		/* parts of the name part of the principal */
+	char	*realm;			/* parts of the realm part of the principal */
+};
+
+/*
+ * Kerberos 5 tagged data
+ */
+struct krb5_tagged_data {
+	/* for tag value, see /usr/include/krb5/krb5.h
+	 * - KRB5_AUTHDATA_* for auth data
+	 * - 
+	 */
+	int32_t		tag;
+	uint32_t	data_len;
+	u8		*data;
+};
+
+/*
+ * RxRPC key for Kerberos V (type-5 security)
+ */
+struct rxk5_key {
+	uint64_t		authtime;	/* time at which auth token generated */
+	uint64_t		starttime;	/* time at which auth token starts */
+	uint64_t		endtime;	/* time at which auth token expired */
+	uint64_t		renew_till;	/* time to which auth token can be renewed */
+	int32_t			is_skey;	/* T if ticket is encrypted in another ticket's
+						 * skey */
+	int32_t			flags;		/* mask of TKT_FLG_* bits (krb5/krb5.h) */
+	struct krb5_principal	client;		/* client principal name */
+	struct krb5_principal	server;		/* server principal name */
+	uint16_t		ticket_len;	/* length of ticket */
+	uint16_t		ticket2_len;	/* length of second ticket */
+	u8			n_authdata;	/* number of authorisation data elements */
+	u8			n_addresses;	/* number of addresses */
+	struct krb5_tagged_data	session;	/* session data; tag is enctype */
+	struct krb5_tagged_data *addresses;	/* addresses */
+	u8			*ticket;	/* krb5 ticket */
+	u8			*ticket2;	/* second krb5 ticket, if related to ticket (via
+						 * DUPLICATE-SKEY or ENC-TKT-IN-SKEY) */
+	struct krb5_tagged_data *authdata;	/* authorisation data */
+};
+
+/*
  * list of tokens attached to an rxrpc key
  */
 struct rxrpc_key_token {
@@ -43,6 +91,7 @@ struct rxrpc_key_token {
 	struct rxrpc_key_token *next;	/* the next token in the list */
 	union {
 		struct rxkad_key *kad;
+		struct rxk5_key *k5;
 	};
 };
 
@@ -64,8 +113,11 @@ struct rxrpc_key_data_v1 {
  * - based on openafs-1.4.10/src/auth/afs_token.xg
  */
 #define AFSTOKEN_LENGTH_MAX		16384	/* max payload size */
+#define AFSTOKEN_STRING_MAX		256	/* max small string length */
+#define AFSTOKEN_DATA_MAX		64	/* max small data length */
 #define AFSTOKEN_CELL_MAX		64	/* max cellname length */
 #define AFSTOKEN_MAX			8	/* max tokens per payload */
+#define AFSTOKEN_BDATALN_MAX		16384	/* max big data length */
 #define AFSTOKEN_RK_TIX_MAX		12000	/* max RxKAD ticket size */
 #define AFSTOKEN_GK_KEY_MAX		64	/* max GSSAPI key size */
 #define AFSTOKEN_GK_TOKEN_MAX		16384	/* max GSSAPI token size */
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index bf4d623..44836f6 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -64,7 +64,7 @@ struct key_type key_type_rxrpc_s = {
 static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
 				       unsigned toklen)
 {
-	struct rxrpc_key_token *token;
+	struct rxrpc_key_token *token, **pptoken;
 	size_t plen;
 	u32 tktlen;
 	int ret;
@@ -129,13 +129,398 @@ static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
 	key->type_data.x[0]++;
 
 	/* attach the data */
-	token->next = key->payload.data;
-	key->payload.data = token;
+	for (pptoken = (struct rxrpc_key_token **)&key->payload.data;
+	     *pptoken;
+	     pptoken = &(*pptoken)->next)
+		continue;
+	*pptoken = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+
+	_leave(" = 0");
+	return 0;
+}
+
+static void rxrpc_free_krb5_principal(struct krb5_principal *princ)
+{
+	int loop;
+
+	if (princ->name_parts) {
+		for (loop = princ->n_name_parts - 1; loop >= 0; loop--)
+			kfree(princ->name_parts[loop]);
+		kfree(princ->name_parts);
+	}
+	kfree(princ->realm);
+}
+
+static void rxrpc_free_krb5_tagged(struct krb5_tagged_data *td)
+{
+	kfree(td->data);
+}
+
+/*
+ * free up an RxK5 token
+ */
+static void rxrpc_rxk5_free(struct rxk5_key *rxk5)
+{
+	int loop;
+
+	rxrpc_free_krb5_principal(&rxk5->client);
+	rxrpc_free_krb5_principal(&rxk5->server);
+	rxrpc_free_krb5_tagged(&rxk5->session);
+
+	if (rxk5->addresses) {
+		for (loop = rxk5->n_addresses - 1; loop >= 0; loop--)
+			rxrpc_free_krb5_tagged(&rxk5->addresses[loop]);
+		kfree(rxk5->addresses);
+	}
+	if (rxk5->authdata) {
+		for (loop = rxk5->n_authdata - 1; loop >= 0; loop--)
+			rxrpc_free_krb5_tagged(&rxk5->authdata[loop]);
+		kfree(rxk5->authdata);
+	}
+
+	kfree(rxk5->ticket);
+	kfree(rxk5->ticket2);
+	kfree(rxk5);
+}
+
+/*
+ * extract a krb5 principal
+ */
+static int rxrpc_krb5_decode_principal(struct krb5_principal *princ,
+				       const __be32 **_xdr,
+				       unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, n_parts, loop, tmp;
+
+	/* there must be at least one name, and at least #names+1 length
+	 * words */
+	if (toklen <= 12)
+		return -EINVAL;
+
+	_enter(",{%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), toklen);
+
+	n_parts = ntohl(*xdr++);
+	toklen -= 4;
+	if (n_parts <= 0 || n_parts > AFSTOKEN_K5_COMPONENTS_MAX)
+		return -EINVAL;
+	princ->n_name_parts = n_parts;
+
+	if (toklen <= (n_parts + 1) * 4)
+		return -EINVAL;
+
+	princ->name_parts = kcalloc(sizeof(char *), n_parts, GFP_KERNEL);
+	if (!princ->name_parts)
+		return -ENOMEM;
+
+	for (loop = 0; loop < n_parts; loop++) {
+		if (toklen < 4)
+			return -EINVAL;
+		tmp = ntohl(*xdr++);
+		toklen -= 4;
+		if (tmp <= 0 || tmp > AFSTOKEN_STRING_MAX)
+			return -EINVAL;
+		if (tmp > toklen)
+			return -EINVAL;
+		princ->name_parts[loop] = kmalloc(tmp + 1, GFP_KERNEL);
+		if (!princ->name_parts[loop])
+			return -ENOMEM;
+		memcpy(princ->name_parts[loop], xdr, tmp);
+		princ->name_parts[loop][tmp] = 0;
+		tmp = (tmp + 3) & ~3;
+		toklen -= tmp;
+		xdr += tmp >> 2;
+	}
+
+	if (toklen < 4)
+		return -EINVAL;
+	tmp = ntohl(*xdr++);
+	toklen -= 4;
+	if (tmp <= 0 || tmp > AFSTOKEN_K5_REALM_MAX)
+		return -EINVAL;
+	if (tmp > toklen)
+		return -EINVAL;
+	princ->realm = kmalloc(tmp + 1, GFP_KERNEL);
+	if (!princ->realm)
+		return -ENOMEM;
+	memcpy(princ->realm, xdr, tmp);
+	princ->realm[tmp] = 0;
+	tmp = (tmp + 3) & ~3;
+	toklen -= tmp;
+	xdr += tmp >> 2;
+
+	_debug("%s/...@%s", princ->name_parts[0], princ->realm);
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract a piece of krb5 tagged data
+ */
+static int rxrpc_krb5_decode_tagged_data(struct krb5_tagged_data *td,
+					 size_t max_data_size,
+					 const __be32 **_xdr,
+					 unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, len;
+
+	/* there must be at least one tag and one length word */
+	if (toklen <= 8)
+		return -EINVAL;
+
+	_enter(",%zu,{%x,%x},%u",
+	       max_data_size, ntohl(xdr[0]), ntohl(xdr[1]), toklen);
+
+	td->tag = ntohl(*xdr++);
+	len = ntohl(*xdr++);
+	toklen -= 8;
+	if (len > max_data_size)
+		return -EINVAL;
+	td->data_len = len;
+
+	if (len > 0) {
+		td->data = kmalloc(len, GFP_KERNEL);
+		if (!td->data)
+			return -ENOMEM;
+		memcpy(td->data, xdr, len);
+		len = (len + 3) & ~3;
+		toklen -= len;
+		xdr += len >> 2;
+	}
+
+	_debug("tag %x len %x", td->tag, td->data_len);
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract an array of tagged data
+ */
+static int rxrpc_krb5_decode_tagged_array(struct krb5_tagged_data **_td,
+					  u8 *_n_elem,
+					  u8 max_n_elem,
+					  size_t max_elem_size,
+					  const __be32 **_xdr,
+					  unsigned *_toklen)
+{
+	struct krb5_tagged_data *td;
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, n_elem, loop;
+	int ret;
+
+	/* there must be at least one count */
+	if (toklen < 4)
+		return -EINVAL;
+
+	_enter(",,%u,%zu,{%x},%u",
+	       max_n_elem, max_elem_size, ntohl(xdr[0]), toklen);
+
+	n_elem = ntohl(*xdr++);
+	toklen -= 4;
+	if (n_elem < 0 || n_elem > max_n_elem)
+		return -EINVAL;
+	*_n_elem = n_elem;
+	if (n_elem > 0) {
+		if (toklen <= (n_elem + 1) * 4)
+			return -EINVAL;
+
+		_debug("n_elem %d", n_elem);
+
+		td = kcalloc(sizeof(struct krb5_tagged_data), n_elem,
+			     GFP_KERNEL);
+		if (!td)
+			return -ENOMEM;
+		*_td = td;
+
+		for (loop = 0; loop < n_elem; loop++) {
+			ret = rxrpc_krb5_decode_tagged_data(&td[loop],
+							    max_elem_size,
+							    &xdr, &toklen);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract a krb5 ticket
+ */
+static int rxrpc_krb5_decode_ticket(u8 **_ticket, uint16_t *_tktlen,
+				    const __be32 **_xdr, unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, len;
+
+	/* there must be at least one length word */
+	if (toklen <= 4)
+		return -EINVAL;
+
+	_enter(",{%x},%u", ntohl(xdr[0]), toklen);
+
+	len = ntohl(*xdr++);
+	toklen -= 4;
+	if (len > AFSTOKEN_K5_TIX_MAX)
+		return -EINVAL;
+	*_tktlen = len;
+
+	_debug("ticket len %u", len);
+
+	if (len > 0) {
+		*_ticket = kmalloc(len, GFP_KERNEL);
+		if (!*_ticket)
+			return -ENOMEM;
+		memcpy(*_ticket, xdr, len);
+		len = (len + 3) & ~3;
+		toklen -= len;
+		xdr += len >> 2;
+	}
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * parse an RxK5 type XDR format token
+ * - the caller guarantees we have at least 4 words
+ */
+static int rxrpc_instantiate_xdr_rxk5(struct key *key, const __be32 *xdr,
+				      unsigned toklen)
+{
+	struct rxrpc_key_token *token, **pptoken;
+	struct rxk5_key *rxk5;
+	const __be32 *end_xdr = xdr + (toklen >> 2);
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       toklen);
+
+	/* reserve some payload space for this subkey - the length of the token
+	 * is a reasonable approximation */
+	ret = key_payload_reserve(key, key->datalen + toklen);
+	if (ret < 0)
+		return ret;
+
+	token = kzalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
+		return -ENOMEM;
+
+	rxk5 = kzalloc(sizeof(*rxk5), GFP_KERNEL);
+	if (!rxk5) {
+		kfree(token);
+		return -ENOMEM;
+	}
+
+	token->security_index = RXRPC_SECURITY_RXK5;
+	token->k5 = rxk5;
+
+	/* extract the principals */
+	ret = rxrpc_krb5_decode_principal(&rxk5->client, &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+	ret = rxrpc_krb5_decode_principal(&rxk5->server, &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	/* extract the session key and the encoding type (the tag field ->
+	 * ENCTYPE_xxx) */
+	ret = rxrpc_krb5_decode_tagged_data(&rxk5->session, AFSTOKEN_DATA_MAX,
+					    &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	if (toklen < 4 * 8 + 2 * 4)
+		goto inval;
+	rxk5->authtime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->starttime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->endtime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->renew_till = be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->is_skey = ntohl(*xdr++);
+	rxk5->flags = ntohl(*xdr++);
+	toklen -= 4 * 8 + 2 * 4;
+
+	_debug("times: a=%llx s=%llx e=%llx rt=%llx",
+	       rxk5->authtime, rxk5->starttime, rxk5->endtime,
+	       rxk5->renew_till);
+	_debug("is_skey=%x flags=%x", rxk5->is_skey, rxk5->flags);
+
+	/* extract the permitted client addresses */
+	ret = rxrpc_krb5_decode_tagged_array(&rxk5->addresses,
+					     &rxk5->n_addresses,
+					     AFSTOKEN_K5_ADDRESSES_MAX,
+					     AFSTOKEN_DATA_MAX,
+					     &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	/* extract the tickets */
+	ret = rxrpc_krb5_decode_ticket(&rxk5->ticket, &rxk5->ticket_len,
+				       &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+	ret = rxrpc_krb5_decode_ticket(&rxk5->ticket2, &rxk5->ticket2_len,
+				       &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	/* extract the typed auth data */
+	ret = rxrpc_krb5_decode_tagged_array(&rxk5->authdata,
+					     &rxk5->n_authdata,
+					     AFSTOKEN_K5_AUTHDATA_MAX,
+					     AFSTOKEN_BDATALN_MAX,
+					     &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	if (toklen != 0)
+		goto inval;
+
+	/* attach the payload to the key */
+	for (pptoken = (struct rxrpc_key_token **)&key->payload.data;
+	     *pptoken;
+	     pptoken = &(*pptoken)->next)
+		continue;
+	*pptoken = token;
 	if (token->kad->expiry < key->expiry)
 		key->expiry = token->kad->expiry;
 
 	_leave(" = 0");
 	return 0;
+
+inval:
+	ret = -EINVAL;
+error:
+	rxrpc_rxk5_free(rxk5);
+	kfree(token);
+	_leave(" = %d", ret);
+	return ret;
 }
 
 /*
@@ -228,6 +613,8 @@ static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datal
 		sec_ix = ntohl(*xdr++);
 		toklen -= 4;
 
+		_debug("TOKEN type=%u [%p-%p]", sec_ix, xdr, token);
+
 		switch (sec_ix) {
 		case RXRPC_SECURITY_RXKAD:
 			ret = rxrpc_instantiate_xdr_rxkad(key, xdr, toklen);
@@ -235,6 +622,12 @@ static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datal
 				goto error;
 			break;
 
+		case RXRPC_SECURITY_RXK5:
+			ret = rxrpc_instantiate_xdr_rxk5(key, xdr, toklen);
+			if (ret != 0)
+				goto error;
+			break;
+
 		default:
 			ret = -EPROTONOSUPPORT;
 			goto error;
@@ -412,6 +805,10 @@ static void rxrpc_destroy(struct key *key)
 		case RXRPC_SECURITY_RXKAD:
 			kfree(token->kad);
 			break;
+		case RXRPC_SECURITY_RXK5:
+			if (token->k5)
+				rxrpc_rxk5_free(token->k5);
+			break;
 		default:
 			printk(KERN_ERR "Unknown token type %x on rxrpc key\n",
 			       token->security_index);
@@ -602,10 +999,13 @@ EXPORT_SYMBOL(rxrpc_get_null_key);
 static long rxrpc_read(const struct key *key,
 		       char __user *buffer, size_t buflen)
 {
-	struct rxrpc_key_token *token;
-	size_t size, toksize;
-	__be32 __user *xdr;
-	u32 cnlen, tktlen, ntoks, zero;
+	const struct rxrpc_key_token *token;
+	const struct krb5_principal *princ;
+	size_t size;
+	__be32 __user *xdr, *oldxdr;
+	u32 cnlen, toksize, ntoks, tok, zero;
+	u16 toksizes[AFSTOKEN_MAX];
+	int loop;
 
 	_enter("");
 
@@ -614,28 +1014,68 @@ static long rxrpc_read(const struct key *key,
 		return -EOPNOTSUPP;
 	cnlen = strlen(key->description + 4);
 
+#define RND(X) (((X) + 3) & ~3)
+
 	/* AFS keys we return in XDR form, so we need to work out the size of
 	 * the XDR */
 	size = 2 * 4;	/* flags, cellname len */
-	size += (cnlen + 3) & ~3;	/* cellname */
+	size += RND(cnlen);	/* cellname */
 	size += 1 * 4;	/* token count */
 
 	ntoks = 0;
 	for (token = key->payload.data; token; token = token->next) {
+		toksize = 4;	/* sec index */
+
 		switch (token->security_index) {
 		case RXRPC_SECURITY_RXKAD:
-			size += 2 * 4;	/* length, security index (switch ID) */
-			size += 8 * 4;	/* viceid, kvno, key*2, begin, end,
-					 * primary, tktlen */
-			size += (token->kad->ticket_len + 3) & ~3; /* ticket */
-			ntoks++;
+			toksize += 8 * 4;	/* viceid, kvno, key*2, begin,
+						 * end, primary, tktlen */
+			toksize += RND(token->kad->ticket_len);
 			break;
 
-		default: /* can't encode */
+		case RXRPC_SECURITY_RXK5:
+			princ = &token->k5->client;
+			toksize += 4 + princ->n_name_parts * 4;
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				toksize += RND(strlen(princ->name_parts[loop]));
+			toksize += 4 + RND(strlen(princ->realm));
+
+			princ = &token->k5->server;
+			toksize += 4 + princ->n_name_parts * 4;
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				toksize += RND(strlen(princ->name_parts[loop]));
+			toksize += 4 + RND(strlen(princ->realm));
+
+			toksize += 8 + RND(token->k5->session.data_len);
+
+			toksize += 4 * 8 + 2 * 4;
+
+			toksize += 4 + token->k5->n_addresses * 8;
+			for (loop = 0; loop < token->k5->n_addresses; loop++)
+				toksize += RND(token->k5->addresses[loop].data_len);
+
+			toksize += 4 + RND(token->k5->ticket_len);
+			toksize += 4 + RND(token->k5->ticket2_len);
+
+			toksize += 4 + token->k5->n_authdata * 8;
+			for (loop = 0; loop < token->k5->n_authdata; loop++)
+				toksize += RND(token->k5->authdata[loop].data_len);
 			break;
+
+		default: /* we have a ticket we can't encode */
+			BUG();
+			continue;
 		}
+
+		_debug("token[%u]: toksize=%u", ntoks, toksize);
+		ASSERTCMP(toksize, <=, AFSTOKEN_LENGTH_MAX);
+
+		toksizes[ntoks++] = toksize;
+		size += toksize + 4; /* each token has a length word */
 	}
 
+#undef RND
+
 	if (!buffer || buflen < size)
 		return size;
 
@@ -647,52 +1087,109 @@ static long rxrpc_read(const struct key *key,
 		if (put_user(y, xdr++) < 0)	\
 			goto fault;		\
 	} while(0)
+#define ENCODE_DATA(l, s)						\
+	do {								\
+		u32 _l = (l);						\
+		ENCODE(l);						\
+		if (copy_to_user(xdr, (s), _l) != 0)			\
+			goto fault;					\
+		if (_l & 3 &&						\
+		    copy_to_user((u8 *)xdr + _l, &zero, 4 - (_l & 3)) != 0) \
+			goto fault;					\
+		xdr += (_l + 3) >> 2;					\
+	} while(0)
+#define ENCODE64(x)					\
+	do {						\
+		__be64 y = cpu_to_be64(x);		\
+		if (copy_to_user(xdr, &y, 8) != 0)	\
+			goto fault;			\
+		xdr += 8 >> 2;				\
+	} while(0)
+#define ENCODE_STR(s)				\
+	do {					\
+		const char *_s = (s);		\
+		ENCODE_DATA(strlen(_s), _s);	\
+	} while(0)
 
-	ENCODE(0);	/* flags */
-	ENCODE(cnlen);	/* cellname length */
-	if (copy_to_user(xdr, key->description + 4, cnlen) != 0)
-		goto fault;
-	if (cnlen & 3 &&
-	    copy_to_user((u8 *)xdr + cnlen, &zero, 4 - (cnlen & 3)) != 0)
-		goto fault;
-	xdr += (cnlen + 3) >> 2;
-	ENCODE(ntoks);	/* token count */
+	ENCODE(0);					/* flags */
+	ENCODE_DATA(cnlen, key->description + 4);	/* cellname */
+	ENCODE(ntoks);
 
+	tok = 0;
 	for (token = key->payload.data; token; token = token->next) {
-		toksize = 1 * 4;	/* sec index */
+		toksize = toksizes[tok++];
+		ENCODE(toksize);
+		oldxdr = xdr;
+		ENCODE(token->security_index);
 
 		switch (token->security_index) {
 		case RXRPC_SECURITY_RXKAD:
-			toksize += 8 * 4;
-			toksize += (token->kad->ticket_len + 3) & ~3;
-			ENCODE(toksize);
-			ENCODE(token->security_index);
 			ENCODE(token->kad->vice_id);
 			ENCODE(token->kad->kvno);
-			if (copy_to_user(xdr, token->kad->session_key, 8) != 0)
-				goto fault;
-			xdr += 8 >> 2;
+			ENCODE_DATA(8, token->kad->session_key);
 			ENCODE(token->kad->start);
 			ENCODE(token->kad->expiry);
 			ENCODE(token->kad->primary_flag);
-			tktlen = token->kad->ticket_len;
-			ENCODE(tktlen);
-			if (copy_to_user(xdr, token->kad->ticket, tktlen) != 0)
-				goto fault;
-			if (tktlen & 3 &&
-			    copy_to_user((u8 *)xdr + tktlen, &zero,
-					 4 - (tktlen & 3)) != 0)
-				goto fault;
-			xdr += (tktlen + 3) >> 2;
+			ENCODE_DATA(token->kad->ticket_len, token->kad->ticket);
+			break;
+
+		case RXRPC_SECURITY_RXK5:
+			princ = &token->k5->client;
+			ENCODE(princ->n_name_parts);
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				ENCODE_STR(princ->name_parts[loop]);
+			ENCODE_STR(princ->realm);
+
+			princ = &token->k5->server;
+			ENCODE(princ->n_name_parts);
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				ENCODE_STR(princ->name_parts[loop]);
+			ENCODE_STR(princ->realm);
+
+			ENCODE(token->k5->session.tag);
+			ENCODE_DATA(token->k5->session.data_len,
+				    token->k5->session.data);
+
+			ENCODE64(token->k5->authtime);
+			ENCODE64(token->k5->starttime);
+			ENCODE64(token->k5->endtime);
+			ENCODE64(token->k5->renew_till);
+			ENCODE(token->k5->is_skey);
+			ENCODE(token->k5->flags);
+
+			ENCODE(token->k5->n_addresses);
+			for (loop = 0; loop < token->k5->n_addresses; loop++) {
+				ENCODE(token->k5->addresses[loop].tag);
+				ENCODE_DATA(token->k5->addresses[loop].data_len,
+					    token->k5->addresses[loop].data);
+			}
+
+			ENCODE_DATA(token->k5->ticket_len, token->k5->ticket);
+			ENCODE_DATA(token->k5->ticket2_len, token->k5->ticket2);
+
+			ENCODE(token->k5->n_authdata);
+			for (loop = 0; loop < token->k5->n_authdata; loop++) {
+				ENCODE(token->k5->authdata[loop].tag);
+				ENCODE_DATA(token->k5->authdata[loop].data_len,
+					    token->k5->authdata[loop].data);
+			}
 			break;
 
 		default:
+			BUG();
 			break;
 		}
+
+		ASSERTCMP((unsigned long)xdr - (unsigned long)oldxdr, ==,
+			  toksize);
 	}
 
+#undef ENCODE_STR
+#undef ENCODE_DATA
+#undef ENCODE64
 #undef ENCODE
 
+	ASSERTCMP(tok, ==, ntoks);
 	ASSERTCMP((char __user *) xdr - buffer, ==, size);
 	_leave(" = %zu", size);
 	return size;


^ permalink raw reply related

* more troubles with bridge in netns
From: Atis Elsts @ 2009-09-14 11:19 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: netdev
In-Reply-To: <4AA6188C.1070806@free.fr>

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Tuesday 08 September 2009 11:40:44 Daniel Lezcano wrote:
> Atis Elsts wrote:
> > Trying to add bridge interface from userspace program, after moving the
> > program to a new network namespace, causes kernel to crash. I am using
> > latest kernel version from git (2.6.31-rc9).
> > The bug is easy to reproduce - just compile and run the attached C
> > program.
> >
> > I see that bridge interface has NETIF_F_NETNS_LOCAL flag, but as I
> > understand, this flag simply means that a device cannot be *moved* across
> > network namespaces, not that it cannot be *created* in other namespaces.
>
> Yep, very easy to reproduce :/
> The sysfs has not been disabled for the bridge. I will try to fix it as
> soon as I can.
>
> Thanks
>   -- Daniel

Hello,

please let me know when the sysfs patch for bridge is available. At the moment 
I managed to get it to work by just commenting out all sysfs stuff for bridge 
module. However, a new problem appears now. After running C program 
(attached) that creates a bridge in network namespace and attaches an 
interface to it, I got this message repeatedly:
 kernel:[  466.758908] unregister_netdevice: waiting for lo to become free. 
Usage count = 2

It sems pretty unlikely that my kernel changes could have caused this?

The unregister_netdevice message does not appear, however, if I uncomment this 
line in child.c:
    system("brctl setfd sim_br0 0");

--Atis

[-- Attachment #2: brtest.tgz --]
[-- Type: application/x-tgz, Size: 10767 bytes --]

^ permalink raw reply

* Re: more troubles with bridge in netns
From: Daniel Lezcano @ 2009-09-14 11:28 UTC (permalink / raw)
  To: Atis Elsts; +Cc: netdev
In-Reply-To: <200909141419.12330.atis@mikrotik.com>

Atis Elsts wrote:
> On Tuesday 08 September 2009 11:40:44 Daniel Lezcano wrote:
>   
>> Atis Elsts wrote:
>>     
>>> Trying to add bridge interface from userspace program, after moving the
>>> program to a new network namespace, causes kernel to crash. I am using
>>> latest kernel version from git (2.6.31-rc9).
>>> The bug is easy to reproduce - just compile and run the attached C
>>> program.
>>>
>>> I see that bridge interface has NETIF_F_NETNS_LOCAL flag, but as I
>>> understand, this flag simply means that a device cannot be *moved* across
>>> network namespaces, not that it cannot be *created* in other namespaces.
>>>       
>> Yep, very easy to reproduce :/
>> The sysfs has not been disabled for the bridge. I will try to fix it as
>> soon as I can.
>>
>> Thanks
>>   -- Daniel
>>     
>
> Hello,
>
> please let me know when the sysfs patch for bridge is available. At the moment 
> I managed to get it to work by just commenting out all sysfs stuff for bridge 
> module. However, a new problem appears now. After running C program 
> (attached) that creates a bridge in network namespace and attaches an 
> interface to it, I got this message repeatedly:
>  kernel:[  466.758908] unregister_netdevice: waiting for lo to become free. 
> Usage count = 2
>
> It sems pretty unlikely that my kernel changes could have caused this?
>
> The unregister_netdevice message does not appear, however, if I uncomment this 
> line in child.c:
>     system("brctl setfd sim_br0 0");
>   

I was about to send a patch to disable the bridge per namespace as it 
seems it was never tested.
Can you send me your kernel patch ?

Thanks.
  -- Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox