Netdev List
 help / color / mirror / Atom feed
* Re: [E1000-devel] [BUG 2.6.30+] e100 sometimes causes oops during resume
From: Rafael J. Wysocki @ 2009-09-16 23:11 UTC (permalink / raw)
  To: Graham, David
  Cc: Karol Lewandowski, linux-kernel@vger.kernel.org,
	e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org
In-Reply-To: <13830B75AD5A2F42848F92269B11996F5BF592C3@orsmsx509.amr.corp.intel.com>

On Wednesday 16 September 2009, Graham, David wrote:
> A v2.6.30..v2.6.31 diff shows that this is probably exposed by Rafael Wysocki's
> commit 6905b1f1, which now allows systems with e100 to sleep. If I understand
> correctly, it looks like these systems simply couldn't sleep before. Is that right Rafael?

The systems where e100 is not power manageable by any means couldn't suspend
before that commit.  For the other systems, where e100 is power manageable
either with ACPI or natively, the commit doesn't change anything. 

> I don't think its likely that the commit is a direct cause of the problem, but that the
> suspend/resume cycle now allows us to see another issue. Maybe e100 is
> leaking memory on suspend/resume cycles, or something else is leaking memory,
> or memory is becoming fragmented and the e100 driver is improperly
> requesting and being failed on an 'atomic' memory allocation  from a heavily
> fragmented memory map. Or something else.

I have a couple of test systems with e100 that don't have any resume problems,
FWIW.

Thanks,
Rafael

^ permalink raw reply

* Re: fanotify as syscalls
From: Eric Paris @ 2009-09-16 22:33 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Alan Cox, Alan Cox, Linus Torvalds, Evgeniy Polyakov,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, hch
In-Reply-To: <20090916214916.GB6243@shareable.org>

On Wed, 2009-09-16 at 22:49 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > > Alan Cox wrote:
> > > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > > invalidate application-level caches from the sequence of events
> > > > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > > > revalidate such caches by the absence of pending events.
> > > > 
> > > > You can't however create the caches reliably because you've no idea if
> > > > you are referencing the right object in the first place - which is why
> > > > you want a handle in these cases. I see fanotify as a handle producing
> > > > addition to inotify, not as a replacement (plus some other bits around
> > > > open blocking for HSM etc)
> > > 
> > > There are two sets of events getting mixed up here.  Inode events -
> > > reads, writes, truncates, chmods; and directory events - renames,
> > > links, creates, unlinks.
> > 
> > My understanding of you argument is that fanotify does not yet provide
> > all inotify events, namely those of directories operations and thus is
> > not suitable to wholesale replace everything inotify can do.
> 
> Largely that, plus seeing fanotify look like it'll acquire some
> capabilities that would be useful with those inotify events and
> inotify not getting them.  Bothered by the apparent direction of
> development, really.
> 
> Btw, I'm not sure you can use inotify+fanotify together simultaneously
> in this way, which may be of benefit - caching might help the
> anti-malware-style access controls.  I'll have to think carefully
> about ordering of some events, and using fanotify and inotify
> independently at the same time loses that ordering.

As soon as you say "ordering" you lose   :)

> 
> > I've already said that working towards that goal is something I plan
> > to pursue,
> 
> Sorry, I missed that, just as I didn't find a reply to Evigny's "I
> need pids".  And from another mail, I thought you were stopping at the
> things with file descriptors.
> 
> > but for now, you still have inotify.
> 
> That's right.  And it sucks for subtrees, so that's why I'd like to
> absorb improvements on subtree inclusions, and exclusion nodes look
> useful too.
> 
> > The mlocate/updatedb people ask me about fanotify and it's on the todo
> > list to allow global reception of of such events.  The fd you get would
> > be of the dir where the event happened.  They didn't care, and I haven't
> > decided if we would provide the path component like inotify does.  Most
> > users are perfectly happy to stat everything in the dir.
> 
> mlocate/updatedb is relatively low performance and of course wants to
> be system-wide.  It's not looking so good if a user wants an indexer
> just on their /home, and the administrator does not want everyone else
> to pay the cost.
> 
> But I think we're quite agreed on how useful subtrees would be.
> System-wide events won't be needed if we can monitor the / subtree
> to get the same effect, and that'll also sort out namespaces and chroots.
> 
> Stat'ing every entry in a dir event.  Thinking out loud:
> 
>    1. Stat'ing everything in a dir just to find out which 1 file was
>       deleted can be quite expensive for some uses (if you have a
>       large dir and it happens a lot), and is unpleasant just because
>       each change _should_ result in about O(1) work.  Taste, style,
>       elegance ;-)
>
>       For POSIX filesystems, I don't see any logical problem with
>       this, actually.  You don't need to call stat()!  It's enough to
>       call readdir() and look at d_ino to track
>       renames/links/creates/unlinks - assuming only directory-change
>       events are relevant here.
> 
>       Just an unpleasant O(n) scaling with directory size.
> 
>       (Note that I ignore mount points not returning the correct
>       d_ino, because apps can track the mount list and
>       compensate; they should be doing this anyway).

I think we generally agree here.

> 
>    2. updatedb-style indexing apps don't care about the
>       readdir/stat-all-entries cost, because they don't need to read
>       the directory after every change, they only need to do it once
>       every 24 hours if any events were received in that interval!
> 
>       (Obviously this isn't the same for pseudo-real-time indexers.)

See, that's why the mlocate/updatedb contacted me.  They don't want it
to be a 24 hour thing.  They was pseudo real time without thrashing the
hd.  Turns out, the interface is there, and the backend just needs
struct path information available to give it to them.

>       For Samba-style caching, on the other hand, the cost of
>       rescanning a large directory when one file is being read often
>       and another file in it is changing often might be prohibitive,
>       forcing it to use heuristics to decide when to monitor a
>       directory and when not to to cache it, depending on directory
>       size.  I'd rather avoid that.

This seems like you are confusing the events that happen to a directory
(mv/rename) and the events that happen to a file (read/write.)   (The
former I do not have code for, the latter I do) Which is surprising
since you so clearly delineate them later, so maybe I'm
misunderstanding.

>    3. Non-POSIX filesystems don't always have stable inode numbers.
>       You can't tell that foo was renamed to bar by reading the
>       directory and looking at d_ino, or by calling stat on each entry.
> 
>       You can assume stable inode numbers for inodes where there's an
>       open file descriptor; that *might* be just enough to squeeze
>       through the logic of a cache.  I'm not sure right now.
> 
>    4. You can't tell when file contents are changed from stat info.
> 
>       That means you have to receive an inode event, not a directory
>       event for data changes, but that's not a problem of course - the
>       name-used-for-access isn't useful for data changes anyway
>       (except for debugging perhaps).
> 
>    5. stat() doesn't tell you about xattr and ACL changes.  xattrs can
>       be large and slow to read on a whole directory.  But as point 4,
>       if attribute changes count as inode changes, there's no problem.
> 
>    6. Calling stat() pulls a lot into cache that doesn't need to be in
>       cache: all those inodes.  But as I mentioned in points 1, 4 and
>       5, provided only directory name operations pass the directory to
>       be scanned, and inode operations always pass the inode, you can
>       use readdir() and avoid stat(), so the inodes don't have to be
>       pulled into cache after all.
> 
>       Except for non-POSIX inode instability.  Would be good to work
>       out if that breaks the algorithm.
> 
> In summary, calling readdir() and maybe stat/getxattr on each entry in
> a directory might be workable, but I'd rather it was avoidable.
> Simple apps may prefer to do it anyway - and let multiple events in a
> directory be merged as a result.
> 
> While I'm here it would be nice to receive one event instead of two
> for operations which involve two paths: link, rename and bind mount.
> Having to pair up two events from inotify isn't helpful in any way.

What do you propose the format of the event should be.  Is this
precluded in what's been proposed?

> Imho an API that satisfies everything we've talking about would let
> you specify which fields you want to receive in the event when you
> bind a listener.  Not _everything_ is selectable of course, but
> whether you want:
> 
>     For inode events (data read/write, attribute/ACL/xattr changes):
> 
>     - Open file descriptor of the affected file [Optional].

Could be added and I've agreed to take a look.  I'm just not sure
bringing back the major flaw of inotify is really moving us forward.

>     - The inode number and device number (always?).

hmmm, if you have the fd you have both, if you choose to get a wd like
inotify, I say you're still own your own to do the magic map.  I don't
want to copy tons of almost always useless data into userspace.

>     - A way to identify the vfsmount (due to bind mounts making the
>       device number insufficient to identify a directory; always?).

Now you want reliable path names?  I need a vfsmount in kernel to open
the fd for userspace, but I don't see how that's translatable to
anything even remotely useable by userspace....

>     For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
>     /mount/umount):
> 
>     - Same as inode above, for the object created/linked/deleted.
> 
>     - Same as inode above, for the directory containing the source name.
>     - Source name [Optional].
>     - Same as above, for the directory containing the target name
>     - Target name [Optional]
> 
>     Source and target are the two names for
>     rename/link/reflink/bind-mount operations.  Otherwise there is
>     only one name to include.
> 
> Ironically, it begins to look a bit like netlink ;-)
> 
> As you can see, I've made the open descriptors optional, and the names
> for directory events optional.  For directory events, the object
> descriptor option should be independent from the source/target
> directory descriptor option.
> 
> Add one more option: wait for Ack before file accessing process can
> proceed, or don't require Ack.  That basically distinguishes inotify
> behaviour from fsnotify behaviour.

Those 2 things are completely independent.  If you request read with
blocking you get read with blocking.  If you request just read, you get
just read.

> It's not obvious, but that option's useful for directory events too,
> if you think about it: Think like an anti-malware or other access
> control manager, and ask: what if I have to block something which
> depends on the layout of files?  Just as directory events are enough
> for caching, they are enough for complete access control of
> layout-dependent state too.  For example, some line of text is no
> problem in a random file, but might be forbidden by the access manager
> from appearing in .bash_login, including by "mv harmless .bash_login".

I'd say they can realize the bad data when .bash_login is next opened
and deny access then  :)

No, it's honestly a good idea, and one that is going to likely take
serious hook movement.  A lot of these things can go on the todo list,
but don't sound like show stoppers to me....

> > It's hopefully feasible, but it's going to take some fsnotify hook
> > movements and possibly so arguments with Al to get the information I
> > want where I want it.
> 
> That may, indeed, be a sticking point :-)

My silent comrade from suse is looking at moving some hooks as we speak
so hopefully directory events can get added shortly after a merge.  I
don't think we should wait until every conceived (but not necessarily
needed) possibility is coded.  We have things that work, meet needs, and
hopefully you'll agree leave us with a place to go in the future.  Do
you?

-Eric


^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-16 21:49 UTC (permalink / raw)
  To: Eric Paris
  Cc: Alan Cox, Alan Cox, Linus Torvalds, Evgeniy Polyakov,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, hch
In-Reply-To: <1253116380.5213.121.camel@dhcp231-106.rdu.redhat.com>

Eric Paris wrote:
> On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > Alan Cox wrote:
> > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > invalidate application-level caches from the sequence of events
> > > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > > revalidate such caches by the absence of pending events.
> > > 
> > > You can't however create the caches reliably because you've no idea if
> > > you are referencing the right object in the first place - which is why
> > > you want a handle in these cases. I see fanotify as a handle producing
> > > addition to inotify, not as a replacement (plus some other bits around
> > > open blocking for HSM etc)
> > 
> > There are two sets of events getting mixed up here.  Inode events -
> > reads, writes, truncates, chmods; and directory events - renames,
> > links, creates, unlinks.
> 
> My understanding of you argument is that fanotify does not yet provide
> all inotify events, namely those of directories operations and thus is
> not suitable to wholesale replace everything inotify can do.

Largely that, plus seeing fanotify look like it'll acquire some
capabilities that would be useful with those inotify events and
inotify not getting them.  Bothered by the apparent direction of
development, really.

Btw, I'm not sure you can use inotify+fanotify together simultaneously
in this way, which may be of benefit - caching might help the
anti-malware-style access controls.  I'll have to think carefully
about ordering of some events, and using fanotify and inotify
independently at the same time loses that ordering.

> I've already said that working towards that goal is something I plan
> to pursue,

Sorry, I missed that, just as I didn't find a reply to Evigny's "I
need pids".  And from another mail, I thought you were stopping at the
things with file descriptors.

> but for now, you still have inotify.

That's right.  And it sucks for subtrees, so that's why I'd like to
absorb improvements on subtree inclusions, and exclusion nodes look
useful too.

> The mlocate/updatedb people ask me about fanotify and it's on the todo
> list to allow global reception of of such events.  The fd you get would
> be of the dir where the event happened.  They didn't care, and I haven't
> decided if we would provide the path component like inotify does.  Most
> users are perfectly happy to stat everything in the dir.

mlocate/updatedb is relatively low performance and of course wants to
be system-wide.  It's not looking so good if a user wants an indexer
just on their /home, and the administrator does not want everyone else
to pay the cost.

But I think we're quite agreed on how useful subtrees would be.
System-wide events won't be needed if we can monitor the / subtree
to get the same effect, and that'll also sort out namespaces and chroots.

Stat'ing every entry in a dir event.  Thinking out loud:

   1. Stat'ing everything in a dir just to find out which 1 file was
      deleted can be quite expensive for some uses (if you have a
      large dir and it happens a lot), and is unpleasant just because
      each change _should_ result in about O(1) work.  Taste, style,
      elegance ;-)

      For POSIX filesystems, I don't see any logical problem with
      this, actually.  You don't need to call stat()!  It's enough to
      call readdir() and look at d_ino to track
      renames/links/creates/unlinks - assuming only directory-change
      events are relevant here.

      Just an unpleasant O(n) scaling with directory size.

      (Note that I ignore mount points not returning the correct
      d_ino, because apps can track the mount list and
      compensate; they should be doing this anyway).

   2. updatedb-style indexing apps don't care about the
      readdir/stat-all-entries cost, because they don't need to read
      the directory after every change, they only need to do it once
      every 24 hours if any events were received in that interval!

      (Obviously this isn't the same for pseudo-real-time indexers.)

      For Samba-style caching, on the other hand, the cost of
      rescanning a large directory when one file is being read often
      and another file in it is changing often might be prohibitive,
      forcing it to use heuristics to decide when to monitor a
      directory and when not to to cache it, depending on directory
      size.  I'd rather avoid that.

   3. Non-POSIX filesystems don't always have stable inode numbers.
      You can't tell that foo was renamed to bar by reading the
      directory and looking at d_ino, or by calling stat on each entry.

      You can assume stable inode numbers for inodes where there's an
      open file descriptor; that *might* be just enough to squeeze
      through the logic of a cache.  I'm not sure right now.

   4. You can't tell when file contents are changed from stat info.

      That means you have to receive an inode event, not a directory
      event for data changes, but that's not a problem of course - the
      name-used-for-access isn't useful for data changes anyway
      (except for debugging perhaps).

   5. stat() doesn't tell you about xattr and ACL changes.  xattrs can
      be large and slow to read on a whole directory.  But as point 4,
      if attribute changes count as inode changes, there's no problem.

   6. Calling stat() pulls a lot into cache that doesn't need to be in
      cache: all those inodes.  But as I mentioned in points 1, 4 and
      5, provided only directory name operations pass the directory to
      be scanned, and inode operations always pass the inode, you can
      use readdir() and avoid stat(), so the inodes don't have to be
      pulled into cache after all.

      Except for non-POSIX inode instability.  Would be good to work
      out if that breaks the algorithm.

In summary, calling readdir() and maybe stat/getxattr on each entry in
a directory might be workable, but I'd rather it was avoidable.
Simple apps may prefer to do it anyway - and let multiple events in a
directory be merged as a result.

While I'm here it would be nice to receive one event instead of two
for operations which involve two paths: link, rename and bind mount.
Having to pair up two events from inotify isn't helpful in any way.

Imho an API that satisfies everything we've talking about would let
you specify which fields you want to receive in the event when you
bind a listener.  Not _everything_ is selectable of course, but
whether you want:

    For inode events (data read/write, attribute/ACL/xattr changes):

    - Open file descriptor of the affected file [Optional].
    - The inode number and device number (always?).
    - A way to identify the vfsmount (due to bind mounts making the
      device number insufficient to identify a directory; always?).

    For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
    /mount/umount):

    - Same as inode above, for the object created/linked/deleted.

    - Same as inode above, for the directory containing the source name.
    - Source name [Optional].
    - Same as above, for the directory containing the target name
    - Target name [Optional]

    Source and target are the two names for
    rename/link/reflink/bind-mount operations.  Otherwise there is
    only one name to include.

Ironically, it begins to look a bit like netlink ;-)

As you can see, I've made the open descriptors optional, and the names
for directory events optional.  For directory events, the object
descriptor option should be independent from the source/target
directory descriptor option.

Add one more option: wait for Ack before file accessing process can
proceed, or don't require Ack.  That basically distinguishes inotify
behaviour from fsnotify behaviour.

It's not obvious, but that option's useful for directory events too,
if you think about it: Think like an anti-malware or other access
control manager, and ask: what if I have to block something which
depends on the layout of files?  Just as directory events are enough
for caching, they are enough for complete access control of
layout-dependent state too.  For example, some line of text is no
problem in a random file, but might be forbidden by the access manager
from appearing in .bash_login, including by "mv harmless .bash_login".

The above is not a final proposal, but I'd be delighted if you'd take
a look at whether it's suitable.  I realise some things may not work
out for implementation reasons.

Meanwhile, I'll take a look at userspace code for my caching algorithm
and see how well that works out.  I think we'll get subtree monitors
out of this before the month is over...

> It's hopefully feasible, but it's going to take some fsnotify hook
> movements and possibly so arguments with Al to get the information I
> want where I want it.

That may, indeed, be a sticking point :-)

> But there is nothing about the interface that
> precludes it and it has been discussed and considered.
> 
> Am I still missing it?

No I think we're on the same wavelength now.  Thanks for being
patient.  (And thanks, Alan, for stepping in and making me describe
what I had in mind better).

-- Jamie

^ permalink raw reply

* Re: [E1000-devel] [BUG 2.6.30+] e100 sometimes causes oops during resume
From: Karol Lewandowski @ 2009-09-16 21:17 UTC (permalink / raw)
  To: Graham, David
  Cc: Karol Lewandowski, Rafael J. Wysocki,
	linux-kernel@vger.kernel.org, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org
In-Reply-To: <4AB1536F.9070205@intel.com>

On Wed, Sep 16, 2009 at 02:06:55PM -0700, Graham, David wrote:
> Karol Lewandowski wrote:
>>
>> I can test, or rather -- start testing this tommorow, if this makes
>> sense to you.
>>
> Yes please Karol, test with 6905b1f1...removed.

Ok.


> I will continue testing  
> here too on 2.6.31, though I have not had a repro yet. And are you  
> performing a standby or hibernate, I want to be sure that I'm resuming  
> from the same state.

I don't use hibernate, I always suspend to ram.

Thanks.

^ permalink raw reply

* question about niu linux driver behaviour
From: Chris Friesen @ 2009-09-16 21:13 UTC (permalink / raw)
  To: netdev, David S. Miller, Santwona.Behera, Matheos Worku

Hi all,

I've got a Sun 3320 ATCA "alonso" board with Neptune network devices.
I'm running basically 2.6.27.8, and I'm seeing some odd behaviour in the
niu driver.

My understanding is that there are four ports. 0 and 1 are XMAC links,
while 10, and 11 are BMAC links.  On our board these normally map to
eth4/5/6/7.

At boot time, if the other end for the XMAC links is not up, then we see
messages that look like:

niu 0000:08:00.0: niu: Port 0 signal bits [00000000] are not [30000000]
niu 0000:08:00.0: niu: Port 0 10G/1G SERDES Link Failed
niu 0000:08:00.1: niu: Port 1 signal bits [00000000] are not [0c000000]
niu 0000:08:00.1: niu: Port 1 10G/1G SERDES Link Failed

and the BMAC links come up as eth4/eth5.

This seems to be due to the fact that we're checking ESR_INT_SIGNALS for
ESR_INT_DET0_P0 and ESR_INT_DET0_P1, and then failing because they're
not set.  (The 10G initialization behaves similarly.)

Is this valid behaviour?  I seems fragile for successful serdes
initialization to depend on the other end of the link being up at driver
init time.

Chris

^ permalink raw reply

* Re: [BUG 2.6.30+] e100 sometimes causes oops during resume
From: Graham, David @ 2009-09-16 21:06 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Rafael J. Wysocki, e1000-devel@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20090916014448.GA1070@bizet.domek.prywatny>

Karol Lewandowski wrote:
> 
> I can test, or rather -- start testing this tommorow, if this makes
> sense to you.
> 
Yes please Karol, test with 6905b1f1...removed. I will continue testing 
here too on 2.6.31, though I have not had a repro yet. And are you 
performing a standby or hibernate, I want to be sure that I'm resuming 
from the same state.
Thanks.
> Thanks.
> 


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-16 21:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB13B09.5040308@gmail.com>

On 09/16/2009 10:22 PM, Gregory Haskins wrote:
> Avi Kivity wrote:
>    
>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>      
>>>> If kvm can do it, others can.
>>>>
>>>>          
>>> The problem is that you seem to either hand-wave over details like this,
>>> or you give details that are pretty much exactly what vbus does already.
>>>    My point is that I've already sat down and thought about these issues
>>> and solved them in a freely available GPL'ed software package.
>>>
>>>        
>> In the kernel.  IMO that's the wrong place for it.
>>      
> 3) "in-kernel": You can do something like virtio-net to vhost to
> potentially meet some of the requirements, but not all.
>
> In order to fully meet (3), you would need to do some of that stuff you
> mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
> we need to have a facility for mapping eventfds and establishing a
> signaling mechanism (like PIO+qid), etc. KVM does this with
> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
> invented.
>    

irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

> To meet performance, this stuff has to be in kernel and there has to be
> a way to manage it.

and management belongs in userspace.

> Since vbus was designed to do exactly that, this is
> what I would advocate.  You could also reinvent these concepts and put
> your own mux and mapping code in place, in addition to all the other
> stuff that vbus does.  But I am not clear why anyone would want to.
>    

Maybe they like their backward compatibility and Windows support.

> So no, the kernel is not the wrong place for it.  Its the _only_ place
> for it.  Otherwise, just use (1) and be done with it.
>
>    

I'm talking about the config stuff, not the data path.

>>   Further, if we adopt
>> vbus, if drop compatibility with existing guests or have to support both
>> vbus and virtio-pci.
>>      
> We already need to support both (at least to support Ira).  virtio-pci
> doesn't work here.  Something else (vbus, or vbus-like) is needed.
>    

virtio-ira.

>>> So the question is: is your position that vbus is all wrong and you wish
>>> to create a new bus-like thing to solve the problem?
>>>        
>> I don't intend to create anything new, I am satisfied with virtio.  If
>> it works for Ira, excellent.  If not, too bad.
>>      
> I think that about sums it up, then.
>    

Yes.  I'm all for reusing virtio, but I'm not going switch to vbus or 
support both for this esoteric use case.

>>> If so, how is it
>>> different from what Ive already done?  More importantly, what specific
>>> objections do you have to what Ive done, as perhaps they can be fixed
>>> instead of starting over?
>>>
>>>        
>> The two biggest objections are:
>> - the host side is in the kernel
>>      
> As it needs to be.
>    

vhost-net somehow manages to work without the config stuff in the kernel.

> With all due respect, based on all of your comments in aggregate I
> really do not think you are truly grasping what I am actually building here.
>    

Thanks.



>>> Bingo.  So now its a question of do you want to write this layer from
>>> scratch, or re-use my framework.
>>>
>>>        
>> You will have to implement a connector or whatever for vbus as well.
>> vbus has more layers so it's probably smaller for vbus.
>>      
> Bingo!

(addictive, isn't it)

> That is precisely the point.
>
> All the stuff for how to map eventfds, handle signal mitigation, demux
> device/function pointers, isolation, etc, are built in.  All the
> connector has to do is transport the 4-6 verbs and provide a memory
> mapping/copy function, and the rest is reusable.  The device models
> would then work in all environments unmodified, and likewise the
> connectors could use all device-models unmodified.
>    

Well, virtio has a similar abstraction on the guest side.  The host side 
abstraction is limited to signalling since all configuration is in 
userspace.  vhost-net ought to work for lguest and s390 without change.

>> It was already implemented three times for virtio, so apparently that's
>> extensible too.
>>      
> And to my point, I'm trying to commoditize as much of that process as
> possible on both the front and backends (at least for cases where
> performance matters) so that you don't need to reinvent the wheel for
> each one.
>    

Since you're interested in any-to-any connectors it makes sense to you.  
I'm only interested in kvm-host-to-kvm-guest, so reducing the already 
minor effort to implement a new virtio binding has little appeal to me.

>> You mean, if the x86 board was able to access the disks and dma into the
>> ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.
>>      
> But as we discussed, vhost doesn't work well if you try to run it on the
> x86 side due to its assumptions about pagable "guest" memory, right?  So
> is that even an option?  And even still, you would still need to solve
> the aggregation problem so that multiple devices can coexist.
>    

I don't know.  Maybe it can be made to work and maybe it cannot.  It 
probably can with some determined hacking.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply

* pull request: wireless-next-2.6 2009-09-16
From: John W. Linville @ 2009-09-16 20:41 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

Dave,

Here is a batch of fixes for 2.6.32...nothing too controversial
AFAICT...

Please let me know if there are problems!

John

---

Individual patches are available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-next-2.6/

---

The following changes since commit 13af7a6ea502fcdd4c0e3d7de6e332b102309491:
  Dhananjay Phadke (1):
        netxen: update copyright

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6.git master

Christian Lamparter (1):
      p54usb: add Zcomax XG-705A usbid

Daniel C Halperin (1):
      iwlwifi: fix HT operation in 2.4 GHz band

Holger Schurig (2):
      cfg80211: use cfg80211_wext_freq() for freq conversion
      cfg80211: minimal error handling for wext-compat freq scanning

Johannes Berg (2):
      iwlwifi: disable powersave for 4965
      cfg80211: fix SME connect

Larry Finger (1):
      ssb: Fix error when V1 SPROM extraction is forced

Luis R. Rodriguez (1):
      wireless: default CONFIG_WLAN to y

Martin Decky (1):
      hostap: Revert a toxic part of the conversion to net_device_ops

Michael Buesch (3):
      b43: Force-wake queues on init
      ssb: Disable verbose SDIO coreswitch
      b43: Fix resume failure

Pavel Roskin (1):
      rc80211_minstrel: fix contention window calculation

Randy Dunlap (1):
      ssb/sdio: fix printk format warnings

Reinette Chatre (1):
      iwlwifi: fix potential rx buffer loss

Sujith (1):
      ath9k: Fix bug in ANI channel handling

Wey-Yi Guy (1):
      iwlwifi: find the correct first antenna

 drivers/net/wireless/Kconfig                |    1 +
 drivers/net/wireless/ath/ath9k/ani.c        |    6 ++++--
 drivers/net/wireless/b43/main.c             |    8 +++-----
 drivers/net/wireless/hostap/hostap_main.c   |    3 ++-
 drivers/net/wireless/iwlwifi/iwl-4965.c     |    1 +
 drivers/net/wireless/iwlwifi/iwl-agn-rs.c   |   10 +++++++++-
 drivers/net/wireless/iwlwifi/iwl-core.c     |    9 ++++++---
 drivers/net/wireless/iwlwifi/iwl-core.h     |    1 +
 drivers/net/wireless/iwlwifi/iwl-power.c    |    5 +++--
 drivers/net/wireless/iwlwifi/iwl-rx.c       |   24 +++++++++++++++++-------
 drivers/net/wireless/iwlwifi/iwl3945-base.c |   24 ++++++++++++++++--------
 drivers/net/wireless/p54/p54usb.c           |    1 +
 drivers/ssb/pci.c                           |    1 +
 drivers/ssb/sdio.c                          |    6 +++---
 net/mac80211/rc80211_minstrel.c             |    2 +-
 net/wireless/scan.c                         |    7 ++++++-
 net/wireless/sme.c                          |   21 +++++++++++++--------
 17 files changed, 88 insertions(+), 42 deletions(-)

diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index ad89d23..49ea9c9 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -5,6 +5,7 @@
 menuconfig WLAN
 	bool "Wireless LAN"
 	depends on !S390
+	default y
 	---help---
 	  This section contains all the pre 802.11 and 802.11 wireless
 	  device drivers. For a complete list of drivers and documentation
diff --git a/drivers/net/wireless/ath/ath9k/ani.c b/drivers/net/wireless/ath/ath9k/ani.c
index a7cbb07..2b49374 100644
--- a/drivers/net/wireless/ath/ath9k/ani.c
+++ b/drivers/net/wireless/ath/ath9k/ani.c
@@ -327,7 +327,8 @@ static void ath9k_hw_ani_ofdm_err_trigger(struct ath_hw *ah)
 					     aniState->firstepLevel + 1);
 		return;
 	} else {
-		if (conf->channel->band == IEEE80211_BAND_2GHZ) {
+		if ((conf->channel->band == IEEE80211_BAND_2GHZ) &&
+		    !conf_is_ht(conf)) {
 			if (!aniState->ofdmWeakSigDetectOff)
 				ath9k_hw_ani_control(ah,
 				     ATH9K_ANI_OFDM_WEAK_SIGNAL_DETECTION,
@@ -369,7 +370,8 @@ static void ath9k_hw_ani_cck_err_trigger(struct ath_hw *ah)
 			ath9k_hw_ani_control(ah, ATH9K_ANI_FIRSTEP_LEVEL,
 					     aniState->firstepLevel + 1);
 	} else {
-		if (conf->channel->band == IEEE80211_BAND_2GHZ) {
+		if ((conf->channel->band == IEEE80211_BAND_2GHZ) &&
+		    !conf_is_ht(conf)) {
 			if (aniState->firstepLevel > 0)
 				ath9k_hw_ani_control(ah,
 					     ATH9K_ANI_FIRSTEP_LEVEL, 0);
diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
index 7a9a3fa..e789792 100644
--- a/drivers/net/wireless/b43/main.c
+++ b/drivers/net/wireless/b43/main.c
@@ -2289,11 +2289,7 @@ static int b43_upload_microcode(struct b43_wldev *dev)
 			err = -ENODEV;
 			goto error;
 		}
-		msleep_interruptible(50);
-		if (signal_pending(current)) {
-			err = -EINTR;
-			goto error;
-		}
+		msleep(50);
 	}
 	b43_read32(dev, B43_MMIO_GEN_IRQ_REASON);	/* dummy read */
 
@@ -4287,6 +4283,8 @@ static int b43_wireless_core_init(struct b43_wldev *dev)
 	if (!dev->suspend_in_progress)
 		b43_rng_init(wl);
 
+	ieee80211_wake_queues(dev->wl->hw);
+
 	b43_set_status(dev, B43_STAT_INITIALIZED);
 
 	if (!dev->suspend_in_progress)
diff --git a/drivers/net/wireless/hostap/hostap_main.c b/drivers/net/wireless/hostap/hostap_main.c
index 6fe122f..eb57d1e 100644
--- a/drivers/net/wireless/hostap/hostap_main.c
+++ b/drivers/net/wireless/hostap/hostap_main.c
@@ -875,15 +875,16 @@ void hostap_setup_dev(struct net_device *dev, local_info_t *local,
 
 	switch(type) {
 	case HOSTAP_INTERFACE_AP:
+		dev->tx_queue_len = 0;	/* use main radio device queue */
 		dev->netdev_ops = &hostap_mgmt_netdev_ops;
 		dev->type = ARPHRD_IEEE80211;
 		dev->header_ops = &hostap_80211_ops;
 		break;
 	case HOSTAP_INTERFACE_MASTER:
-		dev->tx_queue_len = 0;	/* use main radio device queue */
 		dev->netdev_ops = &hostap_master_ops;
 		break;
 	default:
+		dev->tx_queue_len = 0;	/* use main radio device queue */
 		dev->netdev_ops = &hostap_netdev_ops;
 	}
 
diff --git a/drivers/net/wireless/iwlwifi/iwl-4965.c b/drivers/net/wireless/iwlwifi/iwl-4965.c
index 6a13bfb..ca61d37 100644
--- a/drivers/net/wireless/iwlwifi/iwl-4965.c
+++ b/drivers/net/wireless/iwlwifi/iwl-4965.c
@@ -2346,6 +2346,7 @@ struct iwl_cfg iwl4965_agn_cfg = {
 	.mod_params = &iwl4965_mod_params,
 	.use_isr_legacy = true,
 	.ht_greenfield_support = false,
+	.broken_powersave = true,
 };
 
 /* Module firmware */
diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
index 40b207a..346dc06 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
@@ -760,6 +760,7 @@ static u32 rs_get_lower_rate(struct iwl_lq_sta *lq_sta,
 	u16 high_low;
 	u8 switch_to_legacy = 0;
 	u8 is_green = lq_sta->is_green;
+	struct iwl_priv *priv = lq_sta->drv;
 
 	/* check if we need to switch from HT to legacy rates.
 	 * assumption is that mandatory rates (1Mbps or 6Mbps)
@@ -773,7 +774,8 @@ static u32 rs_get_lower_rate(struct iwl_lq_sta *lq_sta,
 			tbl->lq_type = LQ_G;
 
 		if (num_of_ant(tbl->ant_type) > 1)
-			tbl->ant_type = ANT_A;/*FIXME:RS*/
+			tbl->ant_type =
+				first_antenna(priv->hw_params.valid_tx_ant);
 
 		tbl->is_ht40 = 0;
 		tbl->is_SGI = 0;
@@ -883,6 +885,12 @@ static void rs_tx_status(void *priv_r, struct ieee80211_supported_band *sband,
 		mac_index &= RATE_MCS_CODE_MSK;	/* Remove # of streams */
 		if (mac_index >= (IWL_RATE_9M_INDEX - IWL_FIRST_OFDM_RATE))
 			mac_index++;
+		/*
+		 * mac80211 HT index is always zero-indexed; we need to move
+		 * HT OFDM rates after CCK rates in 2.4 GHz band
+		 */
+		if (priv->band == IEEE80211_BAND_2GHZ)
+			mac_index += IWL_FIRST_OFDM_RATE;
 	}
 
 	if ((mac_index < 0) ||
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.c b/drivers/net/wireless/iwlwifi/iwl-core.c
index acfd7b4..fd26c0d 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.c
+++ b/drivers/net/wireless/iwlwifi/iwl-core.c
@@ -1585,9 +1585,12 @@ int iwl_setup_mac(struct iwl_priv *priv)
 	hw->flags = IEEE80211_HW_SIGNAL_DBM |
 		    IEEE80211_HW_NOISE_DBM |
 		    IEEE80211_HW_AMPDU_AGGREGATION |
-		    IEEE80211_HW_SPECTRUM_MGMT |
-		    IEEE80211_HW_SUPPORTS_PS |
-		    IEEE80211_HW_SUPPORTS_DYNAMIC_PS;
+		    IEEE80211_HW_SPECTRUM_MGMT;
+
+	if (!priv->cfg->broken_powersave)
+		hw->flags |= IEEE80211_HW_SUPPORTS_PS |
+			     IEEE80211_HW_SUPPORTS_DYNAMIC_PS;
+
 	hw->wiphy->interface_modes =
 		BIT(NL80211_IFTYPE_STATION) |
 		BIT(NL80211_IFTYPE_ADHOC);
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.h b/drivers/net/wireless/iwlwifi/iwl-core.h
index c04d2a2..7ff9ffb 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.h
+++ b/drivers/net/wireless/iwlwifi/iwl-core.h
@@ -252,6 +252,7 @@ struct iwl_cfg {
 	const u16 max_ll_items;
 	const bool shadow_ram_support;
 	const bool ht_greenfield_support;
+	const bool broken_powersave;
 };
 
 /***************************
diff --git a/drivers/net/wireless/iwlwifi/iwl-power.c b/drivers/net/wireless/iwlwifi/iwl-power.c
index 4ec6a83..60be976 100644
--- a/drivers/net/wireless/iwlwifi/iwl-power.c
+++ b/drivers/net/wireless/iwlwifi/iwl-power.c
@@ -292,8 +292,9 @@ int iwl_power_update_mode(struct iwl_priv *priv, bool force)
 	else
 		dtimper = 1;
 
-	/* TT power setting overwrites everything */
-	if (tt->state >= IWL_TI_1)
+	if (priv->cfg->broken_powersave)
+		iwl_power_sleep_cam_cmd(priv, &cmd);
+	else if (tt->state >= IWL_TI_1)
 		iwl_static_sleep_cmd(priv, &cmd, tt->tt_power_mode, dtimper);
 	else if (!enabled)
 		iwl_power_sleep_cam_cmd(priv, &cmd);
diff --git a/drivers/net/wireless/iwlwifi/iwl-rx.c b/drivers/net/wireless/iwlwifi/iwl-rx.c
index 8150c5c..b90adcb 100644
--- a/drivers/net/wireless/iwlwifi/iwl-rx.c
+++ b/drivers/net/wireless/iwlwifi/iwl-rx.c
@@ -239,26 +239,22 @@ void iwl_rx_allocate(struct iwl_priv *priv, gfp_t priority)
 	struct iwl_rx_queue *rxq = &priv->rxq;
 	struct list_head *element;
 	struct iwl_rx_mem_buffer *rxb;
+	struct sk_buff *skb;
 	unsigned long flags;
 
 	while (1) {
 		spin_lock_irqsave(&rxq->lock, flags);
-
 		if (list_empty(&rxq->rx_used)) {
 			spin_unlock_irqrestore(&rxq->lock, flags);
 			return;
 		}
-		element = rxq->rx_used.next;
-		rxb = list_entry(element, struct iwl_rx_mem_buffer, list);
-		list_del(element);
-
 		spin_unlock_irqrestore(&rxq->lock, flags);
 
 		/* Alloc a new receive buffer */
-		rxb->skb = alloc_skb(priv->hw_params.rx_buf_size + 256,
+		skb = alloc_skb(priv->hw_params.rx_buf_size + 256,
 						priority);
 
-		if (!rxb->skb) {
+		if (!skb) {
 			IWL_CRIT(priv, "Can not allocate SKB buffers\n");
 			/* We don't reschedule replenish work here -- we will
 			 * call the restock method and if it still needs
@@ -266,6 +262,20 @@ void iwl_rx_allocate(struct iwl_priv *priv, gfp_t priority)
 			break;
 		}
 
+		spin_lock_irqsave(&rxq->lock, flags);
+
+		if (list_empty(&rxq->rx_used)) {
+			spin_unlock_irqrestore(&rxq->lock, flags);
+			dev_kfree_skb_any(skb);
+			return;
+		}
+		element = rxq->rx_used.next;
+		rxb = list_entry(element, struct iwl_rx_mem_buffer, list);
+		list_del(element);
+
+		spin_unlock_irqrestore(&rxq->lock, flags);
+
+		rxb->skb = skb;
 		/* Get physical address of RB/SKB */
 		rxb->real_dma_addr = pci_map_single(
 					priv->pci_dev,
diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 2238c9f..0909668 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -1134,6 +1134,7 @@ static void iwl3945_rx_allocate(struct iwl_priv *priv, gfp_t priority)
 	struct iwl_rx_queue *rxq = &priv->rxq;
 	struct list_head *element;
 	struct iwl_rx_mem_buffer *rxb;
+	struct sk_buff *skb;
 	unsigned long flags;
 
 	while (1) {
@@ -1143,17 +1144,11 @@ static void iwl3945_rx_allocate(struct iwl_priv *priv, gfp_t priority)
 			spin_unlock_irqrestore(&rxq->lock, flags);
 			return;
 		}
-
-		element = rxq->rx_used.next;
-		rxb = list_entry(element, struct iwl_rx_mem_buffer, list);
-		list_del(element);
 		spin_unlock_irqrestore(&rxq->lock, flags);
 
 		/* Alloc a new receive buffer */
-		rxb->skb =
-		    alloc_skb(priv->hw_params.rx_buf_size,
-				priority);
-		if (!rxb->skb) {
+		skb = alloc_skb(priv->hw_params.rx_buf_size, priority);
+		if (!skb) {
 			if (net_ratelimit())
 				IWL_CRIT(priv, ": Can not allocate SKB buffers\n");
 			/* We don't reschedule replenish work here -- we will
@@ -1162,6 +1157,19 @@ static void iwl3945_rx_allocate(struct iwl_priv *priv, gfp_t priority)
 			break;
 		}
 
+		spin_lock_irqsave(&rxq->lock, flags);
+		if (list_empty(&rxq->rx_used)) {
+			spin_unlock_irqrestore(&rxq->lock, flags);
+			dev_kfree_skb_any(skb);
+			return;
+		}
+		element = rxq->rx_used.next;
+		rxb = list_entry(element, struct iwl_rx_mem_buffer, list);
+		list_del(element);
+		spin_unlock_irqrestore(&rxq->lock, flags);
+
+		rxb->skb = skb;
+
 		/* If radiotap head is required, reserve some headroom here.
 		 * The physical head count is a variable rx_stats->phy_count.
 		 * We reserve 4 bytes here. Plus these extra bytes, the
diff --git a/drivers/net/wireless/p54/p54usb.c b/drivers/net/wireless/p54/p54usb.c
index e44460f..17e1995 100644
--- a/drivers/net/wireless/p54/p54usb.c
+++ b/drivers/net/wireless/p54/p54usb.c
@@ -67,6 +67,7 @@ static struct usb_device_id p54u_table[] __devinitdata = {
 	{USB_DEVICE(0x0bf8, 0x1009)},   /* FUJITSU E-5400 USB D1700*/
 	{USB_DEVICE(0x0cde, 0x0006)},   /* Medion MD40900 */
 	{USB_DEVICE(0x0cde, 0x0008)},	/* Sagem XG703A */
+	{USB_DEVICE(0x0cde, 0x0015)},	/* Zcomax XG-705A */
 	{USB_DEVICE(0x0d8e, 0x3762)},	/* DLink DWL-G120 Cohiba */
 	{USB_DEVICE(0x124a, 0x4025)},	/* IOGear GWU513 (GW3887IK chip) */
 	{USB_DEVICE(0x1260, 0xee22)},	/* SMC 2862W-G version 2 */
diff --git a/drivers/ssb/pci.c b/drivers/ssb/pci.c
index f853d56..9e50896 100644
--- a/drivers/ssb/pci.c
+++ b/drivers/ssb/pci.c
@@ -600,6 +600,7 @@ static int sprom_extract(struct ssb_bus *bus, struct ssb_sprom *out,
 			ssb_printk(KERN_WARNING PFX "Unsupported SPROM"
 				   "  revision %d detected. Will extract"
 				   " v1\n", out->revision);
+			out->revision = 1;
 			sprom_extract_r123(out, in);
 		}
 	}
diff --git a/drivers/ssb/sdio.c b/drivers/ssb/sdio.c
index 1140510..65a6080 100644
--- a/drivers/ssb/sdio.c
+++ b/drivers/ssb/sdio.c
@@ -21,7 +21,7 @@
 #include "ssb_private.h"
 
 /* Define the following to 1 to enable a printk on each coreswitch. */
-#define SSB_VERBOSE_SDIOCORESWITCH_DEBUG		1
+#define SSB_VERBOSE_SDIOCORESWITCH_DEBUG		0
 
 
 /* Hardware invariants CIS tuples */
@@ -333,7 +333,7 @@ static void ssb_sdio_block_read(struct ssb_device *dev, void *buffer,
 		goto out;
 
 err_out:
-	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%u), error %d\n",
+	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%zu), error %d\n",
 		bus->sdio_sbaddr >> 16, offset, reg_width, saved_count, error);
 out:
 	sdio_release_host(bus->host_sdio);
@@ -440,7 +440,7 @@ static void ssb_sdio_block_write(struct ssb_device *dev, const void *buffer,
 		goto out;
 
 err_out:
-	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%u), error %d\n",
+	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%zu), error %d\n",
 		bus->sdio_sbaddr >> 16, offset, reg_width, saved_count, error);
 out:
 	sdio_release_host(bus->host_sdio);
diff --git a/net/mac80211/rc80211_minstrel.c b/net/mac80211/rc80211_minstrel.c
index 7c51429..6e5d68b 100644
--- a/net/mac80211/rc80211_minstrel.c
+++ b/net/mac80211/rc80211_minstrel.c
@@ -418,7 +418,7 @@ minstrel_rate_init(void *priv, struct ieee80211_supported_band *sband,
 
 			/* contention window */
 			tx_time_single += t_slot + min(cw, mp->cw_max);
-			cw = (cw + 1) << 1;
+			cw = (cw << 1) | 1;
 
 			tx_time += tx_time_single;
 			tx_time_cts += tx_time_single + mi->sp_ack_dur;
diff --git a/net/wireless/scan.c b/net/wireless/scan.c
index 4c210c2..e5f92ee 100644
--- a/net/wireless/scan.c
+++ b/net/wireless/scan.c
@@ -662,7 +662,7 @@ int cfg80211_wext_siwscan(struct net_device *dev,
 				int k;
 				int wiphy_freq = wiphy->bands[band]->channels[j].center_freq;
 				for (k = 0; k < wreq->num_channels; k++) {
-					int wext_freq = wreq->channel_list[k].m / 100000;
+					int wext_freq = cfg80211_wext_freq(wiphy, &wreq->channel_list[k]);
 					if (wext_freq == wiphy_freq)
 						goto wext_freq_found;
 				}
@@ -675,6 +675,11 @@ int cfg80211_wext_siwscan(struct net_device *dev,
 		wext_freq_not_found: ;
 		}
 	}
+	/* No channels found? */
+	if (!i) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	/* Set real number of channels specified in creq->channels[] */
 	creq->n_channels = i;
diff --git a/net/wireless/sme.c b/net/wireless/sme.c
index 6830788..7fae7ee 100644
--- a/net/wireless/sme.c
+++ b/net/wireless/sme.c
@@ -188,7 +188,7 @@ void cfg80211_conn_work(struct work_struct *work)
 	rtnl_unlock();
 }
 
-static bool cfg80211_get_conn_bss(struct wireless_dev *wdev)
+static struct cfg80211_bss *cfg80211_get_conn_bss(struct wireless_dev *wdev)
 {
 	struct cfg80211_registered_device *rdev = wiphy_to_dev(wdev->wiphy);
 	struct cfg80211_bss *bss;
@@ -205,7 +205,7 @@ static bool cfg80211_get_conn_bss(struct wireless_dev *wdev)
 			       WLAN_CAPABILITY_ESS | WLAN_CAPABILITY_PRIVACY,
 			       capa);
 	if (!bss)
-		return false;
+		return NULL;
 
 	memcpy(wdev->conn->bssid, bss->bssid, ETH_ALEN);
 	wdev->conn->params.bssid = wdev->conn->bssid;
@@ -213,14 +213,14 @@ static bool cfg80211_get_conn_bss(struct wireless_dev *wdev)
 	wdev->conn->state = CFG80211_CONN_AUTHENTICATE_NEXT;
 	schedule_work(&rdev->conn_work);
 
-	cfg80211_put_bss(bss);
-	return true;
+	return bss;
 }
 
 static void __cfg80211_sme_scan_done(struct net_device *dev)
 {
 	struct wireless_dev *wdev = dev->ieee80211_ptr;
 	struct cfg80211_registered_device *rdev = wiphy_to_dev(wdev->wiphy);
+	struct cfg80211_bss *bss;
 
 	ASSERT_WDEV_LOCK(wdev);
 
@@ -234,7 +234,10 @@ static void __cfg80211_sme_scan_done(struct net_device *dev)
 	    wdev->conn->state != CFG80211_CONN_SCAN_AGAIN)
 		return;
 
-	if (!cfg80211_get_conn_bss(wdev)) {
+	bss = cfg80211_get_conn_bss(wdev);
+	if (bss) {
+		cfg80211_put_bss(bss);
+	} else {
 		/* not found */
 		if (wdev->conn->state == CFG80211_CONN_SCAN_AGAIN)
 			schedule_work(&rdev->conn_work);
@@ -670,6 +673,7 @@ int __cfg80211_connect(struct cfg80211_registered_device *rdev,
 {
 	struct wireless_dev *wdev = dev->ieee80211_ptr;
 	struct ieee80211_channel *chan;
+	struct cfg80211_bss *bss = NULL;
 	int err;
 
 	ASSERT_WDEV_LOCK(wdev);
@@ -760,7 +764,7 @@ int __cfg80211_connect(struct cfg80211_registered_device *rdev,
 
 		/* don't care about result -- but fill bssid & channel */
 		if (!wdev->conn->params.bssid || !wdev->conn->params.channel)
-			cfg80211_get_conn_bss(wdev);
+			bss = cfg80211_get_conn_bss(wdev);
 
 		wdev->sme_state = CFG80211_SME_CONNECTING;
 		wdev->connect_keys = connkeys;
@@ -770,10 +774,11 @@ int __cfg80211_connect(struct cfg80211_registered_device *rdev,
 			wdev->conn->prev_bssid_valid = true;
 		}
 
-		/* we're good if we have both BSSID and channel */
-		if (wdev->conn->params.bssid && wdev->conn->params.channel) {
+		/* we're good if we have a matching bss struct */
+		if (bss) {
 			wdev->conn->state = CFG80211_CONN_AUTHENTICATE_NEXT;
 			err = cfg80211_conn_do_work(wdev);
+			cfg80211_put_bss(bss);
 		} else {
 			/* otherwise we'll need to scan for the AP first */
 			err = cfg80211_conn_scan(wdev);
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply related

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-16 19:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB10B67.2050108@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 12570 bytes --]

Avi Kivity wrote:
> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>
>>> If kvm can do it, others can.
>>>      
>> The problem is that you seem to either hand-wave over details like this,
>> or you give details that are pretty much exactly what vbus does already.
>>   My point is that I've already sat down and thought about these issues
>> and solved them in a freely available GPL'ed software package.
>>    
> 
> In the kernel.  IMO that's the wrong place for it.

In conversations with Ira, he indicated he needs kernel-to-kernel
ethernet for performance, and needs at least an ethernet and console
connectivity.  You could conceivably build a solution for this system 3
basic ways:

1) "completely" in userspace: use things like tuntap on the ppc boards,
and tunnel packets across a custom point-to-point connection formed over
the pci link to a userspace app on the x86 board.  This app then
reinjects the packets into the x86 kernel as a raw socket or tuntap,
etc.  Pretty much vanilla tuntap/vpn kind of stuff.  Advantage: very
little kernel code.  Problem: performance (citation: hopefully obvious).

2) "partially" in userspace: have an in-kernel virtio-net driver talk to
a userspace based virtio-net backend.  This is the (current, non-vhost
oriented) KVM/qemu model.  Advantage, re-uses existing kernel-code.
Problem: performance (citation: see alacrityvm numbers).

3) "in-kernel": You can do something like virtio-net to vhost to
potentially meet some of the requirements, but not all.

In order to fully meet (3), you would need to do some of that stuff you
mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
we need to have a facility for mapping eventfds and establishing a
signaling mechanism (like PIO+qid), etc. KVM does this with
IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
invented.

To meet performance, this stuff has to be in kernel and there has to be
a way to manage it.  Since vbus was designed to do exactly that, this is
what I would advocate.  You could also reinvent these concepts and put
your own mux and mapping code in place, in addition to all the other
stuff that vbus does.  But I am not clear why anyone would want to.

So no, the kernel is not the wrong place for it.  Its the _only_ place
for it.  Otherwise, just use (1) and be done with it.

>  Further, if we adopt
> vbus, if drop compatibility with existing guests or have to support both
> vbus and virtio-pci.

We already need to support both (at least to support Ira).  virtio-pci
doesn't work here.  Something else (vbus, or vbus-like) is needed.

> 
>> So the question is: is your position that vbus is all wrong and you wish
>> to create a new bus-like thing to solve the problem?
> 
> I don't intend to create anything new, I am satisfied with virtio.  If
> it works for Ira, excellent.  If not, too bad.

I think that about sums it up, then.


>  I believe it will work without too much trouble.

Afaict it wont for the reasons I mentioned.

> 
>> If so, how is it
>> different from what Ive already done?  More importantly, what specific
>> objections do you have to what Ive done, as perhaps they can be fixed
>> instead of starting over?
>>    
> 
> The two biggest objections are:
> - the host side is in the kernel

As it needs to be.

> - the guest side is a new bus instead of reusing pci (on x86/kvm),
> making Windows support more difficult

Thats a function of the vbus-connector, which is different from
vbus-core.  If you don't like it (and I know you don't), we can write
one that interfaces to qemu's pci system.  I just don't like the
limitations that imposes, nor do I think we need that complexity of
dealing with a split PCI model, so I chose to not implement vbus-kvm
this way.

With all due respect, based on all of your comments in aggregate I
really do not think you are truly grasping what I am actually building here.

> 
> I guess these two are exactly what you think are vbus' greatest
> advantages, so we'll probably have to extend our agree-to-disagree on
> this one.
> 
> I also had issues with using just one interrupt vector to service all
> events, but that's easily fixed.

Again, function of the connector.

> 
>>> There is no guest and host in this scenario.  There's a device side
>>> (ppc) and a driver side (x86).  The driver side can access configuration
>>> information on the device side.  How to multiplex multiple devices is an
>>> interesting exercise for whoever writes the virtio binding for that
>>> setup.
>>>      
>> Bingo.  So now its a question of do you want to write this layer from
>> scratch, or re-use my framework.
>>    
> 
> You will have to implement a connector or whatever for vbus as well. 
> vbus has more layers so it's probably smaller for vbus.

Bingo! That is precisely the point.

All the stuff for how to map eventfds, handle signal mitigation, demux
device/function pointers, isolation, etc, are built in.  All the
connector has to do is transport the 4-6 verbs and provide a memory
mapping/copy function, and the rest is reusable.  The device models
would then work in all environments unmodified, and likewise the
connectors could use all device-models unmodified.

> 
>>>>>
>>>>>          
>>>> I am talking about how we would tunnel the config space for N devices
>>>> across his transport.
>>>>
>>>>        
>>> Sounds trivial.
>>>      
>> No one said it was rocket science.  But it does need to be designed and
>> implemented end-to-end, much of which Ive already done in what I hope is
>> an extensible way.
>>    
> 
> It was already implemented three times for virtio, so apparently that's
> extensible too.

And to my point, I'm trying to commoditize as much of that process as
possible on both the front and backends (at least for cases where
performance matters) so that you don't need to reinvent the wheel for
each one.

> 
>>>   Write an address containing the device number and
>>> register number to on location, read or write data from another.
>>>      
>> You mean like the "u64 devh", and "u32 func" fields I have here for the
>> vbus-kvm connector?
>>
>> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64
>>
>>
>>    
> 
> Probably.
> 
> 
> 
>>>> That sounds convenient given his hardware, but it has its own set of
>>>> problems.  For one, the configuration/inventory of these boards is now
>>>> driven by the wrong side and has to be addressed.
>>>>        
>>> Why is it the wrong side?
>>>      
>> "Wrong" is probably too harsh a word when looking at ethernet.  Its
>> certainly "odd", and possibly inconvenient.  It would be like having
>> vhost in a KVM guest, and virtio-net running on the host.  You could do
>> it, but its weird and awkward.  Where it really falls apart and enters
>> the "wrong" category is for non-symmetric devices, like disk-io.
>>
>>    
> 
> 
> It's not odd or wrong or wierd or awkward.

Its weird IMO because IIUC the ppc boards are not really "NICs".  Yes,
their arrangement as bus-master PCI devices makes them look and smell
like "devices", but that is an implementation detail of its transport
(like hypercalls/PIO in KVM) and not relevant to its broader role in the
system.

They are more or less like "guests" from the KVM world.  The x86 is
providing connectivity resources to these guests, not the other way
around.  It is not a goal to make the x86 look like it has a multihomed
array of ppc based NIC adapters.  The only reason we would treat these
ppc boards like NICs is because (iiuc) that is the only way vhost can be
hacked to work with the system, not because its the optimal design.

FWIW: There are a ton of chassis-based systems that look similar to
Ira's out there (PCI inter-connected nodes), and I would like to support
them, too.  So its not like this is a one-off.


> An ethernet NIC is not
> symmetric, one side does DMA and issues interrupts, the other uses its
> own memory.

I never said a NIC was.  I meant the ethernet _protocol_ is symmetric.
I meant it in the sense that you can ingress/egress packets in either
direction and as long as "TX" on one side is "RX" on the other and vice
versa, it all kind of works.  You can even loop it back and it still works.

Contrast this to something like a disk-block protocol where a "read"
message is expected to actually do a read, etc.  In this case, you
cannot arbitrarily assign the location of the "driver" and "device" like
you can with ethernet.  The device should presumably be where the
storage is, and the driver should be where the consumer is.

>  That's exactly the case with Ira's setup.

See "implementation detail" comment above.


> 
> If the ppc boards were to emulate a disk controller, you'd run
> virtio-blk on x86 and vhost-blk on the ppc boards.

Agreed.

> 
>>>> Second, the role
>>>> reversal will likely not work for many models other than ethernet (e.g.
>>>> virtio-console or virtio-blk drivers running on the x86 board would be
>>>> naturally consuming services from the slave boards...virtio-net is an
>>>> exception because 802.x is generally symmetrical).
>>>>
>>>>        
>>> There is no role reversal.
>>>      
>> So if I have virtio-blk driver running on the x86 and vhost-blk device
>> running on the ppc board, I can use the ppc board as a block-device.
>> What if I really wanted to go the other way?
>>    
> 
> You mean, if the x86 board was able to access the disks and dma into the
> ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.

But as we discussed, vhost doesn't work well if you try to run it on the
x86 side due to its assumptions about pagable "guest" memory, right?  So
is that even an option?  And even still, you would still need to solve
the aggregation problem so that multiple devices can coexist.

> 
> As long as you don't use the words "guest" and "host" but keep to
> "driver" and "device", it all works out.
> 
>>> The side doing dma is the device, the side
>>> accessing its own memory is the driver.  Just like that other 1e12
>>> driver/device pairs out there.
>>>      
>> IIUC, his ppc boards really can be seen as "guests" (they are linux
>> instances that are utilizing services from the x86, not the other way
>> around).
> 
> They aren't guests.  Guests don't dma into their host's memory.

Thats not relevant.  They are not guests in the sense of isolated
virtualized guests like KVM.  They are guests in the sense that they are
subordinate linux instances which utilize IO resources on the x86 (host).

The way this would work is that the x86 would be driving the dma
controller on the ppc board, not the other way around.  The fact that
the controller lives on the ppc board is an implementation detail.

The way I envision this to work would be that the ppc board exports two
functions in its device:

1) a vbus-bridge like device
2) a dma-controller that accepts "gpas" as one parameter

so function (1) does the 4-6 verbs I mentioned for device addressing,
etc.  function (2) is utilized by the x86 memctx whenever a
->copy_from() or ->copy_to() operation is invoked.  The ppc board's
would be doing their normal virtio kind of things, like
->add_buf(_pa(skb->data))).

> 
>> vhost forces the model to have the ppc boards act as IO-hosts,
>> whereas vbus would likely work in either direction due to its more
>> refined abstraction layer.
>>    
> 
> vhost=device=dma, virtio=driver=own-memory.

I agree that virtio=driver=own-memory.  The problem is vhost != dma.
vhost = hva*, and it just so happens that Ira's ppc boards support host
mapping/dma so it kind of works.

What I have been trying to say is that the extra abstraction to the
memctx gets the "vhost" side away from hva*, such that it can support
hva if that makes sense, or something else (like a custom dma engine if
it doesn't)

> 
>>> Of course vhost is incomplete, in the same sense that Linux is
>>> incomplete.  Both require userspace.
>>>      
>> A vhost based solution to Iras design is missing more than userspace.
>> Many of those gaps are addressed by a vbus based solution.
>>    
> 
> Maybe.  Ira can fill the gaps or use vbus.
> 
> 

Agreed.

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: [Linux-ATM-General] [PATCH] atm/br2684: netif_stop_queue() when atm device busy and netif_wake_queue() when we can send packets again.
From: Philip A. Prindeville @ 2009-09-16 18:04 UTC (permalink / raw)
  To: Karl Hiramoto; +Cc: David Miller, netdev, linux-atm-general
In-Reply-To: <4AAFAB60.4080302@hiramoto.org>

On 09/15/2009 07:57 AM, Karl Hiramoto wrote:
> Karl Hiramoto wrote:
>   
>> David Miller wrote:
>>   
>>     
>>> From: Karl Hiramoto <karl@hiramoto.org>
>>> Date: Thu, 10 Sep 2009 23:30:44 +0200
>>>
>>>   
>>>     
>>>       
>>>> I'm not really sure if or how many packets to upper layers buffer.
>>>>     
>>>>       
>>>>         
>>> This is determined by ->tx_queue_len, so whatever value is being
>>> set for ATM network devices is what the core will use for backlog
>>> limiting while the device's TX queue is stopped.
>>>     
>>>       
>> I tried varying tx_queue_len by 10, 100,  and 1000x, but it didn't seem 
>> to help much.  Whenever the atm dev called netif_wake_queue() it seems 
>> like the driver still starves for packets  and still takes time to get 
>> going again.
>>
>>
>> It seem like when the driver calls netif_wake_queue() it's TX hardware 
>> queue is nearly full, but it has space to accept new packets.  The TX 
>> hardware queue has time to empty, devices starves for packets(goes 
>> idle), then finally a packet comes in from the upper networking 
>> layers.   I'm not really sure at the moment where the problem lies to my 
>> maximum throughput dropping.
>>
>> I did try changing sk_sndbuf to 256K but that didn't seem to help either.
>>
>> --
>>     
> Actually i think i spoke too soon,  tuning TCP parameters, txqueuelen on 
> all machines the server, router and client it seems my performance came 
> back.
>
> --
> Karl
>   

So what size are you currently using?

Out-of-the-box build, 2.6.27.29 seems to set it to 1000.

-Philip


^ permalink raw reply

* [RFCv4 PATCH 2/2] net: Allow protocols to provide an unlocked_recvmsg socket method
From: Arnaldo Carvalho de Melo @ 2009-09-16 17:07 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Networking Development Mailing List, Caitlin Bestler,
	Chris Van Hoof, Clark Williams, Neil Horman, Nir Tzachar,
	Nivedita Singhvi, Paul Moore, Rémi Denis-Courmont,
	Steven Whitehouse

So thar recvmmsg can use it. With this patch recvmmsg actually _requires_ that
socket->ops->unlocked_recvmsg exists, and that socket->sk->sk_prot->unlocked_recvmsg
is non NULL.

We may well switch back to the previous scheme where sys_recvmmsg checks if
the underlying protocol provides an unlocked version and uses it, falling
back to the locked version if there is none.

But first lets see if this works with recvmmsg alone and what kinds of gains we
get with the unlocked_recvmmsg implementation in UDP. Followup patches can
restore that behaviour if we want to use it with, say, DCCP and SCTP without an
specific unlocked version.

This should address the concerns raised by Rémi about the MSG_UNLOCKED problem.

Cc: Caitlin Bestler <caitlin.bestler@gmail.com>
Cc: Chris Van Hoof <vanhoof@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Nir Tzachar <nir.tzachar@gmail.com>
Cc: Nivedita Singhvi <niv@us.ibm.com>
Cc: Paul Moore <paul.moore@hp.com>
Cc: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Cc: Steven Whitehouse <steve@chygwyn.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 drivers/isdn/mISDN/socket.c    |    2 +
 drivers/net/pppoe.c            |    1 +
 drivers/net/pppol2tp.c         |    1 +
 include/linux/net.h            |    7 +++
 include/net/sock.h             |   13 +++++
 net/appletalk/ddp.c            |    1 +
 net/atm/pvc.c                  |    1 +
 net/atm/svc.c                  |    1 +
 net/ax25/af_ax25.c             |    1 +
 net/bluetooth/bnep/sock.c      |    1 +
 net/bluetooth/cmtp/sock.c      |    1 +
 net/bluetooth/hci_sock.c       |    1 +
 net/bluetooth/hidp/sock.c      |    1 +
 net/bluetooth/l2cap.c          |    1 +
 net/bluetooth/rfcomm/sock.c    |    1 +
 net/bluetooth/sco.c            |    1 +
 net/can/bcm.c                  |    1 +
 net/can/raw.c                  |    1 +
 net/core/sock.c                |   26 +++++++++
 net/dccp/ipv4.c                |    1 +
 net/dccp/ipv6.c                |    1 +
 net/decnet/af_decnet.c         |    1 +
 net/econet/af_econet.c         |    1 +
 net/ieee802154/af_ieee802154.c |    2 +
 net/ipv4/af_inet.c             |    3 +
 net/ipv4/udp.c                 |   52 +++++++++++++++---
 net/ipv6/af_inet6.c            |    2 +
 net/ipv6/raw.c                 |    1 +
 net/ipx/af_ipx.c               |    1 +
 net/irda/af_irda.c             |    4 ++
 net/iucv/af_iucv.c             |    1 +
 net/key/af_key.c               |    1 +
 net/llc/af_llc.c               |    1 +
 net/netlink/af_netlink.c       |    1 +
 net/netrom/af_netrom.c         |    1 +
 net/packet/af_packet.c         |    2 +
 net/phonet/socket.c            |    2 +
 net/rds/af_rds.c               |    1 +
 net/rose/af_rose.c             |    1 +
 net/rxrpc/af_rxrpc.c           |    1 +
 net/sctp/ipv6.c                |    1 +
 net/sctp/protocol.c            |    1 +
 net/socket.c                   |  112 +++++++++++++++++++++++++++++++++++----
 net/tipc/socket.c              |    3 +
 net/unix/af_unix.c             |    3 +
 net/x25/af_x25.c               |    1 +
 46 files changed, 244 insertions(+), 21 deletions(-)

diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c36f521..6da3a71 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -590,6 +590,7 @@ static const struct proto_ops data_sock_ops = {
 	.getname	= data_sock_getname,
 	.sendmsg	= mISDN_sock_sendmsg,
 	.recvmsg	= mISDN_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= datagram_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
@@ -743,6 +744,7 @@ static const struct proto_ops base_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..cbcd3d5 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -1121,6 +1121,7 @@ static const struct proto_ops pppoe_ops = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= pppoe_sendmsg,
 	.recvmsg	= pppoe_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.ioctl		= pppox_ioctl,
 };
diff --git a/drivers/net/pppol2tp.c b/drivers/net/pppol2tp.c
index e0f9219..af6160c 100644
--- a/drivers/net/pppol2tp.c
+++ b/drivers/net/pppol2tp.c
@@ -2590,6 +2590,7 @@ static struct proto_ops pppol2tp_ops = {
 	.getsockopt	= pppol2tp_getsockopt,
 	.sendmsg	= pppol2tp_sendmsg,
 	.recvmsg	= pppol2tp_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.ioctl		= pppox_ioctl,
 };
diff --git a/include/linux/net.h b/include/linux/net.h
index d67587a..8b852de 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -186,6 +186,10 @@ struct proto_ops {
 	int		(*recvmsg)   (struct kiocb *iocb, struct socket *sock,
 				      struct msghdr *m, size_t total_len,
 				      int flags);
+	int		(*unlocked_recvmsg)(struct kiocb *iocb,
+					    struct socket *sock,
+					    struct msghdr *m,
+					    size_t total_len, int flags);
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
@@ -316,6 +320,8 @@ SOCKCALL_WRAP(name, sendmsg, (struct kiocb *iocb, struct socket *sock, struct ms
 	      (iocb, sock, m, len)) \
 SOCKCALL_WRAP(name, recvmsg, (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t len, int flags), \
 	      (iocb, sock, m, len, flags)) \
+SOCKCALL_WRAP(name, unlocked_recvmsg, (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t len, int flags), \
+	      (iocb, sock, m, len, flags)) \
 SOCKCALL_WRAP(name, mmap, (struct file *file, struct socket *sock, struct vm_area_struct *vma), \
 	      (file, sock, vma)) \
 	      \
@@ -337,6 +343,7 @@ static const struct proto_ops name##_ops = {			\
 	.getsockopt	= __lock_##name##_getsockopt,	\
 	.sendmsg	= __lock_##name##_sendmsg,	\
 	.recvmsg	= __lock_##name##_recvmsg,	\
+	.unlocked_recvmsg = __lock_##name##_unlocked_recvmsg,	\
 	.mmap		= __lock_##name##_mmap,		\
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..7c62428 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -644,6 +644,11 @@ struct proto {
 					   struct msghdr *msg,
 					size_t len, int noblock, int flags, 
 					int *addr_len);
+	int			(*unlocked_recvmsg)(struct kiocb *iocb,
+						    struct sock *sk,
+						    struct msghdr *msg,
+						    size_t len, int noblock,
+						    int flags, int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
@@ -998,6 +1003,11 @@ extern int                      sock_no_sendmsg(struct kiocb *, struct socket *,
 						struct msghdr *, size_t);
 extern int                      sock_no_recvmsg(struct kiocb *, struct socket *,
 						struct msghdr *, size_t, int);
+extern int			sock_no_unlocked_recvmsg(struct kiocb *iocb,
+							 struct socket *sock,
+							 struct msghdr *msg,
+							 size_t size,
+							 int flags);
 extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
@@ -1014,6 +1024,9 @@ extern int sock_common_getsockopt(struct socket *sock, int level, int optname,
 				  char __user *optval, int __user *optlen);
 extern int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
 			       struct msghdr *msg, size_t size, int flags);
+extern int sock_common_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+					struct msghdr *msg, size_t size,
+					int flags);
 extern int sock_common_setsockopt(struct socket *sock, int level, int optname,
 				  char __user *optval, int optlen);
 extern int compat_sock_common_getsockopt(struct socket *sock, int level,
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 4a6ff2b..bb2e1bb 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1847,6 +1847,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(atalk_dgram_ops) = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= atalk_sendmsg,
 	.recvmsg	= atalk_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/atm/pvc.c b/net/atm/pvc.c
index e1d22d9..5c03749 100644
--- a/net/atm/pvc.c
+++ b/net/atm/pvc.c
@@ -122,6 +122,7 @@ static const struct proto_ops pvc_proto_ops = {
 	.getsockopt =	pvc_getsockopt,
 	.sendmsg =	vcc_sendmsg,
 	.recvmsg =	vcc_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 7b831b5..6c66ae9 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -644,6 +644,7 @@ static const struct proto_ops svc_proto_ops = {
 	.setsockopt =	svc_setsockopt,
 	.getsockopt =	svc_getsockopt,
 	.sendmsg =	vcc_sendmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.recvmsg =	vcc_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index da0f64f..43f4f57 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1976,6 +1976,7 @@ static const struct proto_ops ax25_proto_ops = {
 	.getsockopt	= ax25_getsockopt,
 	.sendmsg	= ax25_sendmsg,
 	.recvmsg	= ax25_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index e857628..0b26b3c 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -178,6 +178,7 @@ static const struct proto_ops bnep_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index 16b0fad..72a4b5d 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -173,6 +173,7 @@ static const struct proto_ops cmtp_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 4f9621f..bd0aace 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -603,6 +603,7 @@ static const struct proto_ops hci_sock_ops = {
 	.getname	= hci_sock_getname,
 	.sendmsg	= hci_sock_sendmsg,
 	.recvmsg	= hci_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.ioctl		= hci_sock_ioctl,
 	.poll		= datagram_poll,
 	.listen		= sock_no_listen,
diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index 37c9d7d..90b40e2 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -224,6 +224,7 @@ static const struct proto_ops hidp_sock_ops = {
 	.getname	= sock_no_getname,
 	.sendmsg	= sock_no_sendmsg,
 	.recvmsg	= sock_no_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= sock_no_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
index b030125..dc73bd4 100644
--- a/net/bluetooth/l2cap.c
+++ b/net/bluetooth/l2cap.c
@@ -3907,6 +3907,7 @@ static const struct proto_ops l2cap_sock_ops = {
 	.getname	= l2cap_sock_getname,
 	.sendmsg	= l2cap_sock_sendmsg,
 	.recvmsg	= l2cap_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= bt_sock_poll,
 	.ioctl		= bt_sock_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 0b85e81..00b1a41 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -1092,6 +1092,7 @@ static const struct proto_ops rfcomm_sock_ops = {
 	.getname	= rfcomm_sock_getname,
 	.sendmsg	= rfcomm_sock_sendmsg,
 	.recvmsg	= rfcomm_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.shutdown	= rfcomm_sock_shutdown,
 	.setsockopt	= rfcomm_sock_setsockopt,
 	.getsockopt	= rfcomm_sock_getsockopt,
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index 13c27f1..fda79b8 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -984,6 +984,7 @@ static const struct proto_ops sco_sock_ops = {
 	.getname	= sco_sock_getname,
 	.sendmsg	= sco_sock_sendmsg,
 	.recvmsg	= bt_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= bt_sock_poll,
 	.ioctl		= bt_sock_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/can/bcm.c b/net/can/bcm.c
index 597da4f..e0aff9e 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1562,6 +1562,7 @@ static struct proto_ops bcm_ops __read_mostly = {
 	.getsockopt    = sock_no_getsockopt,
 	.sendmsg       = bcm_sendmsg,
 	.recvmsg       = bcm_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap          = sock_no_mmap,
 	.sendpage      = sock_no_sendpage,
 };
diff --git a/net/can/raw.c b/net/can/raw.c
index db3152d..b8fa610 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -730,6 +730,7 @@ static struct proto_ops raw_ops __read_mostly = {
 	.getsockopt    = raw_getsockopt,
 	.sendmsg       = raw_sendmsg,
 	.recvmsg       = raw_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap          = sock_no_mmap,
 	.sendpage      = sock_no_sendpage,
 };
diff --git a/net/core/sock.c b/net/core/sock.c
index 30d5446..6ac86d4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1653,6 +1653,13 @@ int sock_no_connect(struct socket *sock, struct sockaddr *saddr,
 }
 EXPORT_SYMBOL(sock_no_connect);
 
+int sock_no_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+			     struct msghdr *msg, size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(sock_no_unlocked_recvmsg);
+
 int sock_no_socketpair(struct socket *sock1, struct socket *sock2)
 {
 	return -EOPNOTSUPP;
@@ -2014,6 +2021,25 @@ int sock_common_recvmsg(struct kiocb *iocb, struct socket *sock,
 }
 EXPORT_SYMBOL(sock_common_recvmsg);
 
+int sock_common_unlocked_recvmsg(struct kiocb *iocb, struct socket *sock,
+				 struct msghdr *msg, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	if (sk->sk_prot->unlocked_recvmsg == NULL)
+		return -EOPNOTSUPP;
+
+	err = sk->sk_prot->unlocked_recvmsg(iocb, sk, msg, size,
+					    flags & MSG_DONTWAIT,
+					    flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(sock_common_unlocked_recvmsg);
+
 /*
  *	Set socket options on an inet socket.
  */
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index d01c00d..e781f01 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -974,6 +974,7 @@ static const struct proto_ops inet_dccp_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 64f011c..f530e37 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1175,6 +1175,7 @@ static struct proto_ops inet6_dccp_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 77d4028..aa1af0b 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -2348,6 +2348,7 @@ static const struct proto_ops dn_proto_ops = {
 	.getsockopt =	dn_getsockopt,
 	.sendmsg =	dn_sendmsg,
 	.recvmsg =	dn_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/econet/af_econet.c b/net/econet/af_econet.c
index 0e0254f..7891aad 100644
--- a/net/econet/af_econet.c
+++ b/net/econet/af_econet.c
@@ -765,6 +765,7 @@ static const struct proto_ops econet_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	econet_sendmsg,
 	.recvmsg =	econet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/ieee802154/af_ieee802154.c b/net/ieee802154/af_ieee802154.c
index cd949d5..98cf2be 100644
--- a/net/ieee802154/af_ieee802154.c
+++ b/net/ieee802154/af_ieee802154.c
@@ -195,6 +195,7 @@ static const struct proto_ops ieee802154_raw_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = ieee802154_sock_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
@@ -220,6 +221,7 @@ static const struct proto_ops ieee802154_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = ieee802154_sock_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6c30a73..4981d8e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -866,6 +866,7 @@ const struct proto_ops inet_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -893,6 +894,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_common_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -923,6 +925,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ebaaa7f..fcb34bd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -882,13 +882,34 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(udp_ioctl);
 
+static void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
+{
+	lock_sock(sk);
+	skb_free_datagram(sk, skb);
+	release_sock(sk);
+}
+
+static int skb_kill_datagram_locked(struct sock *sk, struct sk_buff *skb,
+                                   unsigned int flags)
+{
+	int ret;
+	lock_sock(sk);
+	ret = skb_kill_datagram(sk, skb, flags);
+	release_sock(sk);
+	return ret;
+}
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
  */
-
-int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-		size_t len, int noblock, int flags, int *addr_len)
+static int __udp_recvmsg(struct kiocb *iocb, struct sock *sk,
+			 struct msghdr *msg, size_t len, int noblock,
+			 int flags, int *addr_len,
+			 void (*free_datagram)(struct sock *,
+					       struct sk_buff *),
+			 int  (*kill_datagram)(struct sock *,
+					       struct sk_buff *, unsigned int))
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
@@ -967,23 +988,35 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
-	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	free_datagram(sk, skb);
 out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
-	if (!skb_kill_datagram(sk, skb, flags))
+	if (!kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
 
 	if (noblock)
 		return -EAGAIN;
 	goto try_again;
 }
 
+int udp_recvmsg(struct kiocb *iocb, struct sock *sk,
+		struct msghdr *msg, size_t len, int noblock,
+		int flags, int *addr_len)
+{
+	return __udp_recvmsg(iocb, sk, msg, len, noblock, flags, addr_len,
+			     skb_free_datagram_locked,
+			     skb_kill_datagram_locked);
+}
+
+int udp_unlocked_recvmsg(struct kiocb *iocb, struct sock *sk,
+			 struct msghdr *msg, size_t len, int noblock,
+			 int flags, int *addr_len)
+{
+	return __udp_recvmsg(iocb, sk, msg, len, noblock, flags, addr_len,
+			     skb_free_datagram, skb_kill_datagram);
+}
 
 int udp_disconnect(struct sock *sk, int flags)
 {
@@ -1580,6 +1613,7 @@ struct proto udp_prot = {
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
+	.unlocked_recvmsg  = udp_unlocked_recvmsg,
 	.sendpage	   = udp_sendpage,
 	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index a123a32..b72c518 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -518,6 +518,7 @@ const struct proto_ops inet6_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = tcp_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -544,6 +545,7 @@ const struct proto_ops inet6_dgram_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_common_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 7d675b8..d17db28 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1325,6 +1325,7 @@ static const struct proto_ops inet6_sockraw_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = sock_common_recvmsg,	/* ok		*/
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipx/af_ipx.c b/net/ipx/af_ipx.c
index f1118d9..45048a0 100644
--- a/net/ipx/af_ipx.c
+++ b/net/ipx/af_ipx.c
@@ -1953,6 +1953,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(ipx_dgram_ops) = {
 	.getsockopt	= ipx_getsockopt,
 	.sendmsg	= ipx_sendmsg,
 	.recvmsg	= ipx_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 50b43c5..7e97581 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -2489,6 +2489,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_stream_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg,
 	.recvmsg =	irda_recvmsg_stream,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2513,6 +2514,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_seqpacket_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2537,6 +2539,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_dgram_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg_dgram,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2562,6 +2565,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(irda_ultra_ops) = {
 	.getsockopt =	irda_getsockopt,
 	.sendmsg =	irda_sendmsg_ultra,
 	.recvmsg =	irda_recvmsg_dgram,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 49c15b4..c208622 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -1693,6 +1693,7 @@ static struct proto_ops iucv_sock_ops = {
 	.getname	= iucv_sock_getname,
 	.sendmsg	= iucv_sock_sendmsg,
 	.recvmsg	= iucv_sock_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.poll		= iucv_sock_poll,
 	.ioctl		= sock_no_ioctl,
 	.mmap		= sock_no_mmap,
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 4e98193..3ef1f26 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3636,6 +3636,7 @@ static const struct proto_ops pfkey_ops = {
 	.getsockopt	=	sock_no_getsockopt,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 
 	/* Now the operations that really occur. */
 	.release	=	pfkey_release,
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index c45eee1..d948caf 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -1115,6 +1115,7 @@ static const struct proto_ops llc_ui_ops = {
 	.getsockopt  = llc_ui_getsockopt,
 	.sendmsg     = llc_ui_sendmsg,
 	.recvmsg     = llc_ui_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap	     = sock_no_mmap,
 	.sendpage    = sock_no_sendpage,
 };
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d0ff382..0d1b446 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2036,6 +2036,7 @@ static const struct proto_ops netlink_ops = {
 	.getsockopt =	netlink_getsockopt,
 	.sendmsg =	netlink_sendmsg,
 	.recvmsg =	netlink_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index ce1a34b..3550d34 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1395,6 +1395,7 @@ static const struct proto_ops nr_proto_ops = {
 	.getsockopt	=	nr_getsockopt,
 	.sendmsg	=	nr_sendmsg,
 	.recvmsg	=	nr_recvmsg,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
 };
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d3d52c6..d987e23 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2338,6 +2338,7 @@ static const struct proto_ops packet_ops_spkt = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	packet_sendmsg_spkt,
 	.recvmsg =	packet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -2359,6 +2360,7 @@ static const struct proto_ops packet_ops = {
 	.getsockopt =	packet_getsockopt,
 	.sendmsg =	packet_sendmsg,
 	.recvmsg =	packet_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		packet_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 7a4ee39..248e8b2 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -327,6 +327,7 @@ const struct proto_ops phonet_dgram_ops = {
 #endif
 	.sendmsg	= pn_socket_sendmsg,
 	.recvmsg	= sock_common_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
@@ -352,6 +353,7 @@ const struct proto_ops phonet_stream_ops = {
 #endif
 	.sendmsg	= pn_socket_sendmsg,
 	.recvmsg	= sock_common_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 108ed2e..1f2e8db 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -376,6 +376,7 @@ static struct proto_ops rds_proto_ops = {
 	.getsockopt =	rds_getsockopt,
 	.sendmsg =	rds_sendmsg,
 	.recvmsg =	rds_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index e5f478c..a64c623 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1532,6 +1532,7 @@ static struct proto_ops rose_proto_ops = {
 	.getsockopt	=	rose_getsockopt,
 	.sendmsg	=	rose_sendmsg,
 	.recvmsg	=	rose_recvmsg,
+	.unlocked_recvmsg =	sock_no_unlocked_recvmsg,
 	.mmap		=	sock_no_mmap,
 	.sendpage	=	sock_no_sendpage,
 };
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index bfe493e..bf4c38a 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -766,6 +766,7 @@ static const struct proto_ops rxrpc_rpc_ops = {
 	.getsockopt	= sock_no_getsockopt,
 	.sendmsg	= rxrpc_sendmsg,
 	.recvmsg	= rxrpc_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage,
 };
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 6a4b190..b68d9f9 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -918,6 +918,7 @@ static const struct proto_ops inet6_seqpacket_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 60093be..8caedcb 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -895,6 +895,7 @@ static const struct proto_ops inet_seqpacket_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = sock_common_recvmsg,
+	.unlocked_recvmsg  = sock_no_unlocked_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/socket.c b/net/socket.c
index 32db56a..dc5b976 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -690,6 +690,32 @@ static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 	return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags);
 }
 
+static inline int __sock_unlocked_recvmsg_nosec(struct kiocb *iocb,
+						struct socket *sock,
+						struct msghdr *msg,
+						size_t size, int flags)
+{
+	struct sock_iocb *si = kiocb_to_siocb(iocb);
+
+	si->sock = sock;
+	si->scm = NULL;
+	si->msg = msg;
+	si->size = size;
+	si->flags = flags;
+
+	return sock->ops->unlocked_recvmsg(iocb, sock, msg, size, flags);
+}
+
+static inline int __sock_unlocked_recvmsg(struct kiocb *iocb,
+					  struct socket *sock,
+					  struct msghdr *msg, size_t size,
+					  int flags)
+{
+	int err = security_socket_recvmsg(sock, msg, size, flags);
+
+	return err ?: __sock_unlocked_recvmsg_nosec(iocb, sock, msg, size, flags);
+}
+
 int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 		 size_t size, int flags)
 {
@@ -720,6 +746,58 @@ static int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
 	return ret;
 }
 
+static int sock_unlocked_recvmsg(struct socket *sock, struct msghdr *msg,
+				 size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_unlocked_recvmsg(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
+static int sock_unlocked_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
+				       size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_unlocked_recvmsg_nosec(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
+enum sock_recvmsg_security {
+	SOCK_RECVMSG_SEC = 0,
+	SOCK_RECVMSG_NOSEC,
+};
+
+enum sock_recvmsg_locking {
+	SOCK_LOCKED_RECVMSG = 0,
+	SOCK_UNLOCKED_RECVMSG,
+};
+
+static int (*sock_recvmsg_table[2][2])(struct socket *sock, struct msghdr *msg,
+				       size_t size, int flags) = {
+	[SOCK_RECVMSG_SEC] = {
+		[SOCK_LOCKED_RECVMSG]	= sock_recvmsg, /* The old one */
+		[SOCK_UNLOCKED_RECVMSG] = sock_unlocked_recvmsg,
+	},
+	[SOCK_RECVMSG_NOSEC] = {
+		[SOCK_LOCKED_RECVMSG]	= sock_recvmsg_nosec,
+		[SOCK_UNLOCKED_RECVMSG] = sock_unlocked_recvmsg_nosec,
+	},
+};
+
 int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 		   struct kvec *vec, size_t num, size_t size, int flags)
 {
@@ -1984,7 +2062,9 @@ out:
 }
 
 static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
-			 struct msghdr *msg_sys, unsigned flags, int nosec)
+			 struct msghdr *msg_sys, unsigned flags,
+			 enum sock_recvmsg_security security,
+			 enum sock_recvmsg_locking locking)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
@@ -2044,8 +2124,8 @@ static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
 
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
-	err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys,
-							  total_len, flags);
+	err = sock_recvmsg_table[security][locking](sock, msg_sys,
+						    total_len, flags);
 	if (err < 0)
 		goto out_freeiov;
 	len = err;
@@ -2092,7 +2172,8 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	if (!sock)
 		goto out;
 
-	err = __sys_recvmsg(sock, msg, &msg_sys, flags, 0);
+	err = __sys_recvmsg(sock, msg, &msg_sys, flags,
+			    SOCK_RECVMSG_SEC, SOCK_LOCKED_RECVMSG);
 
 	fput_light(sock->file, fput_needed);
 out:
@@ -2111,6 +2192,7 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	struct mmsghdr __user *entry;
 	struct msghdr msg_sys;
 	struct timespec end_time;
+	enum sock_recvmsg_security security;
 
 	if (timeout &&
 	    poll_select_set_timeout(&end_time, timeout->tv_sec,
@@ -2123,20 +2205,25 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	if (!sock)
 		return err;
 
+	lock_sock(sock->sk);
+
 	err = sock_error(sock->sk);
 	if (err)
 		goto out_put;
 
 	entry = mmsg;
 
+	security = SOCK_RECVMSG_SEC;
 	while (datagrams < vlen) {
-		/*
-		 * No need to ask LSM for more than the first datagram.
-		 */
 		err = __sys_recvmsg(sock, (struct msghdr __user *)entry,
-				    &msg_sys, flags, datagrams);
+				    &msg_sys, flags, security,
+				    SOCK_UNLOCKED_RECVMSG);
 		if (err < 0)
 			break;
+		/*
+		 * No need to ask LSM for more than the first datagram.
+		 */
+		security = SOCK_RECVMSG_NOSEC;
 		err = put_user(err, &entry->msg_len);
 		if (err)
 			break;
@@ -2165,9 +2252,8 @@ out_put:
 	fput_light(sock->file, fput_needed);
 
 	if (err == 0)
-		return datagrams;
-
-	if (datagrams != 0) {
+		err = datagrams;
+	else if (datagrams != 0) {
 		/*
 		 * We may return less entries than requested (vlen) if the
 		 * sock is non block and there aren't enough datagrams...
@@ -2182,9 +2268,11 @@ out_put:
 			sock->sk->sk_err = -err;
 		}
 
-		return datagrams;
+		err = datagrams;
 	}
 
+	release_sock(sock->sk);
+
 	return err;
 }
 
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index e8254e8..97b3f05 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1797,6 +1797,7 @@ static const struct proto_ops msg_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_msg,
 	.recvmsg	= recv_msg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
@@ -1818,6 +1819,7 @@ static const struct proto_ops packet_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_packet,
 	.recvmsg	= recv_msg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
@@ -1839,6 +1841,7 @@ static const struct proto_ops stream_ops = {
 	.getsockopt	= getsockopt,
 	.sendmsg	= send_stream,
 	.recvmsg	= recv_stream,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap		= sock_no_mmap,
 	.sendpage	= sock_no_sendpage
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 51ab497..9e7aa9a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -521,6 +521,7 @@ static const struct proto_ops unix_stream_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_stream_sendmsg,
 	.recvmsg =	unix_stream_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -542,6 +543,7 @@ static const struct proto_ops unix_dgram_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_dgram_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
@@ -563,6 +565,7 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_seqpacket_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 5e6c072..7c20b26 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1620,6 +1620,7 @@ static const struct proto_ops SOCKOPS_WRAPPED(x25_proto_ops) = {
 	.getsockopt =	x25_getsockopt,
 	.sendmsg =	x25_sendmsg,
 	.recvmsg =	x25_recvmsg,
+	.unlocked_recvmsg = sock_no_unlocked_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
-- 
1.5.5.1


^ permalink raw reply related

* [RFCv4 PATCH 1/2] net: Introduce recvmmsg socket syscall
From: Arnaldo Carvalho de Melo @ 2009-09-16 17:07 UTC (permalink / raw)
  To: David Miller
  Cc: Caitlin Bestler, Chris Van Hoof, Clark Williams, Neil Horman,
	Nir Tzachar, Nivedita Singhvi, Paul Moore,
	Rémi Denis-Courmont, Steven Whitehouse,
	Linux Networking Development Mailing List

Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.

Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.

This takes into account comments made by:

. Paul Moore: sock_recvmsg is called only for the first datagram,
  sock_recvmsg_nosec is used for the rest.

. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
  works in the same fashion as the ppoll one.

  If the underlying protocol returns a datagram with MSG_OOB set, this
  will make recvmmsg return right away with as many datagrams (+ the OOB
  one) it has received so far.

. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
  datagrams and then recvmsg returns an error, recvmmsg will return
  the successfully received datagrams, store the error and return it
  in the next call.

This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.

Cc: Caitlin Bestler <caitlin.bestler@gmail.com>
Cc: Chris Van Hoof <vanhoof@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Nir Tzachar <nir.tzachar@gmail.com>
Cc: Nivedita Singhvi <niv@us.ibm.com>
Cc: Paul Moore <paul.moore@hp.com>
Cc: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Cc: Steven Whitehouse <steve@chygwyn.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 arch/alpha/kernel/systbls.S            |    1 +
 arch/arm/kernel/calls.S                |    1 +
 arch/avr32/kernel/syscall_table.S      |    1 +
 arch/blackfin/mach-common/entry.S      |    1 +
 arch/ia64/kernel/entry.S               |    1 +
 arch/microblaze/kernel/syscall_table.S |    1 +
 arch/mips/kernel/scall32-o32.S         |    1 +
 arch/mips/kernel/scall64-64.S          |    1 +
 arch/mips/kernel/scall64-n32.S         |    1 +
 arch/mips/kernel/scall64-o32.S         |    1 +
 arch/sh/kernel/syscalls_64.S           |    1 +
 arch/sparc/kernel/systbls_64.S         |    4 +-
 arch/x86/ia32/ia32entry.S              |    1 +
 arch/x86/include/asm/unistd_32.h       |    1 +
 arch/x86/include/asm/unistd_64.h       |    2 +
 arch/x86/kernel/syscall_table_32.S     |    1 +
 arch/xtensa/include/asm/unistd.h       |    4 +-
 include/linux/net.h                    |    1 +
 include/linux/socket.h                 |   10 ++
 include/linux/syscalls.h               |    4 +
 include/net/compat.h                   |    8 +
 kernel/sys_ni.c                        |    2 +
 net/compat.c                           |   33 +++++-
 net/socket.c                           |  225 ++++++++++++++++++++++++++------
 24 files changed, 259 insertions(+), 48 deletions(-)

diff --git a/arch/alpha/kernel/systbls.S b/arch/alpha/kernel/systbls.S
index 95c9aef..cda6b8b 100644
--- a/arch/alpha/kernel/systbls.S
+++ b/arch/alpha/kernel/systbls.S
@@ -497,6 +497,7 @@ sys_call_table:
 	.quad sys_signalfd
 	.quad sys_ni_syscall
 	.quad sys_eventfd
+	.quad sys_recvmmsg
 
 	.size sys_call_table, . - sys_call_table
 	.type sys_call_table, @object
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index f776e72..43995f6 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -374,6 +374,7 @@
 		CALL(sys_pwritev)
 		CALL(sys_rt_tgsigqueueinfo)
 		CALL(sys_perf_counter_open)
+		CALL(sys_recvmmsg)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/avr32/kernel/syscall_table.S b/arch/avr32/kernel/syscall_table.S
index 7ee0057..e76bad1 100644
--- a/arch/avr32/kernel/syscall_table.S
+++ b/arch/avr32/kernel/syscall_table.S
@@ -295,4 +295,5 @@ sys_call_table:
 	.long	sys_signalfd
 	.long	sys_ni_syscall		/* 280, was sys_timerfd */
 	.long	sys_eventfd
+	.long	sys_recvmmsg
 	.long	sys_ni_syscall		/* r8 is saturated at nr_syscalls */
diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
index fb1795d..e4d3d0f 100644
--- a/arch/blackfin/mach-common/entry.S
+++ b/arch/blackfin/mach-common/entry.S
@@ -1612,6 +1612,7 @@ ENTRY(_sys_call_table)
 	.long _sys_pwritev
 	.long _sys_rt_tgsigqueueinfo
 	.long _sys_perf_counter_open
+	.long _sys_recvmmsg
 
 	.rept NR_syscalls-(.-_sys_call_table)/4
 	.long _sys_ni_syscall
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index d0e7d37..d75b872 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1806,6 +1806,7 @@ sys_call_table:
 	data8 sys_preadv
 	data8 sys_pwritev			// 1320
 	data8 sys_rt_tgsigqueueinfo
+	data8 sys_recvmmsg
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
 #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 4572160..623dbf1 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -371,3 +371,4 @@ ENTRY(sys_call_table)
 	.long sys_ni_syscall
 	.long sys_rt_tgsigqueueinfo	/* 365 */
 	.long sys_perf_counter_open
+	.long sys_recvmmsg
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index b570821..b92aa3e 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -655,6 +655,7 @@ einval:	li	v0, -ENOSYS
 	sys	sys_rt_tgsigqueueinfo	4
 	sys	sys_perf_counter_open	5
 	sys	sys_accept4		4
+	sys     sys_recvmmsg            5
 	.endm
 
 	/* We pre-compute the number of _instruction_ bytes needed to
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 3d866f2..d3384d8 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -492,4 +492,5 @@ sys_call_table:
 	PTR	sys_rt_tgsigqueueinfo
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 1a6ae12..15412f7 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -418,4 +418,5 @@ EXPORT(sysn32_call_table)
 	PTR	compat_sys_rt_tgsigqueueinfo	/* 5295 */
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     compat_sys_recvmmsg
 	.size	sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index cd31087..ba1e940 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -538,4 +538,5 @@ sys_call_table:
 	PTR	compat_sys_rt_tgsigqueueinfo
 	PTR	sys_perf_counter_open
 	PTR	sys_accept4
+	PTR     compat_sys_recvmmsg
 	.size	sys_call_table,.-sys_call_table
diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
index bf420b6..056e0a7 100644
--- a/arch/sh/kernel/syscalls_64.S
+++ b/arch/sh/kernel/syscalls_64.S
@@ -391,3 +391,4 @@ sys_call_table:
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo
 	.long sys_perf_counter_open
+	.long sys_recvmmsg		/* 365 */
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 91b06b7..ff40e70 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -83,7 +83,7 @@ sys_call_table32:
 /*310*/	.word compat_sys_utimensat, compat_sys_signalfd, sys_timerfd_create, sys_eventfd, compat_sys_fallocate
 	.word compat_sys_timerfd_settime, compat_sys_timerfd_gettime, compat_sys_signalfd4, sys_eventfd2, sys_epoll_create1
 /*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, compat_sys_preadv
-	.word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, sys_perf_counter_open
+	.word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, sys_perf_counter_open, compat_sys_recvmmsg
 
 #endif /* CONFIG_COMPAT */
 
@@ -158,4 +158,4 @@ sys_call_table:
 /*310*/	.word sys_utimensat, sys_signalfd, sys_timerfd_create, sys_eventfd, sys_fallocate
 	.word sys_timerfd_settime, sys_timerfd_gettime, sys_signalfd4, sys_eventfd2, sys_epoll_create1
 /*320*/	.word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv
-	.word sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_counter_open
+	.word sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_counter_open, sys_recvmmsg
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index ba331bf..94dc323 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -832,4 +832,5 @@ ia32_sys_call_table:
 	.quad compat_sys_pwritev
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_counter_open
+	.quad compat_sys_recvmmsg
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 8deaada..665ea30 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_recvmmsg		337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index b9f3c60..f837601 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_counter_open			298
 __SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
+#define __NR_recvmmsg				299
+__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..4881b14 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_recvmmsg
diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h
index c092c8f..4e55dc7 100644
--- a/arch/xtensa/include/asm/unistd.h
+++ b/arch/xtensa/include/asm/unistd.h
@@ -681,8 +681,10 @@ __SYSCALL(304, sys_signalfd, 3)
 __SYSCALL(305, sys_ni_syscall, 0)
 #define __NR_eventfd				306
 __SYSCALL(306, sys_eventfd, 1)
+#define __NR_recvmmsg				307
+__SYSCALL(307, sys_recvmmsg, 5)
 
-#define __NR_syscall_count			307
+#define __NR_syscall_count			308
 
 /*
  * sysxtensa syscall handler
diff --git a/include/linux/net.h b/include/linux/net.h
index 4fc2ffd..d67587a 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -41,6 +41,7 @@
 #define SYS_SENDMSG	16		/* sys_sendmsg(2)		*/
 #define SYS_RECVMSG	17		/* sys_recvmsg(2)		*/
 #define SYS_ACCEPT4	18		/* sys_accept4(2)		*/
+#define SYS_RECVMMSG	19		/* sys_recvmmsg(2)		*/
 
 typedef enum {
 	SS_FREE = 0,			/* not allocated		*/
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 3b461df..c192bf8 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,12 @@ struct msghdr {
 	unsigned	msg_flags;
 };
 
+/* For recvmmsg/sendmmsg */
+struct mmsghdr {
+	struct msghdr   msg_hdr;
+	unsigned        msg_len;
+};
+
 /*
  *	POSIX 1003.1g - ancillary data object information
  *	Ancillary data consits of a sequence of pairs of
@@ -327,6 +333,10 @@ extern int move_addr_to_user(struct sockaddr *kaddr, int klen, void __user *uadd
 extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr *kaddr);
 extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data);
 
+struct timespec;
+
+extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
+			  unsigned int flags, struct timespec *timeout);
 #endif
 #endif /* not kernel and not glibc */
 #endif /* _LINUX_SOCKET_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a8e3782..b21818c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -25,6 +25,7 @@ struct linux_dirent64;
 struct list_head;
 struct msgbuf;
 struct msghdr;
+struct mmsghdr;
 struct msqid_ds;
 struct new_utsname;
 struct nfsctl_arg;
@@ -686,6 +687,9 @@ asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
 asmlinkage long sys_recvfrom(int, void __user *, size_t, unsigned,
 				struct sockaddr __user *, int __user *);
 asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags);
+asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
+			     unsigned int vlen, unsigned flags,
+			     struct timespec __user *timeout);
 asmlinkage long sys_socket(int, int, int);
 asmlinkage long sys_socketpair(int, int, int, int __user *);
 asmlinkage long sys_socketcall(int call, unsigned long __user *args);
diff --git a/include/net/compat.h b/include/net/compat.h
index 5bbf8bf..96c38d8 100644
--- a/include/net/compat.h
+++ b/include/net/compat.h
@@ -18,6 +18,11 @@ struct compat_msghdr {
 	compat_uint_t	msg_flags;
 };
 
+struct compat_mmsghdr {
+	struct compat_msghdr msg_hdr;
+	compat_uint_t        msg_len;
+};
+
 struct compat_cmsghdr {
 	compat_size_t	cmsg_len;
 	compat_int_t	cmsg_level;
@@ -35,6 +40,9 @@ extern int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *);
 extern int verify_compat_iovec(struct msghdr *, struct iovec *, struct sockaddr *, int);
 extern asmlinkage long compat_sys_sendmsg(int,struct compat_msghdr __user *,unsigned);
 extern asmlinkage long compat_sys_recvmsg(int,struct compat_msghdr __user *,unsigned);
+extern asmlinkage long compat_sys_recvmmsg(int, struct compat_mmsghdr __user *,
+					   unsigned, unsigned,
+					   struct timespec __user *);
 extern asmlinkage long compat_sys_getsockopt(int, int, int, char __user *, int __user *);
 extern int put_cmsg_compat(struct msghdr*, int, int, int, void *);
 
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..f581fb0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,7 +48,9 @@ cond_syscall(sys_shutdown);
 cond_syscall(sys_sendmsg);
 cond_syscall(compat_sys_sendmsg);
 cond_syscall(sys_recvmsg);
+cond_syscall(sys_recvmmsg);
 cond_syscall(compat_sys_recvmsg);
+cond_syscall(compat_sys_recvmmsg);
 cond_syscall(sys_socketcall);
 cond_syscall(sys_futex);
 cond_syscall(compat_sys_futex);
diff --git a/net/compat.c b/net/compat.c
index 12728b1..ee57376 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -727,10 +727,10 @@ EXPORT_SYMBOL(compat_mc_getsockopt);
 
 /* Argument list sizes for compat_sys_socketcall */
 #define AL(x) ((x) * sizeof(u32))
-static unsigned char nas[19]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
+static unsigned char nas[20]={AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
 				AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
 				AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
-				AL(4)};
+				AL(4),AL(5)};
 #undef AL
 
 asmlinkage long compat_sys_sendmsg(int fd, struct compat_msghdr __user *msg, unsigned flags)
@@ -755,13 +755,36 @@ asmlinkage long compat_sys_recvfrom(int fd, void __user *buf, size_t len,
 	return sys_recvfrom(fd, buf, len, flags | MSG_CMSG_COMPAT, addr, addrlen);
 }
 
+asmlinkage long compat_sys_recvmmsg(int fd, struct compat_mmsghdr __user *mmsg,
+				    unsigned vlen, unsigned int flags,
+				    struct timespec __user *timeout)
+{
+	int datagrams;
+	struct timespec ktspec;
+	struct compat_timespec __user *utspec =
+			(struct compat_timespec __user *)timeout;
+
+	if (get_user(ktspec.tv_sec, &utspec->tv_sec) ||
+	    get_user(ktspec.tv_nsec, &utspec->tv_nsec))
+		return -EFAULT;
+
+	datagrams = __sys_recvmmsg(fd, (struct mmsghdr __user *)mmsg, vlen,
+				   flags | MSG_CMSG_COMPAT, &ktspec);
+	if (datagrams > 0 &&
+	    (put_user(ktspec.tv_sec, &utspec->tv_sec) ||
+	     put_user(ktspec.tv_nsec, &utspec->tv_nsec)))
+		datagrams = -EFAULT;
+
+	return datagrams;
+}
+
 asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
 {
 	int ret;
 	u32 a[6];
 	u32 a0, a1;
 
-	if (call < SYS_SOCKET || call > SYS_ACCEPT4)
+	if (call < SYS_SOCKET || call > SYS_RECVMMSG)
 		return -EINVAL;
 	if (copy_from_user(a, args, nas[call]))
 		return -EFAULT;
@@ -823,6 +846,10 @@ asmlinkage long compat_sys_socketcall(int call, u32 __user *args)
 	case SYS_RECVMSG:
 		ret = compat_sys_recvmsg(a0, compat_ptr(a1), a[2]);
 		break;
+	case SYS_RECVMMSG:
+		ret = compat_sys_recvmmsg(a0, compat_ptr(a1), a[2], a[3],
+					  compat_ptr(a[4]));
+		break;
 	case SYS_ACCEPT4:
 		ret = sys_accept4(a0, compat_ptr(a1), compat_ptr(a[2]), a[3]);
 		break;
diff --git a/net/socket.c b/net/socket.c
index 6d47165..32db56a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -668,10 +668,9 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
-static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
-				 struct msghdr *msg, size_t size, int flags)
+static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
+				       struct msghdr *msg, size_t size, int flags)
 {
-	int err;
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
 	si->sock = sock;
@@ -680,13 +679,17 @@ static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 	si->size = size;
 	si->flags = flags;
 
-	err = security_socket_recvmsg(sock, msg, size, flags);
-	if (err)
-		return err;
-
 	return sock->ops->recvmsg(iocb, sock, msg, size, flags);
 }
 
+static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
+				 struct msghdr *msg, size_t size, int flags)
+{
+	int err = security_socket_recvmsg(sock, msg, size, flags);
+
+	return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags);
+}
+
 int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 		 size_t size, int flags)
 {
@@ -702,6 +705,21 @@ int sock_recvmsg(struct socket *sock, struct msghdr *msg,
 	return ret;
 }
 
+static int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
+			      size_t size, int flags)
+{
+	struct kiocb iocb;
+	struct sock_iocb siocb;
+	int ret;
+
+	init_sync_kiocb(&iocb, NULL);
+	iocb.private = &siocb;
+	ret = __sock_recvmsg_nosec(&iocb, sock, msg, size, flags);
+	if (-EIOCBQUEUED == ret)
+		ret = wait_on_sync_kiocb(&iocb);
+	return ret;
+}
+
 int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 		   struct kvec *vec, size_t num, size_t size, int flags)
 {
@@ -1965,22 +1983,15 @@ out:
 	return err;
 }
 
-/*
- *	BSD recvmsg interface
- */
-
-SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
-		unsigned int, flags)
+static int __sys_recvmsg(struct socket *sock, struct msghdr __user *msg,
+			 struct msghdr *msg_sys, unsigned flags, int nosec)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
-	struct socket *sock;
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
-	struct msghdr msg_sys;
 	unsigned long cmsg_ptr;
 	int err, iov_size, total_len, len;
-	int fput_needed;
 
 	/* kernel mode address */
 	struct sockaddr_storage addr;
@@ -1990,27 +2001,23 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	int __user *uaddr_len;
 
 	if (MSG_CMSG_COMPAT & flags) {
-		if (get_compat_msghdr(&msg_sys, msg_compat))
+		if (get_compat_msghdr(msg_sys, msg_compat))
 			return -EFAULT;
 	}
-	else if (copy_from_user(&msg_sys, msg, sizeof(struct msghdr)))
+	else if (copy_from_user(msg_sys, msg, sizeof(struct msghdr)))
 		return -EFAULT;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
-	if (!sock)
-		goto out;
-
 	err = -EMSGSIZE;
-	if (msg_sys.msg_iovlen > UIO_MAXIOV)
-		goto out_put;
+	if (msg_sys->msg_iovlen > UIO_MAXIOV)
+		goto out;
 
 	/* Check whether to allocate the iovec area */
 	err = -ENOMEM;
-	iov_size = msg_sys.msg_iovlen * sizeof(struct iovec);
-	if (msg_sys.msg_iovlen > UIO_FASTIOV) {
+	iov_size = msg_sys->msg_iovlen * sizeof(struct iovec);
+	if (msg_sys->msg_iovlen > UIO_FASTIOV) {
 		iov = sock_kmalloc(sock->sk, iov_size, GFP_KERNEL);
 		if (!iov)
-			goto out_put;
+			goto out;
 	}
 
 	/*
@@ -2018,46 +2025,47 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 	 *      kernel msghdr to use the kernel address space)
 	 */
 
-	uaddr = (__force void __user *)msg_sys.msg_name;
+	uaddr = (__force void __user *)msg_sys->msg_name;
 	uaddr_len = COMPAT_NAMELEN(msg);
 	if (MSG_CMSG_COMPAT & flags) {
-		err = verify_compat_iovec(&msg_sys, iov,
+		err = verify_compat_iovec(msg_sys, iov,
 					  (struct sockaddr *)&addr,
 					  VERIFY_WRITE);
 	} else
-		err = verify_iovec(&msg_sys, iov,
+		err = verify_iovec(msg_sys, iov,
 				   (struct sockaddr *)&addr,
 				   VERIFY_WRITE);
 	if (err < 0)
 		goto out_freeiov;
 	total_len = err;
 
-	cmsg_ptr = (unsigned long)msg_sys.msg_control;
-	msg_sys.msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
+	cmsg_ptr = (unsigned long)msg_sys->msg_control;
+	msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
 
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
-	err = sock_recvmsg(sock, &msg_sys, total_len, flags);
+	err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys,
+							  total_len, flags);
 	if (err < 0)
 		goto out_freeiov;
 	len = err;
 
 	if (uaddr != NULL) {
 		err = move_addr_to_user((struct sockaddr *)&addr,
-					msg_sys.msg_namelen, uaddr,
+					msg_sys->msg_namelen, uaddr,
 					uaddr_len);
 		if (err < 0)
 			goto out_freeiov;
 	}
-	err = __put_user((msg_sys.msg_flags & ~MSG_CMSG_COMPAT),
+	err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT),
 			 COMPAT_FLAGS(msg));
 	if (err)
 		goto out_freeiov;
 	if (MSG_CMSG_COMPAT & flags)
-		err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
+		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
 				 &msg_compat->msg_controllen);
 	else
-		err = __put_user((unsigned long)msg_sys.msg_control - cmsg_ptr,
+		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
 				 &msg->msg_controllen);
 	if (err)
 		goto out_freeiov;
@@ -2066,21 +2074,150 @@ SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
 out_freeiov:
 	if (iov != iovstack)
 		sock_kfree_s(sock->sk, iov, iov_size);
-out_put:
+out:
+	return err;
+}
+
+/*
+ *	BSD recvmsg interface
+ */
+
+SYSCALL_DEFINE3(recvmsg, int, fd, struct msghdr __user *, msg,
+		unsigned int, flags)
+{
+	int fput_needed, err;
+	struct msghdr msg_sys;
+	struct socket *sock = sockfd_lookup_light(fd, &err, &fput_needed);
+
+	if (!sock)
+		goto out;
+
+	err = __sys_recvmsg(sock, msg, &msg_sys, flags, 0);
+
 	fput_light(sock->file, fput_needed);
 out:
 	return err;
 }
 
-#ifdef __ARCH_WANT_SYS_SOCKETCALL
+/*
+ *     Linux recvmmsg interface
+ */
+
+int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
+		   unsigned int flags, struct timespec *timeout)
+{
+	int fput_needed, err, datagrams;
+	struct socket *sock;
+	struct mmsghdr __user *entry;
+	struct msghdr msg_sys;
+	struct timespec end_time;
+
+	if (timeout &&
+	    poll_select_set_timeout(&end_time, timeout->tv_sec,
+				    timeout->tv_nsec))
+		return -EINVAL;
+
+	datagrams = 0;
+
+	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	if (!sock)
+		return err;
+
+	err = sock_error(sock->sk);
+	if (err)
+		goto out_put;
+
+	entry = mmsg;
+
+	while (datagrams < vlen) {
+		/*
+		 * No need to ask LSM for more than the first datagram.
+		 */
+		err = __sys_recvmsg(sock, (struct msghdr __user *)entry,
+				    &msg_sys, flags, datagrams);
+		if (err < 0)
+			break;
+		err = put_user(err, &entry->msg_len);
+		if (err)
+			break;
+		++entry;
+		++datagrams;
+
+		if (timeout) {
+			ktime_get_ts(timeout);
+			*timeout = timespec_sub(end_time, *timeout);
+			if (timeout->tv_sec < 0) {
+				timeout->tv_sec = timeout->tv_nsec = 0;
+				break;
+			}
+
+			/* Timeout, return less than vlen datagrams */
+			if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
+				break;
+		}
+
+		/* Out of band data, return right away */
+		if (msg_sys.msg_flags & MSG_OOB)
+			break;
+	}
+
+out_put:
+	fput_light(sock->file, fput_needed);
 
+	if (err == 0)
+		return datagrams;
+
+	if (datagrams != 0) {
+		/*
+		 * We may return less entries than requested (vlen) if the
+		 * sock is non block and there aren't enough datagrams...
+		 */
+		if (err != -EAGAIN) {
+			/*
+			 * ... or  if recvmsg returns an error after we
+			 * received some datagrams, where we record the
+			 * error to return on the next call or if the
+			 * app asks about it using getsockopt(SO_ERROR).
+			 */
+			sock->sk->sk_err = -err;
+		}
+
+		return datagrams;
+	}
+
+	return err;
+}
+
+SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
+		unsigned int, vlen, unsigned int, flags,
+		struct timespec __user *, timeout)
+{
+	int datagrams;
+	struct timespec timeout_sys;
+
+	if (!timeout)
+		return __sys_recvmmsg(fd, mmsg, vlen, flags, NULL);
+
+	if (copy_from_user(&timeout_sys, timeout, sizeof(timeout_sys)))
+		return -EFAULT;
+
+	datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
+
+	if (datagrams > 0 &&
+	    copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
+		datagrams = -EFAULT;
+
+	return datagrams;
+}
+
+#ifdef __ARCH_WANT_SYS_SOCKETCALL
 /* Argument list sizes for sys_socketcall */
 #define AL(x) ((x) * sizeof(unsigned long))
-static const unsigned char nargs[19]={
+static const unsigned char nargs[20] = {
 	AL(0),AL(3),AL(3),AL(3),AL(2),AL(3),
 	AL(3),AL(3),AL(4),AL(4),AL(4),AL(6),
 	AL(6),AL(2),AL(5),AL(5),AL(3),AL(3),
-	AL(4)
+	AL(4),AL(5)
 };
 
 #undef AL
@@ -2099,7 +2236,7 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 	unsigned long a0, a1;
 	int err;
 
-	if (call < 1 || call > SYS_ACCEPT4)
+	if (call < 1 || call > SYS_RECVMMSG)
 		return -EINVAL;
 
 	/* copy_from_user should be SMP safe. */
@@ -2173,6 +2310,10 @@ SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args)
 	case SYS_RECVMSG:
 		err = sys_recvmsg(a0, (struct msghdr __user *)a1, a[2]);
 		break;
+	case SYS_RECVMMSG:
+		err = sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3],
+				   (struct timespec __user *)a[4]);
+		break;
 	case SYS_ACCEPT4:
 		err = sys_accept4(a0, (struct sockaddr __user *)a1,
 				  (int __user *)a[2], a[3]);
-- 
1.5.5.1



^ permalink raw reply related

* [RFCv4 PATCH 0/2] net: Introduce recvmmsg socket syscall
From: Arnaldo Carvalho de Melo @ 2009-09-16 17:07 UTC (permalink / raw)
  To: David Miller
  Cc: Caitlin Bestler, Chris Van Hoof, Clark Williams, Neil Horman,
	Nir Tzachar, Nivedita Singhvi, Paul Moore,
	Rémi Denis-Courmont, Steven Whitehouse,
	Linux Networking Development Mailing List

Hi,

	Nir, can you please test with this patchset and check if latency
numbers improved? They should, I think :-)

	New perf callgraphs here:

http://oops.ghostprotocols.net:81/acme/perf.recvmsg.step2.cg.data.txt.bz2

versus

http://oops.ghostprotocols.net:81/acme/perf.recvmmsg.step2.cg.data.txt.bz2

	Look at what appears now on the radar, its not locking :-)
	
	Or course, I need to do more tests, but it looks promising, please give
it a go and report back here if you can!

- Arnaldo

# Samples: 761074
#
# Overhead   Command             Shared Object  Symbol
# ........  ........  ........................  ......
#
     6.54%  recvmmsg  [kernel]                  [k] skb_set_owner_r
                |
                |--99.43%-- sock_queue_rcv_skb
                |          __udp_queue_rcv_skb
                |          sk_backlog_rcv
                |          release_sock
                |          __sys_recvmmsg
                |          sys_recvmmsg
                |          system_call_fastpath
                |          syscall
                |          |
                |           --12.76%-- main
                |                     __libc_start_main
                |
                 --0.57%-- __udp_queue_rcv_skb
                           sk_backlog_rcv
                           release_sock
                           __sys_recvmmsg
                           sys_recvmmsg
                           system_call_fastpath
                           syscall
                           |
                            --10.84%-- main
                                      __libc_start_main

     5.88%  recvmmsg  [kernel]                  [k] _spin_lock_irqsave
                |
                |--47.58%-- skb_queue_tail
                |          sock_queue_rcv_skb
                |          __udp_queue_rcv_skb
                |          sk_backlog_rcv
                |          release_sock
                |          __sys_recvmmsg
                |          sys_recvmmsg
                |          system_call_fastpath
                |          syscall
                |          |
                |           --12.56%-- main
                |                     __libc_start_main
                |
                |--41.85%-- __skb_recv_datagram
                |          __udp_recvmsg
                |          udp_unlocked_recvmsg
                |          sock_common_unlocked_recvmsg
                |          __sock_unlocked_recvmsg_nosec
                |          |
                |          |--98.41%-- sock_unlocked_recvmsg_nosec
                |          |          __sys_recvmsg
                |          |          __sys_recvmmsg
                |          |          sys_recvmmsg
                |          |          system_call_fastpath
                |          |          syscall
                |          |          |
                |          |           --12.82%-- main
                |          |                     __libc_start_main
                |          |
                |           --1.59%-- sock_unlocked_recvmsg
                |                     __sys_recvmsg
                |                     __sys_recvmmsg
                |                     sys_recvmmsg


- Arnaldo

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Stephen Hemminger @ 2009-09-16 17:00 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Stephan von Krawczynski, Eric Dumazet, linux-kernel,
	Linux Netdev List
In-Reply-To: <20090916052304.GA4894@ff.dom.local>

On Wed, 16 Sep 2009 05:23:04 +0000
Jarek Poplawski <jarkao2@gmail.com> wrote:

> On Tue, Sep 15, 2009 at 03:57:19PM -0700, Stephen Hemminger wrote:
> > On Tue, 15 Sep 2009 08:13:55 +0000
> > Jarek Poplawski <jarkao2@gmail.com> wrote:
> > 
> > > On 14-09-2009 18:31, Stephen Hemminger wrote:
> > > > On Mon, 14 Sep 2009 17:55:05 +0200
> > > > Stephan von Krawczynski <skraw@ithnet.com> wrote:
> > > > 
> > > >> On Mon, 14 Sep 2009 15:57:03 +0200
> > > >> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >>
> > > >>> Stephan von Krawczynski a A~(c)crit :
> > > >>>> Hello all,
> > > ...
> > > >>> rp_filter - INTEGER
> > > >>>         0 - No source validation.
> > > >>>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
> > > >>>             Each incoming packet is tested against the FIB and if the interface
> > > >>>             is not the best reverse path the packet check will fail.
> > > >>>             By default failed packets are discarded.
> > > >>>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
> > > >>>             Each incoming packet's source address is also tested against the FIB
> > > >>>             and if the source address is not reachable via any interface
> > > >>>             the packet check will fail.
> > > ...
> > > > RP filter did not work correctly in 2.6.30. The code added to to the loose
> > > > mode caused a bug; the rp_filter value was being computed as:
> > > >   rp_filter = interface_value & all_value;
> > > > So in order to get reverse path filter both would have to be set.
> > > > 
> > > > In 2.6.31 this was change to:
> > > >    rp_filter = max(interface_value, all_value);
> > > > 
> > > > This was the intended behaviour, if user asks all interfaces to have rp
> > > > filtering turned on, then set /proc/sys/net/ipv4/conf/all/rp_filter = 1
> > > > or to turn on just one interface, set it for just that interface.
> > > 
> > > Alas this max() formula handles also cases where both values are set
> > > and it doesn't look very natural/"user friendly" to me. Especially
> > > with something like this: all_value = 2; interface_value = 1
> > > Why would anybody care to bother with interface_value in such a case?
> > > 
> > > "All" suggests "default" in this context, so I'd rather expect
> > > something like:
> > >     rp_filter = interface_value ? : all_value;
> > > which gives "the inteded behaviour" too, plus more...
> > > 
> > > We'd only need to add e.g.:
> > >  0 - Default ("all") validation. (No source validation if "all" is 0).
> > >  3 - No source validation on this interface.
> > 
> > More values == more confusion.
> > I chose the maxconf() method to make rp_filter consistent with other
> > multi valued variables (arp_announce and arp_ignore).
> 
> This additional value is not necessary (it'd give as superpowers).
> Max seems logical to me only when values are sorted (especially if
> max is the strictest).

The values had to be unsorted because of the requirement to retain
interface compatibility with older releases.
-- 

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Nelson, Shannon @ 2009-09-16 16:10 UTC (permalink / raw)
  To: Simon Horman, Or Gerlitz
  Cc: e1000-devel@lists.sourceforge.net, netdev, Kirsher, Jeffrey T,
	Alexander Duyck
In-Reply-To: <20090916070443.GB22495@verge.net.au>

Simon Horman wrote:
>On Wed, Sep 16, 2009 at 09:47:28AM +0300, Or Gerlitz wrote:
>> also is there
>> 82599 (Niantic) documentation which is publicly avail and I can look
>> at? specifically, I would love taking a look on the equivalent of
>> the "Intel 82576 SR-IOV Driver Companion Guide"
>
>Sorry, I don't know anything about the 82599. But I am only working
>with publicly available documentation.

The "82599 Developer Manual" is available at http://sourceforge.net/projects/e1000/files/.

sln

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-16 16:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <200909161722.37606.arnd@arndb.de>

On Wed, Sep 16, 2009 at 05:22:37PM +0200, Arnd Bergmann wrote:
> On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> > On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > > 
> > > This might have portability issues. On x86 it should work, but if the
> > > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > > calls, which don't work on user pointers.
> > > 
> > > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > > if they are on an I/O mapping.
> > > 
> > We are talking about doing this in userspace, not in kernel.
> 
> Ok, that's fine then. I thought the idea was to use the vhost_net driver

It's a separate issue. We were talking generally about configuration
and setup. Gregory implemented it in kernel, Avi wants it
moved to userspace, with only fastpath in kernel.

> to access the user memory, which would be a really cute hack otherwise,
> as you'd only need to provide the eventfds from a hardware specific
> driver and could use the regular virtio_net on the other side.
> 
> 	Arnd <><

To do that, maybe copy to user on ppc can be fixed, or wrapped
around in a arch specific macro, so that everyone else
does not have to go through abstraction layers.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-16 15:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB0F1EF.5050102@gmail.com>

On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>
>> If kvm can do it, others can.
>>      
> The problem is that you seem to either hand-wave over details like this,
> or you give details that are pretty much exactly what vbus does already.
>   My point is that I've already sat down and thought about these issues
> and solved them in a freely available GPL'ed software package.
>    

In the kernel.  IMO that's the wrong place for it.  Further, if we adopt 
vbus, if drop compatibility with existing guests or have to support both 
vbus and virtio-pci.

> So the question is: is your position that vbus is all wrong and you wish
> to create a new bus-like thing to solve the problem?

I don't intend to create anything new, I am satisfied with virtio.  If 
it works for Ira, excellent.  If not, too bad.  I believe it will work 
without too much trouble.

> If so, how is it
> different from what Ive already done?  More importantly, what specific
> objections do you have to what Ive done, as perhaps they can be fixed
> instead of starting over?
>    

The two biggest objections are:
- the host side is in the kernel
- the guest side is a new bus instead of reusing pci (on x86/kvm), 
making Windows support more difficult

I guess these two are exactly what you think are vbus' greatest 
advantages, so we'll probably have to extend our agree-to-disagree on 
this one.

I also had issues with using just one interrupt vector to service all 
events, but that's easily fixed.

>> There is no guest and host in this scenario.  There's a device side
>> (ppc) and a driver side (x86).  The driver side can access configuration
>> information on the device side.  How to multiplex multiple devices is an
>> interesting exercise for whoever writes the virtio binding for that setup.
>>      
> Bingo.  So now its a question of do you want to write this layer from
> scratch, or re-use my framework.
>    

You will have to implement a connector or whatever for vbus as well.  
vbus has more layers so it's probably smaller for vbus.

>>>>
>>>>          
>>> I am talking about how we would tunnel the config space for N devices
>>> across his transport.
>>>
>>>        
>> Sounds trivial.
>>      
> No one said it was rocket science.  But it does need to be designed and
> implemented end-to-end, much of which Ive already done in what I hope is
> an extensible way.
>    

It was already implemented three times for virtio, so apparently that's 
extensible too.

>>   Write an address containing the device number and
>> register number to on location, read or write data from another.
>>      
> You mean like the "u64 devh", and "u32 func" fields I have here for the
> vbus-kvm connector?
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64
>
>    

Probably.



>>> That sounds convenient given his hardware, but it has its own set of
>>> problems.  For one, the configuration/inventory of these boards is now
>>> driven by the wrong side and has to be addressed.
>>>        
>> Why is it the wrong side?
>>      
> "Wrong" is probably too harsh a word when looking at ethernet.  Its
> certainly "odd", and possibly inconvenient.  It would be like having
> vhost in a KVM guest, and virtio-net running on the host.  You could do
> it, but its weird and awkward.  Where it really falls apart and enters
> the "wrong" category is for non-symmetric devices, like disk-io.
>
>    


It's not odd or wrong or wierd or awkward.  An ethernet NIC is not 
symmetric, one side does DMA and issues interrupts, the other uses its 
own memory.  That's exactly the case with Ira's setup.

If the ppc boards were to emulate a disk controller, you'd run 
virtio-blk on x86 and vhost-blk on the ppc boards.

>>> Second, the role
>>> reversal will likely not work for many models other than ethernet (e.g.
>>> virtio-console or virtio-blk drivers running on the x86 board would be
>>> naturally consuming services from the slave boards...virtio-net is an
>>> exception because 802.x is generally symmetrical).
>>>
>>>        
>> There is no role reversal.
>>      
> So if I have virtio-blk driver running on the x86 and vhost-blk device
> running on the ppc board, I can use the ppc board as a block-device.
> What if I really wanted to go the other way?
>    

You mean, if the x86 board was able to access the disks and dma into the 
ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.

As long as you don't use the words "guest" and "host" but keep to 
"driver" and "device", it all works out.

>> The side doing dma is the device, the side
>> accessing its own memory is the driver.  Just like that other 1e12
>> driver/device pairs out there.
>>      
> IIUC, his ppc boards really can be seen as "guests" (they are linux
> instances that are utilizing services from the x86, not the other way
> around).

They aren't guests.  Guests don't dma into their host's memory.

> vhost forces the model to have the ppc boards act as IO-hosts,
> whereas vbus would likely work in either direction due to its more
> refined abstraction layer.
>    

vhost=device=dma, virtio=driver=own-memory.

>> Of course vhost is incomplete, in the same sense that Linux is
>> incomplete.  Both require userspace.
>>      
> A vhost based solution to Iras design is missing more than userspace.
> Many of those gaps are addressed by a vbus based solution.
>    

Maybe.  Ira can fill the gaps or use vbus.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Alexander Duyck @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Simon Horman, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, Alexander Duyck, Kirsher, Jeffrey T
In-Reply-To: <4AB0F1BD.4020206@voltaire.com>

Or Gerlitz wrote:
> Alexander Duyck wrote:
>> The interface for all of this would make sense as part of a virtual 
>> ethernet switch control which is the way I am currently leaning on all 
>> this.
> Yes, you can say that out of the per VF <mac, vlan-id, priority, rate> 
> tuple I mentioned, except for the mac, the other parameters actually 
> belong to the egress flow of the virtual switch port this VF is 
> connected to. So the vswitch actually signs the packet with vlan+pbits 
> and enforces the rate. Now vswitch can be software based, or hardware 
> NIC based.

Even something such as MAC address would make sense for a virtual 
ethernet switch configuration in that you could restrict unicast ingress 
traffic for the VF to a specific address much like you would do on a 
regular L2 switch.

> Now, I assume there may be NICs which will let you configure the 
> <vlan-id, priority, rate> as part of the their virtual switch config, 
> but others, e.g
> the 82576 as an example, and following our discussion, will let you do 
> that for the VF, in the VF driver which as you said may run the guest OS 
> where we can't control it...

I think you may be a bit confused.  The configuration for the VFs would 
be part of the PF via the virtual ethernet switch control.  As a result 
it is only the PF which needs to be running on the host.

Thanks,

Alex

^ permalink raw reply

* [PATCH] add vif using local interface index instead of IP
From: Ilia K. @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

When routing daemon wants to enable forwarding of multicast traffic it
performs something like:

       struct vifctl vc = {
               .vifc_vifi  = 1,
               .vifc_flags = 0,
               .vifc_threshold = 1,
               .vifc_rate_limit = 0,
               .vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
               .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
         };
       setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));

This leads (in the kernel) to calling  vif_add() function call which
search the (physical) device using assigned IP address:
       dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);

The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.

The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:

       struct vifctl vc = {
               .vifc_vifi  = 1,
               .vifc_flags = VIFF_USE_IFINDEX,   /* NEW */
               .vifc_threshold = 1,
               .vifc_rate_limit = 0,
               .vifc_lcl_ifindex = if_nametoindex("eth0"),   /* NEW */
               .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
         };
       setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));


Signed-off-by: Ilia K. <mail4ilia@gmail.com>

[-- Attachment #2: vif_add.patch --]
[-- Type: text/x-diff, Size: 1634 bytes --]

=== modified file 'include/linux/mroute.h'
--- old/include/linux/mroute.h	2009-08-10 11:17:32 +0000
+++ new/include/linux/mroute.h	2009-09-08 06:58:46 +0000
@@ -59,13 +59,18 @@
 	unsigned char vifc_flags;	/* VIFF_ flags */
 	unsigned char vifc_threshold;	/* ttl limit */
 	unsigned int vifc_rate_limit;	/* Rate limiter values (NI) */
-	struct in_addr vifc_lcl_addr;	/* Our address */
+	union {
+		struct in_addr vifc_lcl_addr;     /* Local interface address */
+		int            vifc_lcl_ifindex;  /* Local interface index   */
+	};
 	struct in_addr vifc_rmt_addr;	/* IPIP tunnel addr */
 };
 
-#define VIFF_TUNNEL	0x1	/* IPIP tunnel */
-#define VIFF_SRCRT	0x2	/* NI */
-#define VIFF_REGISTER	0x4	/* register vif	*/
+#define VIFF_TUNNEL		0x1	/* IPIP tunnel */
+#define VIFF_SRCRT		0x2	/* NI */
+#define VIFF_REGISTER		0x4	/* register vif	*/
+#define VIFF_USE_IFINDEX	0x8	/* use vifc_lcl_ifindex instead of
+					   vifc_lcl_addr to find an interface */
 
 /*
  *	Cache manipulation structures for mrouted and PIMd

=== modified file 'net/ipv4/ipmr.c'
--- old/net/ipv4/ipmr.c	2009-08-10 11:17:32 +0000
+++ new/net/ipv4/ipmr.c	2009-09-08 06:34:21 +0000
@@ -470,8 +470,18 @@
 			return err;
 		}
 		break;
+
+	case VIFF_USE_IFINDEX:
 	case 0:
-		dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
+		if (vifc->vifc_flags == VIFF_USE_IFINDEX) {
+			dev = dev_get_by_index(net, vifc->vifc_lcl_ifindex);
+			if (dev && dev->ip_ptr == NULL) {
+				dev_put(dev);
+				return -EADDRNOTAVAIL;
+			}
+		} else
+			dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
+
 		if (!dev)
 			return -EADDRNOTAVAIL;
 		err = dev_set_allmulti(dev, 1);


^ permalink raw reply

* Re: fanotify as syscalls
From: Eric Paris @ 2009-09-16 15:53 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Alan Cox, Alan Cox, Linus Torvalds, Evgeniy Polyakov,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, hch
In-Reply-To: <20090916125658.GF29359@shareable.org>

On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> Alan Cox wrote:
> > > You can't rely on the name being non-racy, but you _can_ reliably
> > > invalidate application-level caches from the sequence of events
> > > including file writes, creates, renames, links, unlinks, mounts.  And
> > > revalidate such caches by the absence of pending events.
> > 
> > You can't however create the caches reliably because you've no idea if
> > you are referencing the right object in the first place - which is why
> > you want a handle in these cases. I see fanotify as a handle producing
> > addition to inotify, not as a replacement (plus some other bits around
> > open blocking for HSM etc)
> 
> There are two sets of events getting mixed up here.  Inode events -
> reads, writes, truncates, chmods; and directory events - renames,
> links, creates, unlinks.

My understanding of you argument is that fanotify does not yet provide
all inotify events, namely those of directories operations and thus is
not suitable to wholesale replace everything inotify can do.  I've
already said that working towards that goal is something I plan to
pursue, but for now, you still have inotify.

The mlocate/updatedb people ask me about fanotify and it's on the todo
list to allow global reception of of such events.  The fd you get would
be of the dir where the event happened.  They didn't care, and I haven't
decided if we would provide the path component like inotify does.  Most
users are perfectly happy to stat everything in the dir.

It's hopefully feasible, but it's going to take some fsnotify hook
movements and possibly so arguments with Al to get the information I
want where I want it.  But there is nothing about the interface that
precludes it and it has been discussed and considered.

Am I still missing it?

-Eric


^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Arnd Bergmann @ 2009-09-16 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <20090916151329.GC5513@redhat.com>

On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > 
> > This might have portability issues. On x86 it should work, but if the
> > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > calls, which don't work on user pointers.
> > 
> > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > if they are on an I/O mapping.
> > 
> We are talking about doing this in userspace, not in kernel.

Ok, that's fine then. I thought the idea was to use the vhost_net driver
to access the user memory, which would be a really cute hack otherwise,
as you'd only need to provide the eventfds from a hardware specific
driver and could use the regular virtio_net on the other side.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-16 15:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <200909161657.42628.arnd@arndb.de>

On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> 
> This might have portability issues. On x86 it should work, but if the
> host is powerpc or similar, you cannot reliably access PCI I/O memory
> through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> calls, which don't work on user pointers.
> 
> Specifically on powerpc, copy_from_user cannot access unaligned buffers
> if they are on an I/O mapping.
> 
> 	Arnd <><

We are talking about doing this in userspace, not in kernel.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: bonding...
From: Jay Vosburgh @ 2009-09-16 15:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20090916.014459.86856955.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:

>
>Jay, there is quite a backlog of bonding patches in the queue right
>now.
>
>I'd like to know when you'll get to processing them because it's more
>then two weeks (!!) for some of them and these folks are going to miss
>the merge window out of no fault of their own.

	I'll get through them today.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Arnd Bergmann @ 2009-09-16 14:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Avi Kivity, Ira W. Snyder, netdev,
	virtualization, kvm, linux-kernel, mingo, linux-mm, akpm, hpa,
	Rusty Russell, s.hetze, alacrityvm-devel
In-Reply-To: <20090915212545.GC27954@redhat.com>

On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> Userspace in x86 maps a PCI region, uses it for communication with ppc?

This might have portability issues. On x86 it should work, but if the
host is powerpc or similar, you cannot reliably access PCI I/O memory
through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
calls, which don't work on user pointers.

Specifically on powerpc, copy_from_user cannot access unaligned buffers
if they are on an I/O mapping.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch 4/7] [PATCH] af_iucv: fix race in __iucv_sock_wait()
From: Ursula Braun @ 2009-09-16 14:37 UTC (permalink / raw)
  To: davem, netdev, linux-s390
  Cc: schwidefsky, heiko.carstens, Hendrik Brueckner, Ursula Braun
In-Reply-To: <20090916143721.863799000@linux.vnet.ibm.com>

[-- Attachment #1: 604-af_iucv-sock-wait-race.diff --]
[-- Type: text/plain, Size: 969 bytes --]

From: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>

Moving prepare_to_wait before the condition to avoid a race between
schedule_timeout and wake up.
The race can appear during iucv_sock_connect() and iucv_callback_connack().

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
---

 net/iucv/af_iucv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-uschi/net/iucv/af_iucv.c
===================================================================
--- linux-2.6-uschi.orig/net/iucv/af_iucv.c
+++ linux-2.6-uschi/net/iucv/af_iucv.c
@@ -59,8 +59,8 @@ do {									\
 	DEFINE_WAIT(__wait);						\
 	long __timeo = timeo;						\
 	ret = 0;							\
+	prepare_to_wait(sk->sk_sleep, &__wait, TASK_INTERRUPTIBLE);	\
 	while (!(condition)) {						\
-		prepare_to_wait(sk->sk_sleep, &__wait, TASK_INTERRUPTIBLE); \
 		if (!__timeo) {						\
 			ret = -EAGAIN;					\
 			break;						\


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox