Re: [PATCH net-next 0/4] mm,tcp: provide mmap

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
       [not found] <20180420155542.122183-1-edumazet@google.com>
@ 2018-04-23 21:14 ` Andy Lutomirski
  2018-04-23 21:38   ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Lutomirski @ 2018-04-23 21:14 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, linux-kernel, Soheil Hassas Yeganeh, Eric Dumazet,
	linux-mm, Linux API

On 04/20/2018 08:55 AM, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
> 
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability.
> 

I think that the right solution is to rework mmap() on TCP sockets a 
bit.  The current approach in net-next is very strange for a few reasons:

1. It uses mmap() as an operation that has side effects besides just 
creating a mapping.  If nothing else, it's surprising, since mmap() 
doesn't usually do that.  But it's also causing problems like what 
you're seeing.

2. The performance is worse than it needs to be.  mmap() is slow, and I 
doubt you'll find many mm developers who consider this particular abuse 
of mmap() to be a valid thing to optimize for.

3. I'm not at all convinced the accounting is sane.  As far as I can 
tell, you're allowing unprivileged users to increment the count on 
network-owned pages, limited only by available virtual memory, without 
obviously charging it to the socket buffer limits.  It looks like a 
program that simply forgot to call munmap() would cause the system to 
run out of memory, and I see no reason to expect the OOM killer to have 
any real chance of killing the right task.

4. Error handling sucks.  If I try to mmap() a large range (which is the 
whole point -- using a small range will kill performance) and not quite 
all of it can be mapped, then I waste a bunch of time in the kernel and 
get *none* of the range mapped.

I would suggest that you rework the interface a bit.  First a user would 
call mmap() on a TCP socket, which would create an empty VMA.  (It would 
set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
recognize it, but it would have no effect whatsoever on the TCP state 
machine.  Reading the VMA would get SIGBUS.)  Then a user would call a 
new ioctl() or setsockopt() function and pass something like:

struct tcp_zerocopy_receive {
   void *address;
   size_t length;
};

The kernel would verify that [address, address+length) is entirely 
inside a single TCP VMA and then would do the vm_insert_range magic.  On 
success, length is changed to the length that actually got mapped.  The 
kernel could do this while holding mmap_sem for *read*, and it could get 
the lock ordering right.  If and when mm range locks ever get merged, it 
could switch to using a range lock.

Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the 
part of the mapping that you're done with.

Does this seem reasonable?  It should involve very little code change, 
it will run faster, it will scale better, and it is much less weird IMO.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
  2018-04-23 21:14 ` [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue Andy Lutomirski
@ 2018-04-23 21:38   ` Eric Dumazet
  2018-04-24  2:04     ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2018-04-23 21:38 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Dumazet, David S . Miller
  Cc: netdev, linux-kernel, Soheil Hassas Yeganeh, Eric Dumazet,
	linux-mm, Linux API

Hi Andy

On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> On 04/20/2018 08:55 AM, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability.
>>
> 
> I think that the right solution is to rework mmap() on TCP sockets a bit.  The current approach in net-next is very strange for a few reasons:
> 
> 1. It uses mmap() as an operation that has side effects besides just creating a mapping.  If nothing else, it's surprising, since mmap() doesn't usually do that.  But it's also causing problems like what you're seeing.
> 
> 2. The performance is worse than it needs to be.  mmap() is slow, and I doubt you'll find many mm developers who consider this particular abuse of mmap() to be a valid thing to optimize for.
> 
> 3. I'm not at all convinced the accounting is sane.  As far as I can tell, you're allowing unprivileged users to increment the count on network-owned pages, limited only by available virtual memory, without obviously charging it to the socket buffer limits.  It looks like a program that simply forgot to call munmap() would cause the system to run out of memory, and I see no reason to expect the OOM killer to have any real chance of killing the right task.

> 
> 4. Error handling sucks.  If I try to mmap() a large range (which is the whole point -- using a small range will kill performance) and not quite all of it can be mapped, then I waste a bunch of time in the kernel and get *none* of the range mapped.
> 
> I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like:


> 
> struct tcp_zerocopy_receive {
>   void *address;
>   size_t length;
> };
> 
> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.

I have no idea what is the proper API for that.
Where the TCP VMA(s) would be stored ?
In TCP socket, or MM layer ?


And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?

Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)

  On success, length is changed to the length that actually got mapped.  The kernel could do this while holding mmap_sem for *read*, and it could get the lock ordering right.  If and when mm range locks ever get merged, it could switch to using a range lock.
> 
> Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the part of the mapping that you're done with.
> 
> Does this seem reasonable?  It should involve very little code change, it will run faster, it will scale better, and it is much less weird IMO.

Maybe, although I do not see the 'little code change' yet.

But at least, this seems pretty nice idea, especially if it could allow us to fill the mmap()ed area later when packets are received.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
  2018-04-23 21:38   ` Eric Dumazet
@ 2018-04-24  2:04     ` Andy Lutomirski
  2018-04-24  4:30       ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Lutomirski @ 2018-04-24  2:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andy Lutomirski, Eric Dumazet, David S . Miller, netdev,
	linux-kernel, Soheil Hassas Yeganeh, linux-mm, Linux API

On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Hi Andy
>
> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:

>> I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like:
>
>
>>
>> struct tcp_zerocopy_receive {
>>   void *address;
>>   size_t length;
>> };
>>
>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>
> I have no idea what is the proper API for that.
> Where the TCP VMA(s) would be stored ?
> In TCP socket, or MM layer ?

MM layer.  I haven't tested this at all, and the error handling is
totally wrong, but I think you'd do something like:

len = get_user(...);

down_read(&current->mm->mmap_sem);

vma = find_vma(mm, start);
if (!vma || vma->vm_start > start)
  return -EFAULT;

/* This is buggy.  You also need to check that the file is a socket.
This is probably trivial. */
if (vma->vm_file->private_data != sock)
  return -EINVAL;

if (len > vma->vm_end - start)
  return -EFAULT;  /* too big a request. */

and now you'd do the vm_insert_page() dance, except that you don't
have to abort the whole procedure if you discover that something isn't
aligned right.  Instead you'd just stop and tell the caller that you
didn't map the full requested size.  You might also need to add some
code to charge the caller for the pages that get pinned, but that's an
orthogonal issue.

You also need to provide some way for user programs to signal that
they're done with the page in question.  MADV_DONTNEED might be
sufficient.

In the mmap() helper, you might want to restrict the mapped size to
something reasonable.  And it might be nice to hook mremap() to
prevent user code from causing too much trouble.

With my x86-writer-of-TLB-code hat on, I expect the major performance
costs to be the generic costs of mmap() and munmap() (which only
happen once per socket instead of once per read if you like my idea),
the cost of a TLB miss when the data gets read (really not so bad on
modern hardware), and the cost of the TLB invalidation when user code
is done with the buffers.  The latter is awful, especially in
multithreaded programs.  In fact, it's so bad that it might be worth
mentioning in the documentation for this code that it just shouldn't
be used in multithreaded processes.  (Also, on non-PCID hardware,
there's an annoying situation in which a recently-migrated thread that
removes a mapping sends an IPI to the CPU that the thread used to be
on.  I thought I had a clever idea to get rid of that IPI once, but it
turned out to be wrong.)

Architectures like ARM that have superior TLB handling primitives will
not be hurt as badly if this is used my a multithreaded program.

>
>
> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?

Exactly.  If I request 10MB mapped and only the first 9MB are aligned
right, I still want the first 9 MB.

>
> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)

There's nothing to account.  It's the same as mapping /dev/null or
similar -- the mm core should take care of it for you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
  2018-04-24  2:04     ` Andy Lutomirski
@ 2018-04-24  4:30       ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2018-04-24  4:30 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, linux-kernel,
	Soheil Hassas Yeganeh, linux-mm, Linux API



On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> 
>>> I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>>   void *address;
>>>   size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
> 
> MM layer.  I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
> 
> len = get_user(...);
> 
> down_read(&current->mm->mmap_sem);
> 
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
>   return -EFAULT;
> 
> /* This is buggy.  You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
>   return -EINVAL;
> 
> if (len > vma->vm_end - start)
>   return -EFAULT;  /* too big a request. */
> 
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right.  Instead you'd just stop and tell the caller that you
> didn't map the full requested size.  You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
> 
> You also need to provide some way for user programs to signal that
> they're done with the page in question.  MADV_DONTNEED might be
> sufficient.
> 
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable.  And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
> 
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers.  The latter is awful, especially in
> multithreaded programs.  In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes.  (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on.  I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
> 
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
> 
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?
> 
> Exactly.  If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
> 
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)
> 
> There's nothing to account.  It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
> 

Thanks Andy, I am working on all this, and initial patch looks sane enough.

 include/uapi/linux/tcp.h |    7 +
 net/ipv4/tcp.c           |  175 +++++++++++++++++++++++------------------------
 2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-04-24  4:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180420155542.122183-1-edumazet@google.com>
2018-04-23 21:14 ` [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue Andy Lutomirski
2018-04-23 21:38   ` Eric Dumazet
2018-04-24  2:04     ` Andy Lutomirski
2018-04-24  4:30       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).