From: Eric Dumazet <eric.dumazet@gmail.com>
To: Andy Lutomirski <luto@kernel.org>, Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>,
"David S . Miller" <davem@davemloft.net>,
netdev <netdev@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
Soheil Hassas Yeganeh <soheil@google.com>,
linux-mm <linux-mm@kvack.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
Date: Mon, 23 Apr 2018 21:30:34 -0700 [thread overview]
Message-ID: <e494c904-bdd0-8c6c-0592-c4960dcf2865@gmail.com> (raw)
In-Reply-To: <CALCETrWOLU+P_jVpuOUQT2e_5ZShAP3OM0yJZMbC=pv5La9Cvg@mail.gmail.com>
On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
>
>>> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>> void *address;
>>> size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
>
> MM layer. I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
>
> len = get_user(...);
>
> down_read(¤t->mm->mmap_sem);
>
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
> return -EFAULT;
>
> /* This is buggy. You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
> return -EINVAL;
>
> if (len > vma->vm_end - start)
> return -EFAULT; /* too big a request. */
>
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right. Instead you'd just stop and tell the caller that you
> didn't map the full requested size. You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
>
> You also need to provide some way for user programs to signal that
> they're done with the page in question. MADV_DONTNEED might be
> sufficient.
>
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable. And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
>
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers. The latter is awful, especially in
> multithreaded programs. In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes. (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on. I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
>
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
>
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?
>
> Exactly. If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
>
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)
>
> There's nothing to account. It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
>
Thanks Andy, I am working on all this, and initial patch looks sane enough.
include/uapi/linux/tcp.h | 7 +
net/ipv4/tcp.c | 175 +++++++++++++++++++++++------------------------
2 files changed, 93 insertions(+), 89 deletions(-)
I will test all this before sending for review asap.
( I have not done the compat code yet, this can be done later I guess)
prev parent reply other threads:[~2018-04-24 4:30 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20180420155542.122183-1-edumazet@google.com>
2018-04-21 9:07 ` [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue Christoph Hellwig
2018-04-21 16:55 ` Eric Dumazet
2018-04-23 21:14 ` Andy Lutomirski
2018-04-23 21:38 ` Eric Dumazet
2018-04-24 2:04 ` Andy Lutomirski
2018-04-24 4:30 ` Eric Dumazet [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e494c904-bdd0-8c6c-0592-c4960dcf2865@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=soheil@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).