From: Jason Baron <jbaron@akamai.com>
To: David Miller <davem@davemloft.net>
Cc: Jason Baron <jbaron@akamai.com>,
netdev@vger.kernel.org, eric.dumazet@gmail.com
Subject: Re: [PATCH] tcp: set SOCK_NOSPACE under memory presure
Date: Thu, 30 Apr 2015 10:34:17 -0400 [thread overview]
Message-ID: <55423D69.7000907@akamai.com> (raw)
In-Reply-To: <5537AF76.1020108@akamai.com>
On 04/22/2015 10:25 AM, Jason Baron wrote:
> On 04/21/2015 05:33 PM, David Miller wrote:
>> From: Jason Baron <jbaron@akamai.com>
>> Date: Mon, 20 Apr 2015 20:05:13 +0000 (GMT)
>>
>>> Under tcp memory pressure, calling epoll_wait() in edge triggered
>>> mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
>>> even when there is suffcient memory available to continue making
>>> progress. The problem is that __sk_mem_schedule() can return 0,
>>> under memory pressure without having set the SOCK_NOSPACE flag. Thus,
>>> even though all the outstanding packets have been acked, we never
>>> get the EPOLLOUT that we are expecting from epoll_wait().
>>>
>>> This issue is currently limited to epoll when used in edge trigger
>>> mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
>>> This is sufficient for poll()/select() and epoll() in level trigger
>>> mode. However, in edge trigger mode, epoll() is relying on the write
>>> path to set SOCK_NOSPACE. So I view this patch as bringing us into
>>> sync with poll()/select() and epoll() level trigger behavior.
>> Can you explain exactly how epoll in edge trigger mode is
>> depending upon SOCK_NOSPACE being set in this way? I tried
>> to read the epoll code and it just seems to call ->poll()
>> in the normal way when returning event state.
> In edge trigger mode, when we receive a wakeup event we
> call ->poll() in the normal way, *but* we do not leave the event
> as still pending. (Specifically, in the epoll() code we are not
> re-adding it (fs/eventpoll.c:ep_send_events_proc())). This is
> because we are only interested in the 'edge' or the event
> going high. In level trigger mode, we do leave the event
> pending if its 'high', such that it will re-trigger again for us on
> the next epoll_wait().
>
> EPOLL(7) is clear that in edge-trigger mode we can only do
> epoll_wait() after read/write return -EAGAIN. Thus, in the
> case of the socket write, we are relying on the fact that
> tcp_sendmsg()/network layer is going to issue a wakeup
> for us at some point in the future when we get -EAGAIN.
>
> This all works fine in the case you pointed out where we
> have exceeded the sk->sndbuf and set SOCK_NOSPACE.
> However, when we return -EAGAIN from the write path
> b/c we are over the tcp memory limits and not b/c we are
> over the sndbuf, we are never going to get another wakeup
> (since SOCK_NOSPACE is not set in this case). Level trigger
> avoids this since the subsequent epoll_wait() is going to
> re-try the ->poll() (and set SOCK_NOSPACE if it fails).
>
> Now, in the memory failure case, we are not really
> waiting for the buffer to empty, but rather for there to be
> memory more generally available. So it could be argued
> that we need to implement a wakeup here based on memory
> being available as opposed to the write queue emptying.
> That is one potential option here.
>
> I think the other one is the route I was proposing, which was
> to treat the out of memory case, in the same way as the
> sk->sndbuf queue full case, as select(), poll() and epoll() level
> trigger are currently doing. And potentially add maybe an
> -ENOSPC return if the write queue really is empty...I thought
> that approach made sense b/c even under memory pressure
> (over sk_prot_mem_limits(sk, 1), but not over
> sk_prot_mem_limits(sk, 2)) we continue to guarantee a
> minimum sndbuf size (implying we can keep making progress).
> That said, there is a case, over sk_prot_mem_limits(sk, 2),
> where we do not guarantee the minimum buffer size, but I
> think in practice that is very hard to hit (since we are reducing
> usage over sk_prot_mem_limits(sk, 1) aggressively).
>
> There is also the case where we actually are out of memory
> on the system, ie kmalloc() etc. are failing, in which case
> we could maybe return -ENOSPC, or else we would potentially
> need a larger change to wait on memory being available
> as opposed to the buffer emptying.
>
> Thanks,
>
> -Jason
Hi,
Just curious if anybody had any further reaction on this issue.
I think making the epoll edge trigger case, as least match what
we are seeing for poll()/select()/epoll() level trigger seems
reasonable here.
Thanks,
-Jason
next prev parent reply other threads:[~2015-04-30 14:34 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-20 20:05 [PATCH] tcp: set SOCK_NOSPACE under memory presure Jason Baron
2015-04-21 21:33 ` David Miller
2015-04-22 14:25 ` Jason Baron
2015-04-30 14:34 ` Jason Baron [this message]
2015-04-30 15:44 ` David Miller
2015-05-04 23:12 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55423D69.7000907@akamai.com \
--to=jbaron@akamai.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).