netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SOCK_MEMALLOC vs loopback
@ 2015-03-04 18:38 Ilya Dryomov
  2015-03-04 20:04 ` Mel Gorman
  0 siblings, 1 reply; 6+ messages in thread
From: Ilya Dryomov @ 2015-03-04 18:38 UTC (permalink / raw)
  To: ceph-devel, Eric Dumazet
  Cc: Sage Weil, Mike Christie, Mel Gorman, NeilBrown, netdev

Hello,

A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
libceph sockets and PF_MEMALLOC around send/receive paths (commit
89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
like nbd and is succeptible to all the same memory allocation
deadlocks, so it seemed like a step in the right direction.

However that turned out to not play nice with loopback - such a simple
workload as 'dd if=/dev/zero of=/dev/rbd0 bs=4M' would now lock up in
no time if one or more ceph-osd (think nbd-server) processes are
running on the same box - as soon as memory gets tight and
__alloc_skb() dips into PF_MEMALLOC reserves and marks skb as
pfmemalloc, packets start being dropped on the receiving side:

int sk_filter(struct sock *sk, struct sk_buff *skb)
{
        ...

        /*
         * If the skb was allocated from pfmemalloc reserves, only
         * allow SOCK_MEMALLOC sockets to use it as this socket is
         * helping free memory
         */
        if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
                return -ENOMEM;

as the receiving ceph-osd socket is not a SOCK_MEMALLOC socket.

The motivation behind this is clear but this makes loopback rbd just
plain unusable and while we never recommended it to our users and
advised against it, we had a few "it worked for us for more than
a year" kind of reports.  It's also very useful for testing.

Some googling revealed that I'm not the first one to hit this.  SUSE
guys carried (are carrying?) a patch to sk_filter() to allow pfmemalloc
skbs through to make up for GPFS's misuse of PF_MEMALLOC [1], this was
mentioned tangentially by Eric in [2] and he suggested a possible fix
in [3].

"When I discussed with David on this issue, I said that one possibility
would be to accept a pfmemalloc skb on regular skb if no other packet is
in a receive queue, to get a chance to make progress (and limit memory
consumption to no more than one skb per TCP socket)"

Eric, was there any progress on this front?  We would like to work on
fixing this, but need some mm and net input.

(I also CC'ed Neil as he did the NFS loopback series recently and this
may touch on swap-on-nfs.)

[1] https://gitorious.org/opensuse/kernel-source/commit/a78bfd6
[2] http://article.gmane.org/gmane.linux.kernel/1418791
[3] http://article.gmane.org/gmane.linux.kernel.stable/46128

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-03-05  9:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-04 18:38 SOCK_MEMALLOC vs loopback Ilya Dryomov
2015-03-04 20:04 ` Mel Gorman
2015-03-05  4:03   ` Mike Christie
2015-03-05  4:13     ` Mike Christie
2015-03-05  7:09       ` Ilya Dryomov
2015-03-05  9:50     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).