SOCK_MEMALLOC vs loopback

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SOCK_MEMALLOC vs loopback
@ 2015-03-04 18:38 Ilya Dryomov
  2015-03-04 20:04 ` Mel Gorman
  0 siblings, 1 reply; 6+ messages in thread
From: Ilya Dryomov @ 2015-03-04 18:38 UTC (permalink / raw)
  To: ceph-devel, Eric Dumazet
  Cc: Sage Weil, Mike Christie, Mel Gorman, NeilBrown, netdev

Hello,

A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
libceph sockets and PF_MEMALLOC around send/receive paths (commit
89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
like nbd and is succeptible to all the same memory allocation
deadlocks, so it seemed like a step in the right direction.

However that turned out to not play nice with loopback - such a simple
workload as 'dd if=/dev/zero of=/dev/rbd0 bs=4M' would now lock up in
no time if one or more ceph-osd (think nbd-server) processes are
running on the same box - as soon as memory gets tight and
__alloc_skb() dips into PF_MEMALLOC reserves and marks skb as
pfmemalloc, packets start being dropped on the receiving side:

int sk_filter(struct sock *sk, struct sk_buff *skb)
{
        ...

        /*
         * If the skb was allocated from pfmemalloc reserves, only
         * allow SOCK_MEMALLOC sockets to use it as this socket is
         * helping free memory
         */
        if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
                return -ENOMEM;

as the receiving ceph-osd socket is not a SOCK_MEMALLOC socket.

The motivation behind this is clear but this makes loopback rbd just
plain unusable and while we never recommended it to our users and
advised against it, we had a few "it worked for us for more than
a year" kind of reports.  It's also very useful for testing.

Some googling revealed that I'm not the first one to hit this.  SUSE
guys carried (are carrying?) a patch to sk_filter() to allow pfmemalloc
skbs through to make up for GPFS's misuse of PF_MEMALLOC [1], this was
mentioned tangentially by Eric in [2] and he suggested a possible fix
in [3].

"When I discussed with David on this issue, I said that one possibility
would be to accept a pfmemalloc skb on regular skb if no other packet is
in a receive queue, to get a chance to make progress (and limit memory
consumption to no more than one skb per TCP socket)"

Eric, was there any progress on this front?  We would like to work on
fixing this, but need some mm and net input.

(I also CC'ed Neil as he did the NFS loopback series recently and this
may touch on swap-on-nfs.)

[1] https://gitorious.org/opensuse/kernel-source/commit/a78bfd6
[2] http://article.gmane.org/gmane.linux.kernel/1418791
[3] http://article.gmane.org/gmane.linux.kernel.stable/46128

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SOCK_MEMALLOC vs loopback
  2015-03-04 18:38 SOCK_MEMALLOC vs loopback Ilya Dryomov
@ 2015-03-04 20:04 ` Mel Gorman
  2015-03-05  4:03   ` Mike Christie
  0 siblings, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2015-03-04 20:04 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: ceph-devel, Eric Dumazet, Sage Weil, Mike Christie, NeilBrown,
	netdev

On Wed, Mar 04, 2015 at 09:38:48PM +0300, Ilya Dryomov wrote:
> Hello,
> 
> A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
> libceph sockets and PF_MEMALLOC around send/receive paths (commit
> 89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
> like nbd and is succeptible to all the same memory allocation
> deadlocks, so it seemed like a step in the right direction.
> 

The contract for SOCK_MEMALLOC is that it would only be used for temporary
allocations that were necessary for the system to make forward progress. In
the case of swap-over-NFS, it would only be used for transmitting
buffers that were necessary to write data to swap when there were no
other options. If that contract is not met then using it can deadlock the
system. It's the same for PF_MEMALLOC -- activating that is a recipe for
deadlock due to memory exhaustion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SOCK_MEMALLOC vs loopback
  2015-03-04 20:04 ` Mel Gorman
@ 2015-03-05  4:03   ` Mike Christie
  2015-03-05  4:13     ` Mike Christie
  2015-03-05  9:50     ` Mel Gorman
  0 siblings, 2 replies; 6+ messages in thread
From: Mike Christie @ 2015-03-05  4:03 UTC (permalink / raw)
  To: Mel Gorman, Ilya Dryomov
  Cc: ceph-devel, Eric Dumazet, Sage Weil, NeilBrown, netdev

On 03/04/2015 02:04 PM, Mel Gorman wrote:
> On Wed, Mar 04, 2015 at 09:38:48PM +0300, Ilya Dryomov wrote:
>> Hello,
>>
>> A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
>> libceph sockets and PF_MEMALLOC around send/receive paths (commit
>> 89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
>> like nbd and is succeptible to all the same memory allocation
>> deadlocks, so it seemed like a step in the right direction.
>>
> 
> The contract for SOCK_MEMALLOC is that it would only be used for temporary
> allocations that were necessary for the system to make forward progress. In
> the case of swap-over-NFS, it would only be used for transmitting
> buffers that were necessary to write data to swap when there were no

Are upper layers like NFS/iSCSI/NBD/RBD supposed to know or track when
there are no other options (for example if a GFP_ATOMIC allocation
fails, then set the flags and retry the operation), or are they supposed
to be able to set the flags, send IO and let the network layer handle it?

> other options. If that contract is not met then using it can deadlock the
> system. It's the same for PF_MEMALLOC -- activating that is a recipe for
> deadlock due to memory exhaustion.

For rbd and iscsi's SOCK_MEMALLOC/PF_MEMALLOC use, I copied what you did
for nbd in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca which always
sets those flags and seems to rely on the network layer to do the right
thing. Are they all incorrect?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SOCK_MEMALLOC vs loopback
  2015-03-05  4:03   ` Mike Christie
@ 2015-03-05  4:13     ` Mike Christie
  2015-03-05  7:09       ` Ilya Dryomov
  2015-03-05  9:50     ` Mel Gorman
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Christie @ 2015-03-05  4:13 UTC (permalink / raw)
  To: Mel Gorman, Ilya Dryomov
  Cc: ceph-devel, Eric Dumazet, Sage Weil, NeilBrown, netdev

On 03/04/2015 10:03 PM, Mike Christie wrote:
> On 03/04/2015 02:04 PM, Mel Gorman wrote:
>> > On Wed, Mar 04, 2015 at 09:38:48PM +0300, Ilya Dryomov wrote:
>>> >> Hello,
>>> >>
>>> >> A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
>>> >> libceph sockets and PF_MEMALLOC around send/receive paths (commit
>>> >> 89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
>>> >> like nbd and is succeptible to all the same memory allocation
>>> >> deadlocks, so it seemed like a step in the right direction.
>>> >>
>> > 
>> > The contract for SOCK_MEMALLOC is that it would only be used for temporary
>> > allocations that were necessary for the system to make forward progress. In
>> > the case of swap-over-NFS, it would only be used for transmitting
>> > buffers that were necessary to write data to swap when there were no
> Are upper layers like NFS/iSCSI/NBD/RBD supposed to know or track when
> there are no other options (for example if a GFP_ATOMIC allocation
> fails, then set the flags and retry the operation), or are they supposed
> to be able to set the flags, send IO and let the network layer handle it?
> 

Oh yeah, maybe I misunderstood you. Were you just saying we should not
be using it for the configuration we are hitting the problem on?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SOCK_MEMALLOC vs loopback
  2015-03-05  4:13     ` Mike Christie
@ 2015-03-05  7:09       ` Ilya Dryomov
  0 siblings, 0 replies; 6+ messages in thread
From: Ilya Dryomov @ 2015-03-05  7:09 UTC (permalink / raw)
  To: Mike Christie
  Cc: Mel Gorman, ceph-devel, Eric Dumazet, Sage Weil, NeilBrown,
	netdev

On Thu, Mar 5, 2015 at 7:13 AM, Mike Christie <mchristi@redhat.com> wrote:
> On 03/04/2015 10:03 PM, Mike Christie wrote:
>> On 03/04/2015 02:04 PM, Mel Gorman wrote:
>>> > On Wed, Mar 04, 2015 at 09:38:48PM +0300, Ilya Dryomov wrote:
>>>> >> Hello,
>>>> >>
>>>> >> A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
>>>> >> libceph sockets and PF_MEMALLOC around send/receive paths (commit
>>>> >> 89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
>>>> >> like nbd and is succeptible to all the same memory allocation
>>>> >> deadlocks, so it seemed like a step in the right direction.
>>>> >>
>>> >
>>> > The contract for SOCK_MEMALLOC is that it would only be used for temporary
>>> > allocations that were necessary for the system to make forward progress. In
>>> > the case of swap-over-NFS, it would only be used for transmitting
>>> > buffers that were necessary to write data to swap when there were no
>> Are upper layers like NFS/iSCSI/NBD/RBD supposed to know or track when
>> there are no other options (for example if a GFP_ATOMIC allocation
>> fails, then set the flags and retry the operation), or are they supposed
>> to be able to set the flags, send IO and let the network layer handle it?
>>
>
> Oh yeah, maybe I misunderstood you. Were you just saying we should not
> be using it for the configuration we are hitting the problem on?

NFS seems to be a bit of special case: its SOCK_MEMALLOC is set only
for swap sockets and it's a filesystem.  Mel's patch sets SOCK_MEMALLOC
on all nbd sockets unconditionally, but AFAICT there was a distinct
effort to make loopback nbd work (commit 48cf6061b302, "NBD: allow nbd
to be used locally").  I suspect it's currently broken in the same way.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SOCK_MEMALLOC vs loopback
  2015-03-05  4:03   ` Mike Christie
  2015-03-05  4:13     ` Mike Christie
@ 2015-03-05  9:50     ` Mel Gorman
  1 sibling, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2015-03-05  9:50 UTC (permalink / raw)
  To: Mike Christie
  Cc: Ilya Dryomov, ceph-devel, Eric Dumazet, Sage Weil, NeilBrown,
	netdev

On Wed, Mar 04, 2015 at 10:03:51PM -0600, Mike Christie wrote:
> On 03/04/2015 02:04 PM, Mel Gorman wrote:
> > other options. If that contract is not met then using it can deadlock the
> > system. It's the same for PF_MEMALLOC -- activating that is a recipe for
> > deadlock due to memory exhaustion.
> 
> For rbd and iscsi's SOCK_MEMALLOC/PF_MEMALLOC use, I copied what you did
> for nbd in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca which always
> sets those flags and seems to rely on the network layer to do the right
> thing. Are they all incorrect?

NBD is a poor example and if it comes to that, I would suggest removing it
and let NBD easily deadlock like it used to. NBD has other failure cases
such as the client can get paged out if -swap is not specified. The same
commit notes that NBD may still deadlock and that min_free_kbytes may have
to be increased.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-03-05  9:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-04 18:38 SOCK_MEMALLOC vs loopback Ilya Dryomov
2015-03-04 20:04 ` Mel Gorman
2015-03-05  4:03   ` Mike Christie
2015-03-05  4:13     ` Mike Christie
2015-03-05  7:09       ` Ilya Dryomov
2015-03-05  9:50     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).