nbd, nbdkit, loopback mounts and memory management

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* nbd, nbdkit, loopback mounts and memory management
@ 2019-02-15 19:19 Pavel Machek
  2019-02-15 22:41 ` Richard W.M. Jones
  2019-02-17  8:44 ` Richard W.M. Jones
  0 siblings, 2 replies; 10+ messages in thread
From: Pavel Machek @ 2019-02-15 19:19 UTC (permalink / raw)
  To: rjones, kernel list, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

Hi!

I watched fosdem talk about
nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
of warning: I'm not sure using it read-write on localhost is safe.

In particular, user application could create a lot of dirty data
quickly. If there's not enough memory for nbdkit (or nbd-client or
nbd-server), you might get a deadlock.

Also note that nbd.txt in Documentation/blockdev/ points to
sourceforge; it should probably point to
https://github.com/NetworkBlockDevice/nbd ?

Best regards,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek
@ 2019-02-15 22:41 ` Richard W.M. Jones
  2019-02-15 22:53   ` Richard W.M. Jones
                     ` (2 more replies)
  2019-02-17  8:44 ` Richard W.M. Jones
  1 sibling, 3 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2019-02-15 22:41 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd

On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> Hi!
> 
> I watched fosdem talk about
> nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> of warning: I'm not sure using it read-write on localhost is safe.
> 
> In particular, user application could create a lot of dirty data
> quickly. If there's not enough memory for nbdkit (or nbd-client or
> nbd-server), you might get a deadlock.

Thanks for the kind words about the talk.  I've added Wouter Verhelst
& the NBD mailing list to CC.  Although I did the talk because the
subject is interesting, how I actually use nbdkit / NBD is to talk to
qemu and that's where I have most experience and where we (Red Hat)
use it in production systems.

However in January I spent a lot of time exercising the NBD loop-mount
+ nbdkit case using fio in order to find contention / bottlenecks in
our use of threads and locks.  I didn't notice any particular problems
then, but it's possible my testing wasn't thorough enough.  Or that
fio only creates small numbers of dirty pages (because of locality in
its access patterns I guess?)

When you say it's not safe, what could happen?  What would we observe
if it was going wrong?

> Also note that nbd.txt in Documentation/blockdev/ points to
> sourceforge; it should probably point to
> https://github.com/NetworkBlockDevice/nbd ?

Wouter should be able to say what the correct link should be.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 22:41 ` Richard W.M. Jones
@ 2019-02-15 22:53   ` Richard W.M. Jones
  2019-02-16  8:16     ` Wouter Verhelst
  2019-02-15 22:55   ` Pavel Machek
  2019-11-17 16:58   ` Richard W.M. Jones
  2 siblings, 1 reply; 10+ messages in thread
From: Richard W.M. Jones @ 2019-02-15 22:53 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd

On Fri, Feb 15, 2019 at 10:41:26PM +0000, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

Reading more carefully I see you said we'd observe a deadlock.  I
didn't see that, but again my testing of this wouldn't have been very
thorough.  When I have some time I'll try creating / spooling huge
files into an NBD loop mount to see if I can cause a deadlock.

Thanks, Rich.

> > Also note that nbd.txt in Documentation/blockdev/ points to
> > sourceforge; it should probably point to
> > https://github.com/NetworkBlockDevice/nbd ?
> 
> Wouter should be able to say what the correct link should be.
> 
> Rich.
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 22:53   ` Richard W.M. Jones
@ 2019-02-16  8:16     ` Wouter Verhelst
  0 siblings, 0 replies; 10+ messages in thread
From: Wouter Verhelst @ 2019-02-16  8:16 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: Pavel Machek, kernel list, Andrew Morton, nbd

Hi,

On Fri, Feb 15, 2019 at 10:53:32PM +0000, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 10:41:26PM +0000, Richard W.M. Jones wrote:
> > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > > Hi!
> > > 
> > > I watched fosdem talk about
> > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > > of warning: I'm not sure using it read-write on localhost is safe.
> > > 
> > > In particular, user application could create a lot of dirty data
> > > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > > nbd-server), you might get a deadlock.
> > 
> > Thanks for the kind words about the talk.  I've added Wouter Verhelst
> > & the NBD mailing list to CC.  Although I did the talk because the
> > subject is interesting, how I actually use nbdkit / NBD is to talk to
> > qemu and that's where I have most experience and where we (Red Hat)
> > use it in production systems.
> > 
> > However in January I spent a lot of time exercising the NBD loop-mount
> > + nbdkit case using fio in order to find contention / bottlenecks in
> > our use of threads and locks.  I didn't notice any particular problems
> > then, but it's possible my testing wasn't thorough enough.  Or that
> > fio only creates small numbers of dirty pages (because of locality in
> > its access patterns I guess?)
> > 
> > When you say it's not safe, what could happen?  What would we observe
> > if it was going wrong?
> 
> Reading more carefully I see you said we'd observe a deadlock.  I
> didn't see that, but again my testing of this wouldn't have been very
> thorough.  When I have some time I'll try creating / spooling huge
> files into an NBD loop mount to see if I can cause a deadlock.

While it's of course impossible to fully exclude the possibility of
deadlock when clearing dirty pages to the network, since Mel Gorman's
work that resulted in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca
this should be extremely unlikely, and swapping over the network (NBD or
NFS or whatnot) should be reasonably safe, as well as clearing dirty
pages etc.

Additionally, nbd-client when called with -s calls memlockall() at an
appropriate moment, so that it should not be swapped out.

That only leaves the server side. Personally I haven't been able to
deadlock a reasonably recent machine using NBD, but of course YMMV.

> > > Also note that nbd.txt in Documentation/blockdev/ points to
> > > sourceforge; it should probably point to
> > > https://github.com/NetworkBlockDevice/nbd ?
> > 
> > Wouter should be able to say what the correct link should be.

The sourceforge project is still active, and is where I do the official
file releases. I also push the git repository there. For people who just
want a released version of the NBD utilities, pointing to sourceforge
isn't wrong, I would say.

GitHub is indeed used mostly for development, though.

It might be nice to rethink all that, now that we don't have a
mailinglist running at sourceforge anymore, but I don't think it's very
urgent.

-- 
To the thief who stole my anti-depressants: I hope you're happy

  -- seen somewhere on the Internet on a photo of a billboard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 22:41 ` Richard W.M. Jones
  2019-02-15 22:53   ` Richard W.M. Jones
@ 2019-02-15 22:55   ` Pavel Machek
  2019-11-17 16:58   ` Richard W.M. Jones
  2 siblings, 0 replies; 10+ messages in thread
From: Pavel Machek @ 2019-02-15 22:55 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

On Fri 2019-02-15 22:41:26, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

I'm not saying I've seen it happen, or have a test. But my
understanding of memory management says it could deadlock... if nbd
tried allocating memory while memory was "full" of dirty data.

Dunno, something like ... take 1GB block device with 1GB RAM
machine. Create memory pressure so that nbdkit (etc) is dropped from
memory. Then quickly make all the data on the block device dirty.

I believe that scenario is something that can not happen on system
without NBD in loopback configuration.

Situation may be made worse if nbdkit needs to allocate memory due for
compression buffers or something like that.

Best regards,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 22:41 ` Richard W.M. Jones
  2019-02-15 22:53   ` Richard W.M. Jones
  2019-02-15 22:55   ` Pavel Machek
@ 2019-11-17 16:58   ` Richard W.M. Jones
  2 siblings, 0 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2019-11-17 16:58 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd


FWIW to follow up to this old thread, I made the change in nbdkit:

https://github.com/libguestfs/nbdkit/commit/acc37af9989aae708e8acad3535e1491931e6bdb

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek
  2019-02-15 22:41 ` Richard W.M. Jones
@ 2019-02-17  8:44 ` Richard W.M. Jones
  2019-02-17 23:15   ` Pavel Machek
  1 sibling, 1 reply; 10+ messages in thread
From: Richard W.M. Jones @ 2019-02-17  8:44 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Andrew Morton

So not to dispute that this could be a bug, but I couldn't cause a
deadlock.  I wonder if you can see something wrong with my method?

*** Set up ***

- kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
- nbd-client 3.19-1.fc30
- nbdkit 1.11.5 (git commit ef9d1978ce28)

Baremetal machine was booted with mem=2G to artificially limit the
RAM.  The machine has 64G of swap.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           1806         329        1383           0          93        1357
Swap:         65535         179       65356

*** Method ***

I started nbdkit as a 4G RAM disk:

  ./nbdkit memory size=4G

This is implemented as a sparse array with a 2 level page table, and
should allocate (using malloc) every time a new area of the disk is
written to.  Exact implementation is here:
https://github.com/libguestfs/nbdkit/tree/master/common/sparse

I started nbd-client using the -swap option which uses
mlockall(MCL_FUTURE) to lock the client into RAM.

  nbd-client -b 512 -swap localhost /dev/nbd0

I then created a filesystem on the RAM disk, mounted it, and copied a
3G file into it.  I tried this various ways, but the variation I was
eventually happy with was:

  mke2fs /dev/nbd0
  mount /dev/nbd0 /tmp/mnt

  dd if=/dev/zero of=/tmp/big bs=1M count=3000
  cp /tmp/big /tmp/mnt/big

I couldn't get any kind of deadlock or failure in this test.

(Note that if you repeat the test several times, in theory you could
delete the file and fstrim the filesystem, but when I was testing it
to be sure I unmounted everything and killed and restarted nbdkit
between each test.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-17  8:44 ` Richard W.M. Jones
@ 2019-02-17 23:15   ` Pavel Machek
  2019-02-17 23:51     ` Richard W.M. Jones
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Machek @ 2019-02-17 23:15 UTC (permalink / raw)
  To: Richard W.M. Jones, smcdowell; +Cc: kernel list, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2174 bytes --]

Hi!

> So not to dispute that this could be a bug, but I couldn't cause a
> deadlock.  I wonder if you can see something wrong with my method?
> 
> *** Set up ***
> 
> - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
> - nbd-client 3.19-1.fc30
> - nbdkit 1.11.5 (git commit ef9d1978ce28)
> 
> Baremetal machine was booted with mem=2G to artificially limit the
> RAM.  The machine has 64G of swap.
> 
> # free -m
>               total        used        free      shared  buff/cache   available
> Mem:           1806         329        1383           0          93        1357
> Swap:         65535         179       65356
> 
> *** Method ***
> 
> I started nbdkit as a 4G RAM disk:
> 
>   ./nbdkit memory size=4G
> 
> This is implemented as a sparse array with a 2 level page table, and
> should allocate (using malloc) every time a new area of the disk is
> written to.  Exact implementation is here:
> https://github.com/libguestfs/nbdkit/tree/master/common/sparse
> 
> I started nbd-client using the -swap option which uses
> mlockall(MCL_FUTURE) to lock the client into RAM.
> 
>   nbd-client -b 512 -swap localhost /dev/nbd0
> 
> I then created a filesystem on the RAM disk, mounted it, and copied a
> 3G file into it.  I tried this various ways, but the variation I was
> eventually happy with was:
> 
>   mke2fs /dev/nbd0
>   mount /dev/nbd0 /tmp/mnt
> 
>   dd if=/dev/zero of=/tmp/big bs=1M count=3000
>   cp /tmp/big /tmp/mnt/big
> 
> I couldn't get any kind of deadlock or failure in this test.
> 
> (Note that if you repeat the test several times, in theory you could
> delete the file and fstrim the filesystem, but when I was testing it
> to be sure I unmounted everything and killed and restarted nbdkit
> between each test.)

This looks like quite a good try. I'd try to use mmap() to dirty
memory very quickly.

But Shaun reported it happened somehow often for them, so he might
have a practical test case... better than my theories :-).

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: nbd, nbdkit, loopback mounts and memory management
  2019-02-17 23:15   ` Pavel Machek
@ 2019-02-17 23:51     ` Richard W.M. Jones
       [not found]       ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Richard W.M. Jones @ 2019-02-17 23:51 UTC (permalink / raw)
  To: Pavel Machek; +Cc: smcdowell, kernel list, Andrew Morton

On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote:
> But Shaun reported it happened somehow often for them, so he might
> have a practical test case... better than my theories :-).

Yes, certainly not saying this isn't a problem.

I think the good news is the fix seems quite easy, ie. to add mlockall
and adjust the OOM killer score, as is done currently in the client:

https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885
https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219

For now I have added a note in the TODO file to follow up in case we
get a test case or reports of a problem:

https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>]

* Re: nbd, nbdkit, loopback mounts and memory management
       [not found]       ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>
@ 2019-03-12 16:14         ` Shaun McDowell
  0 siblings, 0 replies; 10+ messages in thread
From: Shaun McDowell @ 2019-03-12 16:14 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: Pavel Machek, kernel list, Andrew Morton, linux-xfs

We have a good example of deadlock when using an XFS filesystem on an
nbd loopback device regardless of mlockall and MCL_FUTURE. Below I'll
paste the hung task traces but I'll try to summarize the steps that
caused the problem.

The setup is an nbd loopback device with an XFS filesystem that is
then mounted on the local server (ubuntu 18.04 bionic, kernel 4.15).
Heavy load on the server is putting the kernel in a state where it has
exhausted its free pages list and new page allocation requests will
attempt to first free allocated pages -- mm/vmscan.c
try_to_free_pages(). XFS registers a pair of functions in the
super_operations struct (nr_cached_objects and free_cached_objects)
that causes the try_to_free_pages() to perform a synchronous write of
dirty inodes to the block device -- fs/xfs/xfs_icache.c
xfs_reclaim_inodes_nr().

A resource loop deadlock can occur if the nbd userland process needs
to allocate memory to make forward progress and the allocation causes
the kernel to call try_to_free_pages() which then leads to XFS issuing
a blocking synchronous write of a dirty inode page destined back to
the nbd loopback device. A more common deadlock we see (traces below)
is when another process on the system is blocked in
try_to_free_pages() waiting for the nbd device to perform the
synchronous write and then the nbd userland process also triggers
try_to_free_pages() and blocks on a mutex waiting for the first
process to complete its synchronous write.

From a kernel implemented device driver, which pages should not be
freed during a page allocation can be controlled using the GFP_ flags.
(GFP_NOIO, GFP_NOFS). From userland (as far as I can tell) we do not
have a way to control which pages the kernel is allowed to free in
try_to_free_pages().

Below are some traces where we have hit this deadlock. 0) A process on
the server needs to allocate a page and triggers the kernel to attempt
to free pages and XFS issues a synchronous write to the nbd loopback
device; 1) within the nbd loopback userland process, a memory
allocation request triggers a malloc arena pool to expand its heap and
call mprotect which causes the kernel to try to allocate pages and
call try_to_free_pages(), and proceed down the XFS reclaim inodes path
where it blocks trying to acquire the mutex that is held by that
process from trace 0 waiting for us to make progress; 2) an example of
another thread within the nbd userland process hitting the same
problem when attempting to write to a tcp socket and also blocking on
the XFS reclaim inodes mutex.

So far the problem has been avoided by using ext4 instead of XFS as
ext4 does not provide the super_operations struct functions
(nr_cached_objects, free_cached_objects) and does not issue writes in
the try_to_free_pages() code path.

Ideally, the nbd userland process could set a process flag to tell the
kernel to use GFP_NOIO or GFP_NOFS for its allocations to avoid this
resource deadlock.

Kernel hung tasks
0) A process somewhere on the server has triggered a syscall that
causes the kernel to attempt to free pages and eventually call
xfs_reclaim_inodes, acquire xfs inode reclaim mutex, and wait for a
write to the nbd loopback device
Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task glfs_epoll000:4535
blocked for more than 30 seconds.
Mar 06 04:54:14 ip-172-16-9-203 kernel:       Not tainted
4.15.0-1032-aws #34-Ubuntu
Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 06 04:54:14 ip-172-16-9-203 kernel: glfs_epoll000   D    0  4535
   1 0x00000000
Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace:
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __schedule+0x291/0x8a0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule+0x2c/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule_timeout+0x1cf/0x350
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? _xfs_buf_ioapply+0x396/0x4e0 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup+0xc6/0x110
Mar 06 04:54:14 ip-172-16-9-203 kernel:  wait_for_completion+0xba/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? wake_up_q+0x80/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? xfs_bwrite+0x24/0x60 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_buf_submit_wait+0x81/0x210 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_bwrite+0x24/0x60 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inode+0x31d/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_ag+0x1e6/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup_tag+0xd9/0x160
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __list_lru_walk_one.isra.5+0x37/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? iput+0x220/0x220
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  super_cache_scan+0x165/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab.part.51+0x1e7/0x3e0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab+0x29/0x30
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_node+0x11e/0x300
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_try_to_free_pages+0xc9/0x330
Mar 06 04:54:14 ip-172-16-9-203 kernel:  try_to_free_pages+0xee/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_slowpath+0x3fc/0xe20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? ena_com_prepare_tx+0x811/0xa70 [ena]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_nodemask+0x263/0x280
Mar 06 04:54:14 ip-172-16-9-203 kernel:  alloc_pages_vma+0x88/0x1f0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  handle_pte_fault+0x3bf/0xd30
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __handle_mm_fault+0x478/0x5c0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  handle_mm_fault+0xb1/0x1f0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __do_page_fault+0x250/0x4d0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_page_fault+0x2e/0xe0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_async_page_fault+0x51/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  async_page_fault+0x25/0x50

The nbd loopback process attempts to allocate memory and triggers a
malloc heap to grow which in turn calls mprotect, alloc pages, try to
free, and blocks trying to acquire the xfs reclaim inodes mutex that
is held waiting on our device
Mar 06 04:54:14 ip-172-16-9-203 kernel:       Not tainted
4.15.0-1032-aws #34-Ubuntu
Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 06 04:54:14 ip-172-16-9-203 kernel: cbdkit          D    0  4405
4399 0x00000000
Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace:
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __schedule+0x291/0x8a0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule+0x2c/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule_preempt_disabled+0xe/0x10
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __mutex_lock.isra.2+0x18c/0x4d0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __mutex_lock_slowpath+0x13/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __mutex_lock_slowpath+0x13/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  mutex_lock+0x2f/0x40
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_ag+0x2ca/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup_tag+0xd9/0x160
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __list_lru_walk_one.isra.5+0x37/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? iput+0x220/0x220
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  super_cache_scan+0x165/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab.part.51+0x1e7/0x3e0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab+0x29/0x30
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_node+0x11e/0x300
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_try_to_free_pages+0xc9/0x330
Mar 06 04:54:14 ip-172-16-9-203 kernel:  try_to_free_pages+0xee/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_slowpath+0x3fc/0xe20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __slab_free+0x14d/0x2c0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __slab_free+0x14d/0x2c0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __switch_to_asm+0x40/0x70
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_nodemask+0x263/0x280
Mar 06 04:54:14 ip-172-16-9-203 kernel:  alloc_pages_vma+0x88/0x1f0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  handle_pte_fault+0x3bf/0xd30
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? unix_stream_read_generic+0x703/0x900
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __handle_mm_fault+0x478/0x5c0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  handle_mm_fault+0xb1/0x1f0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __get_user_pages+0x1ee/0x720
Mar 06 04:54:14 ip-172-16-9-203 kernel:  populate_vma_page_range+0x71/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  mprotect_fixup+0x29d/0x380
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_mprotect_pkey+0x1e6/0x300
Mar 06 04:54:14 ip-172-16-9-203 kernel:  SyS_mprotect+0x13/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_syscall_64+0x73/0x130
Mar 06 04:54:14 ip-172-16-9-203 kernel:
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Another thread within the nbd loopback process attempting to write to
a socket and also being blocked by the xfs reclaim inodes
Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task cbdkit:4413 blocked
for more than 30 seconds.
Mar 06 04:54:14 ip-172-16-9-203 kernel:       Not tainted
4.15.0-1032-aws #34-Ubuntu
Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 06 04:54:14 ip-172-16-9-203 kernel: cbdkit          D    0  4413
4399 0x00000000
Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace:
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __schedule+0x291/0x8a0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule+0x2c/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule_preempt_disabled+0xe/0x10
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __mutex_lock.isra.2+0x18c/0x4d0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __mutex_lock_slowpath+0x13/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __mutex_lock_slowpath+0x13/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  mutex_lock+0x2f/0x40
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_ag+0x2ca/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup_tag+0xd9/0x160
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __list_lru_walk_one.isra.5+0x37/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? iput+0x220/0x220
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  super_cache_scan+0x165/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab.part.51+0x1e7/0x3e0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_slab+0x29/0x30
Mar 06 04:54:14 ip-172-16-9-203 kernel:  shrink_node+0x11e/0x300
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_try_to_free_pages+0xc9/0x330
Mar 06 04:54:14 ip-172-16-9-203 kernel:  try_to_free_pages+0xee/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_slowpath+0x3fc/0xe20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ?
__kmalloc_node_track_caller+0x227/0x2d0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __alloc_pages_nodemask+0x263/0x280
Mar 06 04:54:14 ip-172-16-9-203 kernel:  alloc_pages_current+0x6a/0xe0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  skb_page_frag_refill+0xcc/0xf0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  sk_page_frag_refill+0x1d/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  tcp_sendmsg_locked+0x24d/0xe70
Mar 06 04:54:14 ip-172-16-9-203 kernel:  tcp_sendmsg+0x2c/0x50
Mar 06 04:54:14 ip-172-16-9-203 kernel:  inet_sendmsg+0x2e/0xb0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  sock_sendmsg+0x3e/0x50
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ___sys_sendmsg+0x2a0/0x2f0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? aa_sk_perm+0x48/0x1a0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? wake_up_q+0x44/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? futex_wake+0x8f/0x180
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? do_futex+0x300/0x500
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __sys_sendmsg+0x54/0x90
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __sys_sendmsg+0x54/0x90
Mar 06 04:54:14 ip-172-16-9-203 kernel:  SyS_sendmsg+0x12/0x20
Mar 06 04:54:14 ip-172-16-9-203 kernel:  do_syscall_64+0x73/0x130
Mar 06 04:54:14 ip-172-16-9-203 kernel:
entry_SYSCALL_64_after_hwframe+0x3d/0xa2


On Sun, Feb 17, 2019 at 8:02 PM Shaun McDowell <smcdowell@cloudbd.io> wrote:
>
> Severs we have experienced deadlocks on with nbd loopback are AWS EC2 instances that do NOT have swap (latest ubuntu trusty, xenial, bionic, and amzn1/2 images). It could be that with swap space available this is a non issue.
>
> On Sun, Feb 17, 2019 at 6:51 PM Richard W.M. Jones <rjones@redhat.com> wrote:
>>
>> On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote:
>> > But Shaun reported it happened somehow often for them, so he might
>> > have a practical test case... better than my theories :-).
>>
>> Yes, certainly not saying this isn't a problem.
>>
>> I think the good news is the fix seems quite easy, ie. to add mlockall
>> and adjust the OOM killer score, as is done currently in the client:
>>
>> https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885
>> https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219
>>
>> For now I have added a note in the TODO file to follow up in case we
>> get a test case or reports of a problem:
>>
>> https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d
>>
>> Thanks,
>>
>> Rich.
>>
>> --
>> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
>> Read my programming and virtualization blog: http://rwmj.wordpress.com
>> libguestfs lets you edit virtual machines.  Supports shell scripting,
>> bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-11-17 17:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek
2019-02-15 22:41 ` Richard W.M. Jones
2019-02-15 22:53   ` Richard W.M. Jones
2019-02-16  8:16     ` Wouter Verhelst
2019-02-15 22:55   ` Pavel Machek
2019-11-17 16:58   ` Richard W.M. Jones
2019-02-17  8:44 ` Richard W.M. Jones
2019-02-17 23:15   ` Pavel Machek
2019-02-17 23:51     ` Richard W.M. Jones
     [not found]       ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>
2019-03-12 16:14         ` Shaun McDowell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox