* nbd, nbdkit, loopback mounts and memory management @ 2019-02-15 19:19 Pavel Machek 2019-02-15 22:41 ` Richard W.M. Jones 2019-02-17 8:44 ` Richard W.M. Jones 0 siblings, 2 replies; 10+ messages in thread From: Pavel Machek @ 2019-02-15 19:19 UTC (permalink / raw) To: rjones, kernel list, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 676 bytes --] Hi! I watched fosdem talk about nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word of warning: I'm not sure using it read-write on localhost is safe. In particular, user application could create a lot of dirty data quickly. If there's not enough memory for nbdkit (or nbd-client or nbd-server), you might get a deadlock. Also note that nbd.txt in Documentation/blockdev/ points to sourceforge; it should probably point to https://github.com/NetworkBlockDevice/nbd ? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek @ 2019-02-15 22:41 ` Richard W.M. Jones 2019-02-15 22:53 ` Richard W.M. Jones ` (2 more replies) 2019-02-17 8:44 ` Richard W.M. Jones 1 sibling, 3 replies; 10+ messages in thread From: Richard W.M. Jones @ 2019-02-15 22:41 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > Hi! > > I watched fosdem talk about > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > of warning: I'm not sure using it read-write on localhost is safe. > > In particular, user application could create a lot of dirty data > quickly. If there's not enough memory for nbdkit (or nbd-client or > nbd-server), you might get a deadlock. Thanks for the kind words about the talk. I've added Wouter Verhelst & the NBD mailing list to CC. Although I did the talk because the subject is interesting, how I actually use nbdkit / NBD is to talk to qemu and that's where I have most experience and where we (Red Hat) use it in production systems. However in January I spent a lot of time exercising the NBD loop-mount + nbdkit case using fio in order to find contention / bottlenecks in our use of threads and locks. I didn't notice any particular problems then, but it's possible my testing wasn't thorough enough. Or that fio only creates small numbers of dirty pages (because of locality in its access patterns I guess?) When you say it's not safe, what could happen? What would we observe if it was going wrong? > Also note that nbd.txt in Documentation/blockdev/ points to > sourceforge; it should probably point to > https://github.com/NetworkBlockDevice/nbd ? Wouter should be able to say what the correct link should be. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 22:41 ` Richard W.M. Jones @ 2019-02-15 22:53 ` Richard W.M. Jones 2019-02-16 8:16 ` Wouter Verhelst 2019-02-15 22:55 ` Pavel Machek 2019-11-17 16:58 ` Richard W.M. Jones 2 siblings, 1 reply; 10+ messages in thread From: Richard W.M. Jones @ 2019-02-15 22:53 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd On Fri, Feb 15, 2019 at 10:41:26PM +0000, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > Hi! > > > > I watched fosdem talk about > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > of warning: I'm not sure using it read-write on localhost is safe. > > > > In particular, user application could create a lot of dirty data > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > nbd-server), you might get a deadlock. > > Thanks for the kind words about the talk. I've added Wouter Verhelst > & the NBD mailing list to CC. Although I did the talk because the > subject is interesting, how I actually use nbdkit / NBD is to talk to > qemu and that's where I have most experience and where we (Red Hat) > use it in production systems. > > However in January I spent a lot of time exercising the NBD loop-mount > + nbdkit case using fio in order to find contention / bottlenecks in > our use of threads and locks. I didn't notice any particular problems > then, but it's possible my testing wasn't thorough enough. Or that > fio only creates small numbers of dirty pages (because of locality in > its access patterns I guess?) > > When you say it's not safe, what could happen? What would we observe > if it was going wrong? Reading more carefully I see you said we'd observe a deadlock. I didn't see that, but again my testing of this wouldn't have been very thorough. When I have some time I'll try creating / spooling huge files into an NBD loop mount to see if I can cause a deadlock. Thanks, Rich. > > Also note that nbd.txt in Documentation/blockdev/ points to > > sourceforge; it should probably point to > > https://github.com/NetworkBlockDevice/nbd ? > > Wouter should be able to say what the correct link should be. > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > libguestfs lets you edit virtual machines. Supports shell scripting, > bindings from many languages. http://libguestfs.org -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 22:53 ` Richard W.M. Jones @ 2019-02-16 8:16 ` Wouter Verhelst 0 siblings, 0 replies; 10+ messages in thread From: Wouter Verhelst @ 2019-02-16 8:16 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: Pavel Machek, kernel list, Andrew Morton, nbd Hi, On Fri, Feb 15, 2019 at 10:53:32PM +0000, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 10:41:26PM +0000, Richard W.M. Jones wrote: > > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > > Hi! > > > > > > I watched fosdem talk about > > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > > of warning: I'm not sure using it read-write on localhost is safe. > > > > > > In particular, user application could create a lot of dirty data > > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > > nbd-server), you might get a deadlock. > > > > Thanks for the kind words about the talk. I've added Wouter Verhelst > > & the NBD mailing list to CC. Although I did the talk because the > > subject is interesting, how I actually use nbdkit / NBD is to talk to > > qemu and that's where I have most experience and where we (Red Hat) > > use it in production systems. > > > > However in January I spent a lot of time exercising the NBD loop-mount > > + nbdkit case using fio in order to find contention / bottlenecks in > > our use of threads and locks. I didn't notice any particular problems > > then, but it's possible my testing wasn't thorough enough. Or that > > fio only creates small numbers of dirty pages (because of locality in > > its access patterns I guess?) > > > > When you say it's not safe, what could happen? What would we observe > > if it was going wrong? > > Reading more carefully I see you said we'd observe a deadlock. I > didn't see that, but again my testing of this wouldn't have been very > thorough. When I have some time I'll try creating / spooling huge > files into an NBD loop mount to see if I can cause a deadlock. While it's of course impossible to fully exclude the possibility of deadlock when clearing dirty pages to the network, since Mel Gorman's work that resulted in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca this should be extremely unlikely, and swapping over the network (NBD or NFS or whatnot) should be reasonably safe, as well as clearing dirty pages etc. Additionally, nbd-client when called with -s calls memlockall() at an appropriate moment, so that it should not be swapped out. That only leaves the server side. Personally I haven't been able to deadlock a reasonably recent machine using NBD, but of course YMMV. > > > Also note that nbd.txt in Documentation/blockdev/ points to > > > sourceforge; it should probably point to > > > https://github.com/NetworkBlockDevice/nbd ? > > > > Wouter should be able to say what the correct link should be. The sourceforge project is still active, and is where I do the official file releases. I also push the git repository there. For people who just want a released version of the NBD utilities, pointing to sourceforge isn't wrong, I would say. GitHub is indeed used mostly for development, though. It might be nice to rethink all that, now that we don't have a mailinglist running at sourceforge anymore, but I don't think it's very urgent. -- To the thief who stole my anti-depressants: I hope you're happy -- seen somewhere on the Internet on a photo of a billboard ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 22:41 ` Richard W.M. Jones 2019-02-15 22:53 ` Richard W.M. Jones @ 2019-02-15 22:55 ` Pavel Machek 2019-11-17 16:58 ` Richard W.M. Jones 2 siblings, 0 replies; 10+ messages in thread From: Pavel Machek @ 2019-02-15 22:55 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd [-- Attachment #1: Type: text/plain, Size: 2135 bytes --] On Fri 2019-02-15 22:41:26, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > Hi! > > > > I watched fosdem talk about > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > of warning: I'm not sure using it read-write on localhost is safe. > > > > In particular, user application could create a lot of dirty data > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > nbd-server), you might get a deadlock. > > Thanks for the kind words about the talk. I've added Wouter Verhelst > & the NBD mailing list to CC. Although I did the talk because the > subject is interesting, how I actually use nbdkit / NBD is to talk to > qemu and that's where I have most experience and where we (Red Hat) > use it in production systems. > > However in January I spent a lot of time exercising the NBD loop-mount > + nbdkit case using fio in order to find contention / bottlenecks in > our use of threads and locks. I didn't notice any particular problems > then, but it's possible my testing wasn't thorough enough. Or that > fio only creates small numbers of dirty pages (because of locality in > its access patterns I guess?) > > When you say it's not safe, what could happen? What would we observe > if it was going wrong? I'm not saying I've seen it happen, or have a test. But my understanding of memory management says it could deadlock... if nbd tried allocating memory while memory was "full" of dirty data. Dunno, something like ... take 1GB block device with 1GB RAM machine. Create memory pressure so that nbdkit (etc) is dropped from memory. Then quickly make all the data on the block device dirty. I believe that scenario is something that can not happen on system without NBD in loopback configuration. Situation may be made worse if nbdkit needs to allocate memory due for compression buffers or something like that. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 22:41 ` Richard W.M. Jones 2019-02-15 22:53 ` Richard W.M. Jones 2019-02-15 22:55 ` Pavel Machek @ 2019-11-17 16:58 ` Richard W.M. Jones 2 siblings, 0 replies; 10+ messages in thread From: Richard W.M. Jones @ 2019-11-17 16:58 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton, Wouter Verhelst, nbd FWIW to follow up to this old thread, I made the change in nbdkit: https://github.com/libguestfs/nbdkit/commit/acc37af9989aae708e8acad3535e1491931e6bdb Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek 2019-02-15 22:41 ` Richard W.M. Jones @ 2019-02-17 8:44 ` Richard W.M. Jones 2019-02-17 23:15 ` Pavel Machek 1 sibling, 1 reply; 10+ messages in thread From: Richard W.M. Jones @ 2019-02-17 8:44 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, Andrew Morton So not to dispute that this could be a bug, but I couldn't cause a deadlock. I wonder if you can see something wrong with my method? *** Set up *** - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64 - nbd-client 3.19-1.fc30 - nbdkit 1.11.5 (git commit ef9d1978ce28) Baremetal machine was booted with mem=2G to artificially limit the RAM. The machine has 64G of swap. # free -m total used free shared buff/cache available Mem: 1806 329 1383 0 93 1357 Swap: 65535 179 65356 *** Method *** I started nbdkit as a 4G RAM disk: ./nbdkit memory size=4G This is implemented as a sparse array with a 2 level page table, and should allocate (using malloc) every time a new area of the disk is written to. Exact implementation is here: https://github.com/libguestfs/nbdkit/tree/master/common/sparse I started nbd-client using the -swap option which uses mlockall(MCL_FUTURE) to lock the client into RAM. nbd-client -b 512 -swap localhost /dev/nbd0 I then created a filesystem on the RAM disk, mounted it, and copied a 3G file into it. I tried this various ways, but the variation I was eventually happy with was: mke2fs /dev/nbd0 mount /dev/nbd0 /tmp/mnt dd if=/dev/zero of=/tmp/big bs=1M count=3000 cp /tmp/big /tmp/mnt/big I couldn't get any kind of deadlock or failure in this test. (Note that if you repeat the test several times, in theory you could delete the file and fstrim the filesystem, but when I was testing it to be sure I unmounted everything and killed and restarted nbdkit between each test.) Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-17 8:44 ` Richard W.M. Jones @ 2019-02-17 23:15 ` Pavel Machek 2019-02-17 23:51 ` Richard W.M. Jones 0 siblings, 1 reply; 10+ messages in thread From: Pavel Machek @ 2019-02-17 23:15 UTC (permalink / raw) To: Richard W.M. Jones, smcdowell; +Cc: kernel list, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 2174 bytes --] Hi! > So not to dispute that this could be a bug, but I couldn't cause a > deadlock. I wonder if you can see something wrong with my method? > > *** Set up *** > > - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64 > - nbd-client 3.19-1.fc30 > - nbdkit 1.11.5 (git commit ef9d1978ce28) > > Baremetal machine was booted with mem=2G to artificially limit the > RAM. The machine has 64G of swap. > > # free -m > total used free shared buff/cache available > Mem: 1806 329 1383 0 93 1357 > Swap: 65535 179 65356 > > *** Method *** > > I started nbdkit as a 4G RAM disk: > > ./nbdkit memory size=4G > > This is implemented as a sparse array with a 2 level page table, and > should allocate (using malloc) every time a new area of the disk is > written to. Exact implementation is here: > https://github.com/libguestfs/nbdkit/tree/master/common/sparse > > I started nbd-client using the -swap option which uses > mlockall(MCL_FUTURE) to lock the client into RAM. > > nbd-client -b 512 -swap localhost /dev/nbd0 > > I then created a filesystem on the RAM disk, mounted it, and copied a > 3G file into it. I tried this various ways, but the variation I was > eventually happy with was: > > mke2fs /dev/nbd0 > mount /dev/nbd0 /tmp/mnt > > dd if=/dev/zero of=/tmp/big bs=1M count=3000 > cp /tmp/big /tmp/mnt/big > > I couldn't get any kind of deadlock or failure in this test. > > (Note that if you repeat the test several times, in theory you could > delete the file and fstrim the filesystem, but when I was testing it > to be sure I unmounted everything and killed and restarted nbdkit > between each test.) This looks like quite a good try. I'd try to use mmap() to dirty memory very quickly. But Shaun reported it happened somehow often for them, so he might have a practical test case... better than my theories :-). Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: nbd, nbdkit, loopback mounts and memory management 2019-02-17 23:15 ` Pavel Machek @ 2019-02-17 23:51 ` Richard W.M. Jones [not found] ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com> 0 siblings, 1 reply; 10+ messages in thread From: Richard W.M. Jones @ 2019-02-17 23:51 UTC (permalink / raw) To: Pavel Machek; +Cc: smcdowell, kernel list, Andrew Morton On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote: > But Shaun reported it happened somehow often for them, so he might > have a practical test case... better than my theories :-). Yes, certainly not saying this isn't a problem. I think the good news is the fix seems quite easy, ie. to add mlockall and adjust the OOM killer score, as is done currently in the client: https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885 https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219 For now I have added a note in the TODO file to follow up in case we get a test case or reports of a problem: https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>]
* Re: nbd, nbdkit, loopback mounts and memory management [not found] ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com> @ 2019-03-12 16:14 ` Shaun McDowell 0 siblings, 0 replies; 10+ messages in thread From: Shaun McDowell @ 2019-03-12 16:14 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: Pavel Machek, kernel list, Andrew Morton, linux-xfs We have a good example of deadlock when using an XFS filesystem on an nbd loopback device regardless of mlockall and MCL_FUTURE. Below I'll paste the hung task traces but I'll try to summarize the steps that caused the problem. The setup is an nbd loopback device with an XFS filesystem that is then mounted on the local server (ubuntu 18.04 bionic, kernel 4.15). Heavy load on the server is putting the kernel in a state where it has exhausted its free pages list and new page allocation requests will attempt to first free allocated pages -- mm/vmscan.c try_to_free_pages(). XFS registers a pair of functions in the super_operations struct (nr_cached_objects and free_cached_objects) that causes the try_to_free_pages() to perform a synchronous write of dirty inodes to the block device -- fs/xfs/xfs_icache.c xfs_reclaim_inodes_nr(). A resource loop deadlock can occur if the nbd userland process needs to allocate memory to make forward progress and the allocation causes the kernel to call try_to_free_pages() which then leads to XFS issuing a blocking synchronous write of a dirty inode page destined back to the nbd loopback device. A more common deadlock we see (traces below) is when another process on the system is blocked in try_to_free_pages() waiting for the nbd device to perform the synchronous write and then the nbd userland process also triggers try_to_free_pages() and blocks on a mutex waiting for the first process to complete its synchronous write. From a kernel implemented device driver, which pages should not be freed during a page allocation can be controlled using the GFP_ flags. (GFP_NOIO, GFP_NOFS). From userland (as far as I can tell) we do not have a way to control which pages the kernel is allowed to free in try_to_free_pages(). Below are some traces where we have hit this deadlock. 0) A process on the server needs to allocate a page and triggers the kernel to attempt to free pages and XFS issues a synchronous write to the nbd loopback device; 1) within the nbd loopback userland process, a memory allocation request triggers a malloc arena pool to expand its heap and call mprotect which causes the kernel to try to allocate pages and call try_to_free_pages(), and proceed down the XFS reclaim inodes path where it blocks trying to acquire the mutex that is held by that process from trace 0 waiting for us to make progress; 2) an example of another thread within the nbd userland process hitting the same problem when attempting to write to a tcp socket and also blocking on the XFS reclaim inodes mutex. So far the problem has been avoided by using ext4 instead of XFS as ext4 does not provide the super_operations struct functions (nr_cached_objects, free_cached_objects) and does not issue writes in the try_to_free_pages() code path. Ideally, the nbd userland process could set a process flag to tell the kernel to use GFP_NOIO or GFP_NOFS for its allocations to avoid this resource deadlock. Kernel hung tasks 0) A process somewhere on the server has triggered a syscall that causes the kernel to attempt to free pages and eventually call xfs_reclaim_inodes, acquire xfs inode reclaim mutex, and wait for a write to the nbd loopback device Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task glfs_epoll000:4535 blocked for more than 30 seconds. Mar 06 04:54:14 ip-172-16-9-203 kernel: Not tainted 4.15.0-1032-aws #34-Ubuntu Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 06 04:54:14 ip-172-16-9-203 kernel: glfs_epoll000 D 0 4535 1 0x00000000 Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace: Mar 06 04:54:14 ip-172-16-9-203 kernel: __schedule+0x291/0x8a0 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule+0x2c/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule_timeout+0x1cf/0x350 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? _xfs_buf_ioapply+0x396/0x4e0 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup+0xc6/0x110 Mar 06 04:54:14 ip-172-16-9-203 kernel: wait_for_completion+0xba/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? wake_up_q+0x80/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? xfs_bwrite+0x24/0x60 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_buf_submit_wait+0x81/0x210 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_bwrite+0x24/0x60 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inode+0x31d/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_ag+0x1e6/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup_tag+0xd9/0x160 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __list_lru_walk_one.isra.5+0x37/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? iput+0x220/0x220 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_nr+0x33/0x40 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_fs_free_cached_objects+0x19/0x20 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: super_cache_scan+0x165/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab.part.51+0x1e7/0x3e0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab+0x29/0x30 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_node+0x11e/0x300 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_try_to_free_pages+0xc9/0x330 Mar 06 04:54:14 ip-172-16-9-203 kernel: try_to_free_pages+0xee/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_slowpath+0x3fc/0xe20 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? ena_com_prepare_tx+0x811/0xa70 [ena] Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_nodemask+0x263/0x280 Mar 06 04:54:14 ip-172-16-9-203 kernel: alloc_pages_vma+0x88/0x1f0 Mar 06 04:54:14 ip-172-16-9-203 kernel: handle_pte_fault+0x3bf/0xd30 Mar 06 04:54:14 ip-172-16-9-203 kernel: __handle_mm_fault+0x478/0x5c0 Mar 06 04:54:14 ip-172-16-9-203 kernel: handle_mm_fault+0xb1/0x1f0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __do_page_fault+0x250/0x4d0 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_page_fault+0x2e/0xe0 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_async_page_fault+0x51/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: async_page_fault+0x25/0x50 The nbd loopback process attempts to allocate memory and triggers a malloc heap to grow which in turn calls mprotect, alloc pages, try to free, and blocks trying to acquire the xfs reclaim inodes mutex that is held waiting on our device Mar 06 04:54:14 ip-172-16-9-203 kernel: Not tainted 4.15.0-1032-aws #34-Ubuntu Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 06 04:54:14 ip-172-16-9-203 kernel: cbdkit D 0 4405 4399 0x00000000 Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace: Mar 06 04:54:14 ip-172-16-9-203 kernel: __schedule+0x291/0x8a0 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule+0x2c/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule_preempt_disabled+0xe/0x10 Mar 06 04:54:14 ip-172-16-9-203 kernel: __mutex_lock.isra.2+0x18c/0x4d0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __mutex_lock_slowpath+0x13/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __mutex_lock_slowpath+0x13/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: mutex_lock+0x2f/0x40 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_ag+0x2ca/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup_tag+0xd9/0x160 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __list_lru_walk_one.isra.5+0x37/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? iput+0x220/0x220 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_nr+0x33/0x40 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_fs_free_cached_objects+0x19/0x20 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: super_cache_scan+0x165/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab.part.51+0x1e7/0x3e0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab+0x29/0x30 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_node+0x11e/0x300 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_try_to_free_pages+0xc9/0x330 Mar 06 04:54:14 ip-172-16-9-203 kernel: try_to_free_pages+0xee/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_slowpath+0x3fc/0xe20 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __slab_free+0x14d/0x2c0 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __slab_free+0x14d/0x2c0 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __switch_to_asm+0x40/0x70 Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_nodemask+0x263/0x280 Mar 06 04:54:14 ip-172-16-9-203 kernel: alloc_pages_vma+0x88/0x1f0 Mar 06 04:54:14 ip-172-16-9-203 kernel: handle_pte_fault+0x3bf/0xd30 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? unix_stream_read_generic+0x703/0x900 Mar 06 04:54:14 ip-172-16-9-203 kernel: __handle_mm_fault+0x478/0x5c0 Mar 06 04:54:14 ip-172-16-9-203 kernel: handle_mm_fault+0xb1/0x1f0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __get_user_pages+0x1ee/0x720 Mar 06 04:54:14 ip-172-16-9-203 kernel: populate_vma_page_range+0x71/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: mprotect_fixup+0x29d/0x380 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_mprotect_pkey+0x1e6/0x300 Mar 06 04:54:14 ip-172-16-9-203 kernel: SyS_mprotect+0x13/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_syscall_64+0x73/0x130 Mar 06 04:54:14 ip-172-16-9-203 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Another thread within the nbd loopback process attempting to write to a socket and also being blocked by the xfs reclaim inodes Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task cbdkit:4413 blocked for more than 30 seconds. Mar 06 04:54:14 ip-172-16-9-203 kernel: Not tainted 4.15.0-1032-aws #34-Ubuntu Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 06 04:54:14 ip-172-16-9-203 kernel: cbdkit D 0 4413 4399 0x00000000 Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace: Mar 06 04:54:14 ip-172-16-9-203 kernel: __schedule+0x291/0x8a0 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule+0x2c/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule_preempt_disabled+0xe/0x10 Mar 06 04:54:14 ip-172-16-9-203 kernel: __mutex_lock.isra.2+0x18c/0x4d0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __mutex_lock_slowpath+0x13/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __mutex_lock_slowpath+0x13/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: mutex_lock+0x2f/0x40 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_ag+0x2ca/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup_tag+0xd9/0x160 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __list_lru_walk_one.isra.5+0x37/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? iput+0x220/0x220 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_nr+0x33/0x40 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_fs_free_cached_objects+0x19/0x20 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: super_cache_scan+0x165/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab.part.51+0x1e7/0x3e0 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_slab+0x29/0x30 Mar 06 04:54:14 ip-172-16-9-203 kernel: shrink_node+0x11e/0x300 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_try_to_free_pages+0xc9/0x330 Mar 06 04:54:14 ip-172-16-9-203 kernel: try_to_free_pages+0xee/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_slowpath+0x3fc/0xe20 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __kmalloc_node_track_caller+0x227/0x2d0 Mar 06 04:54:14 ip-172-16-9-203 kernel: __alloc_pages_nodemask+0x263/0x280 Mar 06 04:54:14 ip-172-16-9-203 kernel: alloc_pages_current+0x6a/0xe0 Mar 06 04:54:14 ip-172-16-9-203 kernel: skb_page_frag_refill+0xcc/0xf0 Mar 06 04:54:14 ip-172-16-9-203 kernel: sk_page_frag_refill+0x1d/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: tcp_sendmsg_locked+0x24d/0xe70 Mar 06 04:54:14 ip-172-16-9-203 kernel: tcp_sendmsg+0x2c/0x50 Mar 06 04:54:14 ip-172-16-9-203 kernel: inet_sendmsg+0x2e/0xb0 Mar 06 04:54:14 ip-172-16-9-203 kernel: sock_sendmsg+0x3e/0x50 Mar 06 04:54:14 ip-172-16-9-203 kernel: ___sys_sendmsg+0x2a0/0x2f0 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? aa_sk_perm+0x48/0x1a0 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? wake_up_q+0x44/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? futex_wake+0x8f/0x180 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? do_futex+0x300/0x500 Mar 06 04:54:14 ip-172-16-9-203 kernel: __sys_sendmsg+0x54/0x90 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __sys_sendmsg+0x54/0x90 Mar 06 04:54:14 ip-172-16-9-203 kernel: SyS_sendmsg+0x12/0x20 Mar 06 04:54:14 ip-172-16-9-203 kernel: do_syscall_64+0x73/0x130 Mar 06 04:54:14 ip-172-16-9-203 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 On Sun, Feb 17, 2019 at 8:02 PM Shaun McDowell <smcdowell@cloudbd.io> wrote: > > Severs we have experienced deadlocks on with nbd loopback are AWS EC2 instances that do NOT have swap (latest ubuntu trusty, xenial, bionic, and amzn1/2 images). It could be that with swap space available this is a non issue. > > On Sun, Feb 17, 2019 at 6:51 PM Richard W.M. Jones <rjones@redhat.com> wrote: >> >> On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote: >> > But Shaun reported it happened somehow often for them, so he might >> > have a practical test case... better than my theories :-). >> >> Yes, certainly not saying this isn't a problem. >> >> I think the good news is the fix seems quite easy, ie. to add mlockall >> and adjust the OOM killer score, as is done currently in the client: >> >> https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885 >> https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219 >> >> For now I have added a note in the TODO file to follow up in case we >> get a test case or reports of a problem: >> >> https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d >> >> Thanks, >> >> Rich. >> >> -- >> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones >> Read my programming and virtualization blog: http://rwmj.wordpress.com >> libguestfs lets you edit virtual machines. Supports shell scripting, >> bindings from many languages. http://libguestfs.org ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-11-17 17:01 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-15 19:19 nbd, nbdkit, loopback mounts and memory management Pavel Machek
2019-02-15 22:41 ` Richard W.M. Jones
2019-02-15 22:53 ` Richard W.M. Jones
2019-02-16 8:16 ` Wouter Verhelst
2019-02-15 22:55 ` Pavel Machek
2019-11-17 16:58 ` Richard W.M. Jones
2019-02-17 8:44 ` Richard W.M. Jones
2019-02-17 23:15 ` Pavel Machek
2019-02-17 23:51 ` Richard W.M. Jones
[not found] ` <CAM1OiDOKJ3SGHABNooQPFfx3KMYepYmSPxwyZZjZERc_y9v1WA@mail.gmail.com>
2019-03-12 16:14 ` Shaun McDowell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox