fscache recursive hang -- similar to loopback NFS issues

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fscache recursive hang -- similar to loopback NFS issues
@ 2014-07-19 20:20 Milosz Tanski
  2014-07-19 20:31 ` Milosz Tanski
  2014-07-21  6:40 ` NeilBrown
  0 siblings, 2 replies; 12+ messages in thread
From: Milosz Tanski @ 2014-07-19 20:20 UTC (permalink / raw)
  To: neilb; +Cc: linux-fsdevel@vger.kernel.org, ceph-devel,
	linux-cachefs@redhat.com

Neil,

I saw your recent patcheset for improving the wait_on_bit interface
(particular: SCHED: allow wait_on_bit_action functions to support a
timeout.) I'm looking on some guidance on leveraging that work to
solve other recursive lock hang in fscache.

I've ran into similar issues you're trying to solve with loopback NFS
but in the fscache code. This happens under heavy vma preasure when
the kernel is aggressively trying to trim the page cache.

The hang is caused by this serious of events
1. cachefiles_write_page - cachefiles (the fscache backend, sitting on
ext4) tries to write page to disk
2. ext4 tries to allocate a page in writeback (without GPF_NOFS and
with wait flag)
3. due to vma preasure the kernel tries to free-up pages
4. this causes release pages in ceph to be called
5. the selected page is cached page in process of write out (from step #1)
6. fscache_wait_on_page_write hangs forever

Is there a solution that you have to NFS as another patch that
implements the timeout that I can use a template? I'm not familiar
with that piece of the code base.

Best,
- Milosz

INFO: task kworker/u30:7:28375 blocked for more than 120 seconds.
      Not tainted 3.15.0-virtual #74
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u30:7   D 0000000000000000     0 28375      2 0x00000000
Workqueue: fscache_operation fscache_op_work_func [fscache]
 ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8
 ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040
 ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0
Call Trace:
 [<ffffffff815930e9>] schedule+0x29/0x70
 [<ffffffffa018bed5>] __fscache_wait_on_page_write+0x55/0x90 [fscache]
 [<ffffffff810a4350>] ? __wake_up_sync+0x20/0x20
 [<ffffffffa018c135>] __fscache_maybe_release_page+0x65/0x1e0 [fscache]
 [<ffffffffa02ad813>] ceph_releasepage+0x83/0x100 [ceph]
 [<ffffffff811635b0>] ? anon_vma_fork+0x130/0x130
 [<ffffffff8112cdd2>] try_to_release_page+0x32/0x50
 [<ffffffff81140096>] shrink_page_list+0x7e6/0x9d0
 [<ffffffff8113f278>] ? isolate_lru_pages.isra.73+0x78/0x1e0
 [<ffffffff81140932>] shrink_inactive_list+0x252/0x4c0
 [<ffffffff811412b1>] shrink_lruvec+0x3e1/0x670
 [<ffffffff8114157f>] shrink_zone+0x3f/0x110
 [<ffffffff81141b06>] do_try_to_free_pages+0x1d6/0x450
 [<ffffffff8114a939>] ? zone_statistics+0x99/0xc0
 [<ffffffff81141e44>] try_to_free_pages+0xc4/0x180
 [<ffffffff81136982>] __alloc_pages_nodemask+0x6b2/0xa60
 [<ffffffff811c1d4e>] ? __find_get_block+0xbe/0x250
 [<ffffffff810a405e>] ? wake_up_bit+0x2e/0x40
 [<ffffffff811740c3>] alloc_pages_current+0xb3/0x180
 [<ffffffff8112cf07>] __page_cache_alloc+0xb7/0xd0
 [<ffffffff8112da6c>] grab_cache_page_write_begin+0x7c/0xe0
 [<ffffffff81214072>] ? ext4_mark_inode_dirty+0x82/0x220
 [<ffffffff81214a89>] ext4_da_write_begin+0x89/0x2d0
 [<ffffffff8112c6ee>] generic_perform_write+0xbe/0x1d0
 [<ffffffff811a96b1>] ? update_time+0x81/0xc0
 [<ffffffff811ad4c2>] ? mnt_clone_write+0x12/0x30
 [<ffffffff8112e80e>] __generic_file_aio_write+0x1ce/0x3f0
 [<ffffffff8112ea8e>] generic_file_aio_write+0x5e/0xe0
 [<ffffffff8120b94f>] ext4_file_write+0x9f/0x410
 [<ffffffff8120af56>] ? ext4_file_open+0x66/0x180
 [<ffffffff8118f0da>] do_sync_write+0x5a/0x90
 [<ffffffffa025c6c9>] cachefiles_write_page+0x149/0x430 [cachefiles]
 [<ffffffff812cf439>] ? radix_tree_gang_lookup_tag+0x89/0xd0
 [<ffffffffa018c512>] fscache_write_op+0x222/0x3b0 [fscache]
 [<ffffffffa018b35a>] fscache_op_work_func+0x3a/0x100 [fscache]
 [<ffffffff8107bfe9>] process_one_work+0x179/0x4a0
 [<ffffffff8107d47b>] worker_thread+0x11b/0x370
 [<ffffffff8107d360>] ? manage_workers.isra.21+0x2e0/0x2e0
 [<ffffffff81083d69>] kthread+0xc9/0xe0
 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0
 [<ffffffff8159eefc>] ret_from_fork+0x7c/0xb0
 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-19 20:20 fscache recursive hang -- similar to loopback NFS issues Milosz Tanski
@ 2014-07-19 20:31 ` Milosz Tanski
  2014-07-21  6:40 ` NeilBrown
  1 sibling, 0 replies; 12+ messages in thread
From: Milosz Tanski @ 2014-07-19 20:31 UTC (permalink / raw)
  To: neilb
  Cc: ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com, David Howells

I forgot to mention this.... David Howells attempt to fix a similar
issue with NFS and fscache on ext4 last year:
http://www.redhat.com/archives/linux-cachefs/2013-May/msg00003.html
The problem is that ext4 it's wisdom tries to allocate a page without
using GPF_NOFS In the code: ext4/inode.c:2678 so the fix that David
added not going to do anything for us.

        /*
         * grab_cache_page_write_begin() can take a long time if the
         * system is thrashing due to memory pressure, or if the page
         * is being written back.  So grab it first before we start
         * the transaction handle.  This also allows us to allocate
         * the page (if needed) without using GFP_NOFS.
         */
retry_grab:
        page = grab_cache_page_write_begin(mapping, index, flags);
        if (!page)


On Sat, Jul 19, 2014 at 4:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> Neil,
>
> I saw your recent patcheset for improving the wait_on_bit interface
> (particular: SCHED: allow wait_on_bit_action functions to support a
> timeout.) I'm looking on some guidance on leveraging that work to
> solve other recursive lock hang in fscache.
>
> I've ran into similar issues you're trying to solve with loopback NFS
> but in the fscache code. This happens under heavy vma preasure when
> the kernel is aggressively trying to trim the page cache.
>
> The hang is caused by this serious of events
> 1. cachefiles_write_page - cachefiles (the fscache backend, sitting on
> ext4) tries to write page to disk
> 2. ext4 tries to allocate a page in writeback (without GPF_NOFS and
> with wait flag)
> 3. due to vma preasure the kernel tries to free-up pages
> 4. this causes release pages in ceph to be called
> 5. the selected page is cached page in process of write out (from step #1)
> 6. fscache_wait_on_page_write hangs forever
>
> Is there a solution that you have to NFS as another patch that
> implements the timeout that I can use a template? I'm not familiar
> with that piece of the code base.
>
> Best,
> - Milosz
>
> INFO: task kworker/u30:7:28375 blocked for more than 120 seconds.
>       Not tainted 3.15.0-virtual #74
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u30:7   D 0000000000000000     0 28375      2 0x00000000
> Workqueue: fscache_operation fscache_op_work_func [fscache]
>  ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8
>  ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040
>  ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0
> Call Trace:
>  [<ffffffff815930e9>] schedule+0x29/0x70
>  [<ffffffffa018bed5>] __fscache_wait_on_page_write+0x55/0x90 [fscache]
>  [<ffffffff810a4350>] ? __wake_up_sync+0x20/0x20
>  [<ffffffffa018c135>] __fscache_maybe_release_page+0x65/0x1e0 [fscache]
>  [<ffffffffa02ad813>] ceph_releasepage+0x83/0x100 [ceph]
>  [<ffffffff811635b0>] ? anon_vma_fork+0x130/0x130
>  [<ffffffff8112cdd2>] try_to_release_page+0x32/0x50
>  [<ffffffff81140096>] shrink_page_list+0x7e6/0x9d0
>  [<ffffffff8113f278>] ? isolate_lru_pages.isra.73+0x78/0x1e0
>  [<ffffffff81140932>] shrink_inactive_list+0x252/0x4c0
>  [<ffffffff811412b1>] shrink_lruvec+0x3e1/0x670
>  [<ffffffff8114157f>] shrink_zone+0x3f/0x110
>  [<ffffffff81141b06>] do_try_to_free_pages+0x1d6/0x450
>  [<ffffffff8114a939>] ? zone_statistics+0x99/0xc0
>  [<ffffffff81141e44>] try_to_free_pages+0xc4/0x180
>  [<ffffffff81136982>] __alloc_pages_nodemask+0x6b2/0xa60
>  [<ffffffff811c1d4e>] ? __find_get_block+0xbe/0x250
>  [<ffffffff810a405e>] ? wake_up_bit+0x2e/0x40
>  [<ffffffff811740c3>] alloc_pages_current+0xb3/0x180
>  [<ffffffff8112cf07>] __page_cache_alloc+0xb7/0xd0
>  [<ffffffff8112da6c>] grab_cache_page_write_begin+0x7c/0xe0
>  [<ffffffff81214072>] ? ext4_mark_inode_dirty+0x82/0x220
>  [<ffffffff81214a89>] ext4_da_write_begin+0x89/0x2d0
>  [<ffffffff8112c6ee>] generic_perform_write+0xbe/0x1d0
>  [<ffffffff811a96b1>] ? update_time+0x81/0xc0
>  [<ffffffff811ad4c2>] ? mnt_clone_write+0x12/0x30
>  [<ffffffff8112e80e>] __generic_file_aio_write+0x1ce/0x3f0
>  [<ffffffff8112ea8e>] generic_file_aio_write+0x5e/0xe0
>  [<ffffffff8120b94f>] ext4_file_write+0x9f/0x410
>  [<ffffffff8120af56>] ? ext4_file_open+0x66/0x180
>  [<ffffffff8118f0da>] do_sync_write+0x5a/0x90
>  [<ffffffffa025c6c9>] cachefiles_write_page+0x149/0x430 [cachefiles]
>  [<ffffffff812cf439>] ? radix_tree_gang_lookup_tag+0x89/0xd0
>  [<ffffffffa018c512>] fscache_write_op+0x222/0x3b0 [fscache]
>  [<ffffffffa018b35a>] fscache_op_work_func+0x3a/0x100 [fscache]
>  [<ffffffff8107bfe9>] process_one_work+0x179/0x4a0
>  [<ffffffff8107d47b>] worker_thread+0x11b/0x370
>  [<ffffffff8107d360>] ? manage_workers.isra.21+0x2e0/0x2e0
>  [<ffffffff81083d69>] kthread+0xc9/0xe0
>  [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
>  [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0
>  [<ffffffff8159eefc>] ret_from_fork+0x7c/0xb0
>  [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-19 20:20 fscache recursive hang -- similar to loopback NFS issues Milosz Tanski
  2014-07-19 20:31 ` Milosz Tanski
@ 2014-07-21  6:40 ` NeilBrown
  2014-07-21 11:42   ` Milosz Tanski
  2014-07-29 16:12   ` David Howells
  1 sibling, 2 replies; 12+ messages in thread
From: NeilBrown @ 2014-07-21  6:40 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com, David Howells

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

On Sat, 19 Jul 2014 16:20:01 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> Neil,
> 
> I saw your recent patcheset for improving the wait_on_bit interface
> (particular: SCHED: allow wait_on_bit_action functions to support a
> timeout.) I'm looking on some guidance on leveraging that work to
> solve other recursive lock hang in fscache.
> 
> I've ran into similar issues you're trying to solve with loopback NFS
> but in the fscache code. This happens under heavy vma preasure when
> the kernel is aggressively trying to trim the page cache.
> 
> The hang is caused by this serious of events
> 1. cachefiles_write_page - cachefiles (the fscache backend, sitting on
> ext4) tries to write page to disk
> 2. ext4 tries to allocate a page in writeback (without GPF_NOFS and
> with wait flag)
> 3. due to vma preasure the kernel tries to free-up pages
> 4. this causes release pages in ceph to be called
> 5. the selected page is cached page in process of write out (from step #1)
> 6. fscache_wait_on_page_write hangs forever
> 
> Is there a solution that you have to NFS as another patch that
> implements the timeout that I can use a template? I'm not familiar
> with that piece of the code base.

It looks like the comment in  __fscache_maybe_release_page

	/* We will wait here if we're allowed to, but that could deadlock the
	 * allocator as the work threads writing to the cache may all end up
	 * sleeping on memory allocation, so we may need to impose a timeout
	 * too. */

is correct when it says "we may need to impose a timeout".
The following __fscache_wait_on_page_write() needs to timeout.

However that doesn't use wait_on_bit(), it just has a simple wait_event.
So something like this should fix it (or should at least move the problem
along a bit).

NeilBrown



diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index ed70714503fa..58035024c5cf 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -43,6 +43,13 @@ void __fscache_wait_on_page_write(struct fscache_cookie *cookie, struct page *pa
 }
 EXPORT_SYMBOL(__fscache_wait_on_page_write);
 
+void __fscache_wait_on_page_write_timeout(struct fscache_cookie *cookie, struct page *page, unsigned long timeout)
+{
+	wait_queue_head_t *wq = bit_waitqueue(&cookie->flags, 0);
+
+	wait_event_timeout(*wq, !__fscache_check_page_write(cookie, page), timeout);
+}
+
 /*
  * decide whether a page can be released, possibly by cancelling a store to it
  * - we're allowed to sleep if __GFP_WAIT is flagged
@@ -115,7 +122,7 @@ page_busy:
 	}
 
 	fscache_stat(&fscache_n_store_vmscan_wait);
-	__fscache_wait_on_page_write(cookie, page);
+	__fscache_wait_on_page_write_timeout(cookie, page, HZ);
 	gfp &= ~__GFP_WAIT;
 	goto try_again;
 }




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-21  6:40 ` NeilBrown
@ 2014-07-21 11:42   ` Milosz Tanski
  2014-07-29 16:12   ` David Howells
  1 sibling, 0 replies; 12+ messages in thread
From: Milosz Tanski @ 2014-07-21 11:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-fsdevel@vger.kernel.org, ceph-devel,
	linux-cachefs@redhat.com

Neil,

That's the same thing exact fix I started testing on Saturday. I found that
there already is a wait_event_timeout (even without your recent changes).
The thing I'm not quite sure is what timeout it should use? A quick search
through references on LXR shows that it's use anywhere in fs code except
for debug cases (btrfs) and network filesystem.

- Milosz


On Mon, Jul 21, 2014 at 2:40 AM, NeilBrown <neilb@suse.de> wrote:

> On Sat, 19 Jul 2014 16:20:01 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>
> > Neil,
> >
> > I saw your recent patcheset for improving the wait_on_bit interface
> > (particular: SCHED: allow wait_on_bit_action functions to support a
> > timeout.) I'm looking on some guidance on leveraging that work to
> > solve other recursive lock hang in fscache.
> >
> > I've ran into similar issues you're trying to solve with loopback NFS
> > but in the fscache code. This happens under heavy vma preasure when
> > the kernel is aggressively trying to trim the page cache.
> >
> > The hang is caused by this serious of events
> > 1. cachefiles_write_page - cachefiles (the fscache backend, sitting on
> > ext4) tries to write page to disk
> > 2. ext4 tries to allocate a page in writeback (without GPF_NOFS and
> > with wait flag)
> > 3. due to vma preasure the kernel tries to free-up pages
> > 4. this causes release pages in ceph to be called
> > 5. the selected page is cached page in process of write out (from step
> #1)
> > 6. fscache_wait_on_page_write hangs forever
> >
> > Is there a solution that you have to NFS as another patch that
> > implements the timeout that I can use a template? I'm not familiar
> > with that piece of the code base.
>
> It looks like the comment in  __fscache_maybe_release_page
>
>         /* We will wait here if we're allowed to, but that could deadlock
> the
>          * allocator as the work threads writing to the cache may all end
> up
>          * sleeping on memory allocation, so we may need to impose a
> timeout
>          * too. */
>
> is correct when it says "we may need to impose a timeout".
> The following __fscache_wait_on_page_write() needs to timeout.
>
> However that doesn't use wait_on_bit(), it just has a simple wait_event.
> So something like this should fix it (or should at least move the problem
> along a bit).
>
> NeilBrown
>
>
>
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index ed70714503fa..58035024c5cf 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -43,6 +43,13 @@ void __fscache_wait_on_page_write(struct fscache_cookie
> *cookie, struct page *pa
>  }
>  EXPORT_SYMBOL(__fscache_wait_on_page_write);
>
> +void __fscache_wait_on_page_write_timeout(struct fscache_cookie *cookie,
> struct page *page, unsigned long timeout)
> +{
> +       wait_queue_head_t *wq = bit_waitqueue(&cookie->flags, 0);
> +
> +       wait_event_timeout(*wq, !__fscache_check_page_write(cookie, page),
> timeout);
> +}
> +
>  /*
>   * decide whether a page can be released, possibly by cancelling a store
> to it
>   * - we're allowed to sleep if __GFP_WAIT is flagged
> @@ -115,7 +122,7 @@ page_busy:
>         }
>
>         fscache_stat(&fscache_n_store_vmscan_wait);
> -       __fscache_wait_on_page_write(cookie, page);
> +       __fscache_wait_on_page_write_timeout(cookie, page, HZ);
>         gfp &= ~__GFP_WAIT;
>         goto try_again;
>  }
>
>
>
>


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-21  6:40 ` NeilBrown
  2014-07-21 11:42   ` Milosz Tanski
@ 2014-07-29 16:12   ` David Howells
  2014-07-29 21:17     ` NeilBrown
  1 sibling, 1 reply; 12+ messages in thread
From: David Howells @ 2014-07-29 16:12 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: NeilBrown, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

Milosz Tanski <milosz@adfin.com> wrote:

> That's the same thing exact fix I started testing on Saturday. I found that
> there already is a wait_event_timeout (even without your recent changes). The
> thing I'm not quite sure is what timeout it should use?

That's probably something to make an external tuning knob for.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-29 16:12   ` David Howells
@ 2014-07-29 21:17     ` NeilBrown
  2014-07-30  1:48       ` Milosz Tanski
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2014-07-29 21:17 UTC (permalink / raw)
  To: David Howells
  Cc: Milosz Tanski, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:

> Milosz Tanski <milosz@adfin.com> wrote:
> 
> > That's the same thing exact fix I started testing on Saturday. I found that
> > there already is a wait_event_timeout (even without your recent changes). The
> > thing I'm not quite sure is what timeout it should use?
> 
> That's probably something to make an external tuning knob for.
> 
> David

Ugg.  External tuning knobs should be avoided wherever possible, and always
come with detailed instructions on how to tune them  </rant>

In this case I think it very nearly doesn't matter *at all* what value is
used.

If you set it a bit too high, then on the very very rare occasion that it
would currently deadlock, you get a longer-than-necessary wait.  So just make
sure that is short enough that by the time the sysadmin notices and starts
looking for the problem, it will be gone.

And if you set it a bit too low, then it will loop around to find another
page to deal with before that one is finished being written out, and so maybe
do a little bit more work than is needed (though it'll be needed eventually).

So the perfect number is somewhere between the typical response time for
storage, and the typical response time for the sys-admin.  Anywhere between
100ms and 10sec would do.  1 second is the geo-mean.

(sorry I didn't reply earlier - I missed you email somehow).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-29 21:17     ` NeilBrown
@ 2014-07-30  1:48       ` Milosz Tanski
  2014-07-30  2:19         ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Milosz Tanski @ 2014-07-30  1:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

I would vote on the lower end of the spectrum by default (closer to
100ms) since I imagine anybody deploying this in production
environment would likely be using SSD drives for the caching. And in
my tests on spinning disks there was little to no benefit outside of
reducing network traffic.

- Milosz

On Tue, Jul 29, 2014 at 5:17 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:
>
>> Milosz Tanski <milosz@adfin.com> wrote:
>>
>> > That's the same thing exact fix I started testing on Saturday. I found that
>> > there already is a wait_event_timeout (even without your recent changes). The
>> > thing I'm not quite sure is what timeout it should use?
>>
>> That's probably something to make an external tuning knob for.
>>
>> David
>
> Ugg.  External tuning knobs should be avoided wherever possible, and always
> come with detailed instructions on how to tune them  </rant>
>
> In this case I think it very nearly doesn't matter *at all* what value is
> used.
>
> If you set it a bit too high, then on the very very rare occasion that it
> would currently deadlock, you get a longer-than-necessary wait.  So just make
> sure that is short enough that by the time the sysadmin notices and starts
> looking for the problem, it will be gone.
>
> And if you set it a bit too low, then it will loop around to find another
> page to deal with before that one is finished being written out, and so maybe
> do a little bit more work than is needed (though it'll be needed eventually).
>
> So the perfect number is somewhere between the typical response time for
> storage, and the typical response time for the sys-admin.  Anywhere between
> 100ms and 10sec would do.  1 second is the geo-mean.
>
> (sorry I didn't reply earlier - I missed you email somehow).
>
> NeilBrown



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-30  1:48       ` Milosz Tanski
@ 2014-07-30  2:19         ` NeilBrown
  2014-07-30 16:06           ` Milosz Tanski
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2014-07-30  2:19 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: David Howells, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

On Tue, 29 Jul 2014 21:48:34 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> I would vote on the lower end of the spectrum by default (closer to
> 100ms) since I imagine anybody deploying this in production
> environment would likely be using SSD drives for the caching. And in
> my tests on spinning disks there was little to no benefit outside of
> reducing network traffic.

Maybe I'm confused......

I thought the whole point of this patch was to avoid deadlocks.
Now you seem to be talking about a performance benefit.
What did I miss?

NeilBrown


> 
> - Milosz
> 
> On Tue, Jul 29, 2014 at 5:17 PM, NeilBrown <neilb@suse.de> wrote:
> > On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:
> >
> >> Milosz Tanski <milosz@adfin.com> wrote:
> >>
> >> > That's the same thing exact fix I started testing on Saturday. I found that
> >> > there already is a wait_event_timeout (even without your recent changes). The
> >> > thing I'm not quite sure is what timeout it should use?
> >>
> >> That's probably something to make an external tuning knob for.
> >>
> >> David
> >
> > Ugg.  External tuning knobs should be avoided wherever possible, and always
> > come with detailed instructions on how to tune them  </rant>
> >
> > In this case I think it very nearly doesn't matter *at all* what value is
> > used.
> >
> > If you set it a bit too high, then on the very very rare occasion that it
> > would currently deadlock, you get a longer-than-necessary wait.  So just make
> > sure that is short enough that by the time the sysadmin notices and starts
> > looking for the problem, it will be gone.
> >
> > And if you set it a bit too low, then it will loop around to find another
> > page to deal with before that one is finished being written out, and so maybe
> > do a little bit more work than is needed (though it'll be needed eventually).
> >
> > So the perfect number is somewhere between the typical response time for
> > storage, and the typical response time for the sys-admin.  Anywhere between
> > 100ms and 10sec would do.  1 second is the geo-mean.
> >
> > (sorry I didn't reply earlier - I missed you email somehow).
> >
> > NeilBrown
> 
> 
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-30  2:19         ` NeilBrown
@ 2014-07-30 16:06           ` Milosz Tanski
  2014-08-05  4:12             ` Milosz Tanski
  2014-08-05 14:32             ` David Howells
  0 siblings, 2 replies; 12+ messages in thread
From: Milosz Tanski @ 2014-07-30 16:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

I don't think that fixing a dead lock should impose a somewhat
un-explainable high latency for the for the end user (or system
admin). With old drives such latencies (second plus) were not
unexpected.

- Milosz

On Tue, Jul 29, 2014 at 10:19 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 29 Jul 2014 21:48:34 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>
>> I would vote on the lower end of the spectrum by default (closer to
>> 100ms) since I imagine anybody deploying this in production
>> environment would likely be using SSD drives for the caching. And in
>> my tests on spinning disks there was little to no benefit outside of
>> reducing network traffic.
>
> Maybe I'm confused......
>
> I thought the whole point of this patch was to avoid deadlocks.
> Now you seem to be talking about a performance benefit.
> What did I miss?
>
> NeilBrown
>
>
>>
>> - Milosz
>>
>> On Tue, Jul 29, 2014 at 5:17 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:
>> >
>> >> Milosz Tanski <milosz@adfin.com> wrote:
>> >>
>> >> > That's the same thing exact fix I started testing on Saturday. I found that
>> >> > there already is a wait_event_timeout (even without your recent changes). The
>> >> > thing I'm not quite sure is what timeout it should use?
>> >>
>> >> That's probably something to make an external tuning knob for.
>> >>
>> >> David
>> >
>> > Ugg.  External tuning knobs should be avoided wherever possible, and always
>> > come with detailed instructions on how to tune them  </rant>
>> >
>> > In this case I think it very nearly doesn't matter *at all* what value is
>> > used.
>> >
>> > If you set it a bit too high, then on the very very rare occasion that it
>> > would currently deadlock, you get a longer-than-necessary wait.  So just make
>> > sure that is short enough that by the time the sysadmin notices and starts
>> > looking for the problem, it will be gone.
>> >
>> > And if you set it a bit too low, then it will loop around to find another
>> > page to deal with before that one is finished being written out, and so maybe
>> > do a little bit more work than is needed (though it'll be needed eventually).
>> >
>> > So the perfect number is somewhere between the typical response time for
>> > storage, and the typical response time for the sys-admin.  Anywhere between
>> > 100ms and 10sec would do.  1 second is the geo-mean.
>> >
>> > (sorry I didn't reply earlier - I missed you email somehow).
>> >
>> > NeilBrown
>>
>>
>>
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-30 16:06           ` Milosz Tanski
@ 2014-08-05  4:12             ` Milosz Tanski
  2014-08-05  4:49               ` NeilBrown
  2014-08-05 14:32             ` David Howells
  1 sibling, 1 reply; 12+ messages in thread
From: Milosz Tanski @ 2014-08-05  4:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

I was away for a few days but I did think about this some more and how
to avoid a tunable and having a sensible default option.

FSCache already tracks statistics about how long writes and reads take
(at least if you enable that option). With those stats in hand we
should be able generate a default timeout value that works well and
avoid a tunable.

My self I thinking something like the 90th percentile time for page
write... whatever the value may be this should be a decent way of
auto-optimizing this timeout.

- M

On Wed, Jul 30, 2014 at 12:06 PM, Milosz Tanski <milosz@adfin.com> wrote:
> I don't think that fixing a dead lock should impose a somewhat
> un-explainable high latency for the for the end user (or system
> admin). With old drives such latencies (second plus) were not
> unexpected.
>
> - Milosz
>
> On Tue, Jul 29, 2014 at 10:19 PM, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 29 Jul 2014 21:48:34 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>>
>>> I would vote on the lower end of the spectrum by default (closer to
>>> 100ms) since I imagine anybody deploying this in production
>>> environment would likely be using SSD drives for the caching. And in
>>> my tests on spinning disks there was little to no benefit outside of
>>> reducing network traffic.
>>
>> Maybe I'm confused......
>>
>> I thought the whole point of this patch was to avoid deadlocks.
>> Now you seem to be talking about a performance benefit.
>> What did I miss?
>>
>> NeilBrown
>>
>>
>>>
>>> - Milosz
>>>
>>> On Tue, Jul 29, 2014 at 5:17 PM, NeilBrown <neilb@suse.de> wrote:
>>> > On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:
>>> >
>>> >> Milosz Tanski <milosz@adfin.com> wrote:
>>> >>
>>> >> > That's the same thing exact fix I started testing on Saturday. I found that
>>> >> > there already is a wait_event_timeout (even without your recent changes). The
>>> >> > thing I'm not quite sure is what timeout it should use?
>>> >>
>>> >> That's probably something to make an external tuning knob for.
>>> >>
>>> >> David
>>> >
>>> > Ugg.  External tuning knobs should be avoided wherever possible, and always
>>> > come with detailed instructions on how to tune them  </rant>
>>> >
>>> > In this case I think it very nearly doesn't matter *at all* what value is
>>> > used.
>>> >
>>> > If you set it a bit too high, then on the very very rare occasion that it
>>> > would currently deadlock, you get a longer-than-necessary wait.  So just make
>>> > sure that is short enough that by the time the sysadmin notices and starts
>>> > looking for the problem, it will be gone.
>>> >
>>> > And if you set it a bit too low, then it will loop around to find another
>>> > page to deal with before that one is finished being written out, and so maybe
>>> > do a little bit more work than is needed (though it'll be needed eventually).
>>> >
>>> > So the perfect number is somewhere between the typical response time for
>>> > storage, and the typical response time for the sys-admin.  Anywhere between
>>> > 100ms and 10sec would do.  1 second is the geo-mean.
>>> >
>>> > (sorry I didn't reply earlier - I missed you email somehow).
>>> >
>>> > NeilBrown
>>>
>>>
>>>
>>
>
>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-08-05  4:12             ` Milosz Tanski
@ 2014-08-05  4:49               ` NeilBrown
  0 siblings, 0 replies; 12+ messages in thread
From: NeilBrown @ 2014-08-05  4:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: David Howells, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

[-- Attachment #1: Type: text/plain, Size: 4824 bytes --]

On Tue, 5 Aug 2014 00:12:25 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> I was away for a few days but I did think about this some more and how
> to avoid a tunable and having a sensible default option.
> 
> FSCache already tracks statistics about how long writes and reads take
> (at least if you enable that option). With those stats in hand we
> should be able generate a default timeout value that works well and
> avoid a tunable.
> 
> My self I thinking something like the 90th percentile time for page
> write... whatever the value may be this should be a decent way of
> auto-optimizing this timeout.

Sounds like it could be a good approach, though if stats aren't enabled we
need a sensible fall back.

What is the actual cost of having the timeout too small?  I guess it might be
unnecessary writeouts, but I haven't done any analysis.
If the cost is expected to be quite small, a smaller timeout might be very
appropriate.

One statistic that might be interesting is how long that wait typically takes
at present, and how often it deadlocks.

Mind you, we might be trying too hard.  Maybe just go for 100ms.

When you suggested that, I wasn't really objecting to your choice of a
number.  I was surprised because you seemed to justify it as a performance
concern, and I didn't think deadlocks would happen often enough for that to
be a valid concern.  When deadlock do happen, I presume the system is
temporarily under high memory pressure so lots of things are probably going
slowly, so a delay of a second might not be noticed.
But there are way too many "probably"s and "might"s.  If you can present
anything that looks like real data, it'll certainly trump all my hypotheses.

Thanks,
NeilBrown

> 
> - M
> 
> On Wed, Jul 30, 2014 at 12:06 PM, Milosz Tanski <milosz@adfin.com> wrote:
> > I don't think that fixing a dead lock should impose a somewhat
> > un-explainable high latency for the for the end user (or system
> > admin). With old drives such latencies (second plus) were not
> > unexpected.
> >
> > - Milosz
> >
> > On Tue, Jul 29, 2014 at 10:19 PM, NeilBrown <neilb@suse.de> wrote:
> >> On Tue, 29 Jul 2014 21:48:34 -0400 Milosz Tanski <milosz@adfin.com> wrote:
> >>
> >>> I would vote on the lower end of the spectrum by default (closer to
> >>> 100ms) since I imagine anybody deploying this in production
> >>> environment would likely be using SSD drives for the caching. And in
> >>> my tests on spinning disks there was little to no benefit outside of
> >>> reducing network traffic.
> >>
> >> Maybe I'm confused......
> >>
> >> I thought the whole point of this patch was to avoid deadlocks.
> >> Now you seem to be talking about a performance benefit.
> >> What did I miss?
> >>
> >> NeilBrown
> >>
> >>
> >>>
> >>> - Milosz
> >>>
> >>> On Tue, Jul 29, 2014 at 5:17 PM, NeilBrown <neilb@suse.de> wrote:
> >>> > On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells@redhat.com> wrote:
> >>> >
> >>> >> Milosz Tanski <milosz@adfin.com> wrote:
> >>> >>
> >>> >> > That's the same thing exact fix I started testing on Saturday. I found that
> >>> >> > there already is a wait_event_timeout (even without your recent changes). The
> >>> >> > thing I'm not quite sure is what timeout it should use?
> >>> >>
> >>> >> That's probably something to make an external tuning knob for.
> >>> >>
> >>> >> David
> >>> >
> >>> > Ugg.  External tuning knobs should be avoided wherever possible, and always
> >>> > come with detailed instructions on how to tune them  </rant>
> >>> >
> >>> > In this case I think it very nearly doesn't matter *at all* what value is
> >>> > used.
> >>> >
> >>> > If you set it a bit too high, then on the very very rare occasion that it
> >>> > would currently deadlock, you get a longer-than-necessary wait.  So just make
> >>> > sure that is short enough that by the time the sysadmin notices and starts
> >>> > looking for the problem, it will be gone.
> >>> >
> >>> > And if you set it a bit too low, then it will loop around to find another
> >>> > page to deal with before that one is finished being written out, and so maybe
> >>> > do a little bit more work than is needed (though it'll be needed eventually).
> >>> >
> >>> > So the perfect number is somewhere between the typical response time for
> >>> > storage, and the typical response time for the sys-admin.  Anywhere between
> >>> > 100ms and 10sec would do.  1 second is the geo-mean.
> >>> >
> >>> > (sorry I didn't reply earlier - I missed you email somehow).
> >>> >
> >>> > NeilBrown
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Milosz Tanski
> > CTO
> > 16 East 34th Street, 15th floor
> > New York, NY 10016
> >
> > p: 646-253-9055
> > e: milosz@adfin.com
> 
> 
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fscache recursive hang -- similar to loopback NFS issues
  2014-07-30 16:06           ` Milosz Tanski
  2014-08-05  4:12             ` Milosz Tanski
@ 2014-08-05 14:32             ` David Howells
  1 sibling, 0 replies; 12+ messages in thread
From: David Howells @ 2014-08-05 14:32 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: NeilBrown, ceph-devel, linux-fsdevel@vger.kernel.org,
	linux-cachefs@redhat.com

Milosz Tanski <milosz@adfin.com> wrote:

> FSCache already tracks statistics about how long writes and reads take
> (at least if you enable that option). With those stats in hand we
> should be able generate a default timeout value that works well and
> avoid a tunable.

If stat generation is going to be on all the time, it should probably use
something other than atomic ops.  NFS uses a set of ints on each CPU which are
added together when needed.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-08-05 14:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-19 20:20 fscache recursive hang -- similar to loopback NFS issues Milosz Tanski
2014-07-19 20:31 ` Milosz Tanski
2014-07-21  6:40 ` NeilBrown
2014-07-21 11:42   ` Milosz Tanski
2014-07-29 16:12   ` David Howells
2014-07-29 21:17     ` NeilBrown
2014-07-30  1:48       ` Milosz Tanski
2014-07-30  2:19         ` NeilBrown
2014-07-30 16:06           ` Milosz Tanski
2014-08-05  4:12             ` Milosz Tanski
2014-08-05  4:49               ` NeilBrown
2014-08-05 14:32             ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).