All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
@ 2009-07-09 19:22 Matt Helsley
       [not found] ` <20090709192207.GJ32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Matt Helsley @ 2009-07-09 19:22 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: Containers

The robust futex test can hang if the kernel fails to properly set the robust
list pointer. This currently happens during restart. The test should not
hang and instead should report failure.

Use a timeout to ensure that hangs are caught and reported as failure.
The timeout should return ETIMEDOUT. This limits the total amount of time
checkpoint/restart can take so a suitable timeout is essential here.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reported-by: Serge Hallyn <serue-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
--
Still needs testing.

diff --git a/futex/robust.c b/futex/robust.c
index a52f638..4cda4f7 100644
--- a/futex/robust.c
+++ b/futex/robust.c
@@ -103,6 +103,10 @@ void add_rfutex(struct futex *rf)
 
 void acquire_rfutex(struct futex *rf, pid_t tid)
 {
+	struct timespec timeout = {
+		.tv_sec = 5,
+		.tv_nsec = 0
+	};
 	int val = 0;
 
 	rlist.list_op_pending = &rf->rlist; /* ARCH TODO make sure this assignment is atomic */
@@ -125,7 +129,7 @@ void acquire_rfutex(struct futex *rf, pid_t tid)
 		val = __sync_or_and_fetch(&rf->tid.counter, FUTEX_WAITERS);
 		log("INFO", "futex(FUTEX_WAIT, %x)\n", val);
 		if (futex(&rf->tid.counter, FUTEX_WAIT, val,
-			  NULL, NULL, 0) == 0)
+			  &timeout, NULL, 0) == 0)
 			break;
 		log("INFO", "futex returned with errno %d (%s).\n", errno, strerror(errno));
 		switch(errno) {
@@ -139,8 +143,9 @@ void acquire_rfutex(struct futex *rf, pid_t tid)
 				log("WARN", "EINTR while sleeping on futex\n");
 				continue;
 			case ETIMEDOUT:
-				log("WARN", "ETIMEDOUT while sleeping on futex\n");
-				continue;
+				log("FAIL", "ETIMEDOUT while sleeping on futex.\n");
+				fail++;
+				return;
 			case EACCES:
 				log("FAIL", "FUTEX_WAIT EACCES - no read access to futex memory\n");
 				fail++;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found] ` <20090709192207.GJ32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-07-09 20:00   ` Serge E. Hallyn
       [not found]     ` <20090709200040.GA21053-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-07-09 20:58   ` Serge E. Hallyn
  1 sibling, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2009-07-09 20:00 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Serge Hallyn, Containers

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> The robust futex test can hang if the kernel fails to properly set the robust
> list pointer. This currently happens during restart. The test should not
> hang and instead should report failure.
> 
> Use a timeout to ensure that hangs are caught and reported as failure.
> The timeout should return ETIMEDOUT. This limits the total amount of time
> checkpoint/restart can take so a suitable timeout is essential here.
> 
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Reported-by: Serge Hallyn <serue-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Hey Matt,

last month you sent out a (short) kernel patch for robust futexes.
Was that supposed to be enough to fully support c/r of robust futexes?

-serge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found]     ` <20090709200040.GA21053-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-07-09 20:06       ` Matt Helsley
       [not found]         ` <20090709200649.GL32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Matt Helsley @ 2009-07-09 20:06 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Serge Hallyn

On Thu, Jul 09, 2009 at 03:00:40PM -0500, Serge E. Hallyn wrote:
> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > The robust futex test can hang if the kernel fails to properly set the robust
> > list pointer. This currently happens during restart. The test should not
> > hang and instead should report failure.
> > 
> > Use a timeout to ensure that hangs are caught and reported as failure.
> > The timeout should return ETIMEDOUT. This limits the total amount of time
> > checkpoint/restart can take so a suitable timeout is essential here.
> > 
> > Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > Reported-by: Serge Hallyn <serue-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> Hey Matt,
> 
> last month you sent out a (short) kernel patch for robust futexes.
> Was that supposed to be enough to fully support c/r of robust futexes?
> 
> -serge

Yup. I need to get the update of that patch sent out but the old version
should still work I think.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found]         ` <20090709200649.GL32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-07-09 20:14           ` Oren Laadan
  0 siblings, 0 replies; 8+ messages in thread
From: Oren Laadan @ 2009-07-09 20:14 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, Serge Hallyn



Matt Helsley wrote:
> On Thu, Jul 09, 2009 at 03:00:40PM -0500, Serge E. Hallyn wrote:
>> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
>>> The robust futex test can hang if the kernel fails to properly set the robust
>>> list pointer. This currently happens during restart. The test should not
>>> hang and instead should report failure.
>>>
>>> Use a timeout to ensure that hangs are caught and reported as failure.
>>> The timeout should return ETIMEDOUT. This limits the total amount of time
>>> checkpoint/restart can take so a suitable timeout is essential here.
>>>
>>> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>>> Reported-by: Serge Hallyn <serue-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>> Hey Matt,
>>
>> last month you sent out a (short) kernel patch for robust futexes.
>> Was that supposed to be enough to fully support c/r of robust futexes?
>>
>> -serge
> 
> Yup. I need to get the update of that patch sent out but the old version
> should still work I think.
> 

Right. Let's get it in for v17.

Oren.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found] ` <20090709192207.GJ32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2009-07-09 20:00   ` Serge E. Hallyn
@ 2009-07-09 20:58   ` Serge E. Hallyn
       [not found]     ` <20090709205853.GA23637-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2009-07-09 20:58 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> The robust futex test can hang if the kernel fails to properly set the robust
> list pointer. This currently happens during restart. The test should not
> hang and instead should report failure.
> 
> Use a timeout to ensure that hangs are caught and reported as failure.

Doesn't seem to work though :)  The test still hangs on restart.

Not sure it's worth worrying about this, versus just getting the robust
futex restart fix into the kernel :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found]     ` <20090709205853.GA23637-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-07-10  0:21       ` Sukadev Bhattiprolu
       [not found]         ` <20090710002144.GA13085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Sukadev Bhattiprolu @ 2009-07-10  0:21 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers

Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
| > The robust futex test can hang if the kernel fails to properly set the robust
| > list pointer. This currently happens during restart. The test should not
| > hang and instead should report failure.
| > 
| > Use a timeout to ensure that hangs are caught and reported as failure.
| 
| Doesn't seem to work though :)  The test still hangs on restart.

I got a hang on restart, with following backtrace (ckpt-v17-rc1 plus couple
of bug fixes)

mktree        S f6a4bbe0     0 25126  25124 0x00000000
 f6589b00 00000086 00000001 f6a4bbe0 f6a4bd74 c3190160 f5e17e1c 011a6d85
 00000000 c302f680 ffffffea 007ee140 f5e17e1c 00000000 00000001 00000000
 c15fdbfc f5e17e00 f5e17e00 00000000 c1041af6 00000000 f5e17e00 00000000
Call Trace:
 [<c1041af6>] ? futex_wait_queue_me+0x94/0xa5
 [<c1041bfd>] ? futex_wait+0xf6/0x1e9
 [<c106300b>] ? generic_file_buffered_write+0x169/0x257
 [<c1042dd7>] ? do_futex+0x93/0xa01
 [<c101d867>] ? enqueue_entity+0xe/0x7e
 [<c1081787>] ? cache_alloc_refill+0x54/0x43e
 [<c106274a>] ? find_get_page+0x1d/0x7a
 [<c1064407>] ? filemap_fault+0xbb/0x320
 [<c107296c>] ? __do_fault+0x319/0x352
 [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
 [<c1073f6e>] ? handle_mm_fault+0x24e/0x508
 [<c1043846>] ? sys_futex+0x101/0x116
 [<c1351f46>] ? do_page_fault+0x1ff/0x27b
 [<c10027e8>] ? sysenter_do_call+0x12/0x26
mktree        S f642b750     0 25127  25124 0x00000000
 f6589b00 00000086 c15fcd3c f642b750 f642b8e4 c3170160 c1041e2f 011a6d7f
 ffffffff f6589b00 000005da 00000000 00000001 00000000 00000000 00000000
 f6500000 00000008 f66d5e7c f66d5f9c c108a797 00000000 f642b750 c1037c5c
Call Trace:
 [<c1041e2f>] ? futex_wake+0xb9/0xc3
 [<c108a797>] ? pipe_wait+0x4b/0x62
 [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
 [<c108afdf>] ? pipe_read+0x2c0/0x32d
 [<c1066aad>] ? get_page_from_freelist+0x284/0x2de
 [<c1084d7e>] ? do_sync_read+0xbf/0x100
 [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
 [<c10798ca>] ? page_add_new_anon_rmap+0x20/0x3b
 [<c1073ef8>] ? handle_mm_fault+0x1d8/0x508
 [<c1139499>] ? security_file_permission+0xc/0xd
 [<c1084cbf>] ? do_sync_read+0x0/0x100
 [<c10853f7>] ? vfs_read+0x81/0x102
 [<c1085787>] ? sys_read+0x3c/0x63
 [<c10027e8>] ? sysenter_do_call+0x12/0x26

| 
| Not sure it's worth worrying about this, versus just getting the robust
| futex restart fix into the kernel :)
| 
| thanks,
| -serge
| _______________________________________________
| Containers mailing list
| Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found]         ` <20090710002144.GA13085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-07-10 23:34           ` Matt Helsley
       [not found]             ` <20090710233457.GA5213-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Matt Helsley @ 2009-07-10 23:34 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: Containers

On Thu, Jul 09, 2009 at 05:21:44PM -0700, Sukadev Bhattiprolu wrote:
> Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
> | Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> | > The robust futex test can hang if the kernel fails to properly set the robust
> | > list pointer. This currently happens during restart. The test should not
> | > hang and instead should report failure.
> | > 
> | > Use a timeout to ensure that hangs are caught and reported as failure.
> | 
> | Doesn't seem to work though :)  The test still hangs on restart.
> 
> I got a hang on restart, with following backtrace (ckpt-v17-rc1 plus couple
> of bug fixes)

Sorry, which fixes?

Perhaps this is the same problem that Serge was seeing..

> 
> mktree        S f6a4bbe0     0 25126  25124 0x00000000
>  f6589b00 00000086 00000001 f6a4bbe0 f6a4bd74 c3190160 f5e17e1c 011a6d85
>  00000000 c302f680 ffffffea 007ee140 f5e17e1c 00000000 00000001 00000000
>  c15fdbfc f5e17e00 f5e17e00 00000000 c1041af6 00000000 f5e17e00 00000000
> Call Trace:
>  [<c1041af6>] ? futex_wait_queue_me+0x94/0xa5
>  [<c1041bfd>] ? futex_wait+0xf6/0x1e9
>  [<c106300b>] ? generic_file_buffered_write+0x169/0x257
>  [<c1042dd7>] ? do_futex+0x93/0xa01
>  [<c101d867>] ? enqueue_entity+0xe/0x7e
>  [<c1081787>] ? cache_alloc_refill+0x54/0x43e
>  [<c106274a>] ? find_get_page+0x1d/0x7a
>  [<c1064407>] ? filemap_fault+0xbb/0x320
>  [<c107296c>] ? __do_fault+0x319/0x352
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c1073f6e>] ? handle_mm_fault+0x24e/0x508

This is what it should look like when a task on a futex is being woken. The
fault just means that the page backing the futex was paged out between
checkpoint and restart. In theory that's not a problem for the futex
code -- it's designed to handle faults. However, of course, it should
not cause a stack dump.

>  [<c1043846>] ? sys_futex+0x101/0x116
>  [<c1351f46>] ? do_page_fault+0x1ff/0x27b
>  [<c10027e8>] ? sysenter_do_call+0x12/0x26
> mktree        S f642b750     0 25127  25124 0x00000000
>  f6589b00 00000086 c15fcd3c f642b750 f642b8e4 c3170160 c1041e2f 011a6d7f
>  ffffffff f6589b00 000005da 00000000 00000001 00000000 00000000 00000000
>  f6500000 00000008 f66d5e7c f66d5f9c c108a797 00000000 f642b750 c1037c5c
> Call Trace:
>  [<c1041e2f>] ? futex_wake+0xb9/0xc3
>  [<c108a797>] ? pipe_wait+0x4b/0x62
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c108afdf>] ? pipe_read+0x2c0/0x32d
>  [<c1066aad>] ? get_page_from_freelist+0x284/0x2de
>  [<c1084d7e>] ? do_sync_read+0xbf/0x100
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c10798ca>] ? page_add_new_anon_rmap+0x20/0x3b
>  [<c1073ef8>] ? handle_mm_fault+0x1d8/0x508
>  [<c1139499>] ? security_file_permission+0xc/0xd
>  [<c1084cbf>] ? do_sync_read+0x0/0x100
>  [<c10853f7>] ? vfs_read+0x81/0x102
>  [<c1085787>] ? sys_read+0x3c/0x63
>  [<c10027e8>] ? sysenter_do_call+0x12/0x26

The first place to look, of course, is the futex restart blocks.

Thanks for the report. I'm kind of swamped with little things at the
moment so I'm going to have to put off deeper analysis of this for now.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart
       [not found]             ` <20090710233457.GA5213-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-07-11 20:56               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 8+ messages in thread
From: Sukadev Bhattiprolu @ 2009-07-11 20:56 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers

Matt Helsley [matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| On Thu, Jul 09, 2009 at 05:21:44PM -0700, Sukadev Bhattiprolu wrote:
| > Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| > | Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
| > | > The robust futex test can hang if the kernel fails to properly set the robust
| > | > list pointer. This currently happens during restart. The test should not
| > | > hang and instead should report failure.
| > | > 
| > | > Use a timeout to ensure that hangs are caught and reported as failure.
| > | 
| > | Doesn't seem to work though :)  The test still hangs on restart.
| > 
| > I got a hang on restart, with following backtrace (ckpt-v17-rc1 plus couple
| > of bug fixes)
| 
| Sorry, which fixes?

I was referring to informal versions of these two commits

	linux-cr: 3c60cd06509ae7b12db3176dabc5a8baff45341a
	user-cr: 768ee7c3a407b31f8b7202ce5395163dfe79893e

that I added on top of ckpt-v17-rc1.

| 
| Perhaps this is the same problem that Serge was seeing..

Ok.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-07-11 20:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-09 19:22 [PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart Matt Helsley
     [not found] ` <20090709192207.GJ32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-07-09 20:00   ` Serge E. Hallyn
     [not found]     ` <20090709200040.GA21053-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-09 20:06       ` Matt Helsley
     [not found]         ` <20090709200649.GL32310-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-07-09 20:14           ` Oren Laadan
2009-07-09 20:58   ` Serge E. Hallyn
     [not found]     ` <20090709205853.GA23637-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-10  0:21       ` Sukadev Bhattiprolu
     [not found]         ` <20090710002144.GA13085-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-10 23:34           ` Matt Helsley
     [not found]             ` <20090710233457.GA5213-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-07-11 20:56               ` Sukadev Bhattiprolu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.