All of lore.kernel.org
 help / color / mirror / Atom feed
* On netfront accelerator add/remove watches
@ 2008-07-29  3:32 BVK Chaitanya
  2008-07-29 11:09 ` Neil Turton
  0 siblings, 1 reply; 5+ messages in thread
From: BVK Chaitanya @ 2008-07-29  3:32 UTC (permalink / raw)
  To: Xen-devel

Hi,


I see that netfront_accel_add_watch and netfront_accel_remove_watch 
functions are _not_ protected by accelerator_mutex in accel.c   Is there 
any specific reason for this?

I see that they sometimes get called twice (and result in BUG_ON) in 
very fast (20ms) domain suspend-resume cycles and I couldn't figure out 
how it is possible :-(


--
bvk-chaitanya

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On netfront accelerator add/remove watches
  2008-07-29  3:32 On netfront accelerator add/remove watches BVK Chaitanya
@ 2008-07-29 11:09 ` Neil Turton
  2008-07-30  4:58   ` BVK Chaitanya
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Turton @ 2008-07-29 11:09 UTC (permalink / raw)
  To: BVK Chaitanya; +Cc: Xen-devel

Hi,

BVK Chaitanya wrote:
> I see that netfront_accel_add_watch and netfront_accel_remove_watch 
> functions are _not_ protected by accelerator_mutex in accel.c   Is there 
> any specific reason for this?

Yes.  These functions need to be synchronised by the callers.  Adding a
mutex here would ensure that they didn't execute at the same time, but
wouldn't impose any order on the calls.  This matters because add
followed by remove is different from remove followed by add.  The
callers need to decide which order they should be executed in.

The relevant call chains are as follows:

xenbus otherend_changed callback
  -> backend_changed [netfront.c]
    -> network_connect
      -> talk_to_backend
        -> netfront_accelerator_add_watch

xenbus suspend_cancel callback
  -> netfront_suspend_cancel
    -> netfront_accelerator_suspend_cancel
      -> netfront_accelerator_add_watch

xenbus suspend callback
  -> netfront_suspend
    -> netfront_accelerator_suspend
      -> netfront_accelerator_remove_watch

xenbus remove callback
  -> netfront_remove
    -> netfront_accelerator_call_remove
      -> netfront_accelerator_remove_watch

So the watch is only added/removed from a xenbus callback.  I think
these callbacks should be synchronised by xenbus.  Can someone confirm that?

> I see that they sometimes get called twice (and result in BUG_ON) in 
> very fast (20ms) domain suspend-resume cycles and I couldn't figure out 
> how it is possible :-(

Is that the BUG_ON in netfront_accelerator_add_watch?  One possible
explanation is that suspend_cancel is called and then otherend_changed
is called.  Can you add a printk to netfront_suspend_cancel to see if it
gets called just before the BUG_ON gets triggered?

Cheers, Neil.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On netfront accelerator add/remove watches
  2008-07-29 11:09 ` Neil Turton
@ 2008-07-30  4:58   ` BVK Chaitanya
  2008-07-30 15:17     ` Kieran Mansley
  0 siblings, 1 reply; 5+ messages in thread
From: BVK Chaitanya @ 2008-07-30  4:58 UTC (permalink / raw)
  To: Neil Turton; +Cc: Xen-devel

Neil Turton wrote:
> Hi,
> 
> BVK Chaitanya wrote:
>> I see that netfront_accel_add_watch and netfront_accel_remove_watch 
>> functions are _not_ protected by accelerator_mutex in accel.c   Is there 
>> any specific reason for this?
> 
> Yes.  These functions need to be synchronised by the callers.  Adding a
> mutex here would ensure that they didn't execute at the same time, but
> wouldn't impose any order on the calls.  This matters because add
> followed by remove is different from remove followed by add.  The
> callers need to decide which order they should be executed in.
> 
> So the watch is only added/removed from a xenbus callback.  I think
> these callbacks should be synchronised by xenbus.  Can someone confirm that?

OK. I understand it now.

> 
>> I see that they sometimes get called twice (and result in BUG_ON) in 
>> very fast (20ms) domain suspend-resume cycles and I couldn't figure out 
>> how it is possible :-(
> 
> Is that the BUG_ON in netfront_accelerator_add_watch?  One possible
> explanation is that suspend_cancel is called and then otherend_changed
> is called.  Can you add a printk to netfront_suspend_cancel to see if it
> gets called just before the BUG_ON gets triggered?
> 

Yes, BUG_ON was from netfront_accelerator_add_watch function.  I think i 
got the problem: xen_suspend which calls suspend_cancel is not 
serialized properly.

Under heavy load and very fine suspend-resume cycles, multiple 
suspend_cancel instances can be running simultaneously.


regards,
--
bvk-chaitanya

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On netfront accelerator add/remove watches
  2008-07-30  4:58   ` BVK Chaitanya
@ 2008-07-30 15:17     ` Kieran Mansley
  2008-07-31 12:44       ` BVK Chaitanya
  0 siblings, 1 reply; 5+ messages in thread
From: Kieran Mansley @ 2008-07-30 15:17 UTC (permalink / raw)
  To: BVK Chaitanya; +Cc: Xen-devel, Neil Turton

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

On Wed, 2008-07-30 at 10:28 +0530, BVK Chaitanya wrote:
> Neil Turton wrote:
> > Is that the BUG_ON in netfront_accelerator_add_watch?  One possible
> > explanation is that suspend_cancel is called and then otherend_changed
> > is called.  Can you add a printk to netfront_suspend_cancel to see if it
> > gets called just before the BUG_ON gets triggered?
> > 
> 
> Yes, BUG_ON was from netfront_accelerator_add_watch function.  I think i 
> got the problem: xen_suspend which calls suspend_cancel is not 
> serialized properly.
> 
> Under heavy load and very fine suspend-resume cycles, multiple 
> suspend_cancel instances can be running simultaneously.

I'd be very surprised if that was the case, a lot more would go wrong if
suspend_cancel was running more than once simultaneously for the same
domain.

We think the bug is due to the suspend being called before the frontend
has reached XenbusStateConnected, then suspend_cancel restoring the
watch that wasn't there before, and then the frontend moving to
XenbusStateConnected and trying to set the watch again.

Here's a patch that should fix that problem.  Could you test and see if
it solves the problem you're seeing?  I've not been able to check it
myself as I'm unable to get a recent xen-unstable.hg that will build for
one reason or another today.

Keir: I don't know if you're tagging a linux-2.6.18-xen.hg tree for the
3.3.0 and 3.2.2 releases, but this fix should probably go into both if
you are.

Thanks

Kieran


diff -r 1d647ef26f3f drivers/xen/netfront/accel.c
--- a/drivers/xen/netfront/accel.c
+++ b/drivers/xen/netfront/accel.c
@@ -709,8 +709,9 @@ int netfront_accelerator_suspend_cancel(
 	 * accelerator, so no need to call accelerator_probe_new_vif()
 	 * directly here
 	 */
-	netfront_accelerator_add_watch(np);
- 	return 0;
+	if (dev->state == XenbusStateConnected)
+		netfront_accelerator_add_watch(np);
+	return 0;
 }
  
  


[-- Attachment #2: accel_watch_suspend_cancel --]
[-- Type: text/plain, Size: 510 bytes --]

Don't set accel watch on suspend_cancel unless we're in a state where it has meaning

diff -r 1d647ef26f3f drivers/xen/netfront/accel.c
--- a/drivers/xen/netfront/accel.c
+++ b/drivers/xen/netfront/accel.c
@@ -709,8 +709,9 @@ int netfront_accelerator_suspend_cancel(
 	 * accelerator, so no need to call accelerator_probe_new_vif()
 	 * directly here
 	 */
-	netfront_accelerator_add_watch(np);
- 	return 0;
+	if (dev->state == XenbusStateConnected)
+		netfront_accelerator_add_watch(np);
+	return 0;
 }
  
  

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On netfront accelerator add/remove watches
  2008-07-30 15:17     ` Kieran Mansley
@ 2008-07-31 12:44       ` BVK Chaitanya
  0 siblings, 0 replies; 5+ messages in thread
From: BVK Chaitanya @ 2008-07-31 12:44 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Xen-devel, Neil Turton

Kieran Mansley wrote:
> On Wed, 2008-07-30 at 10:28 +0530, BVK Chaitanya wrote:
>> Under heavy load and very fine suspend-resume cycles, multiple 
>> suspend_cancel instances can be running simultaneously.
> 
> I'd be very surprised if that was the case, a lot more would go wrong if
> suspend_cancel was running more than once simultaneously for the same
> domain.
> 
> We think the bug is due to the suspend being called before the frontend
> has reached XenbusStateConnected, then suspend_cancel restoring the
> watch that wasn't there before, and then the frontend moving to
> XenbusStateConnected and trying to set the watch again.
> 
> Here's a patch that should fix that problem.  Could you test and see if
> it solves the problem you're seeing?  I've not been able to check it
> myself as I'm unable to get a recent xen-unstable.hg that will build for
> one reason or another today.


Yeah, i will test with your patch and let you know.



--
bvk-chaitanya

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-07-31 12:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-29  3:32 On netfront accelerator add/remove watches BVK Chaitanya
2008-07-29 11:09 ` Neil Turton
2008-07-30  4:58   ` BVK Chaitanya
2008-07-30 15:17     ` Kieran Mansley
2008-07-31 12:44       ` BVK Chaitanya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.