Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* [Drbd-dev] too small timeout in drbdsetup
@ 2008-07-14 17:24 syrius.ml
  2008-08-04 18:37 ` syrius.ml
  0 siblings, 1 reply; 9+ messages in thread
From: syrius.ml @ 2008-07-14 17:24 UTC (permalink / raw)
  To: drbd-dev


Hi,

as previously reported here
http://thread.gmane.org/gmane.linux.kernel.drbd.devel/330 I also get
the error message.

looking at
http://git.drbd.org/?p=drbd-8.0.git;a=blob;f=user/drbdsetup.c;h=0bca7c1c773bcbd1c2ed6781062396ed15e77e9c;hb=HEAD#l1919
and
http://git.drbd.org/?p=drbd-8.2.git;a=blob;f=user/drbdsetup.c;h=3868f1a18f4cda80cad5b0b05aa6f2348755dedd;hb=HEAD

it seems the timeout is still too low (at least for me)

I've fixed my problem by increasing the timeout to 5s.

to reproduce the bug i was doing several drbdsetup disk one after the other
in a script.

(in fact the bug was first triggered by heartbeat drbd ocf script)

Was do you thing would be the best change to make ?
increase the timeout ?
why not using NL_TIME (12000) as other drbd_calls ?

Thanks

-- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-07-14 17:24 [Drbd-dev] too small timeout in drbdsetup syrius.ml
@ 2008-08-04 18:37 ` syrius.ml
  2008-08-05 11:35   ` Philipp Reisner
  2008-08-05 12:54   ` Graham, Simon
  0 siblings, 2 replies; 9+ messages in thread
From: syrius.ml @ 2008-08-04 18:37 UTC (permalink / raw)
  To: drbd-dev

syrius.ml@no-log.org writes:

> Hi,
>
> as previously reported here
> http://thread.gmane.org/gmane.linux.kernel.drbd.devel/330 I also get
> the error message.
>
> looking at
> http://git.drbd.org/?p=drbd-8.0.git;a=blob;f=user/drbdsetup.c;h=0bca7c1c773bcbd1c2ed6781062396ed15e77e9c;hb=HEAD#l1919
> and
> http://git.drbd.org/?p=drbd-8.2.git;a=blob;f=user/drbdsetup.c;h=3868f1a18f4cda80cad5b0b05aa6f2348755dedd;hb=HEAD
>
> it seems the timeout is still too low (at least for me)
>
> I've fixed my problem by increasing the timeout to 5s.
>
> to reproduce the bug i was doing several drbdsetup disk one after the other
> in a script.
>
> (in fact the bug was first triggered by heartbeat drbd ocf script)
>
> Was do you thing would be the best change to make ?
> increase the timeout ?
> why not using NL_TIME (12000) as other drbd_calls ?


Sorry to insist, 8.0.13 is on its way and you haven't answered about
this subject.
A lot of people have to make the change by hand and recompile,
distributions might add their own patch before releasing. Anyway what
do you think about this ?

TIA

-- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-08-04 18:37 ` syrius.ml
@ 2008-08-05 11:35   ` Philipp Reisner
  2008-08-05 17:15     ` syrius.ml
  2008-08-12  9:58     ` Nikola Ciprich
  2008-08-05 12:54   ` Graham, Simon
  1 sibling, 2 replies; 9+ messages in thread
From: Philipp Reisner @ 2008-08-05 11:35 UTC (permalink / raw)
  To: drbd-dev

Am Montag, 4. August 2008 20:37:00 schrieb syrius.ml@no-log.org:
> syrius.ml@no-log.org writes:
> > Hi,
> >
> > as previously reported here
> > http://thread.gmane.org/gmane.linux.kernel.drbd.devel/330 I also get
> > the error message.
> >
> > looking at
> > http://git.drbd.org/?p=drbd-8.0.git;a=blob;f=user/drbdsetup.c;h=0bca7c1c7
> >73bcbd1c2ed6781062396ed15e77e9c;hb=HEAD#l1919 and
> > http://git.drbd.org/?p=drbd-8.2.git;a=blob;f=user/drbdsetup.c;h=3868f1a18
> >f4cda80cad5b0b05aa6f2348755dedd;hb=HEAD
> >
> > it seems the timeout is still too low (at least for me)
> >
> > I've fixed my problem by increasing the timeout to 5s.
> >
> > to reproduce the bug i was doing several drbdsetup disk one after the
> > other in a script.
> >
> > (in fact the bug was first triggered by heartbeat drbd ocf script)
> >
> > Was do you thing would be the best change to make ?
> > increase the timeout ?
> > why not using NL_TIME (12000) as other drbd_calls ?
>
> Sorry to insist, 8.0.13 is on its way and you haven't answered about
> this subject.
> A lot of people have to make the change by hand and recompile,
> distributions might add their own patch before releasing. Anyway what
> do you think about this ?
>

Hi,

You have an issue with that 500ms in that function, right ?

void ensure_drbd_driver_is_present(void)
{
	struct drbd_tag_list *tl;
	char buffer[4096];
	int sk_nl, rr;

	sk_nl = open_cn();
	/* Might print:
	   Missing privileges? You should run this as root.
	   Connector module not loaded? try 'modprobe cn'. */
	if (sk_nl < 0) exit(20);

	tl = create_tag_list(2);
	add_tag(tl, TT_END, NULL, 0); // close the tag list

	tl->drbd_p_header->packet_type = P_get_state;
	tl->drbd_p_header->drbd_minor = 0;
	tl->drbd_p_header->flags = 0;

	rr = call_drbd(sk_nl, tl, (struct nlmsghdr*)buffer, 4096, 500);
	/* Might print: (after 500ms)
	   No response from the DRBD driver! Is the module loaded? */
	close_cn(sk_nl);
	if (rr == -2) exit(20);
}

We do not experience any issue with the 500ms in our setups, as are reports
about such an issue rather rare. Could you give a more details description
about the conditions you trigger can trigger this ?

I guess we will add an option to drbdsetup then, and have it is setting
in the globals section of drbd.conf. 

I do not want to inrecase it for all users, since is seems to affect only
a very small part of our user base.

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [Drbd-dev] too small timeout in drbdsetup
  2008-08-04 18:37 ` syrius.ml
  2008-08-05 11:35   ` Philipp Reisner
@ 2008-08-05 12:54   ` Graham, Simon
  1 sibling, 0 replies; 9+ messages in thread
From: Graham, Simon @ 2008-08-05 12:54 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

> 
> You have an issue with that 500ms in that function, right ?
> 
> void ensure_drbd_driver_is_present(void)
>...
> We do not experience any issue with the 500ms in our setups, as are
> reports
> about such an issue rather rare. Could you give a more details
> description
> about the conditions you trigger can trigger this ?
> 

FWIW, we see this occasionally in our environment too - usually when the
system is very busy; we're running DRBD in Dom0 of a Xen environment and
when the system is busy, the hypervisor can 'steal' time from Dom0 -
this translates to wall time moving forward with no chance to actually
execute code which can cause the 500ms timer to expire even when the
DRBD module is present.

We see this timeout once or twice a day (in about 2500hrs of total run
time) so it's not terribly prevalent but it's enough to cause issues. 

I must admit (-blush-) that I simply commented out the call to
ensure_drbd_driver_is_present() in our version since I _know_ that the
module is always loaded. I think having a command line option to
drbdsetup that lengthens the timeout or disables the check altogether
would be good.

BTW: There is one other slightly annoying side effect of the check --
every time drbdsetup is run, you get an unexpected event reported if you
are running 'drbdsetup monitor' -- not terribly bad but it confuses
folks debugging issues...

Simon


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-08-05 11:35   ` Philipp Reisner
@ 2008-08-05 17:15     ` syrius.ml
  2008-08-06  8:53       ` Lars Ellenberg
  2008-08-12  9:58     ` Nikola Ciprich
  1 sibling, 1 reply; 9+ messages in thread
From: syrius.ml @ 2008-08-05 17:15 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: drbd-dev

Philipp Reisner <philipp.reisner@linbit.com> writes:

> We do not experience any issue with the 500ms in our setups, as are reports
> about such an issue rather rare. Could you give a more details description
> about the conditions you trigger can trigger this ?

I have several drbd devices (4 atm).
Most of the time it is triggered by heartbeat RA scripts when the
device are setup one after the other.
could also happen if i do "echo r1 r2 r3 r4 | xargs -n 1 drbdadm
primary" for example.

> I guess we will add an option to drbdsetup then, and have it is setting
> in the globals section of drbd.conf. 

sounds good.

> I do not want to inrecase it for all users, since is seems to affect only
> a very small part of our user base.

ok.

-- 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-08-05 17:15     ` syrius.ml
@ 2008-08-06  8:53       ` Lars Ellenberg
  0 siblings, 0 replies; 9+ messages in thread
From: Lars Ellenberg @ 2008-08-06  8:53 UTC (permalink / raw)
  To: drbd-dev

On Tue, Aug 05, 2008 at 07:15:20PM +0200, syrius.ml@no-log.org wrote:
> Philipp Reisner <philipp.reisner@linbit.com> writes:
> 
> > We do not experience any issue with the 500ms in our setups, as are reports
> > about such an issue rather rare. Could you give a more details description
> > about the conditions you trigger can trigger this ?
> 
> I have several drbd devices (4 atm).
> Most of the time it is triggered by heartbeat RA scripts when the
> device are setup one after the other.
> could also happen if i do "echo r1 r2 r3 r4 | xargs -n 1 drbdadm
> primary" for example.
> 
> > I guess we will add an option to drbdsetup then, and have it is setting
> > in the globals section of drbd.conf. 
> 
> sounds good.
> 
> > I do not want to inrecase it for all users, since is seems to affect only
> > a very small part of our user base.

it may well affect all.
sometimes we do something that involves IO from the context of the
cqueue thread. we probably should not.

if I have too few of them (typically depends on number of cores), and
all are busy (for a second or so), any communication attempt with a
500ms timeout will time out while they are still processing the previous
command.

I'd suggest to replace that check function by a simple stat on
/proc/drbd. The netlink path will be exercised by the next command
anyways, with a much larger timeout. The extra call with short timeout
in the beginning was just a convenience to not run into the full timeout
and only then realize that the module is missing.

The only drawback I can see to this aproach is, that if now someone loads
a pre-DRBD-8 module (0.7, 0.6), but uses DRBD-8 userland, the driver
would appear to be loaded, and drbdsetup will then still run into the
full timeout of the respective commands.

yes, stupid things do happen.
but, how much do we care?

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-08-05 11:35   ` Philipp Reisner
  2008-08-05 17:15     ` syrius.ml
@ 2008-08-12  9:58     ` Nikola Ciprich
  1 sibling, 0 replies; 9+ messages in thread
From: Nikola Ciprich @ 2008-08-12  9:58 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: nikola.ciprich, drbd-dev

Hi Phillip,
we are experiencing this problem as well (also multiple DRBD devices operated
by heartbeat), so I also vote for either increasing timeout value, or making it optional :-)
thanks a lot!
nik

On Tue, Aug 05, 2008 at 01:35:15PM +0200, Philipp Reisner wrote:
> Am Montag, 4. August 2008 20:37:00 schrieb syrius.ml@no-log.org:
> > syrius.ml@no-log.org writes:
> > > Hi,
> > >
> > > as previously reported here
> > > http://thread.gmane.org/gmane.linux.kernel.drbd.devel/330 I also get
> > > the error message.
> > >
> > > looking at
> > > http://git.drbd.org/?p=drbd-8.0.git;a=blob;f=user/drbdsetup.c;h=0bca7c1c7
> > >73bcbd1c2ed6781062396ed15e77e9c;hb=HEAD#l1919 and
> > > http://git.drbd.org/?p=drbd-8.2.git;a=blob;f=user/drbdsetup.c;h=3868f1a18
> > >f4cda80cad5b0b05aa6f2348755dedd;hb=HEAD
> > >
> > > it seems the timeout is still too low (at least for me)
> > >
> > > I've fixed my problem by increasing the timeout to 5s.
> > >
> > > to reproduce the bug i was doing several drbdsetup disk one after the
> > > other in a script.
> > >
> > > (in fact the bug was first triggered by heartbeat drbd ocf script)
> > >
> > > Was do you thing would be the best change to make ?
> > > increase the timeout ?
> > > why not using NL_TIME (12000) as other drbd_calls ?
> >
> > Sorry to insist, 8.0.13 is on its way and you haven't answered about
> > this subject.
> > A lot of people have to make the change by hand and recompile,
> > distributions might add their own patch before releasing. Anyway what
> > do you think about this ?
> >
> 
> Hi,
> 
> You have an issue with that 500ms in that function, right ?
> 
> void ensure_drbd_driver_is_present(void)
> {
> 	struct drbd_tag_list *tl;
> 	char buffer[4096];
> 	int sk_nl, rr;
> 
> 	sk_nl = open_cn();
> 	/* Might print:
> 	   Missing privileges? You should run this as root.
> 	   Connector module not loaded? try 'modprobe cn'. */
> 	if (sk_nl < 0) exit(20);
> 
> 	tl = create_tag_list(2);
> 	add_tag(tl, TT_END, NULL, 0); // close the tag list
> 
> 	tl->drbd_p_header->packet_type = P_get_state;
> 	tl->drbd_p_header->drbd_minor = 0;
> 	tl->drbd_p_header->flags = 0;
> 
> 	rr = call_drbd(sk_nl, tl, (struct nlmsghdr*)buffer, 4096, 500);
> 	/* Might print: (after 500ms)
> 	   No response from the DRBD driver! Is the module loaded? */
> 	close_cn(sk_nl);
> 	if (rr == -2) exit(20);
> }
> 
> We do not experience any issue with the 500ms in our setups, as are reports
> about such an issue rather rare. Could you give a more details description
> about the conditions you trigger can trigger this ?
> 
> I guess we will add an option to drbdsetup then, and have it is setting
> in the globals section of drbd.conf. 
> 
> I do not want to inrecase it for all users, since is seems to affect only
> a very small part of our user base.
> 
> -Phil
> -- 
> : Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
> : Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Drbd-dev] too small timeout in drbdsetup
@ 2008-08-14  7:57 Jerome Martin
  2008-08-16 15:27 ` Lars Ellenberg
  0 siblings, 1 reply; 9+ messages in thread
From: Jerome Martin @ 2008-08-14  7:57 UTC (permalink / raw)
  To: drbd-dev

Hello,

I use drbd in a HA cluster environement, with at times several
drbdadm/drbdsetup commands running in parallel (notably for monitoring
drbd resources status, but also during cluster startup, nodes reboots,
etc..).

I have a drbd minor counts on each node that ranges from 2 to 10, and on
all nodes with more than 3 drbd ressources, I am experiencing very
annoying issues with that timeout.

Please note that it took me some time to realize that the fault was not
coming from my resource agents or some other faults, and I expect it to
be the same for many other users after me if this does not get changed
and/or documented on the drbd upstream packages. I of course now have to
maintain a custom drbd package in my local repository only to get the
trivial patch below in.

Philip, Lars, please consider this request seriously and give it some
thinking. I really think this is an issue that needs to be addressed
because :

1/ I concerns a usecase that is really on your target users base (HA /
clustering)
2/ It is an issue that can stay dormant for a very long time until a
user decides to add more drbds to his setup,and then bite him really
bad, even though it is now reported and trivial to fix (even though I
believe one should always perform tests before production with a bigger
scale than one's actual initial need)
3/ when being addressed by a learning cluster-admin who at the same time
needs to deal with many other different issues, even though it seems
trivial, it can be very hard to debug what's wrong when it gets
triggered
4/ It adds robustness to drbd in many usage scenarios, and I believe
this is what drbd is about: being robust. I'd be disapointed not to see
drbd go for the "safer and most robust" choices, as I guess a large
member of the users community would be.

I know you might feel I am over-emphasing this tiny little detail, but
after many talks around this on linux-ha IRC channels, plus several
support sessions given to users/developers of my own clustering OSS
project, not to mention reports I had from different people in different
usage scenarios unrelated to the above, I really feel this might have a
larger impact than one could initially imagine when looking at it
coldly. And I also fear some of the impacted people are not reporting
their experience because they lack understanding of what actually
happens, and might turn their back on a drbd-based solution for the
wrong reasons ("never got it to run stable, so dumped it" type of
story).

Note: the actual timeout value that I feel is required to solve 99.9% of
the occurences of the issue at hand is 1s, but I feel that for coherency
reasons it should be set to NL_TIME and be made configurable. But YMMV
and I see no reasons to stay on the safe side with this one.

Thanks for your most valuable time,
Jerome

diff -Naurd drbd8-8.0.12~/user/drbdsetup.c drbd8-8.0.12/user/drbdsetup.c
--- drbd8-8.0.12~/user/drbdsetup.c	2008-04-08 21:05:56.000000000 +0200
+++ drbd8-8.0.12/user/drbdsetup.c	2008-08-04 16:11:11.000000000 +0200
@@ -1839,8 +1839,8 @@
 	tl->drbd_p_header->drbd_minor = 0;
 	tl->drbd_p_header->flags = 0;
 
-	rr = call_drbd(sk_nl, tl, (struct nlmsghdr*)buffer, 4096, 500);
-	/* Might print: (after 500ms)
+	rr = call_drbd(sk_nl, tl, (struct nlmsghdr*)buffer, 4096, NL_TIME);
+	/* Might print: (after NL_TIME)
 	   No response from the DRBD driver! Is the module loaded? */
 	close_cn(sk_nl);
 	if (rr == -2) exit(20);

Best Regards,
-- 
Jérôme Martin | LongPhone
Responsable Architecture Réseau
122, rue la Boetie | 75008 Paris
Tel :  +33 (0)1 56 26 28 44
Fax : +33 (0)1 56 26 28 45
Mail : jmartin@longphone.fr
Web : www.longphone.com <http://www.longphone.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Drbd-dev] too small timeout in drbdsetup
  2008-08-14  7:57 Jerome Martin
@ 2008-08-16 15:27 ` Lars Ellenberg
  0 siblings, 0 replies; 9+ messages in thread
From: Lars Ellenberg @ 2008-08-16 15:27 UTC (permalink / raw)
  To: drbd-dev

On Thu, Aug 14, 2008 at 09:57:33AM +0200, Jerome Martin wrote:
> Philip, Lars, please consider this request seriously and give it some
> thinking. I really think this is an issue that needs to be addressed
> because :

[a few valid points]

> I know you might feel I am over-emphasing this tiny little detail, but

absolutely not.

I had always just used a "stat" on /proc/drbd there,
and that is what we will do from now on.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-08-16 15:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 17:24 [Drbd-dev] too small timeout in drbdsetup syrius.ml
2008-08-04 18:37 ` syrius.ml
2008-08-05 11:35   ` Philipp Reisner
2008-08-05 17:15     ` syrius.ml
2008-08-06  8:53       ` Lars Ellenberg
2008-08-12  9:58     ` Nikola Ciprich
2008-08-05 12:54   ` Graham, Simon
  -- strict thread matches above, loose matches on Subject: below --
2008-08-14  7:57 Jerome Martin
2008-08-16 15:27 ` Lars Ellenberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox