* [DRIVER SUBMISSION] DRBD wants to go mainline
@ 2007-07-21 20:38 Lars Ellenberg
2007-07-21 21:17 ` Jan Engelhardt
` (4 more replies)
0 siblings, 5 replies; 42+ messages in thread
From: Lars Ellenberg @ 2007-07-21 20:38 UTC (permalink / raw)
To: Jens Axboe, Andrew Morton, lkml; +Cc: drbd-user
DRBD wants to go mainline.
please have a look at the "for-linus" branch of
git://git.drbd.org/home/git/linux-drbd.git.
Longer story:
Hi there,
for lkml readers who don't know about DRBD yet:
DRBD is the distributed replicated block device,
a stacked block device driver as developed mainly
by Lars Ellenberg and Philipp Reisner
both employed at Linbit.
We implement shared-disk semantics in a shared-nothing cluster.
yes, we even support cluster file systems (OCFS2, GFS2)
[support for this is somewhat clumsy, still,
mainly userland/setup/convenience issues.
but that will improve.]
to actually use it you will need some userland tools
(drbdadm, drbdmeta, drbdsetup).
most unfortunate current limitations:
we support only 2 nodes directly;
but we are working on that, too :)
I'm going to give a presentation/BoF about it
at linux-conf.eu in September.
Currently we are at "stable version 8.0.y" (y == 4 right now,
8.0.5 to be released within a few days.
we have a project home page http://www.drbd.org
we have some mention on the linux-HA project http://www.linux-ha.org/DRBD
(both are slightly out-dated, unfortunately...)
our subversion repository (yeah, sorry, there has been no git back then)
is at http://svn.drbd.org/drbd/branches/drbd-8.0
and, we have a user base...
[drbd-user readers who are in a position to do so:
please help us get this into mainline! ]
Now, finally, we have a git repository of the kernel module part, too:
git://git.drbd.org/home/git/linux-drbd.git
it has two branches currently,
master, which tracks ...torvalds/linux-2.6.git, and
for-linus, which tracks the cleaned up patch meant for inclusin in
mainline.
so if interessted please do
git fetch git://git.drbd.org/home/git/linux-drbd.git +for-linus:drbd-8.0
to get a local branch named drbd-8.0 where you can inspect it.
see below for a diffstat of master..for-linus.
Jens, Andrew, anyone: please review,
and give me advice how to proceed from here.
technically, would you prefer
frequent rebase of the "for-linus" branch,
or rather merges back and forth?
Cheers,
Lars
The only not drbd specific files touched
are listed explicitly here:
=====================
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a4a3119..76f84bc 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -272,6 +272,8 @@ config BLK_DEV_CRYPTOLOOP
instead, which can be configured to be on-disk compatible with the
cryptoloop device.
+source "drivers/block/drbd/Kconfig"
+
config BLK_DEV_NBD
tristate "Network block device support"
depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 819c829..b27e16a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_BLK_DEV_UB) += ub.o
obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_LGUEST_GUEST) += lguest_blk.o
+obj-$(CONFIG_BLK_DEV_DRBD) += drbd/
diff --git a/include/linux/major.h b/include/linux/major.h
index 0cb9805..2dbd7c1 100644
--- a/include/linux/major.h
+++ b/include/linux/major.h
@@ -145,6 +145,8 @@
#define UNIX98_PTY_MAJOR_COUNT 8
#define UNIX98_PTY_SLAVE_MAJOR (UNIX98_PTY_MASTER_MAJOR+UNIX98_PTY_MAJOR_COUNT)
+#define DRBD_MAJOR 147
+
#define RTF_MAJOR 150
#define RAW_MAJOR 162
=====================
(yes, block major 147 is assigned by LANANA,
as documented in Documentation/devices.txt already)
git-diff master..for-linus | diffstat
drivers/block/Kconfig | 2
drivers/block/Makefile | 1
drivers/block/drbd/Kconfig | 32
drivers/block/drbd/Makefile | 7
drivers/block/drbd/drbd_actlog.c | 1456 ++++++++++++++++
drivers/block/drbd/drbd_bitmap.c | 1184 +++++++++++++
drivers/block/drbd/drbd_buildtag.c | 6
drivers/block/drbd/drbd_int.h | 1894 ++++++++++++++++++++
drivers/block/drbd/drbd_main.c | 3249 +++++++++++++++++++++++++++++++++++
drivers/block/drbd/drbd_nl.c | 1847 ++++++++++++++++++++
drivers/block/drbd/drbd_proc.c | 267 ++
drivers/block/drbd/drbd_receiver.c | 3356 +++++++++++++++++++++++++++++++++++++
drivers/block/drbd/drbd_req.c | 1169 ++++++++++++
drivers/block/drbd/drbd_req.h | 320 +++
drivers/block/drbd/drbd_strings.c | 106 +
drivers/block/drbd/drbd_worker.c | 1046 +++++++++++
drivers/block/drbd/drbd_wrappers.h | 144 +
drivers/block/drbd/lru_cache.c | 370 ++++
drivers/block/drbd/lru_cache.h | 147 +
include/linux/drbd.h | 284 +++
include/linux/drbd_config.h | 55
include/linux/drbd_limits.h | 124 +
include/linux/drbd_nl.h | 105 +
include/linux/drbd_tag_magic.h | 78
include/linux/major.h | 2
25 files changed, 17251 insertions(+)
--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg @ 2007-07-21 21:17 ` Jan Engelhardt 2007-07-21 22:43 ` Lars Ellenberg 2007-07-21 21:34 ` Jesper Juhl ` (3 subsequent siblings) 4 siblings, 1 reply; 42+ messages in thread From: Jan Engelhardt @ 2007-07-21 21:17 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andrew Morton, lkml, drbd-user On Jul 21 2007 22:38, Lars Ellenberg wrote: > >We implement shared-disk semantics in a shared-nothing cluster. If nothing is shared, the disk is not shared, but got shared-disk semantics? A little confusing. Jan -- ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 21:17 ` Jan Engelhardt @ 2007-07-21 22:43 ` Lars Ellenberg 2007-07-22 9:06 ` Jan Engelhardt 2007-07-27 18:46 ` Pavel Machek 0 siblings, 2 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-21 22:43 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Jens Axboe, Andrew Morton, lkml On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: > > On Jul 21 2007 22:38, Lars Ellenberg wrote: > > > >We implement shared-disk semantics in a shared-nothing cluster. > > If nothing is shared, the disk is not shared, but got shared-disk > semantics? A little confusing. Think of it as RAID1 over TCP. Typically you have one Node in Primary, the other as Secondary, replication target only. But you can also have both Active, for use with a cluster file system. So the semantics are like you have two (to come: N) nodes and a shared disk. only that there is not one shared disk, but two (N) replicated ones. btw, regarding linux kernel CodingStyle issues. scripts/checkpatch.pl reports 2000+ lines of complaints. I'm working on it now, it is mostly whitespace (lame excuse: phil grew up with emacs and gnu coding style, sorry for that). -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 22:43 ` Lars Ellenberg @ 2007-07-22 9:06 ` Jan Engelhardt 2007-07-22 14:03 ` Lars Ellenberg 2007-07-27 18:46 ` Pavel Machek 1 sibling, 1 reply; 42+ messages in thread From: Jan Engelhardt @ 2007-07-22 9:06 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andrew Morton, lkml On Jul 22 2007 00:43, Lars Ellenberg wrote: >On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: >> On Jul 21 2007 22:38, Lars Ellenberg wrote: >> > >> >We implement shared-disk semantics in a shared-nothing cluster. >> >> If nothing is shared, the disk is not shared, but got shared-disk >> semantics? A little confusing. > >Think of it as RAID1 over TCP. And what does it do better than raid1-over-NBD? (Which is already N-disk, and, logically, seems to support cluster filesystems) >Typically you have one Node in Primary, the other as Secondary, >replication target only. >But you can also have both Active, for use with a cluster file system. > >So the semantics are like you have >two (to come: N) nodes and a shared disk. >only that there is not one shared disk, >but two (N) replicated ones. > >btw, regarding linux kernel CodingStyle issues. >scripts/checkpatch.pl reports 2000+ lines of complaints. >I'm working on it now, it is mostly whitespace (lame excuse: phil grew >up with emacs and gnu coding style, sorry for that). > Jan -- ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 9:06 ` Jan Engelhardt @ 2007-07-22 14:03 ` Lars Ellenberg 0 siblings, 0 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-22 14:03 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Jens Axboe, Andrew Morton, lkml On Sun, Jul 22, 2007 at 11:06:59AM +0200, Jan Engelhardt wrote: > > On Jul 22 2007 00:43, Lars Ellenberg wrote: > >On Sat, Jul 21, 2007 at 11:17:43PM +0200, Jan Engelhardt wrote: > >> On Jul 21 2007 22:38, Lars Ellenberg wrote: > >> > > >> >We implement shared-disk semantics in a shared-nothing cluster. > >> > >> If nothing is shared, the disk is not shared, but got shared-disk > >> semantics? A little confusing. > > > >Think of it as RAID1 over TCP. > > And what does it do better than raid1-over-NBD? (Which is already N-disk, > and, logically, seems to support cluster filesystems) DRBD has built-in logic to track which copy of the data is the most recent. DRBD has resource-level-fencing in place (though we can improve on that). DRBD deals with concurrent writes of the participating nodes correctly. "N-disk" is not the question, you can stack drbd on top of md, lvm, anything. N-nodes is the interessting thing. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 22:43 ` Lars Ellenberg 2007-07-22 9:06 ` Jan Engelhardt @ 2007-07-27 18:46 ` Pavel Machek 2007-07-30 19:35 ` Lars Ellenberg 1 sibling, 1 reply; 42+ messages in thread From: Pavel Machek @ 2007-07-27 18:46 UTC (permalink / raw) To: Lars Ellenberg, Jan Engelhardt, Jens Axboe, Andrew Morton, lkml Hi! > > >We implement shared-disk semantics in a shared-nothing cluster. > > > > If nothing is shared, the disk is not shared, but got shared-disk > > semantics? A little confusing. > > Think of it as RAID1 over TCP. > Typically you have one Node in Primary, the other as Secondary, > replication target only. I guess TCP means people should not swap over it? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-27 18:46 ` Pavel Machek @ 2007-07-30 19:35 ` Lars Ellenberg 2007-07-30 19:41 ` Jan Engelhardt 0 siblings, 1 reply; 42+ messages in thread From: Lars Ellenberg @ 2007-07-30 19:35 UTC (permalink / raw) To: Pavel Machek; +Cc: Jan Engelhardt, Jens Axboe, Andrew Morton, lkml On Fri, Jul 27, 2007 at 06:46:17PM +0000, Pavel Machek wrote: > Hi! > > > > >We implement shared-disk semantics in a shared-nothing cluster. > > > > > > If nothing is shared, the disk is not shared, but got shared-disk > > > semantics? A little confusing. > > > > Think of it as RAID1 over TCP. > > Typically you have one Node in Primary, the other as Secondary, > > replication target only. > > I guess TCP means people should not swap over it? people should not swap over DRBD, because it would not be useful. DRBD is to have applicaction data available for more than one node, without a single point of failure; when the node the app currently runs on crashes, the data is there so some other node can take over from there. what would you do with the swap of a crashed node, apart from, well, crash analysis? you don't need it to be highly available for that. besides, yes, when you have network io in the block io path, with linux (and probably most other OSes), there is the posibility of vm starvation or even deadlock. situation is improving, though - iirc, there has been talk to have a emergency memory pool very low level in the network stack, and some special "I am doing block-io" socket flag; what is the status of that, anyone? I belive DRBD behaves very good even in oom situations. we considered these things from the very beginning. that said, I did not see a DRBD cluster hanging hard in OOM, yet. and we do operate quite a few busy database/mail/web/file/application/iSCSI/whatever clusters. Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-30 19:35 ` Lars Ellenberg @ 2007-07-30 19:41 ` Jan Engelhardt 0 siblings, 0 replies; 42+ messages in thread From: Jan Engelhardt @ 2007-07-30 19:41 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Pavel Machek, Jens Axboe, Andrew Morton, lkml On Jul 30 2007 21:35, Lars Ellenberg wrote: >On Fri, Jul 27, 2007 at 06:46:17PM +0000, Pavel Machek wrote: >> Hi! >> >> > > >We implement shared-disk semantics in a shared-nothing cluster. >> > > >> > > If nothing is shared, the disk is not shared, but got shared-disk >> > > semantics? A little confusing. >> > >> > Think of it as RAID1 over TCP. Getting back at that... How does it compare (in terms of "do we need it?") to RAIF1 over NFS? Jan -- ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg 2007-07-21 21:17 ` Jan Engelhardt @ 2007-07-21 21:34 ` Jesper Juhl 2007-07-21 23:50 ` Andi Kleen ` (2 subsequent siblings) 4 siblings, 0 replies; 42+ messages in thread From: Jesper Juhl @ 2007-07-21 21:34 UTC (permalink / raw) To: Jens Axboe, Andrew Morton, lkml, drbd-user On 21/07/07, Lars Ellenberg <lars@linbit.com> wrote: > > DRBD wants to go mainline. > please have a look at the "for-linus" branch of > git://git.drbd.org/home/git/linux-drbd.git. > I just fetched yourt branch and had a (very) quick look. Some comments. Try running your patches through scripts/checkpatch.pl - it shows a lot of style problems. Fixing those up would probably be a good step towards mainline. Remember, checkpatch.pl is not the law - in some situations what it complains about can be totally valid, but usually what it highlights is stuff that it is prefered to clean up (I'd say especially now prior to inclusion so we don't have to do it post inclusion). A few of your files suffer from trailing whitespace at the end of lines. It's interresting to build your code with -W. It shows up quite a few signed vs unsigned comparisons, unused parameters, expressions that are always false, etc. Although I doubt anyone is going to complain too loudly about stuff that only shows up with "-W", cleaning it up can't hurt. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg 2007-07-21 21:17 ` Jan Engelhardt 2007-07-21 21:34 ` Jesper Juhl @ 2007-07-21 23:50 ` Andi Kleen 2007-07-22 5:52 ` Jens Axboe 2007-07-22 6:09 ` Lars Ellenberg 2007-07-23 1:32 ` Kyle Moffett 2007-07-23 20:59 ` Jesper Juhl 4 siblings, 2 replies; 42+ messages in thread From: Andi Kleen @ 2007-07-21 23:50 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andrew Morton, lkml, drbd-user Lars Ellenberg <lars@linbit.com> writes: > > Jens, Andrew, anyone: please review, > and give me advice how to proceed from here. The standard procedure would be to post all the source code in logical pieces on the list for review. Then iterate until all comments are addressed. -Andi ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 23:50 ` Andi Kleen @ 2007-07-22 5:52 ` Jens Axboe 2007-07-22 13:58 ` Lars Ellenberg 2007-07-22 6:09 ` Lars Ellenberg 1 sibling, 1 reply; 42+ messages in thread From: Jens Axboe @ 2007-07-22 5:52 UTC (permalink / raw) To: Andi Kleen; +Cc: Lars Ellenberg, Andrew Morton, lkml, drbd-user On Sun, Jul 22 2007, Andi Kleen wrote: > Lars Ellenberg <lars@linbit.com> writes: > > > > Jens, Andrew, anyone: please review, > > and give me advice how to proceed from here. > > The standard procedure would be to post all the source code in logical > pieces on the list for review. Then iterate until all comments are > addressed. Yep, cleanup the style issues (that make sense) from checkpatch and then psot as a series of patches that can be reviewed. Linking to a git tree wont get you very far. -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 5:52 ` Jens Axboe @ 2007-07-22 13:58 ` Lars Ellenberg 2007-07-22 14:49 ` Sam Ravnborg ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-22 13:58 UTC (permalink / raw) To: Jens Axboe; +Cc: Andi Kleen, Andrew Morton, lkml On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > On Sun, Jul 22 2007, Andi Kleen wrote: > > Lars Ellenberg <lars@linbit.com> writes: > > > > > > Jens, Andrew, anyone: please review, > > > and give me advice how to proceed from here. > > > > The standard procedure would be to post all the source code in logical > > pieces on the list for review. Then iterate until all comments are > > addressed. > > Yep, cleanup the style issues (that make sense) from checkpatch and then > psot as a series of patches that can be reviewed. Linking to a git tree > wont get you very far. it got me far enough, for the first try, anyways :-) I did not spam the lkml with patches, and still got some very useful advice (no idea how I could overlook the checkpatch.pl complaints). If each patch of a series needs to compile and work, there will probably only one 17kB patch... it is difficult to split one module into a series of patches. Or am I missing something? may I bother you again to comment on this, please: I am now down to 5 CHECK: memory barrier without comment these are directly connected with our homegrown kernel thread stuff. will vanish as soon as we convert to kthread.h API. 4 CHECK: spinlock_t definition without comment 3 CHECK: Use #include <linux/uaccess.h> instead of <asm/uaccess.h> 3 CHECK: if this code is redundant consider removing it 2 CHECK: Use #include <linux/types.h> instead of <asm/types.h> need to check those, still. 72 ERROR: braces {} are not necessary for single statement blocks one branch needs them, the othe does not. what now? CodingStyle and checkpatch.pl disagree. 13 ERROR: no space between function name and open parenthesis '(' this is about our ERR_IF() macro... suggestions, anyone? do need I to explicitly write these out? 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop 1 ERROR: Macros with complex values should be enclosed in parenthesis these are "netlink packet definition language" in drbd_nl.h, which is sort of clean, and preprocessor magic in drbd_nl.c. suggestions how to handle this cleanly, without making it more ugly? autogenerate code by other means? write it out by hand, and lose the nice and clean drbd_nl.h? 1 ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt yes. working on that. will take some days, though. 1 ERROR: Missing Signed-off-by: line(s) 94 WARNING: declaring multiple variables together should be avoided int snr, enr; does this really need to become two lines? 33 WARNING: line over 80 characters hmmm. get more ugly... probably need some helper functions and temp variables? 4 WARNING: multiple assignments should be avoided x = y = 0; is sometimes a convenient initialization. you don't like it? -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 13:58 ` Lars Ellenberg @ 2007-07-22 14:49 ` Sam Ravnborg 2007-07-22 14:56 ` Andi Kleen 2007-07-22 15:31 ` Satyam Sharma 2 siblings, 0 replies; 42+ messages in thread From: Sam Ravnborg @ 2007-07-22 14:49 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andi Kleen, Andrew Morton, lkml On Sun, Jul 22, 2007 at 03:58:14PM +0200, Lars Ellenberg wrote: > On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > > On Sun, Jul 22 2007, Andi Kleen wrote: > > > Lars Ellenberg <lars@linbit.com> writes: > > > > > > > > Jens, Andrew, anyone: please review, > > > > and give me advice how to proceed from here. > > > > > > The standard procedure would be to post all the source code in logical > > > pieces on the list for review. Then iterate until all comments are > > > addressed. > > > > Yep, cleanup the style issues (that make sense) from checkpatch and then > > psot as a series of patches that can be reviewed. Linking to a git tree > > wont get you very far. > > it got me far enough, for the first try, anyways :-) > I did not spam the lkml with patches, and still got some very useful > advice (no idea how I could overlook the checkpatch.pl complaints). > > If each patch of a series needs to compile and work, > there will probably only one 17kB patch... Thats not needed for reviewing.. > 33 WARNING: line over 80 characters > hmmm. get more ugly... > probably need some helper functions and temp variables? Several people question this check. It gets ugly on text-mode but for the price of readability for the rest. So I suggest concentrate on other matters, but keep an eye for the "many small functions that do exactly _one_ thing and few local variables" approach. Sam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 13:58 ` Lars Ellenberg 2007-07-22 14:49 ` Sam Ravnborg @ 2007-07-22 14:56 ` Andi Kleen 2007-07-22 15:31 ` Satyam Sharma 2 siblings, 0 replies; 42+ messages in thread From: Andi Kleen @ 2007-07-22 14:56 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andi Kleen, Andrew Morton, lkml On Sun, Jul 22, 2007 at 03:58:14PM +0200, Lars Ellenberg wrote: > On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > > On Sun, Jul 22 2007, Andi Kleen wrote: > > > Lars Ellenberg <lars@linbit.com> writes: > > > > > > > > Jens, Andrew, anyone: please review, > > > > and give me advice how to proceed from here. > > > > > > The standard procedure would be to post all the source code in logical > > > pieces on the list for review. Then iterate until all comments are > > > addressed. > > > > Yep, cleanup the style issues (that make sense) from checkpatch and then > > psot as a series of patches that can be reviewed. Linking to a git tree > > wont get you very far. > > it got me far enough, for the first try, anyways :-) > I did not spam the lkml with patches, and still got some very useful > advice (no idea how I could overlook the checkpatch.pl complaints). > > If each patch of a series needs to compile and work, That's not needed for review. The final submission can then be in a single patch. It's just impossible to read a really large single patch. Ideally you space the posting also out over time.to not tax the review resources too much. Another standard trick for compileable patch series is to only add the Kconfig at the end. > 13 ERROR: no space between function name and open parenthesis '(' > this is about our ERR_IF() macro... > suggestions, anyone? > do need I to explicitly write these out? Preferable yes. > > 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop > 1 ERROR: Macros with complex values should be enclosed in parenthesis > these are "netlink packet definition language" in drbd_nl.h, > which is sort of clean, and preprocessor magic in drbd_nl.c. > suggestions how to handle this cleanly, > without making it more ugly? > autogenerate code by other means? > write it out by hand, and lose the nice and clean drbd_nl.h? No, you can ignore it then. > 94 WARNING: declaring multiple variables together should be avoided > int snr, enr; > does this really need to become two lines? Probably not. > > 33 WARNING: line over 80 characters > hmmm. get more ugly... > probably need some helper functions and temp variables? Yes smaller functions are better in general. -Andi ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 13:58 ` Lars Ellenberg 2007-07-22 14:49 ` Sam Ravnborg 2007-07-22 14:56 ` Andi Kleen @ 2007-07-22 15:31 ` Satyam Sharma 2007-07-22 15:50 ` Satyam Sharma 2007-07-22 16:13 ` Stefan Richter 2 siblings, 2 replies; 42+ messages in thread From: Satyam Sharma @ 2007-07-22 15:31 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andi Kleen, Andrew Morton, lkml On 7/22/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > On Sun, Jul 22, 2007 at 07:52:36AM +0200, Jens Axboe wrote: > > [...] > > Yep, cleanup the style issues (that make sense) from checkpatch and then > > psot as a series of patches that can be reviewed. Linking to a git tree > > wont get you very far. > > it got me far enough, for the first try, anyways :-) > I did not spam the lkml with patches, and still got some very useful > advice (no idea how I could overlook the checkpatch.pl complaints). > > If each patch of a series needs to compile and work, > there will probably only one 17kB patch... > it is difficult to split one module into a series of patches. > Or am I missing something? If not too much bother, you could even present the patchset in a way that reflects how the driver code evolved to its present state in v8.0.4. > may I bother you again to comment on this, please: > > I am now down to > 5 CHECK: memory barrier without comment > these are directly connected with our homegrown kernel thread stuff. > will vanish as soon as we convert to kthread.h API. > > 4 CHECK: spinlock_t definition without comment > 3 CHECK: Use #include <linux/uaccess.h> instead of <asm/uaccess.h> > 3 CHECK: if this code is redundant consider removing it > 2 CHECK: Use #include <linux/types.h> instead of <asm/types.h> > need to check those, still. > > 72 ERROR: braces {} are not necessary for single statement blocks > one branch needs them, the othe does not. > what now? CodingStyle and checkpatch.pl disagree. Submit a patch to checkpatch.pl (or preferably CodingStyle, I dislike with the way it blatantly disregards K&R convention in this matter :-) > 13 ERROR: no space between function name and open parenthesis '(' > this is about our ERR_IF() macro... > suggestions, anyone? There shouldn't be any space between function/macro name and open parenthesis. However, there *must* be a space between C language keyword (if, while, for) and open parenthesis. > do need I to explicitly write these out? Yup, do consider that. Also consider making them functions. Macros are _generally_ evil and *always* horrible. > 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop You don't want to break if-else constructs. > 1 ERROR: Macros with complex values should be enclosed in parenthesis I'm not sure know when / why checkpatch.pl reports this. I don't want to pull your tree so can't check for myself, but note some obvious rules: 1. Macro definition must always include parenthesis around the entire definition (unless it's a do { somehting; } while (0) kind of macro). 2. Use parenthesis around wherever it uses the passed argument(s). > these are "netlink packet definition language" in drbd_nl.h, > which is sort of clean, and preprocessor magic in drbd_nl.c. > suggestions how to handle this cleanly, > without making it more ugly? > autogenerate code by other means? > write it out by hand, and lose the nice and clean drbd_nl.h? 1. Open-code it if it is trivial enough and there's no point obfuscating anything. 2. Consider making the macro a function. Probably inline, most probably not. 3. Macros must NOT evaluate the passed argument twice -- consider the possible damages of someone who doesn't know it's a macro and therefore passing in an expression that has side-effects (*boom, crash*). > 1 ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt > yes. working on that. > will take some days, though. > > 1 ERROR: Missing Signed-off-by: line(s) Patches submitted to LKML should conform to guidelines in: * Documentation/SubmittingPatches in kernel sources * http://www.zipworld.com.au/~akpm/linux/patches/stuff/tpp.txt > 94 WARNING: declaring multiple variables together should be avoided > int snr, enr; > does this really need to become two lines? No, if these are some unimportant temporary/automatic variables in some function. Yes, if they are members of some struct (also comment them in this case -- in fact give full kernel-doc style comments). > 33 WARNING: line over 80 characters > hmmm. get more ugly... > probably need some helper functions and temp variables? Yes, do consider breaking functions that go significantly beyond 80 lines into smaller ones. If it's just 4-5 columns beyond 80, and breaking the line would actually hurt readability, don't bother. > 4 WARNING: multiple assignments should be avoided > x = y = 0; > is sometimes a convenient initialization. > you don't like it? Yes, we don't appreciate such usage at all. Please fix this. Satyam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 15:31 ` Satyam Sharma @ 2007-07-22 15:50 ` Satyam Sharma 2007-07-22 16:13 ` Stefan Richter 1 sibling, 0 replies; 42+ messages in thread From: Satyam Sharma @ 2007-07-22 15:50 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andi Kleen, Andrew Morton, lkml On 7/22/07, Satyam Sharma <satyam.sharma@gmail.com> wrote: > On 7/22/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > [...] > > 8 ERROR: Macros with multiple statements should be enclosed in a do - while loop > > You don't want to break if-else constructs. > > > 1 ERROR: Macros with complex values should be enclosed in parenthesis > > I'm not sure know when / why checkpatch.pl reports this. I don't want to > pull your tree so can't check for myself, but note some obvious rules: > > 1. Macro definition must always include parenthesis around the entire > definition (unless it's a do { somehting; } while (0) kind of macro). > 2. Use parenthesis around wherever it uses the passed argument(s). > > > these are "netlink packet definition language" in drbd_nl.h, > > which is sort of clean, and preprocessor magic in drbd_nl.c. > > suggestions how to handle this cleanly, > > without making it more ugly? > > autogenerate code by other means? > > write it out by hand, and lose the nice and clean drbd_nl.h? > > 1. Open-code it if it is trivial enough and there's no point > obfuscating anything. > 2. Consider making the macro a function. Probably inline, most probably not. > 3. Macros must NOT evaluate the passed argument twice -- consider the > possible damages of someone who doesn't know it's a macro and therefore > passing in an expression that has side-effects (*boom, crash*). Some more macro tips for you: If the macro contains multiple statements but must also return a value to the caller (i.e. say it can appear on RHS of an assignment), use the: #define foo ({ something; lastthing; }) kind of idiom. So the return value of the last statement ("lastthing;" above) in there (obviously) becomes the return value of the whole macro. So if you need to define an internal variable inside the macro, and the return value of the macro should be that internal variable itself, use the: #define bar ({ int __mvar; __mvar = something; otherstuff; __mvar; }) kind of idiom. Of course, if you really try to pull off the above stunts, we'll probably ask you to go back and make that a proper function in the first place :-) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 15:31 ` Satyam Sharma 2007-07-22 15:50 ` Satyam Sharma @ 2007-07-22 16:13 ` Stefan Richter 1 sibling, 0 replies; 42+ messages in thread From: Stefan Richter @ 2007-07-22 16:13 UTC (permalink / raw) To: Satyam Sharma; +Cc: Lars Ellenberg, Jens Axboe, Andi Kleen, Andrew Morton, lkml Satyam Sharma wrote: > On 7/22/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: >> 94 WARNING: declaring multiple variables together should be avoided >> int snr, enr; >> does this really need to become two lines? > > No, if these are some unimportant temporary/automatic variables in > some function. > > Yes, if they are members of some struct (also comment them in > this case -- in fact give full kernel-doc style comments). Don't overdo comments. Comments eventually get out of date and are then dangerously misleading. Comment API elements. Comment on the Why if it isn't obvious. It shouldn't be necessary to comment on the How, because that should be coded in an obvious way in the first place. -- Stefan Richter -=====-=-=== -=== =-==- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 23:50 ` Andi Kleen 2007-07-22 5:52 ` Jens Axboe @ 2007-07-22 6:09 ` Lars Ellenberg 2007-07-22 8:52 ` Sam Ravnborg 2007-07-22 9:00 ` Pekka Enberg 1 sibling, 2 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-22 6:09 UTC (permalink / raw) To: Andi Kleen; +Cc: Jens Axboe, Andrew Morton, lkml, drbd-user On Sun, Jul 22, 2007 at 01:50:09AM +0200, Andi Kleen wrote: > Lars Ellenberg <lars@linbit.com> writes: > > > > Jens, Andrew, anyone: please review, > > and give me advice how to proceed from here. > > The standard procedure would be to post all the source code in logical > pieces on the list for review. Then iterate until all comments are addressed. > > -Andi well. there is just one logical piece, the drbd module... it is all in its own subdirectory. it does not really touch anything else. I don't think it is feasible to split it further. Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 6:09 ` Lars Ellenberg @ 2007-07-22 8:52 ` Sam Ravnborg 2007-07-22 9:05 ` Pekka Enberg 2007-07-22 9:00 ` Pekka Enberg 1 sibling, 1 reply; 42+ messages in thread From: Sam Ravnborg @ 2007-07-22 8:52 UTC (permalink / raw) To: Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd-user > there is just one logical piece, the drbd module... > it is all in its own subdirectory. For review purposes the split is often file based. Something along the lines of: Makefile + Kconfig include/linux/drdb* <= why 5 files for one module? drdb_main.c drdb_req.{c+h} etc.. So essentially smaller pieces that can be revied but when ready for inclusion it should go in as _one_ commit to avoid breaking bisect. But please finish off all the CodingStyle issue before submission so the review can concentrate on the actual content. Quickly browsing drdb_int.h shows that there are a few issues yet to sort out. typedef usage, funny macros, volatile etc. Sam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 8:52 ` Sam Ravnborg @ 2007-07-22 9:05 ` Pekka Enberg 0 siblings, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2007-07-22 9:05 UTC (permalink / raw) To: Sam Ravnborg; +Cc: Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd-user Hi Sam, On 7/22/07, Sam Ravnborg <sam@ravnborg.org> wrote: > So essentially smaller pieces that can be revied but when ready for inclusion > it should go in as _one_ commit to avoid breaking bisect. Or alternatively, put all Kconfig and Makefile changes in the last patch. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-22 6:09 ` Lars Ellenberg 2007-07-22 8:52 ` Sam Ravnborg @ 2007-07-22 9:00 ` Pekka Enberg 1 sibling, 0 replies; 42+ messages in thread From: Pekka Enberg @ 2007-07-22 9:00 UTC (permalink / raw) To: Andi Kleen, Jens Axboe, Andrew Morton, lkml, drbd-user Hi Lars, On 7/22/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > there is just one logical piece, the drbd module... > it is all in its own subdirectory. > it does not really touch anything else. > I don't think it is feasible to split it further. I don't know your code at all but the diffstat suggests that "lru cache", "worker", "receiver", and "bitmap" could possibly be made independent patches with all Kconfig and Makefile changes appearing as a final separate patch (so it's git-bisectable). This is 17 KLOC of new code. You really want to make it easy to review if you're serious about getting it merged. Pekka ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg ` (2 preceding siblings ...) 2007-07-21 23:50 ` Andi Kleen @ 2007-07-23 1:32 ` Kyle Moffett 2007-07-23 8:49 ` Christoph Hellwig 2007-07-23 13:32 ` Lars Ellenberg 2007-07-23 20:59 ` Jesper Juhl 4 siblings, 2 replies; 42+ messages in thread From: Kyle Moffett @ 2007-07-23 1:32 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andrew Morton, lkml, drbd-user Ok, I didn't have a chance to get through anywhere near all of it, but here's my comments so far. I didn't really go through things in any particular order but most of these comments are about your drbd_int.h header file. Hopefully a lot of the comments will be useful to fix similar/identical problems in your C files. First of all, if you could break this up into chunks (even if they aren't useful individually) just to make it easier to review. Divisions like "This code does on-disk bitmap management", and "This code does network protocol encoding/decoding" would be extremely helpful when digging through this stuff. I ended up only really doing a cursory review for low-level/style issues, since without the bigger picture (and without the fixed style issues and macro cleanups) it's very hard to really give a good high-level review. Cheers, Kyle Moffett +drbd-objs := drbd_buildtag.o drbd_bitmap.o drbd_proc.o \ + drbd_worker.o drbd_receiver.o drbd_req.o drbd_actlog.o \ + lru_cache.o drbd_main.o drbd_strings.o drbd_nl.o + +obj-$(CONFIG_BLK_DEV_DRBD) += drbd.o Don't use foo-objs, use foo-y instead. +#undef DEVICE_NAME +#define DEVICE_NAME "drbd" This is never actually defined/redefined anywhere else. Just hardcode the "drbd" in your printk strings and save yourself 5-6 characters per use. +/* I don't remember why XCPU ... + * This is used to wake the asender, + * and to interrupt sending the sending task + * on disconnect. + */ +#define DRBD_SIG SIGXCPU + +/* This is used to stop/restart our threads. + * Cannot use SIGTERM nor SIGKILL, since these + * are sent out by init on runlevel changes + * I choose SIGHUP for now. + * + * FIXME btw, we should register some reboot notifier. + */ +#define DRBD_SIGKILL SIGHUP Don't use signals between kernel threads, use proper primitives like notifiers and waitqueues, which means you should also probably switch away from kernel_thread() to the kthread_*() APIs. Also you should fix this FIXME or remove it if it no longer applies:-D. +#ifdef PARANOIA +# define PARANOIA_BUG_ON(x) BUG_ON(x) +#else +# define PARANOIA_BUG_ON(x) +#endif This is only ever used in one place for a simple != test: + PARANOIA_BUG_ON(w != &mdev->resync_work); Just delete PARANOIA_BUG_ON and convert the above to a straight BUG_ON() +#define STATIC +#define STATIC static These two lines are found in different files, but the symbol "STATIC" isn't used anywhere. Just get rid of it. +/* handy macro: DUMPP(somepointer) */ +#define DUMPP(A) ERR( #A " = %p in %s:%d\n", (A), __FILE__, __LINE__); [...] +/* Info: do not remove the spaces around the "," before ## + * Otherwise this is not portable from gcc-2.95 to gcc-3.3 */ +#define PRINTK(level, fmt, args...) \ + printk(level DEVICE_NAME "%d: " fmt, \ + mdev->minor , ##args) + +#define ALERT(fmt, args...) PRINTK(KERN_ALERT, fmt , ##args) [...] No more custom debugging macros please, we have plenty of standardized ones in the kernel already (and all togther too many nonstandardized ones). Please take a good look at dev_printk, etc to make some of your printk()s shorter, but don't add more icky macros. Also gcc < 3.1 is now unsupported, so please remove gcc-2.95 portability comments/cruft (although in this case the code itself doesn't need changing, just the comments). +/* see kernel/printk.c:printk_ratelimit + * macro, so it is easy do have independend rate limits at different locations + * "initializer element not constant ..." with kernel 2.4 :( + * so I initialize toks to something large + */ +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ Any particular reason you can't just use printk_ratelimit for this? Also you should remove any linux 2.4-related code/comments as they won't apply for code submitted to 2.6 mainline. +#ifdef DBG_ASSERTS +extern void drbd_assert_breakpoint(struct drbd_conf *, char *, char *, int ); +# define D_ASSERT(exp) if (!(exp)) \ + drbd_assert_breakpoint(mdev, #exp, __FILE__, __LINE__) +#else +# define D_ASSERT(exp) if (!(exp)) \ + ERR("ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__) +#endif +#define ERR_IF(exp) if (({ \ + int _b = (exp) != 0; \ + if (_b) ERR("%s: (" #exp ") in %s:%d\n", \ + __func__, __FILE__, __LINE__); \ + _b; \ + })) Yuck, more debugging macros. +/* Defines to control fault insertion */ +enum { + DRBD_FAULT_MD_WR = 0, /* meta data write */ + DRBD_FAULT_MD_RD, /* read */ + DRBD_FAULT_RS_WR, /* resync */ + DRBD_FAULT_RS_RD, + DRBD_FAULT_DT_WR, /* data */ + DRBD_FAULT_DT_RD, + DRBD_FAULT_DT_RA, /* data read ahead */ + + DRBD_FAULT_MAX, +}; We have some existing failure-injection code, any chance you could rip out your custom logic and just plug that? I haven't looked over it, though, so I can't really offer any useful suggestions about it. +#include <linux/stringify.h> Put the include at the top of the file, please? +/* integer division, round _UP_ to the next integer */ +#define div_ceil(A, B) ( (A)/(B) + ((A)%(B) ? 1 : 0) ) +/* usual integer division */ +#define div_floor(A, B) ( (A)/(B) ) Your div_ceil function looks OK but I think we already have a standard one somewhere. You might remove that div_floor function, though, as it isn't used anywhere. +/* + * Compatibility Section + *************************/ + +#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags) +#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags) +#define RECALC_SIGPENDING() recalc_sigpending(); These aren't used anywhere, remove please? +#if defined(DBG_SPINLOCKS) && defined(__SMP__) +# define MUST_HOLD(lock) if (!spin_is_locked(lock)) { ERR("Not holding lock! in %s\n", __FUNCTION__ ); } +#else +# define MUST_HOLD(lock) +#endif Why not just open-code "WARN_ON_SMP(!spin_is_locked(lock))" in the few places this is used? Alternatively if you need this to actually BUG(), you might either add a BUG_ON_SMP() to include/asm-generic/bug.h or add a helper to include/linux/spinlock.h: #if defined(CONFIG_SMP) && defined(CONFIG_DEBUG_SPINLOCK) # define spin_lock_musthold(lock) BUG_ON(!spin_is_locked(lock)) #else # define spin_lock_musthold(lock) do { } while(0) #endif +#define SET_MDEV_MAGIC(x) \ + ({ typecheck(struct drbd_conf*, x); \ + (x)->magic = (long)(x) ^ DRBD_MAGIC; }) +#define IS_VALID_MDEV(x) \ + ( typecheck(struct drbd_conf*, x) && \ + ((x) ? (((x)->magic ^ DRBD_MAGIC) == (long)(x)):0)) SET_MDEV_MAGIC is only used in one place, can't you open-code it there? Even if it were used lots of places a little inline function would handle the "typecheck" for you and be easier to read. The IS_VALID_MDEV() isn't used anywhere, so delete it. +enum Drbd_thread_state { + None, + Running, + Exiting, + Restarting +}; + +struct Drbd_thread { + spinlock_t t_lock; + struct task_struct *task; + struct completion startstop; + enum Drbd_thread_state t_state; + int (*function) (struct Drbd_thread *); + struct drbd_conf *mdev; +}; As somebody already mentioned, you need to switch from kernel_thread to kthread_* helpers. +/* + * Having this as the first member of a struct provides sort of "inheritance". + * "derived" structs can be "drbd_queue_work()"ed. + * The callback should know and cast back to the descendant struct. + * drbd_request and Tl_epoch_entry are descendants of drbd_work. + */ +struct drbd_work; +typedef int (*drbd_work_cb)(struct drbd_conf *, struct drbd_work *, int cancel); +struct drbd_work { + struct list_head list; + drbd_work_cb cb; +}; + +struct drbd_barrier; +struct drbd_request { + struct drbd_work w; Eww, please don't do opencoded casting of these. You should use the proper container_of() macro instead: > struct foo { > int a; > }; > struct bar { > int b; > struct foo thefoo; > }; > void mycallback(struct foo *myfoo) > { > struct bar *mybar = container_of(myfoo, struct bar, thefoo); > process_a_bar(mybar); > } It's not *much* better typechecking, but it's at least a little. It also lets you put a struct drbd_work at a non-zero offset inside of some other struct, as container_of() does internal subtraction of the offsetof(). +/* THINK maybe we actually want to use the default "event/%s" worker threads + * or similar in linux 2.6, which uses per cpu data and threads. + * + * To be general, this might need a spin_lock member. + * For now, please use the mdev->req_lock to protect list_head, + * see drbd_queue_work below. + */ +struct drbd_work_queue { + struct list_head q; + struct semaphore s; /* producers up it, worker down()s it */ + spinlock_t q_lock; /* to protect the list. */ +}; Umm, how about fixing this to actually use proper workqueues or something instead of this open-coded mess? +/* for sync_conf and other types... */ +#define PACKET(name, number, fields) struct name { fields }; +#define INTEGER(pn, pr, member) int member; +#define INT64(pn, pr, member) __u64 member; +#define BIT(pn, pr, member) unsigned member : 1; +#define STRING(pn, pr, member, len) \ + unsigned char member[len]; int member ## _len; +#include "linux/drbd_nl.h" The only light at the end of this tunnel is the greedily burning fires of Macro Hell(TM). Isn't there any better way to do this than all those horrid macros? I can sorta see what drbd_nl.h is trying to do, but the way it's included multiple times with all those insane macros defined makes it really hard to mentally parse. +#ifdef PARANOIA + long magic; +#endif Any particular reason you can't just unconditionally use a magic number? It's little more than an extra word per device and a couple simple integer tests. +/* returns 1 if it was successfull, + * returns 0 if there was no data socket. + * so wherever you are going to use the data.socket, e.g. do + * if (!drbd_get_data_sock(mdev)) + * return 0; + * CODE(); + * drbd_put_data_sock(mdev); + */ +static inline int drbd_get_data_sock(struct drbd_conf *mdev) +{ + down(&mdev->data.mutex); + /* drbd_disconnect() could have called drbd_free_sock() + * while we were waiting in down()... */ + if (unlikely(mdev->data.socket == NULL)) { + up(&mdev->data.mutex); + return 0; + } + return 1; +} Any reason this can't be open-coded? The extra unlock logic should just be moved into the cleanup handlers for the appropriate function. It looks like you are using the semaphore as a simple binary mutex, so please switch this to "struct mutex" as well. +#if BITS_PER_LONG == 32 +#define LN2_BPL 5 +#define cpu_to_lel(A) cpu_to_le32(A) +#define lel_to_cpu(A) le32_to_cpu(A) +#elif BITS_PER_LONG == 64 +#define LN2_BPL 6 +#define cpu_to_lel(A) cpu_to_le64(A) +#define lel_to_cpu(A) le64_to_cpu(A) +#else +#error "LN2 of BITS_PER_LONG unknown!" +#endif Hrm, any reason you can't just make the bitmap an array of __u64 and always use the cpu_to_le64()/le64_to_cpu() functions? +/* I want the packet to fit within one page + * THINK maybe use a special bitmap header, + * including offset and compression scheme and whatnot + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! + */ +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long)) Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also means that on archs/configs with 8k or 64k pages you won't waste a bunch of memory. +/* Dynamic tracing framework */ +#ifdef ENABLE_DYNAMIC_TRACE + +extern int trace_type; +extern int trace_devs; [...] AGH!! Not another dynamic tracing framework! Please use one of the existing solutions like kprobes or something. Maybe define some static tracepoints if that would be useful? +static inline void drbd_tcp_cork(struct socket *sock) +{ +#if 1 + mm_segment_t oldfs = get_fs(); + int val = 1; + + set_fs(KERNEL_DS); + tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char *)&val, sizeof(val) ); + set_fs(oldfs); +#else + tcp_sk(sock->sk)->nonagle |= TCP_NAGLE_CORK; +#endif +} Yuck, why'd you do it this way? Probably because your tcp_sk(sock->sk) stuff doesn't have proper locking, I'll bet. You can avoid all the extra wrapper crap by just looking in "do_tcp_setsockopt" and taking the appropriate lock: static inline void drbd_tcp_cork(struct socket *sock) { struct sock *sk = sock->sk; lock_sock(sk); tcp_sk(sk)->nonagle |= TCP_NAGLE_CORK; release-sock(sk); } On the other hand, none of the currently in-tree network block devices or network filesystems seem to feel the need to try to TCP_CORK their output, why does drbd? +#define peer_mask role_mask +#define pdsk_mask disk_mask +#define susp_mask 1 +#define user_isp_mask 1 +#define aftr_isp_mask 1 + +#define NS(T, S) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T = T##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T = (S); val; }) +#define NS2(T1, S1, T2, S2) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \ + mask.T2 = T2##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \ + val.T2 = (S2); val; }) +#define NS3(T1, S1, T2, S2, T3, S3) \ + ({ union drbd_state_t mask; mask.i = 0; mask.T1 = T1##_mask; \ + mask.T2 = T2##_mask; mask.T3 = T3##_mask; mask; }), \ + ({ union drbd_state_t val; val.i = 0; val.T1 = (S1); \ + val.T2 = (S2); val.T3 = (S3); val; }) + +#define _NS(D, T, S) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T = (S); ns; }) +#define _NS2(D, T1, S1, T2, S2) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \ + ns.T2 = (S2); ns; }) +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ + D, ({ union drbd_state_t ns; ns.i = D->state.i; ns.T1 = (S1); \ + ns.T2 = (S2); ns.T3 = (S3); ns; }) Grumble. When I earlier said I thought I was in macro hell, well, I was wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it doesn't even include a *SINGLE* comment!!! +static inline void drbd_state_lock(struct drbd_conf *mdev) +{ + wait_event(mdev->misc_wait, + !test_and_set_bit(CLUSTER_ST_CHANGE, &mdev->flags)); +} Umm, what's wrong with a mutex here? +static inline int semaphore_is_locked(struct semaphore *s) +{ + if (!down_trylock(s)) { + up(s); + return 0; + } + return 1; +} This would be better implemented in the mutex/semaphore generic code, instead of stuffed in a network blockdev. On the other hand, since the only user is this: + D_ASSERT(semaphore_is_locked(&mdev->md_io_mutex)); Can't you just either get rid of the check or open-code it? It would also give you a chance to substitute D_ASSERT(foo) for a proper BUG_ON(!foo) :-D +static inline void +_drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w) [...] Yeah, this stuff should really be fixed to use generic workqueues or something. +#define ERR_IF_CNT_IS_NEGATIVE(which) \ + if (atomic_read(&mdev->which) < 0) \ + ERR("in %s:%d: " #which " = %d < 0 !\n", \ + __func__ , __LINE__ , \ + atomic_read(&mdev->which)) Another macro with an implicit parameter (mdev). Please fix all of these +#define dec_unacked(mdev) do { \ + typecheck(struct drbd_conf *, mdev); \ + atomic_dec(&mdev->unacked_cnt); \ + ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0) [...] More macros that should be inline functions. + if (!have_net_conf) dec_net(mdev); There's lots of coding-style violations like these in there. You should put the "dec_net(mdev);" indented on the line *AFTER* the if statememt. Please read linux/Documentation/CodingStyle for full details. + int mxb = 1000000; /* arbitrary limit on open requests */ Arbitrary limits are usually bad. Derive it from system properties, make it user-configurable, etc. +/* I'd like to use wait_event_lock_irq, + * but I'm not sure when it got introduced, + * and not sure when it has 3 or 4 arguments */ +static inline void inc_ap_bio(struct drbd_conf *mdev) [...] + spin_lock_irq(&mdev->req_lock); + while (!__inc_ap_bio_cond(mdev)) { + prepare_to_wait(&mdev->misc_wait, &wait, TASK_UNINTERRUPTIBLE); + spin_unlock_irq(&mdev->req_lock); + schedule(); + finish_wait(&mdev->misc_wait, &wait); + spin_lock_irq(&mdev->req_lock); + } + spin_unlock_irq(&mdev->req_lock); Since you're submitting for upstream inclusion you can just delete the extra backwards-compatible opencoded stuff and use the generic helper. +#define ARRY_SIZE(A) (sizeof(A)/sizeof(A[0])) There's a generic ARRAY_SIZE(foo) somewhere in the kernel sources. Use that one please. +/* this is developers aid only! */ +#define PARANOIA_ENTRY() BUG_ON(test_and_set_bit(__LC_PARANOIA, &lc->flags)) +#define PARANOIA_LEAVE() do { clear_bit(__LC_PARANOIA, &lc->flags); smp_mb__after_clear_bit(); } while (0) +#define RETURN(x...) do { PARANOIA_LEAVE(); return x ; } while (0) What is the PARANOIA flag intended to verify? This needs better commenting or removal. Don't use macros which take implicit arguments (IE: the lc value is used in the macro without it being passed as a parameter). Since it's debugging code you could also replace the open-coded bit trylock with a blocking spinlock. If something hangs, just turn on lockdep and it will report *MUCH* more useful information about the deadlock. Macros which do a "return" are discouraged, but since this one is called RETURN() it might be OK. On the other hand, you should probably rename it as PARANOIA_RETURN(). +int _drbd_md_sync_page_io(struct drbd_conf *mdev, + struct drbd_backing_dev *bdev, + struct page *page, sector_t sector, + int rw, int size) +{ [...] + ok = (bio_add_page(bio, page, size, 0) == size); + if (!ok) goto out; Ewwww, can somebody say "CodingStyle"? Don't put an if() and its result on a single line, and just use a straight if instead: > if (bio_add_page(bio, page, size, 0) != size) > goto out; + mask = ( hardsect / MD_HARDSECT ) - 1; + D_ASSERT( mask == 1 || mask == 3 || mask == 7 ); + D_ASSERT( hardsect == (mask+1) * MD_HARDSECT ); Problematic spaces and parentheses. Again, see Documentation/CodingStyle for the preferred forms. On the other hand, sometimes little code-style things like that are left to the maintainer if they really strongly prefer it one particular way. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 1:32 ` Kyle Moffett @ 2007-07-23 8:49 ` Christoph Hellwig 2007-07-23 9:00 ` Sam Ravnborg 2007-07-23 13:32 ` Lars Ellenberg 1 sibling, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2007-07-23 8:49 UTC (permalink / raw) To: Kyle Moffett; +Cc: Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd-user On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > First of all, if you could break this up into chunks (even if they > aren't useful individually) just to make it easier to review. No, please don't do that I have no idea why people say that except for drinking too much "small patches" koolaid. Small patches are the most optimal form if you can split patches logically. There is absolutely no point in splitting up a large driver into multiple patches. It just means you have to download and apply all of them to get the global picture needed for a proper review again. If a driver is too large for lkml a pointer to a large patch is much much better for a proper review. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 8:49 ` Christoph Hellwig @ 2007-07-23 9:00 ` Sam Ravnborg 2007-07-23 9:01 ` Christoph Hellwig 0 siblings, 1 reply; 42+ messages in thread From: Sam Ravnborg @ 2007-07-23 9:00 UTC (permalink / raw) To: Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd-user On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote: > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > > First of all, if you could break this up into chunks (even if they > > aren't useful individually) just to make it easier to review. > > No, please don't do that I have no idea why people say that except for > drinking too much "small patches" koolaid. It is much more preferred to have code available in the mailer than being forced to copy it from some big patch somewhere when you want to point out something. Obviously the big picture is needed too. But the introductory mail describing the overall design and functionality should provide most of this. This submission obviously also failed to provide such an overall design intro. > If a driver is too large for lkml a > pointer to a large patch is much much better for a proper review. Then we should request both... Sam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 9:00 ` Sam Ravnborg @ 2007-07-23 9:01 ` Christoph Hellwig 2007-07-23 9:19 ` Sam Ravnborg 0 siblings, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2007-07-23 9:01 UTC (permalink / raw) To: Sam Ravnborg Cc: Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd-user On Mon, Jul 23, 2007 at 11:00:38AM +0200, Sam Ravnborg wrote: > On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote: > > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > > > First of all, if you could break this up into chunks (even if they > > > aren't useful individually) just to make it easier to review. > > > > No, please don't do that I have no idea why people say that except for > > drinking too much "small patches" koolaid. > > It is much more preferred to have code available in the mailer than > being forced to copy it from some big patch somewhere when you > want to point out something. While that's a valid point I totally agree on for small patches it just doesn't matter for a complex driver where you need to sit down for an hour or more to actually do a useful review. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 9:01 ` Christoph Hellwig @ 2007-07-23 9:19 ` Sam Ravnborg 2007-07-23 11:08 ` Lars Ellenberg 0 siblings, 1 reply; 42+ messages in thread From: Sam Ravnborg @ 2007-07-23 9:19 UTC (permalink / raw) To: Christoph Hellwig, Kyle Moffett, Lars Ellenberg, Jens Axboe, Andrew Morton, lkml, drbd-user On Mon, Jul 23, 2007 at 10:01:01AM +0100, Christoph Hellwig wrote: > On Mon, Jul 23, 2007 at 11:00:38AM +0200, Sam Ravnborg wrote: > > On Mon, Jul 23, 2007 at 09:49:03AM +0100, Christoph Hellwig wrote: > > > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > > > > First of all, if you could break this up into chunks (even if they > > > > aren't useful individually) just to make it easier to review. > > > > > > No, please don't do that I have no idea why people say that except for > > > drinking too much "small patches" koolaid. > > > > It is much more preferred to have code available in the mailer than > > being forced to copy it from some big patch somewhere when you > > want to point out something. > > While that's a valid point I totally agree on for small patches it just > doesn't matter for a complex driver where you need to sit down for an > hour or more to actually do a useful review. We do review on different levels. I for once will look at the Makefile + Kconfig bits and some trivial CodingStyle issues. Here it will help. You on the other hand I assume will look at the functionality and overall design and then for you the full patch is the best way. Well I could try to do the same as you but that would take me maybe 10 hours or more just to come up with some limited input on the overall design. So therefore instead of backing out I give inputs on the smaller scale which I know is less usefull but in the hope that this can be resolved so when you or others take your time to review at least the majority of the simple CodingStyle things are fixed so you do not get sidetracked and can concentrate on the important bits. The message is quite clear. For a driver like this do something like: - Describe the functionality / the problem is solved - Describe the overall design (maybe as a file in Documentation/?) - Make a single patch available - Submit mails in smaller chunks for on-the-list reviews This will then be useable bot for the "triviality" reviewers like me and for the "knowledge able" reviewers like you. Sam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 9:19 ` Sam Ravnborg @ 2007-07-23 11:08 ` Lars Ellenberg 0 siblings, 0 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-23 11:08 UTC (permalink / raw) To: Sam Ravnborg Cc: Christoph Hellwig, Kyle Moffett, Jens Axboe, Andrew Morton, lkml, drbd-user On Mon, Jul 23, 2007 at 11:19:40AM +0200, Sam Ravnborg wrote: > The message is quite clear. > For a driver like this do something like: > - Describe the functionality / the problem is solved > - Describe the overall design (maybe as a file in Documentation/?) I'm currently writing the paper for linux-conf.eu, and will make that available to you asap. > - Make a single patch available will do. > - Submit mails in smaller chunks for on-the-list reviews > > This will then be useable bot for the "triviality" reviewers like me > and for the "knowledge able" reviewers like you. will do. now working on the remaining "triviality" issues already mentioned. Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 1:32 ` Kyle Moffett 2007-07-23 8:49 ` Christoph Hellwig @ 2007-07-23 13:32 ` Lars Ellenberg 2007-07-23 13:37 ` Jens Axboe ` (2 more replies) 1 sibling, 3 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-23 13:32 UTC (permalink / raw) To: Kyle Moffett; +Cc: Jens Axboe, Andrew Morton, lkml On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > Ok, I didn't have a chance to get through anywhere near all of it, but > here's my comments so far. I didn't really go through things in any > particular order but most of these comments are about your drbd_int.h > header file. Hopefully a lot of the comments will be useful to fix > similar/identical problems in your C files. Thank you for your time. I want to give some lame excuses, still: > +#undef DEVICE_NAME > +#define DEVICE_NAME "drbd" a left-over. first prototypes of drbd have been around in linux 2.0 time... "in the old days", as a block device, you had to define this, and then include some header after that. will clean that up. > +/* I don't remember why XCPU ... > + * This is used to wake the asender, > + * and to interrupt sending the sending task > + * on disconnect. > + */ > +#define DRBD_SIG SIGXCPU > Don't use signals between kernel threads, use proper primitives like > notifiers and waitqueues, which means you should also probably switch away > from kernel_thread() to the kthread_*() APIs. Also you should fix this > FIXME or remove it if it no longer applies:-D. right. but how to I tell a network thread in tcp_recvmsg to stop early, without using signals? > +/* see kernel/printk.c:printk_ratelimit > + * macro, so it is easy do have independend rate limits at different locations > + * "initializer element not constant ..." with kernel 2.4 :( > + * so I initialize toks to something large > + */ > +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ > > Any particular reason you can't just use printk_ratelimit for this? I want to be able to do a rate-limit per specific message/code fragment, without affecting other messages or execution paths. > +/* > + * Compatibility Section > + *************************/ > + > +#define LOCK_SIGMASK(task, flags) spin_lock_irqsave(&task->sighand->siglock, flags) > +#define UNLOCK_SIGMASK(task, flags) spin_unlock_irqrestore(&task->sighand->siglock, flags) > +#define RECALC_SIGPENDING() recalc_sigpending(); > > These aren't used anywhere, I just recently cleaned up the use of them. when the first drbd prototypes have been developed it was linux 2.0 times... along the way many things have changed in incompatible ways, so we had to abstract it in macros. > remove please? yes. > +/* THINK maybe we actually want to use the default "event/%s" worker threads > + * or similar in linux 2.6, which uses per cpu data and threads. > + * > + * To be general, this might need a spin_lock member. > + * For now, please use the mdev->req_lock to protect list_head, > + * see drbd_queue_work below. > + */ > +struct drbd_work_queue { > + struct list_head q; > + struct semaphore s; /* producers up it, worker down()s it */ > + spinlock_t q_lock; /* to protect the list. */ > +}; > > Umm, how about fixing this to actually use proper workqueues or something > instead of this open-coded mess? unlikely to happen "right now". but it is on our todo list... > +/* I want the packet to fit within one page > + * THINK maybe use a special bitmap header, > + * including offset and compression scheme and whatnot > + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! > + */ > +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof(long)) > > Yuck. Definitely use PAGE_SIZE here, so at least if it's broken on an arch > with multiple page sizes, somebody can grep for PAGE_SIZE to fix it. It also > means that on archs/configs with 8k or 64k pages you won't waste a bunch of > memory. No. This is not to allocate anything, but defines the chunk size with which we transmit the bitmap, when we have to. We need to be able to talk from one arch (say i586) to some other (say s390, or sparc, or whatever). The receiving side has a one-page buffer, from which it may or may not to endian-conversion. The hardcoded 4096 is the minimal common denominator here. > +/* Dynamic tracing framework */ guess we have to explain first what this is for. it probably is not exactly what you think it is... but maybe I am just too ignorant about what you suggest here. we'll have to defer discussion about this for when I'm done doing the trivial fix-ups, and provided an overall design overview. > +static inline void drbd_tcp_cork(struct socket *sock) ... > On the other hand, none of the currently in-tree network block devices or > network filesystems seem to feel the need to try to TCP_CORK their output, > why does drbd? it had performance improvements somewhen. I doubt it has still. maybe we just remove this. > +#define peer_mask role_mask > +#define pdsk_mask disk_mask > +#define susp_mask 1 > +#define user_isp_mask 1 > +#define aftr_isp_mask 1 > +#define NS(T, S) \ > +#define NS2(T1, S1, T2, S2) \ > +#define NS3(T1, S1, T2, S2, T3, S3) \ > +#define _NS(D, T, S) \ > +#define _NS2(D, T1, S1, T2, S2) \ > +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ > > Grumble. When I earlier said I thought I was in macro hell, well, I was > wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it > doesn't even include a *SINGLE* comment!!! uhm. basically you are right. Phil will take over to explain why it is done that way... > + int mxb = 1000000; /* arbitrary limit on open requests */ > > Arbitrary limits are usually bad. Derive it from system properties, make it > user-configurable, etc. it is user configurable. it is just the upper cap on user configurability. most other points I just snipped off of this mail will be handled as you suggested asap. Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 13:32 ` Lars Ellenberg @ 2007-07-23 13:37 ` Jens Axboe 2007-07-23 21:13 ` Lars Ellenberg 2007-07-23 13:40 ` Satyam Sharma 2007-07-24 0:48 ` Kyle Moffett 2 siblings, 1 reply; 42+ messages in thread From: Jens Axboe @ 2007-07-23 13:37 UTC (permalink / raw) To: Lars Ellenberg, Kyle Moffett, Andrew Morton, lkml On Mon, Jul 23 2007, Lars Ellenberg wrote: > > +/* THINK maybe we actually want to use the default "event/%s" worker threads > > + * or similar in linux 2.6, which uses per cpu data and threads. > > + * > > + * To be general, this might need a spin_lock member. > > + * For now, please use the mdev->req_lock to protect list_head, > > + * see drbd_queue_work below. > > + */ > > +struct drbd_work_queue { > > + struct list_head q; > > + struct semaphore s; /* producers up it, worker down()s it */ > > + spinlock_t q_lock; /* to protect the list. */ > > +}; > > > > Umm, how about fixing this to actually use proper workqueues or something > > instead of this open-coded mess? > > unlikely to happen "right now". > but it is on our todo list... But stuff like that is definitely a merge show stopper, jfyi. -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 13:37 ` Jens Axboe @ 2007-07-23 21:13 ` Lars Ellenberg 0 siblings, 0 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-23 21:13 UTC (permalink / raw) To: Jens Axboe; +Cc: Kyle Moffett, Andrew Morton, lkml On Mon, Jul 23, 2007 at 03:37:04PM +0200, Jens Axboe wrote: > On Mon, Jul 23 2007, Lars Ellenberg wrote: > > > +/* THINK maybe we actually want to use the default "event/%s" worker threads > > > + * or similar in linux 2.6, which uses per cpu data and threads. > > > + * > > > + * To be general, this might need a spin_lock member. > > > + * For now, please use the mdev->req_lock to protect list_head, > > > + * see drbd_queue_work below. > > > + */ > > > +struct drbd_work_queue { > > > + struct list_head q; > > > + struct semaphore s; /* producers up it, worker down()s it */ > > > + spinlock_t q_lock; /* to protect the list. */ > > > +}; > > > > > > Umm, how about fixing this to actually use proper workqueues or something > > > instead of this open-coded mess? > > > > unlikely to happen "right now". > > but it is on our todo list... > > But stuff like that is definitely a merge show stopper, jfyi. I see. but it is not that easy to do away with kernel threads (the open-coded mess) in favor of "proper workqueues or something". Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 13:32 ` Lars Ellenberg 2007-07-23 13:37 ` Jens Axboe @ 2007-07-23 13:40 ` Satyam Sharma 2007-07-23 21:19 ` Lars Ellenberg 2007-07-24 0:48 ` Kyle Moffett 2 siblings, 1 reply; 42+ messages in thread From: Satyam Sharma @ 2007-07-23 13:40 UTC (permalink / raw) To: Lars Ellenberg, Kyle Moffett, Jens Axboe, Andrew Morton, lkml On 7/23/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > [...] > > Don't use signals between kernel threads, use proper primitives like > > notifiers and waitqueues, which means you should also probably switch away > > from kernel_thread() to the kthread_*() APIs. Also you should fix this > > FIXME or remove it if it no longer applies:-D. > > right. > but how to I tell a network thread in tcp_recvmsg to stop early, > without using signals? Yup, kthreads API cannot handle (properly stop) kernel threads that want to sleep on possibly-blocking-forever-till-signalled-functions such as tcp_recvmsg or skb_recv_datagram etc etc. There are two workarounds: 1. Use sk_rcvtimeo and related while-continue logic 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop (note that you don't need to allow / write code to handle / etc signals in your kthread code -- force_sig will work automatically) > > +/* THINK maybe we actually want to use the default "event/%s" worker threads > > + * or similar in linux 2.6, which uses per cpu data and threads. > > + * > > + * To be general, this might need a spin_lock member. > > + * For now, please use the mdev->req_lock to protect list_head, > > + * see drbd_queue_work below. > > + */ > > +struct drbd_work_queue { > > + struct list_head q; > > + struct semaphore s; /* producers up it, worker down()s it */ > > + spinlock_t q_lock; /* to protect the list. */ > > +}; > > > > Umm, how about fixing this to actually use proper workqueues or something > > instead of this open-coded mess? > > unlikely to happen "right now". > but it is on our todo list... It should be easier to do it now (if you defer it for later, the code will only grow more and more complex). Also, removing this gunk from your driver will clearly make it smaller, and easier for us to review :-) > > +#define peer_mask role_mask > > +#define pdsk_mask disk_mask > > +#define susp_mask 1 > > +#define user_isp_mask 1 > > +#define aftr_isp_mask 1 > > +#define NS(T, S) \ > > +#define NS2(T1, S1, T2, S2) \ > > +#define NS3(T1, S1, T2, S2, T3, S3) \ > > +#define _NS(D, T, S) \ > > +#define _NS2(D, T1, S1, T2, S2) \ > > +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ > > > > Grumble. When I earlier said I thought I was in macro hell, well, I was > > wrong. *THIS* is macro hell. What the fsck is that supposed to do? And it > > doesn't even include a *SINGLE* comment!!! > > uhm. basically you are right. > Phil will take over to explain why it is done that way... The simplistic problems I initially thought your driver suffered from turned out to be not so true, actually -- like Kyle mentioned, your driver is in proper macro hell. Please get rid of as many as possible, and replace with functions, as applicable. Satyam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 13:40 ` Satyam Sharma @ 2007-07-23 21:19 ` Lars Ellenberg 2007-07-24 7:36 ` Jens Axboe 2007-07-24 23:11 ` Satyam Sharma 0 siblings, 2 replies; 42+ messages in thread From: Lars Ellenberg @ 2007-07-23 21:19 UTC (permalink / raw) To: Satyam Sharma; +Cc: Kyle Moffett, Jens Axboe, Andrew Morton, lkml On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote: > On 7/23/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > >[...] > >> Don't use signals between kernel threads, use proper primitives like > >> notifiers and waitqueues, which means you should also probably switch > >away > >> from kernel_thread() to the kthread_*() APIs. Also you should fix this > >> FIXME or remove it if it no longer applies:-D. > > > >right. > >but how to I tell a network thread in tcp_recvmsg to stop early, > >without using signals? > > > Yup, kthreads API cannot handle (properly stop) kernel threads that want > to sleep on possibly-blocking-forever-till-signalled-functions such as > tcp_recvmsg or skb_recv_datagram etc etc. > > There are two workarounds: > 1. Use sk_rcvtimeo and related while-continue logic > 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop > (note that you don't need to allow / write code to handle / etc signals > in your kthread code -- force_sig will work automatically) this is not only at stop time. for example our "drbd_asender" thread does receive as well as send, and the sending latency is crucial to performance, while the recv will not timeout for the next few seconds. > >> +/* THINK maybe we actually want to use the default "event/%s" worker > >threads > >> + * or similar in linux 2.6, which uses per cpu data and threads. > >> + * > >> + * To be general, this might need a spin_lock member. > >> + * For now, please use the mdev->req_lock to protect list_head, > >> + * see drbd_queue_work below. > >> + */ > >> +struct drbd_work_queue { > >> + struct list_head q; > >> + struct semaphore s; /* producers up it, worker down()s it */ > >> + spinlock_t q_lock; /* to protect the list. */ > >> +}; > >> > >> Umm, how about fixing this to actually use proper workqueues or something > >> instead of this open-coded mess? > > > >unlikely to happen "right now". > >but it is on our todo list... > > It should be easier to do it now (if you defer it for later, the code will > only grow more and more complex). Also, removing this gunk from > your driver will clearly make it smaller, and easier for us to review :-) and will poison the generic work queues with stuff that might block somewhere deep in the tcp stack. and where we are not able to cancel it. not exactly desirable, either. but maybe I am missing something? Lars ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 21:19 ` Lars Ellenberg @ 2007-07-24 7:36 ` Jens Axboe 2007-07-24 23:11 ` Satyam Sharma 1 sibling, 0 replies; 42+ messages in thread From: Jens Axboe @ 2007-07-24 7:36 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Satyam Sharma, Kyle Moffett, Andrew Morton, lkml On Mon, Jul 23 2007, Lars Ellenberg wrote: > On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote: > > On 7/23/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > > >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > > >[...] > > >> Don't use signals between kernel threads, use proper primitives like > > >> notifiers and waitqueues, which means you should also probably switch > > >away > > >> from kernel_thread() to the kthread_*() APIs. Also you should fix this > > >> FIXME or remove it if it no longer applies:-D. > > > > > >right. > > >but how to I tell a network thread in tcp_recvmsg to stop early, > > >without using signals? > > > > > > Yup, kthreads API cannot handle (properly stop) kernel threads that want > > to sleep on possibly-blocking-forever-till-signalled-functions such as > > tcp_recvmsg or skb_recv_datagram etc etc. > > > > There are two workarounds: > > 1. Use sk_rcvtimeo and related while-continue logic > > 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop > > (note that you don't need to allow / write code to handle / etc signals > > in your kthread code -- force_sig will work automatically) > > this is not only at stop time. > for example our "drbd_asender" thread > does receive as well as send, and the sending > latency is crucial to performance, while the recv > will not timeout for the next few seconds. > > > >> +/* THINK maybe we actually want to use the default "event/%s" worker > > >threads > > >> + * or similar in linux 2.6, which uses per cpu data and threads. > > >> + * > > >> + * To be general, this might need a spin_lock member. > > >> + * For now, please use the mdev->req_lock to protect list_head, > > >> + * see drbd_queue_work below. > > >> + */ > > >> +struct drbd_work_queue { > > >> + struct list_head q; > > >> + struct semaphore s; /* producers up it, worker down()s it */ > > >> + spinlock_t q_lock; /* to protect the list. */ > > >> +}; > > >> > > >> Umm, how about fixing this to actually use proper workqueues or something > > >> instead of this open-coded mess? > > > > > >unlikely to happen "right now". > > >but it is on our todo list... > > > > It should be easier to do it now (if you defer it for later, the code will > > only grow more and more complex). Also, removing this gunk from > > your driver will clearly make it smaller, and easier for us to review :-) > > and will poison the generic work queues with stuff that might block > somewhere deep in the tcp stack. and where we are not able to cancel it. > not exactly desirable, either. > but maybe I am missing something? Create your own (single threaded) work queue using create_singlethread_workqueue(). -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 21:19 ` Lars Ellenberg 2007-07-24 7:36 ` Jens Axboe @ 2007-07-24 23:11 ` Satyam Sharma 2007-07-25 9:46 ` Lars Ellenberg 1 sibling, 1 reply; 42+ messages in thread From: Satyam Sharma @ 2007-07-24 23:11 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Kyle Moffett, Jens Axboe, Andrew Morton, lkml Hi Lars, On 7/24/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote: > > On 7/23/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > > >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > > >[...] > > >> Don't use signals between kernel threads, use proper primitives like > > >> notifiers and waitqueues, which means you should also probably switch > > >away > > >> from kernel_thread() to the kthread_*() APIs. Also you should fix this > > >> FIXME or remove it if it no longer applies:-D. > > > > > >right. > > >but how to I tell a network thread in tcp_recvmsg to stop early, > > >without using signals? > > > > Yup, kthreads API cannot handle (properly stop) kernel threads that want > > to sleep on possibly-blocking-forever-till-signalled-functions such as > > tcp_recvmsg or skb_recv_datagram etc etc. > > > > There are two workarounds: > > 1. Use sk_rcvtimeo and related while-continue logic > > 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop > > (note that you don't need to allow / write code to handle / etc signals > > in your kthread code -- force_sig will work automatically) > > this is not only at stop time. > for example our "drbd_asender" thread > does receive as well as send, That's normal -- in fact it would've been surprising if your kthread only did recvs but no sends! But where does the "send" come into the picture over here -- a send won't block forever, so I don't foresee any issues whatsoever w.r.t. kthreads conversion for that. [ BTW I hope you're *not* using any signals-based interface for your kernel thread _at all_. Kthreads disallow (ignore) all signals by default, as they should, and you really shouldn't need to write any logic to handle or do-certain-things-on-seeing a signal in a well designed kernel thread. ] > and the sending > latency is crucial to performance, while the recv > will not timeout for the next few seconds. Again, I don't see what sending latency has to do with a kernel_thread to kthread conversion. Or with signals, for that matter. Anyway, as Kyle Moffett mentioned elsewhere, you could probably look at other examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c). [ I didn't really want to give that example, because I get a nervous breakdown when looking at that code myself, and would actively like to save other fellow developers from a similar fate. To know what I'm talking about, set your xterm to display 40 rows, and then look at the line numbers 3139-3218 in that file, especially 3190-3212. Yes, what you see there is a map of Sulawesi [1] subliminally hidden in Linux kernel code :-) ] Anyway, cifs_demultiplexer_thread() is just your normal kthread that: (1) Ignores all signals (2) Calls perma-blocking-till-signalled functions such as tcp_recvmsg (via kernel_recvmsg) (3) Calls send-to-socket kind of functions Hence, it could get into trouble when the umount(2) code wants to stop it with kthread_stop() and it happens to be blocked in tcp_recvmsg() with noblock = 0 (hence sk_rcvtimeo == MAX_SCHEDULE_TIMEOUT), thus would handle the wake_up_process() internally, and not break out, hence not check kthread_should_stop() which it should -- all this ensuring that the kthread never gets killed, kthread_stop() hangs, and the umount(2) from userspace never returns ... But they've solved it as follows (as I suggested earlier): (1) First, set sock->sk_rcvtimeo to some "magical value" in your code that sets up the socket params after socket->proto_ops->connect(). See ipv4_connect(), f.e. in CIFS they've set it up to 7 seconds. But that's arbitrarily chosen -- this'll ensure your tcp_recvmsg() isn't perma-blocking in the first place, but will unblock/return every 7 secs, and thus get a chance to check kthread_should_stop(). (2) From the code that wants to kill/stop the kthread (module exit, or umount(2) most probably), just ensure you make a call to force_sig() before kthread_stop() on that kthread -- see cifs_umount() in the same file. This'll ensure that even if the kthread is currently sleeping in tcp_recvmsg(), it'll be signalled to break out from there, and thus check kthread_should_stop(). (3) Note that not a single line of code needs to be written extra in the kthread itself for this to work -- nothing to allow / handle signals ... Just this, should be enough for a smooth conversion to kthreads, IMHO. > > >> +/* THINK maybe we actually want to use the default "event/%s" worker > > >threads > > >> + * or similar in linux 2.6, which uses per cpu data and threads. > > >> + * > > >> + * To be general, this might need a spin_lock member. > > >> + * For now, please use the mdev->req_lock to protect list_head, > > >> + * see drbd_queue_work below. > > >> + */ > > >> +struct drbd_work_queue { > > >> + struct list_head q; > > >> + struct semaphore s; /* producers up it, worker down()s it */ > > >> + spinlock_t q_lock; /* to protect the list. */ > > >> +}; > > >> > > >> Umm, how about fixing this to actually use proper workqueues or something > > >> instead of this open-coded mess? > > > > > >unlikely to happen "right now". > > >but it is on our todo list... > > > > It should be easier to do it now (if you defer it for later, the code will > > only grow more and more complex). Also, removing this gunk from > > your driver will clearly make it smaller, and easier for us to review :-) > > and will poison the generic work queues You could create your own workqueue as Jens Axboe suggested. > with stuff that might block > somewhere deep in the tcp stack. and where we are not able to cancel it. > not exactly desirable, either. > but maybe I am missing something? What would that workqueue be for, in the first place? Are we talking of the same (which is presently the) kernel_thread() here? Sorry, but I'm only looking at the struct / snippet above, and reading words like "worker" / "producer" and although you're calling that struct a drbd_work_queue, it isn't quite obvious to me you're actually /using/ it as one. Frankly, a high-level design document is a must, here (with lower level implementation details in the individual changelogs of the patches you finally post to this list). Working that out from 17 kloc would otherwise be too difficult for any reviewer. Thanks, Satyam [1] Sulawesi is the oddly-shaped island in the Indonesian archipelago: http://upload.wikimedia.org/wikipedia/commons/1/16/Sulawesi_map.PNG ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-24 23:11 ` Satyam Sharma @ 2007-07-25 9:46 ` Lars Ellenberg 2007-07-25 12:12 ` Satyam Sharma 0 siblings, 1 reply; 42+ messages in thread From: Lars Ellenberg @ 2007-07-25 9:46 UTC (permalink / raw) To: Satyam Sharma; +Cc: Kyle Moffett, Jens Axboe, Andrew Morton, lkml On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote: > Hi Lars, > > > On 7/24/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > >On Mon, Jul 23, 2007 at 07:10:58PM +0530, Satyam Sharma wrote: > >> On 7/23/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > >> >On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: > >> >[...] > >> >> Don't use signals between kernel threads, use proper primitives like > >> >> notifiers and waitqueues, which means you should also probably switch > >> >away > >> >> from kernel_thread() to the kthread_*() APIs. Also you should fix > >this > >> >> FIXME or remove it if it no longer applies:-D. > >> > > >> >right. > >> >but how to I tell a network thread in tcp_recvmsg to stop early, > >> >without using signals? > >> > >> Yup, kthreads API cannot handle (properly stop) kernel threads that want > >> to sleep on possibly-blocking-forever-till-signalled-functions such as > >> tcp_recvmsg or skb_recv_datagram etc etc. > >> > >> There are two workarounds: > >> 1. Use sk_rcvtimeo and related while-continue logic > >> 2. force_sig(SIGKILL) to your kernel thread just before kthread_stop > >> (note that you don't need to allow / write code to handle / etc signals > >> in your kthread code -- force_sig will work automatically) > > > >this is not only at stop time. > >for example our "drbd_asender" thread > >does receive as well as send, > > That's normal -- in fact it would've been surprising if your kthread only > did recvs but no sends! > > But where does the "send" come into the picture over here -- a send > won't block forever, so I don't foresee any issues whatsoever w.r.t. > kthreads conversion for that. [ BTW I hope you're *not* using any > signals-based interface for your kernel thread _at all_. Kthreads > disallow (ignore) all signals by default, as they should, and you really > shouldn't need to write any logic to handle or do-certain-things-on-seeing > a signal in a well designed kernel thread. ] > > >and the sending > >latency is crucial to performance, while the recv > >will not timeout for the next few seconds. > > Again, I don't see what sending latency has to do with a kernel_thread > to kthread conversion. Or with signals, for that matter. Anyway, as > Kyle Moffett mentioned elsewhere, you could probably look at other > examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c). the basic problem, and what we use signals for, is: it is waiting in recv, waiting for the peer to say something. but I want it to stop recv, and go send something "right now". I don't want to have two threads for that. yes we have timeo in place, anyways: we need to detect a failed peer node in time. we even aim for "sub-second failover" sometimes (which is not exactly feasible; but failover times of 15 seconds and less are requirement for useable HA-iSCSI deployments). but that does not cut it, timeo is seconds. you don't want seconds latency for IO operations. so I signal it, it breaks out of recv, then sends, and goes back to recv. in-kernel epoll would probably solve this. I don't know how to do that properly, though. > >> only grow more and more complex). Also, removing this gunk from > >> your driver will clearly make it smaller, and easier for us to review :-) > > > >and will poison the generic work queues > > You could create your own workqueue as Jens Axboe suggested. will do. I was not aware of the "create_singlethread_workqueue", it does fit our useage good enough. > Frankly, a high-level design document is a must, here (with lower > level implementation details in the individual changelogs of the > patches you finally post to this list). Working that out from 17 kloc > would otherwise be too difficult for any reviewer. sure. will be available soon. this was just a "bust in and see what lkml does about it", I don't expect to be merged within days :) I think it is realistic to be merged this year, though... [as chrismas present, maybe :-)] -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-25 9:46 ` Lars Ellenberg @ 2007-07-25 12:12 ` Satyam Sharma 2007-07-26 2:03 ` david 0 siblings, 1 reply; 42+ messages in thread From: Satyam Sharma @ 2007-07-25 12:12 UTC (permalink / raw) To: Lars Ellenberg, Satyam Sharma, Kyle Moffett, Jens Axboe, Andrew Morton, lkml On 7/25/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: > On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote: > > [...] > > > > But where does the "send" come into the picture over here -- a send > > won't block forever, so I don't foresee any issues whatsoever w.r.t. > > kthreads conversion for that. [ BTW I hope you're *not* using any > > signals-based interface for your kernel thread _at all_. Kthreads > > disallow (ignore) all signals by default, as they should, and you really > > shouldn't need to write any logic to handle or do-certain-things-on-seeing > > a signal in a well designed kernel thread. ] > > > > >and the sending > > >latency is crucial to performance, while the recv > > >will not timeout for the next few seconds. > > > > Again, I don't see what sending latency has to do with a kernel_thread > > to kthread conversion. Or with signals, for that matter. Anyway, as > > Kyle Moffett mentioned elsewhere, you could probably look at other > > examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c). > > the basic problem, and what we use signals for, is: > > it is waiting in recv, waiting for the peer to say something. > but I want it to stop recv, and go send something "right now". That's ... weird. Most (all?) communication between any two parties would follow a protocol where someone recv's stuff, does something with it, and sends it back ... what would you send "right now" if you didn't receive anything? > I don't want to have two threads for that. I really think you should -- you clearly should. From the above, it does appear that you're mixing in multiple kinds of stuff into a single thread, and thus mucking up the entire design (and implementation). > yes we have timeo in place, anyways: we need to detect a failed peer > node in time. we even aim for "sub-second failover" sometimes (which is > not exactly feasible; but failover times of 15 seconds and less are > requirement for useable HA-iSCSI deployments). > but that does not cut it, timeo is seconds. > you don't want seconds latency for IO operations. Ok, that's a reasonable goal. > so I signal it, it breaks out of recv, then sends, and goes back to recv. > > in-kernel epoll would probably solve this. > I don't know how to do that properly, though. Hmm, probably I don't understand what you're doing, how you're doing etc. Will wait for the design & implementation docs. Thanks, Satyam ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-25 12:12 ` Satyam Sharma @ 2007-07-26 2:03 ` david 2007-07-26 3:43 ` Kyle Moffett 0 siblings, 1 reply; 42+ messages in thread From: david @ 2007-07-26 2:03 UTC (permalink / raw) To: Satyam Sharma Cc: Lars Ellenberg, Kyle Moffett, Jens Axboe, Andrew Morton, lkml On Wed, 25 Jul 2007, Satyam Sharma wrote: > On 7/25/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: >> On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote: >> > [...] >> > >> > But where does the "send" come into the picture over here -- a send >> > won't block forever, so I don't foresee any issues whatsoever w.r.t. >> > kthreads conversion for that. [ BTW I hope you're *not* using any >> > signals-based interface for your kernel thread _at all_. Kthreads >> > disallow (ignore) all signals by default, as they should, and you really >> > shouldn't need to write any logic to handle or >> > do-certain-things-on-seeing >> > a signal in a well designed kernel thread. ] >> > >> > > and the sending >> > > latency is crucial to performance, while the recv >> > > will not timeout for the next few seconds. >> > >> > Again, I don't see what sending latency has to do with a kernel_thread >> > to kthread conversion. Or with signals, for that matter. Anyway, as >> > Kyle Moffett mentioned elsewhere, you could probably look at other >> > examples (say cifs_demultiplexer_thread() in fs/cifs/connect.c). >> >> the basic problem, and what we use signals for, is: >> >> it is waiting in recv, waiting for the peer to say something. >> but I want it to stop recv, and go send something "right now". > > That's ... weird. Most (all?) communication between any two parties > would follow a protocol where someone recv's stuff, does something > with it, and sends it back ... what would you send "right now" if you > didn't receive anything? becouse even though you didn't receive anything you now have something important to send. remember that both sides can be sitting in receive mode. this puts them both in a position to respond to the other if the other has something to say. David Lang ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-26 2:03 ` david @ 2007-07-26 3:43 ` Kyle Moffett 2007-07-26 9:17 ` Evgeniy Polyakov 0 siblings, 1 reply; 42+ messages in thread From: Kyle Moffett @ 2007-07-26 3:43 UTC (permalink / raw) To: david Cc: Satyam Sharma, Lars Ellenberg, Jens Axboe, Andrew Morton, LKML Kernel, netdev On Jul 25, 2007, at 22:03:37, david@lang.hm wrote: > On Wed, 25 Jul 2007, Satyam Sharma wrote: >> On 7/25/07, Lars Ellenberg <lars.ellenberg@linbit.com> wrote: >>> On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote: >>>> [...] >>>> But where does the "send" come into the picture over here -- a >>>> send won't block forever, so I don't foresee any issues >>>> whatsoever w.r.t. kthreads conversion for that. [ BTW I hope >>>> you're *not* using any signals-based interface for your kernel >>>> thread _at all_. Kthreads disallow (ignore) all signals by >>>> default, as they should, and you really shouldn't need to write >>>> any logic to handle or > do-certain-things-on-seeing a signal >>>> in a well designed kernel thread. ] and the sending latency is >>>> crucial to performance, while the recv will not timeout for the >>>> next few seconds. Again, I don't see what sending latency has >>>> to do with a kernel_thread to kthread conversion. Or with >>>> signals, for that matter. Anyway, as Kyle Moffett mentioned >>>> elsewhere, you could probably look at other examples (say >>>> cifs_demultiplexer_thread() in fs/cifs/connect.c). >>> >>> the basic problem, and what we use signals for, is: it is >>> waiting in recv, waiting for the peer to say something. but I >>> want it to stop recv, and go send something "right now". >> >> That's ... weird. Most (all?) communication between any two >> parties would follow a protocol where someone recv's stuff, does >> something with it, and sends it back ... what would you send >> "right now" if you didn't receive anything? > > becouse even though you didn't receive anything you now have > something important to send. > > remember that both sides can be sitting in receive mode. this puts > them both in a position to respond to the other if the other has > something to say. Why not just have 2 threads, one for "sending" and one for "receiving". When your receiving thread gets data it takes appropriate locks and processes it, then releases the locks and goes back to waiting for packets. Your sending thread would take appropriate locks, generate data to send, release locks, and transmit packets. You don't have to interrupt the receive thread to send packets, so where's the latency problem, exactly? If I were writing that in userspace I would have: (A) The pool of IO-generating threads (IE: What would ordinarily be userspace) (B) One or a small number of data-reception threads. (C) One or a small number of data-transmission threads. When you get packets to process in your network-reception thread(s), you queue appropriate disk IOs and any appropriate responses with your transmission thread(s). You can basically just sit in a loop on tcp_recvmsg=>demultiplex=>do-stuff. When your IO-generators actually make stuff to send you queue such data for disk IO, then packetize it and hand it off to your data-transmission threads. If you made all your sockets and inter-thread pipes nonblocking then in userspace you would just epoll_wait() on the sockets and pipes and be easily able to react to any IO from anywhere. In kernel space there are similar nonblocking interfaces, although it would probably be easier just to use a couple threads. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-26 3:43 ` Kyle Moffett @ 2007-07-26 9:17 ` Evgeniy Polyakov 0 siblings, 0 replies; 42+ messages in thread From: Evgeniy Polyakov @ 2007-07-26 9:17 UTC (permalink / raw) To: Kyle Moffett Cc: david, Satyam Sharma, Lars Ellenberg, Jens Axboe, Andrew Morton, LKML Kernel, netdev Hi Kyle. On Wed, Jul 25, 2007 at 11:43:38PM -0400, Kyle Moffett (mrlinuxman@mac.com) wrote: > If you made all your sockets and inter-thread pipes nonblocking then > in userspace you would just epoll_wait() on the sockets and pipes and > be easily able to react to any IO from anywhere. > > In kernel space there are similar nonblocking interfaces, although it > would probably be easier just to use a couple threads. There are no such interfaces in kernel - one must create own state machine on top of ->poll() callback, or use sys_poll()/epoll, but likely it is not what one wants for high-performance in-kernel event processing engine. Having two threads does not solve the problem - eventually one needs to send a header before receiving data. So, the solution would either to use always-blocking mode like in NBD, or create own state machine using ->poll() callbacks. > Cheers, > Kyle Moffett -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-23 13:32 ` Lars Ellenberg 2007-07-23 13:37 ` Jens Axboe 2007-07-23 13:40 ` Satyam Sharma @ 2007-07-24 0:48 ` Kyle Moffett 2 siblings, 0 replies; 42+ messages in thread From: Kyle Moffett @ 2007-07-24 0:48 UTC (permalink / raw) To: Lars Ellenberg; +Cc: Jens Axboe, Andrew Morton, LKML Kernel, netdev For the guys on netdev, would you please look at the tcp_recvmsg- threading and TCP_NAGLE_CORK issues below and give opinions on the best way to proceed? One thing to remember, you don't necessarily have to merge every feature right away. As long as the new code is configured "off" by default with an "(EXPERIMENTAL)" warning, you can start getting the core parts and the cleanups upstream before you have to resolve all the issues with low-latency, dynamic-tracing-frameworks, etc. On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote: > On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote: >> +/* I don't remember why XCPU ... >> + * This is used to wake the asender, >> + * and to interrupt sending the sending task >> + * on disconnect. >> + */ >> +#define DRBD_SIG SIGXCPU > >> Don't use signals between kernel threads, use proper primitives >> like notifiers and waitqueues, which means you should also >> probably switch away from kernel_thread() to the kthread_*() >> APIs. Also you should fix this FIXME or remove it if it no longer >> applies:-D. > > right. > but how to I tell a network thread in tcp_recvmsg to stop early, > without using signals? I'm not really a kernel-networking guy, so I can't answer this definitively, but I'm pretty sure the problem has been solved in many network filesystems and such, so I've added a netdev CC. The way I'd do it in userspace is with nonblocking IO and epoll(), that way I don't actually have to "stop" or "signal" the thread, I can just add a socket to epoll fd when I want to pay attention to it, and remove it from my epoll fd when I'm done with it. I'd assume there's some equivalent way in kernelspace based around the "struct kiocb *iocb" and "int nonblock" parameters to the tcp_recvmsg() kernel function. >> +/* see kernel/printk.c:printk_ratelimit >> + * macro, so it is easy do have independend rate limits at >> different locations >> + * "initializer element not constant ..." with kernel 2.4 :( >> + * so I initialize toks to something large >> + */ >> +#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \ >> >> Any particular reason you can't just use printk_ratelimit for this? > > I want to be able to do a rate-limit per specific message/code > fragment, without affecting other messages or execution paths. Ok, so could you change your patch to modify __printk_ratelimit() to also accept a "struct printk_rate" datastructure and make printk_ratelimit() call "__printk_ratelimit(&global_printk_rate);"?? Typically if $KERNEL_FEATURE is insufficient for your needs you should fix $KERNEL_FEATURE instead of duplicating a replacement in your driver. This applies to basically all of the things I'm talking about, kernel-threads, workqueues (BTW: I believe you can make your own custom workqueue thread(s) instead of using the default "events/ *" ones), debugging macros, fault-insertion, integer math, lock- checking, dynamic tracing, etc. If you find some reason that some generic code won't work for you, please try to fix it first so we can all benefit from it. >> Umm, how about fixing this to actually use proper workqueues or >> something instead of this open-coded mess? > > unlikely to happen "right now". but it is on our todo list... Unfortunately problems like these need to be fixed before a mainline merge. Merging duplicated code is a big no-no, and historically there have been problems with people who merge code and never properly maintain it once it's in tree. As a result the rule is your code has to be easily maintainable before anybody will even *consider* merging it. >> +/* I want the packet to fit within one page >> + * THINK maybe use a special bitmap header, >> + * including offset and compression scheme and whatnot >> + * Do not use PAGE_SIZE here! Use a architecture agnostic constant! >> + */ >> +#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof >> (long)) >> >> Yuck. Definitely use PAGE_SIZE here, so at least if it's broken >> on an arch with multiple page sizes, somebody can grep for >> PAGE_SIZE to fix it. It also means that on archs/configs with 8k >> or 64k pages you won't waste a bunch of memory. > > No. This is not to allocate anything, but defines the chunk size > with which we transmit the bitmap, when we have to. We need to be > able to talk from one arch (say i586) to some other (say s390, or > sparc, or whatever). The receiving side has a one-page buffer, > from which it may or may not to endian-conversion. The hardcoded > 4096 is the minimal common denominator here. Ahhh. Please replace the constant "4096" with: /* This is the maximum amount of bitmap we will send per packet */ # define MAX_BITMAP_CHUNK_SIZE 4096 # define BM_PACKET_WORDS \ ((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long)) It's more text but dramatically improves the readability by eliminating more magic numbers. This is a much milder case than I've seen in the past, so it's not that big of a deal. >> +/* Dynamic tracing framework */ > guess we have to explain first what this is for. it probably is > not exactly what you think it is... but maybe I am just too > ignorant about what you suggest here. > > we'll have to defer discussion about this for when I'm done doing > the trivial fix-ups, and provided an overall design overview. From just poking at the code it looks vaguely similar to blktrace or something. We do have a lot of things in the kernel which fall under the name of "Dynamic tracing framework", so if you could look those over and see if you can't reuse them (or just remove the code for now and merge it later), that would be really useful. >> +static inline void drbd_tcp_cork(struct socket *sock) >> +{ >> +#if 1 >> + mm_segment_t oldfs = get_fs(); >> + int val = 1; >> + >> + set_fs(KERNEL_DS); >> + tcp_setsockopt(sock->sk, SOL_TCP, TCP_CORK, (char *)&val, >> sizeof(val) ); >> + set_fs(oldfs); >> +#else >> + tcp_sk(sock->sk)->nonagle |= TCP_NAGLE_CORK; >> +#endif >> +} >> >> Yuck, why'd you do it this way? Probably because your tcp_sk(sock- >> >sk) stuff >> doesn't have proper locking, I'll bet. You can avoid all the >> extra wrapper >> crap by just looking in "do_tcp_setsockopt" and taking the >> appropriate lock: >> >> static inline void drbd_tcp_cork(struct socket *sock) >> { >> struct sock *sk = sock->sk; >> >> lock_sock(sk); >> tcp_sk(sk)->nonagle |= TCP_NAGLE_CORK; >> release-sock(sk); >> } > > it had performance improvements somewhen. I doubt it has still. > maybe we just remove this. NOTE: netdev guys, any chance you could comment on whether or not a RAID1-over-IP block device should be using TCP_NAGLE_CORK? If not, what should they be using instead? >> +#define peer_mask role_mask >> +#define pdsk_mask disk_mask >> +#define susp_mask 1 >> +#define user_isp_mask 1 >> +#define aftr_isp_mask 1 >> +#define NS(T, S) \ >> +#define NS2(T1, S1, T2, S2) \ >> +#define NS3(T1, S1, T2, S2, T3, S3) \ >> +#define _NS(D, T, S) \ >> +#define _NS2(D, T1, S1, T2, S2) \ >> +#define _NS3(D, T1, S1, T2, S2, T3, S3) \ >> >> Grumble. When I earlier said I thought I was in macro hell, well, >> I was >> wrong. *THIS* is macro hell. What the fsck is that supposed to >> do? And it >> doesn't even include a *SINGLE* comment!!! > > uhm. basically you are right. Phil will take over to explain why > it is done that way... Thanks, I'm looking forward to a good bout of hysterical laughter ;-) :-D. > most other points I just snipped off of this mail will be handled > as you suggested asap. Great! Thanks. I'll take another deep dive into your git tree in a week or so when I've got more free time, but it would really help to have some smaller-ish patches which do any necessary preparatory cleanups. Good luck with your submission! Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg ` (3 preceding siblings ...) 2007-07-23 1:32 ` Kyle Moffett @ 2007-07-23 20:59 ` Jesper Juhl 4 siblings, 0 replies; 42+ messages in thread From: Jesper Juhl @ 2007-07-23 20:59 UTC (permalink / raw) To: Jens Axboe, Andrew Morton, lkml, drbd-user On 21/07/07, Lars Ellenberg <lars@linbit.com> wrote: > > DRBD wants to go mainline. > please have a look at the "for-linus" branch of > git://git.drbd.org/home/git/linux-drbd.git. > A few more comments. In drbd_main.c::after_state_ch() there's this code : ... if ( mdev->p_uuid ) { kfree(mdev->p_uuid); mdev->p_uuid = NULL; } ... Since kfree(NULL) if a noop, that should just become : ... kfree(mdev->p_uuid); mdev->p_uuid = NULL; ... Same for these bits in drbd_main.c::drbd_cleanup() ... if(mdev->app_reads_hash) { kfree(mdev->app_reads_hash); mdev->app_reads_hash = NULL; } if ( mdev->p_uuid ) { kfree(mdev->p_uuid); mdev->p_uuid = NULL; } ... They should just become ... kfree(mdev->app_reads_hash); mdev->app_reads_hash = NULL; kfree(mdev->p_uuid); mdev->p_uuid = NULL; ... And then there's drbd_main.c::drbd_new_device() : ... if(mdev->app_reads_hash) kfree(mdev->app_reads_hash); ... That should just be ... kfree(mdev->app_reads_hash); ... Btw; You shouldn't put multiple statements on the same line. That is seen all over the place - like mdev = kzalloc(sizeof(drbd_dev),GFP_KERNEL); if(!mdev) goto Enomem; That should be mdev = kzalloc(sizeof(drbd_dev),GFP_KERNEL); if(!mdev) goto Enomem; There are lots of other examples. Here are a few more pointless NULL checks before kfree() : drbd_nl.c::drbd_nl_net_conf() : ... if(new_tl_hash) { if (mdev->tl_hash) kfree(mdev->tl_hash); mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8; mdev->tl_hash = new_tl_hash; } if(new_ee_hash) { if (mdev->ee_hash) kfree(mdev->ee_hash); mdev->ee_hash_s = mdev->net_conf->max_buffers/8; mdev->ee_hash = new_ee_hash; } if ( mdev->cram_hmac_tfm ) { crypto_free_hash(mdev->cram_hmac_tfm); } mdev->cram_hmac_tfm = tfm; ... should become : ... if(new_tl_hash) { kfree(mdev->tl_hash); mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8; mdev->tl_hash = new_tl_hash; } if(new_ee_hash) { kfree(mdev->ee_hash); mdev->ee_hash_s = mdev->net_conf->max_buffers/8; mdev->ee_hash = new_ee_hash; } crypto_free_hash(mdev->cram_hmac_tfm); mdev->cram_hmac_tfm = tfm; ... still in drbd_nl.c::drbd_nl_net_conf() : ... fail: if (tfm) crypto_free_hash(tfm); if (new_tl_hash) kfree(new_tl_hash); if (new_ee_hash) kfree(new_ee_hash); if (new_conf) kfree(new_conf); ... should become ... fail: crypto_free_hash(tfm); kfree(new_tl_hash); kfree(new_ee_hash); kfree(new_conf); ... drbd_nl.c::ensure_mdev() : ... if(mdev->app_reads_hash) kfree(mdev->app_reads_hash); .. becomes ... kfree(mdev->app_reads_hash); ... In drbd_receiver.c::receive_uuids() we find this : ... if ( mdev->p_uuid ) kfree(mdev->p_uuid); ... which should be just ... kfree(mdev->p_uuid); ... There is also drbd_receiver.c::drbd_disconnect() - where ... if ( mdev->p_uuid ) { kfree(mdev->p_uuid); mdev->p_uuid = NULL; } ... needs to become just ... kfree(mdev->p_uuid); mdev->p_uuid = NULL; ... And in drbd_receiver.c::drbd_do_auth() ... fail: if(peers_ch) kfree(peers_ch); if(response) kfree(response); if(right_response) kfree(right_response); ... needs to be turned into ... fail: kfree(peers_ch); kfree(response); kfree(right_response); ... drbd_req.c::drbd_make_request_common() also needs to get rid of ... if (b) kfree(b); /* if someone else has beaten us to it... */ ... fail_and_free_req: if (b) kfree(b); ... and just turn it into ... kfree(b); /* if someone else has beaten us to it... */ ... fail_and_free_req: kfree(b); ... -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [DRIVER SUBMISSION] DRBD wants to go mainline @ 2007-07-22 9:54 Tomasz Chmielewski 0 siblings, 0 replies; 42+ messages in thread From: Tomasz Chmielewski @ 2007-07-22 9:54 UTC (permalink / raw) To: LKML; +Cc: jengelh Jan Engelhardt wrote > On Jul 22 2007 00:43, Lars Ellenberg wrote: >>Think of it as RAID1 over TCP. > > And what does it do better than raid1-over-NBD? (Which is already N-disk, > and, logically, seems to support cluster filesystems) I don't know about DRDB, but NBD doesn't handle network disconnections at all (well, almost). So basically, disconnect a NBD-connected system for a while (switch, cabling problem, operator error etc.), and you need lots of effort, perhaps restarts, to get the things to a functioning state (devices offlined, kicked out etc.). I wouldn't call such raid-over-NBD setup reliable. A better question would be: what does it do better than raid1-over-iSCSI? iSCSI can recover from disconnections very well when configured properly; but when a disconnection is in place, most of the system will just "hang/freeze" (that is, from the user perspective - the system will be waiting for the I/O to complete, until the systems are connected again). A brief reading of "official DRBD FAQ" didn't give me an answer to that problem. -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2007-07-30 20:12 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-21 20:38 [DRIVER SUBMISSION] DRBD wants to go mainline Lars Ellenberg 2007-07-21 21:17 ` Jan Engelhardt 2007-07-21 22:43 ` Lars Ellenberg 2007-07-22 9:06 ` Jan Engelhardt 2007-07-22 14:03 ` Lars Ellenberg 2007-07-27 18:46 ` Pavel Machek 2007-07-30 19:35 ` Lars Ellenberg 2007-07-30 19:41 ` Jan Engelhardt 2007-07-21 21:34 ` Jesper Juhl 2007-07-21 23:50 ` Andi Kleen 2007-07-22 5:52 ` Jens Axboe 2007-07-22 13:58 ` Lars Ellenberg 2007-07-22 14:49 ` Sam Ravnborg 2007-07-22 14:56 ` Andi Kleen 2007-07-22 15:31 ` Satyam Sharma 2007-07-22 15:50 ` Satyam Sharma 2007-07-22 16:13 ` Stefan Richter 2007-07-22 6:09 ` Lars Ellenberg 2007-07-22 8:52 ` Sam Ravnborg 2007-07-22 9:05 ` Pekka Enberg 2007-07-22 9:00 ` Pekka Enberg 2007-07-23 1:32 ` Kyle Moffett 2007-07-23 8:49 ` Christoph Hellwig 2007-07-23 9:00 ` Sam Ravnborg 2007-07-23 9:01 ` Christoph Hellwig 2007-07-23 9:19 ` Sam Ravnborg 2007-07-23 11:08 ` Lars Ellenberg 2007-07-23 13:32 ` Lars Ellenberg 2007-07-23 13:37 ` Jens Axboe 2007-07-23 21:13 ` Lars Ellenberg 2007-07-23 13:40 ` Satyam Sharma 2007-07-23 21:19 ` Lars Ellenberg 2007-07-24 7:36 ` Jens Axboe 2007-07-24 23:11 ` Satyam Sharma 2007-07-25 9:46 ` Lars Ellenberg 2007-07-25 12:12 ` Satyam Sharma 2007-07-26 2:03 ` david 2007-07-26 3:43 ` Kyle Moffett 2007-07-26 9:17 ` Evgeniy Polyakov 2007-07-24 0:48 ` Kyle Moffett 2007-07-23 20:59 ` Jesper Juhl -- strict thread matches above, loose matches on Subject: below -- 2007-07-22 9:54 Tomasz Chmielewski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox