[PATCH md 0 of 2] Introduction

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH md 0 of 2] Introduction
@ 2004-09-03  2:27 NeilBrown
  2004-09-03  2:27 ` [PATCH md 2 of 2] Correct "working_disk" counts for raid5 and raid6 NeilBrown
  2004-09-03  2:27 ` [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: NeilBrown @ 2004-09-03  2:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Two patches for md in 2.6.9-rc1-mm2

The first adds support for notifying user-space of events in md.

The mechanism is very simple.  A reader of /proc/mdstat can select for
"exceptional" events.  When select/poll indicates one of these, the 
reader should re-read /proc/mdstat from the top looking for changes, and then
select again.

If the reader opens for write as well (O_RDWR) (only root can do
this), then it indicates that it is prepared to take remedial action.
This is currently only relevant for multipath.  On last-drive-failure,
if there is a reader of /proc/mdstat that has an open for write, then
multipath will wait for that reader to add a new drive or take other
action before resubmitting the failed requests.

The second patch just fixes a counter in raid5/6 which could get out-by-one, and so
produce confusing messages.

(I think I sent a single item of junk just before this, sorry about that.  Please
ignore it).

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH md 2 of 2] Correct "working_disk" counts for raid5 and raid6
  2004-09-03  2:27 [PATCH md 0 of 2] Introduction NeilBrown
@ 2004-09-03  2:27 ` NeilBrown
  2004-09-03  2:27 ` [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: NeilBrown @ 2004-09-03  2:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


This error only affects two message (and sysadmin heart-rate).
It does not risk data.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid5.c     |    2 +-
 ./drivers/md/raid6main.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./drivers/md/raid5.c	2004-09-03 11:37:56.000000000 +1000
@@ -477,8 +477,8 @@ static void error(mddev_t *mddev, mdk_rd
 
 	if (!rdev->faulty) {
 		mddev->sb_dirty = 1;
-		conf->working_disks--;
 		if (rdev->in_sync) {
+			conf->working_disks--;
 			mddev->degraded++;
 			conf->failed_disks++;
 			rdev->in_sync = 0;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./drivers/md/raid6main.c	2004-09-03 11:37:56.000000000 +1000
@@ -498,8 +498,8 @@ static void error(mddev_t *mddev, mdk_rd
 
 	if (!rdev->faulty) {
 		mddev->sb_dirty = 1;
-		conf->working_disks--;
 		if (rdev->in_sync) {
+			conf->working_disks--;
 			mddev->degraded++;
 			conf->failed_disks++;
 			rdev->in_sync = 0;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-03  2:27 [PATCH md 0 of 2] Introduction NeilBrown
  2004-09-03  2:27 ` [PATCH md 2 of 2] Correct "working_disk" counts for raid5 and raid6 NeilBrown
@ 2004-09-03  2:27 ` NeilBrown
  2004-09-03 10:19   ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: NeilBrown @ 2004-09-03  2:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid


Every interesting md event (device failure, rebuild, add,remove etc) gets treated 
as an 'urgent data' on /proc/mdstat and cause select to return if waiting for exceptions, 
and poll to return if waiting PRIority data.

To collect an event, the program must re-read /proc/mdstat from start to finish,
and then must select/poll on the file descriptor (or close it).

Events aren't returned as a list of individual events, only as a notification
that something might have happened, and reading /proc/mdstat should show what 
it was.

If a program opens /proc/mdstat with O_RDWR it signals that it intends
to handle events.  In some cases the md driver might want to wait for
an event to be handled before deciding what to do next.  For example
when the last path of a multipath fails, the md driver could either
fail all requests, or could wait for a working path to be provided.
It can do this by waiting for the failure event to be handled, and
then making the decission.  A program that is handling events should
read /proc/mdstat to determine new state, and then handle any events
before either calling select/poll.  By doing this, or by closing the
file, the program indicates that it has done all that it plans to do
about the event.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/md.c                |  125 +++++++++++++++++++++++++++++++++++++--
 ./drivers/md/multipath.c         |   81 +++++++++++++++----------
 ./include/linux/raid/md_k.h      |    4 +
 ./include/linux/raid/multipath.h |    1 
 4 files changed, 173 insertions(+), 38 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./drivers/md/md.c	2004-09-03 11:37:25.000000000 +1000
@@ -37,6 +37,7 @@
 #include <linux/devfs_fs_kernel.h>
 #include <linux/buffer_head.h> /* for invalidate_bdev */
 #include <linux/suspend.h>
+#include <linux/poll.h>
 
 #include <linux/init.h>
 
@@ -124,6 +125,48 @@ static ctl_table raid_root_table[] = {
 
 static struct block_device_operations md_fops;
 
+
+/*
+ * We have a system wide 'event count' that is incremented
+ * on any 'interesting' event, and readers of /proc/mdstat 
+ * can use 'poll' or 'select' to find out when the event
+ * count increases
+ *
+ * Events are:
+ *  start array, stop array, error, add device, remove device, 
+ *  start build, activate spare
+ */
+DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
+EXPORT_SYMBOL(md_event_waiters);
+static atomic_t md_event_count;
+int md_new_event(void) 
+{
+	atomic_inc(&md_event_count);
+	wake_up(&md_event_waiters);
+	return atomic_read(&md_event_count);
+}
+EXPORT_SYMBOL(md_new_event);
+struct mdstat_info {
+	struct list_head list;	/* all active files linked together */
+	unsigned long	seen, reading, acknowledged;
+};
+static LIST_HEAD(readers);
+static spinlock_t readers_lock = SPIN_LOCK_UNLOCKED;
+
+int md_event_reached(unsigned long ev)
+{
+	/* returns true when all readers have acknowledged event 'ev' */
+	struct mdstat_info *mi;
+	int rv  = 1;
+	spin_lock(&readers_lock);
+	list_for_each_entry(mi, &readers, list)
+		if (mi->reading > 0 && mi->acknowledged < ev)
+			rv = 0;
+	spin_unlock(&readers_lock);
+	return rv;
+}
+EXPORT_SYMBOL(md_event_reached);
+
 /*
  * Enables to iterate over all existing md arrays
  * all_mddevs_lock protects this list.
@@ -1717,6 +1760,7 @@ static int do_md_run(mddev_t * mddev)
 	mddev->queue->issue_flush_fn = md_flush_all;
 
 	mddev->changed = 1;
+	md_new_event();
 	return 0;
 }
 
@@ -1821,6 +1865,7 @@ static int do_md_stop(mddev_t * mddev, i
 		printk(KERN_INFO "md: %s switched to read-only mode.\n",
 			mdname(mddev));
 	err = 0;
+	md_new_event();
 out:
 	return err;
 }
@@ -2235,6 +2280,7 @@ static int hot_remove_disk(mddev_t * mdd
 
 	kick_rdev_from_array(rdev);
 	md_update_sb(mddev);
+	md_new_event();
 
 	return 0;
 busy:
@@ -2325,7 +2371,7 @@ static int hot_add_disk(mddev_t * mddev,
 	 */
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
-
+	md_new_event();
 	return 0;
 
 abort_unbind_export:
@@ -2941,6 +2987,7 @@ void md_error(mddev_t *mddev, mdk_rdev_t
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
+	md_new_event();
 }
 
 /* seq_file implementation /proc/mdstat */
@@ -3087,6 +3134,7 @@ static int md_seq_show(struct seq_file *
 	sector_t size;
 	struct list_head *tmp2;
 	mdk_rdev_t *rdev;
+	struct mdstat_info *mi = seq->private;
 	int i;
 
 	if (v == (void*)1) {
@@ -3097,11 +3145,13 @@ static int md_seq_show(struct seq_file *
 				seq_printf(seq, "[%s] ", pers[i]->name);
 
 		spin_unlock(&pers_lock);
-		seq_printf(seq, "\n");
+		mi->reading = atomic_read(&md_event_count);
+		seq_printf(seq, "\nEvent: %-20ld\n", mi->reading);
 		return 0;
 	}
 	if (v == (void*)2) {
 		status_unused(seq);
+		mi->seen = mi->reading;
 		return 0;
 	}
 
@@ -3160,19 +3210,74 @@ static struct seq_operations md_seq_ops 
 	.show   = md_seq_show,
 };
 
+
 static int md_seq_open(struct inode *inode, struct file *file)
 {
 	int error;
+	struct mdstat_info *mi = kmalloc(sizeof(*mi), GFP_KERNEL);
+	if (mi == NULL)
+		return -ENOMEM;
 
 	error = seq_open(file, &md_seq_ops);
+	if (error)
+		kfree(mi);
+	else {
+		struct seq_file *p = file->private_data;
+		p->private = mi;
+		mi->acknowledged = mi->seen = mi->reading = 0;
+		if (file->f_mode & FMODE_WRITE) {
+			spin_lock(&readers_lock);
+			list_add(&mi->list, &readers);
+			spin_unlock(&readers_lock);
+		} else
+			INIT_LIST_HEAD(&mi->list);
+	}
 	return error;
 }
 
+static int md_seq_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = (struct seq_file*)file->private_data;
+	struct mdstat_info *mi = m->private;
+
+	m->private = NULL;
+	if (!list_empty(&mi->list)) {
+		spin_lock(&readers_lock);
+		list_del(&mi->list);
+		spin_unlock(&readers_lock);
+	}
+	kfree(mi);
+	wake_up(&md_event_waiters);
+	return seq_release(inode, file);
+}
+	
+static unsigned int
+mdstat_poll(struct file *filp, poll_table *wait)
+{
+	struct seq_file *m = (struct seq_file*)filp->private_data;
+	struct mdstat_info*mi = m->private;
+	int mask;
+
+	if (mi->acknowledged != mi->seen) {
+		mi->acknowledged = mi->seen;
+		wake_up(&md_event_waiters);
+	}
+	poll_wait(filp, &md_event_waiters, wait);
+
+	/* always allow read */
+	mask = POLLIN | POLLRDNORM;
+
+	if (mi->seen < atomic_read(&md_event_count))
+		mask |= POLLPRI;
+	return mask;
+}
+
 static struct file_operations md_seq_fops = {
 	.open           = md_seq_open,
 	.read           = seq_read,
 	.llseek         = seq_lseek,
-	.release	= seq_release,
+	.release	= md_seq_release,
+	.poll		= mdstat_poll,
 };
 
 int register_md_personality(int pnum, mdk_personality_t *p)
@@ -3418,6 +3523,11 @@ static void md_do_sync(mddev_t *mddev)
 		atomic_add(sectors, &mddev->recovery_active);
 		j += sectors;
 		if (j>1) mddev->curr_resync = j;
+		if (last_check==0)
+			/* This is the earliest that rebuild will be visible
+			 * in /proc/mdstat
+			 */
+			md_new_event();
 
 		if (last_check + window > j || j == max_sectors)
 			continue;
@@ -3569,6 +3679,7 @@ void md_check_recovery(mddev_t *mddev)
 			/* flag recovery needed just to double check */
 			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			wake_up(&resync_wait);
+			md_new_event();
 			goto unlock;
 		}
 		if (mddev->recovery) {
@@ -3595,9 +3706,10 @@ void md_check_recovery(mddev_t *mddev)
 			ITERATE_RDEV(mddev,rdev,rtmp)
 				if (rdev->raid_disk < 0
 				    && !rdev->faulty) {
-					if (mddev->pers->hot_add_disk(mddev,rdev))
+					if (mddev->pers->hot_add_disk(mddev,rdev)) {
 						spares++;
-					else
+						md_new_event();
+					} else
 						break;
 				}
 		}
@@ -3622,6 +3734,7 @@ void md_check_recovery(mddev_t *mddev)
 			} else {
 				md_wakeup_thread(mddev->sync_thread);
 			}
+			md_new_event();
 		}
 	unlock:
 		mddev_unlock(mddev);
@@ -3664,7 +3777,7 @@ static void md_geninit(void)
 
 	dprintk("md: sizeof(mdp_super_t) = %d\n", (int)sizeof(mdp_super_t));
 
-	p = create_proc_entry("mdstat", S_IRUGO, NULL);
+	p = create_proc_entry("mdstat", S_IRUGO|S_IWUSR, NULL);
 	if (p)
 		p->proc_fops = &md_seq_fops;
 }

diff ./drivers/md/multipath.c~current~ ./drivers/md/multipath.c
--- ./drivers/md/multipath.c~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./drivers/md/multipath.c	2004-09-03 11:34:38.000000000 +1000
@@ -54,15 +54,20 @@ static void mp_pool_free(void *mpb, void
 	kfree(mpb);
 }
 
-static int multipath_map (multipath_conf_t *conf)
+static int multipath_map (multipath_conf_t *conf, int retry)
 {
 	int i, disks = conf->raid_disks;
+	int failed = -1;
 
 	/*
-	 * Later we do read balancing on the read side 
-	 * now we use the first available disk.
+	 * We cannot tell the difference between a bad device and a bad block.
+	 * So, while we always prefer a path that hasn't seen a failure,
+	 * if there is no other option we accept a device which has seen
+	 * a failure, but only on the first attempt, never on a retry.
+	 * Normally, devices which have failed are removed from the array,
+	 * but this does not happen for the last device.
 	 */
-
+ retry:
 	spin_lock_irq(&conf->device_lock);
 	for (i = 0; i < disks; i++) {
 		mdk_rdev_t *rdev = conf->multipaths[i].rdev;
@@ -71,9 +76,23 @@ static int multipath_map (multipath_conf
 			spin_unlock_irq(&conf->device_lock);
 			return i;
 		}
+		if (rdev) failed = i;
+	}
+	if (!retry && failed >= 0) {
+		mdk_rdev_t *rdev = conf->multipaths[failed].rdev;
+		atomic_inc(&rdev->nr_pending);
+		spin_unlock_irq(&conf->device_lock);
+		return failed;
 	}
 	spin_unlock_irq(&conf->device_lock);
 
+	/* sync with any monitoring daemon */
+	wait_event(md_event_waiters,
+		   md_event_reached(conf->last_fail_event) ||
+		   conf->working_disks >= 1);
+	if (conf->working_disks)
+		goto retry;
+
 	printk(KERN_ERR "multipath_map(): no more operational IO paths?\n");
 	return (-1);
 }
@@ -186,7 +205,7 @@ static int multipath_make_request (reque
 		disk_stat_add(mddev->gendisk, read_sectors, bio_sectors(bio));
 	}
 
-	mp_bh->path = multipath_map(conf);
+	mp_bh->path = multipath_map(conf, 0);
 	if (mp_bh->path < 0) {
 		bio_endio(bio, bio->bi_size, -EIO);
 		mempool_free(mp_bh, conf->pool);
@@ -250,32 +269,23 @@ static void multipath_error (mddev_t *md
 {
 	multipath_conf_t *conf = mddev_to_conf(mddev);
 
-	if (conf->working_disks <= 1) {
-		/*
-		 * Uh oh, we can do nothing if this is our last path, but
-		 * first check if this is a queued request for a device
-		 * which has just failed.
-		 */
-		printk(KERN_ALERT 
-			"multipath: only one IO path left and IO error.\n");
-		/* leave it active... it's all we have */
-	} else {
-		/*
-		 * Mark disk as unusable
-		 */
-		if (!rdev->faulty) {
-			char b[BDEVNAME_SIZE];
-			rdev->in_sync = 0;
-			rdev->faulty = 1;
-			mddev->sb_dirty = 1;
-			conf->working_disks--;
-			printk(KERN_ALERT "multipath: IO failure on %s,"
-				" disabling IO path. \n	Operation continuing"
-				" on %d IO paths.\n",
-				bdevname (rdev->bdev,b),
-				conf->working_disks);
-		}
+	conf->last_fail_event = md_new_event();
+	/*
+	 * Mark disk as unusable
+	 */
+	if (!rdev->faulty) {
+		char b[BDEVNAME_SIZE];
+		rdev->in_sync = 0;
+		rdev->faulty = 1;
+		mddev->sb_dirty = 1;
+		conf->working_disks--;
+		printk(KERN_ALERT "multipath: IO failure on %s,"
+		       " disabling IO path. \n	Operation continuing"
+		       " on %d IO paths.\n",
+		       bdevname (rdev->bdev,b),
+		       conf->working_disks);
 	}
+
 }
 
 static void print_multipath_conf (multipath_conf_t *conf)
@@ -347,10 +357,17 @@ static int multipath_remove_disk(mddev_t
 	print_multipath_conf(conf);
 	spin_lock_irq(&conf->device_lock);
 
+	if (conf->working_disks == 0)
+		/* don't remove devices when none seem to be working.
+		 * We want to have somewhere to keep trying incase it
+		 * is really a device error
+		 */
+		goto abort;
 	if (p->rdev) {
 		if (p->rdev->in_sync ||
 		    atomic_read(&p->rdev->nr_pending)) {
-			printk(KERN_ERR "hot-remove-disk, slot %d is identified"				" but is still operational!\n", number);
+			printk(KERN_ERR "hot-remove-disk, slot %d is identified"
+			       " but is still operational!\n", number);
 			err = -EBUSY;
 			goto abort;
 		}
@@ -397,7 +414,7 @@ static void multipathd (mddev_t *mddev)
 		bio = &mp_bh->bio;
 		bio->bi_sector = mp_bh->master_bio->bi_sector;
 		
-		if ((mp_bh->path = multipath_map (conf))<0) {
+		if ((mp_bh->path = multipath_map(conf, 1))<0) {
 			printk(KERN_ALERT "multipath: %s: unrecoverable IO read"
 				" error for block %llu\n",
 				bdevname(bio->bi_bdev,b),

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./include/linux/raid/md_k.h	2004-09-03 11:34:38.000000000 +1000
@@ -360,5 +360,9 @@ do {									\
 	__wait_event_lock_irq(wq, condition, lock, cmd);		\
 } while (0)
 
+extern int md_new_event(void);
+extern wait_queue_head_t md_event_waiters;
+extern int md_event_reached(unsigned long ev);
+
 #endif
 

diff ./include/linux/raid/multipath.h~current~ ./include/linux/raid/multipath.h
--- ./include/linux/raid/multipath.h~current~	2004-09-03 11:34:26.000000000 +1000
+++ ./include/linux/raid/multipath.h	2004-09-03 11:34:38.000000000 +1000
@@ -13,6 +13,7 @@ struct multipath_private_data {
 	int			raid_disks;
 	int			working_disks;
 	spinlock_t		device_lock;
+	unsigned long		last_fail_event;
 
 	mempool_t		*pool;
 };

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-03  2:27 ` [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events NeilBrown
@ 2004-09-03 10:19   ` Andrew Morton
  2004-09-06  1:05     ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2004-09-03 10:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Christoph Hellwig

NeilBrown <neilb@cse.unsw.edu.au> wrote:
>
> Every interesting md event (device failure, rebuild, add,remove etc) gets treated 
>  as an 'urgent data' on /proc/mdstat and cause select to return if waiting for exceptions, 
>  and poll to return if waiting PRIority data.
> 
>  To collect an event, the program must re-read /proc/mdstat from start to finish,
>  and then must select/poll on the file descriptor (or close it).
> 
>  Events aren't returned as a list of individual events, only as a notification
>  that something might have happened, and reading /proc/mdstat should show what 
>  it was.
> 
>  If a program opens /proc/mdstat with O_RDWR it signals that it intends
>  to handle events.  In some cases the md driver might want to wait for
>  an event to be handled before deciding what to do next.  For example
>  when the last path of a multipath fails, the md driver could either
>  fail all requests, or could wait for a working path to be provided.
>  It can do this by waiting for the failure event to be handled, and
>  then making the decission.  A program that is handling events should
>  read /proc/mdstat to determine new state, and then handle any events
>  before either calling select/poll.  By doing this, or by closing the
>  file, the program indicates that it has done all that it plans to do
>  about the event.

Christoph points out that this is fairly wild procfs abuse.  We want to be
moving away from that sort of thing, not adding to it.

Is it possible to use rml's new event stuff from rc1-mm3's
kernel-sysfs-events-layer.patch?  Or a bare netlink interface?  Or raidfs?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-03 10:19   ` Andrew Morton
@ 2004-09-06  1:05     ` Neil Brown
  2004-09-06 14:02       ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2004-09-06  1:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, Christoph Hellwig

On Friday September 3, akpm@osdl.org wrote:
> NeilBrown <neilb@cse.unsw.edu.au> wrote:
> >
> > Every interesting md event (device failure, rebuild, add,remove etc) gets treated 
> >  as an 'urgent data' on /proc/mdstat and cause select to return if waiting for exceptions, 
> >  and poll to return if waiting PRIority data.
> > 
> >  To collect an event, the program must re-read /proc/mdstat from start to finish,
> >  and then must select/poll on the file descriptor (or close it).
> > 
> >  Events aren't returned as a list of individual events, only as a notification
> >  that something might have happened, and reading /proc/mdstat should show what 
> >  it was.
> > 
> >  If a program opens /proc/mdstat with O_RDWR it signals that it intends
> >  to handle events.  In some cases the md driver might want to wait for
> >  an event to be handled before deciding what to do next.  For example
> >  when the last path of a multipath fails, the md driver could either
> >  fail all requests, or could wait for a working path to be provided.
> >  It can do this by waiting for the failure event to be handled, and
> >  then making the decission.  A program that is handling events should
> >  read /proc/mdstat to determine new state, and then handle any events
> >  before either calling select/poll.  By doing this, or by closing the
> >  file, the program indicates that it has done all that it plans to do
> >  about the event.
> 
> Christoph points out that this is fairly wild procfs abuse.  We want to be
> moving away from that sort of thing, not adding to it.

I guess it depends on what you mean by "wild procfs abuse"...
Given that /proc/mdstat already exists it doesn't seem too
unreasonable to add a little functionality to it.  How much does it hurt?

> 
> Is it possible to use rml's new event stuff from rc1-mm3's
> kernel-sysfs-events-layer.patch?  Or a bare netlink interface?  Or
> raidfs?

sysfs:  
  Probably.  I would like to improve sysfs support for md but I
  haven't taken the plunge yet to figure out how it all works.
  kevents may well be ok, but as you may need to handle
  multipath-failed events in times if high memory pressure, and need
  to handle it before pressure can be relieved, I prefer to minimise
  the number of kmallocs needed to get the event to userspace.

bare netlink:
  Probably perter sysfs... Funny, but I remember reading a comment in
  the code (2.4 I think, ages ago) about netlink being deprecated or
  something like that, so I never bothered looking at it.  I wonder
  what it meant.

raidfs:
  I've thought about that a couple of times, but don't think it would
  really gain us anything.  I'm not a big fan of "lots of little
  filesystems".   It sound's like an admins nightmare. 

mdadm already uses this (if it is available) and redhat ship it in
their 2.4 kernel (particularly for multipath support).  I know that
isn't a very strong argument, but given that abuse already exists (in
that mdstat exists) is it sufficient argument to justify a small
extension to that abuse?

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-06  1:05     ` Neil Brown
@ 2004-09-06 14:02       ` Christoph Hellwig
  2004-09-06 22:53         ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2004-09-06 14:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, Christoph Hellwig

On Mon, Sep 06, 2004 at 11:05:54AM +1000, Neil Brown wrote:
> > Christoph points out that this is fairly wild procfs abuse.  We want to be
> > moving away from that sort of thing, not adding to it.
> 
> I guess it depends on what you mean by "wild procfs abuse"...
> Given that /proc/mdstat already exists it doesn't seem too
> unreasonable to add a little functionality to it.  How much does it hurt?

poll on procfs is defintily abuse, even in categories of the use procfs
every times.

> > Is it possible to use rml's new event stuff from rc1-mm3's
> > kernel-sysfs-events-layer.patch?  Or a bare netlink interface?  Or
> > raidfs?
> 
> sysfs:  
>   Probably.  I would like to improve sysfs support for md but I
>   haven't taken the plunge yet to figure out how it all works.
>   kevents may well be ok, but as you may need to handle
>   multipath-failed events in times if high memory pressure, and need
>   to handle it before pressure can be relieved, I prefer to minimise
>   the number of kmallocs needed to get the event to userspace.
> 
> bare netlink:
>   Probably perter sysfs... Funny, but I remember reading a comment in
>   the code (2.4 I think, ages ago) about netlink being deprecated or
>   something like that, so I never bothered looking at it.  I wonder
>   what it meant.

netlink isn't deprecated, netlink_dev (aka netlink as chardevice nodes)
is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-06 14:02       ` Christoph Hellwig
@ 2004-09-06 22:53         ` Neil Brown
  2004-09-06 23:07           ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2004-09-06 22:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, linux-raid

On Monday September 6, hch@lst.de wrote:
> On Mon, Sep 06, 2004 at 11:05:54AM +1000, Neil Brown wrote:
> > > Christoph points out that this is fairly wild procfs abuse.  We want to be
> > > moving away from that sort of thing, not adding to it.
> > 
> > I guess it depends on what you mean by "wild procfs abuse"...
> > Given that /proc/mdstat already exists it doesn't seem too
> > unreasonable to add a little functionality to it.  How much does it hurt?
> 
> poll on procfs is defintily abuse, even in categories of the use procfs
> every times.
> 

Oh well..  If you gain more from the patch not going in than I gain
from it going in, so be it.  I'll leave it up to Andrew/Linus.

> > > Is it possible to use rml's new event stuff from rc1-mm3's
> > > kernel-sysfs-events-layer.patch?  Or a bare netlink interface?  Or
> > > raidfs?
> > 
> > sysfs:  
> >   Probably.  I would like to improve sysfs support for md but I
> >   haven't taken the plunge yet to figure out how it all works.
> >   kevents may well be ok, but as you may need to handle
> >   multipath-failed events in times if high memory pressure, and need
> >   to handle it before pressure can be relieved, I prefer to minimise
> >   the number of kmallocs needed to get the event to userspace.
> > 
> > bare netlink:
> >   Probably perter sysfs... Funny, but I remember reading a comment in
> >   the code (2.4 I think, ages ago) about netlink being deprecated or
> >   something like that, so I never bothered looking at it.  I wonder
> >   what it meant.
> 
> netlink isn't deprecated, netlink_dev (aka netlink as chardevice nodes)
> is.

Aha.. that makes sense.  Thanks for clarifying that for me.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-06 22:53         ` Neil Brown
@ 2004-09-06 23:07           ` Andrew Morton
  2004-09-07  0:53             ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2004-09-06 23:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: hch, linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
>
> > poll on procfs is defintily abuse, even in categories of the use procfs
>  > every times.
>  > 
> 
>  Oh well..  If you gain more from the patch not going in than I gain
>  from it going in, so be it.

That's not the point.  Sure, the code works and is useful.  But it takes a
crappy interface and makes it crappier.

You know where we'd end up if we took that attitude all the time, yes?

Could we explore a medium-term plan to tidy all this up?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-06 23:07           ` Andrew Morton
@ 2004-09-07  0:53             ` Neil Brown
  2004-09-07 10:45               ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2004-09-07  0:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hch, linux-raid

On Monday September 6, akpm@osdl.org wrote:
> Neil Brown <neilb@cse.unsw.edu.au> wrote:
> >
> > > poll on procfs is defintily abuse, even in categories of the use procfs
> >  > every times.
> >  > 
> > 
> >  Oh well..  If you gain more from the patch not going in than I gain
> >  from it going in, so be it.
> 
> That's not the point.  Sure, the code works and is useful.  But it takes a
> crappy interface and makes it crappier.

I'm sorry, but I don't see the "crap".
I want a way to tell user-space "it is worth reading from this file
again" and "select" seems to be the most obvious interface to use for
that. 
It is true that the usage is not exactly the same as with normal file, but
things in '/proc' (or sysfs) simple are not normal files.

One man's fish is another man's poisson???

Why is sending a message over a netlink socket intrinsically better
than returning from a select?

Looking at the comment with the sysfs event layer, I see that it has
recently been trimmed down from sending a payload to sending a single
word in the context of a particular sysfs file.  This is starting to
look suspiciously like dnotify (or maybe inotify).

So why a new event layer.  Why not leverage dnotify concepts.

And back to the sysfs-events, what values can the "signal" word have?
Is it a state, or a change, or a request?  If I have a situation that
needs attention, do I say so once, or every tick, or every HZ.  How do
I know when there is a new listener?  How do I respond and say 'that
is dealt with'?  Can I send a matching signal down the netlink, or
would it have to be completely out-of-band?

I thought about this when I did the select-on-/proc/mdstat thing.
I don't think having the kernel say "what happened" is a good thing.
It is too easy to lose those messages, and it isn't clear what sort of
messages they should be.
It is much simpler and, I think, more reliable to have the kernel say
"something happened over here", and have the listener have a look to see
what of interest has changed.

Maybe I'm saying that dnotify/inotify/sysfs-events should be discarded
in favour of something that encompasses them all.  Whenever anything
happens to an object on which events have been requested, a message
with just the name of the object gets sent on some netlink socket.

(it would help if I could find some documentation on netlink.  Any
pointers?)

> 
> You know where we'd end up if we took that attitude all the time,
> yes?

Obviously you don't want to accept "crap".  But who can say where the
"crap" is?

Obviously we need a design authority, and equally obviously, that is
you and Linus at the moment.  Which is why I explicitly said I would
accept your decision.

> 
> Could we explore a medium-term plan to tidy all this up?

If  "all this" is poor design in user-kernel interfaces, then  you need
to say what is acceptable and what isn't, and set a flag day/version
when all the non-acceptable stuff will be removed.  Until you do that,
there will be mess.

If by "all this" you mean signalling changes in the state of md
devices, then I'm happy for you to reject the patch if you don't like
it.  Redhat can patch their kernels or drop it too as they choose.  I
can live without it.  Meanwhile I'll try to learn a bit more about
sysfs and netlink and see if I can come up with something that I am
happy with and that seems to fit the current preferred approach.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events.
  2004-09-07  0:53             ` Neil Brown
@ 2004-09-07 10:45               ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2004-09-07 10:45 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, hch, linux-raid

On Tue, Sep 07, 2004 at 10:53:39AM +1000, Neil Brown wrote:
> On Monday September 6, akpm@osdl.org wrote:
> > Neil Brown <neilb@cse.unsw.edu.au> wrote:
> > >
> > > > poll on procfs is defintily abuse, even in categories of the use procfs
> > >  > every times.
> > >  > 
> > > 
> > >  Oh well..  If you gain more from the patch not going in than I gain
> > >  from it going in, so be it.
> > 
> > That's not the point.  Sure, the code works and is useful.  But it takes a
> > crappy interface and makes it crappier.
> 
> I'm sorry, but I don't see the "crap".

IF you want poll/select do a character device.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-09-07 10:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03  2:27 [PATCH md 0 of 2] Introduction NeilBrown
2004-09-03  2:27 ` [PATCH md 2 of 2] Correct "working_disk" counts for raid5 and raid6 NeilBrown
2004-09-03  2:27 ` [PATCH md 1 of 2] Add interface to md driver for userspace monitoring of events NeilBrown
2004-09-03 10:19   ` Andrew Morton
2004-09-06  1:05     ` Neil Brown
2004-09-06 14:02       ` Christoph Hellwig
2004-09-06 22:53         ` Neil Brown
2004-09-06 23:07           ` Andrew Morton
2004-09-07  0:53             ` Neil Brown
2004-09-07 10:45               ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).