[PATCH md ] When resizing an array, we need to update resync_max

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH md ] When resizing an array, we need to update resync_max_sectors as well as size.
       [not found] <20050717182650.24540.patches@notabene>
@ 2005-07-17  8:27 ` NeilBrown
  2005-07-17 12:10   ` Found a new bug! djani22
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2005-07-17  8:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid

Another md patch against 2.6.13-rc2-mm2, suitable for 2.6.13.
Thanks,
NeilBrown

### Comments for Changeset

Without this, and attempt to 'grow' an array will claim to have synced
the extra part without actually having done anything.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/raid1.c     |    1 +
 ./drivers/md/raid5.c     |    1 +
 ./drivers/md/raid6main.c |    1 +
 3 files changed, 3 insertions(+)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2005-07-17 18:25:47.000000000 +1000
+++ ./drivers/md/raid1.c	2005-07-17 17:18:13.000000000 +1000
@@ -1467,6 +1467,7 @@ static int raid1_resize(mddev_t *mddev, 
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	}
 	mddev->size = mddev->array_size;
+	mddev->resync_max_sectors = sectors;
 	return 0;
 }
 

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2005-07-17 18:25:47.000000000 +1000
+++ ./drivers/md/raid5.c	2005-07-17 18:25:52.000000000 +1000
@@ -1931,6 +1931,7 @@ static int raid5_resize(mddev_t *mddev, 
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	}
 	mddev->size = sectors /2;
+	mddev->resync_max_sectors = sectors;
 	return 0;
 }
 

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2005-07-17 18:25:47.000000000 +1000
+++ ./drivers/md/raid6main.c	2005-07-17 17:19:04.000000000 +1000
@@ -2095,6 +2095,7 @@ static int raid6_resize(mddev_t *mddev, 
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	}
 	mddev->size = sectors /2;
+	mddev->resync_max_sectors = sectors;
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Found a new bug!
  2005-07-17  8:27 ` [PATCH md ] When resizing an array, we need to update resync_max_sectors as well as size NeilBrown
@ 2005-07-17 12:10   ` djani22
  2005-07-17 22:13     ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: djani22 @ 2005-07-17 12:10 UTC (permalink / raw)
  To: linux-raid

Hi all!

I think I found a new bug in the kernel ! (or mdadm?)

First I try this:
mkraid --configfile /etc/raidtab.nw /dev/md0 -R
DESTROYING the contents of /dev/md0 in 5 seconds, Ctrl-C if unsure!
handling MD device /dev/md0
analyzing super-block
couldn't get device size for /dev/md31 -- File too large
mkraid: aborted.
(In addition to the above messages, see the syslog and /proc/mdstat as well
 for potential clues.)

Next  I try this:

./create_linear
mdadm: /dev/md31 appears to be part of a raid array:
    level=0 devices=1 ctime=Sun Jul 17 13:30:27 2005
Continue creating array? y
./create_linear: line 1:  2853 Segmentation fault      mdadm --create
/dev/md0 --chunk=32 --level=linear --force --raid-devices=1 /dev/md31

After this little script the half of the raid subsystem hangs:

The raidtools makes nothing, the mdadm makes nothing too.
AND the cat /proc/mdstat is hangs too!
But the /dev/md31 device is still working.

mdstat in previous 2s: (watch cat /proc/mdstat)

Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [faulty]
md31 : active raid0 md4[3] md3[2] md2[1] md1[0]
      7814332928 blocks 32k chunks

md4 : active raid1 nbd3[0]
      1953583296 blocks [2/1] [U_]

md3 : active raid1 nbd2[0]
      1953583296 blocks [2/1] [U_]

md2 : active raid1 nbd1[0]
      1953583296 blocks [2/1] [U_]

md1 : active raid1 nbd0[0]
      1953583296 blocks [2/1] [U_]

unused devices: <none>

Kernel: 2.6.13-rc3
raidtools-1.00.3
mdadm-1.12.0

The background:
I try to build a big array ~8TB.

I use for this 5 PCs.
4 for "disk nodes" with nbd and 1 for "concentrator".
(from previous idea in this list. ;)
In the concentrator, the first level raid  (md1-4) is for ability to backup,
swap the disk nodes. (node-spare)
The next level (md31) is for the performance. ;)
And, the last level (md0 linear) for scalability.

Why dont use LVM for last level?
Well, I try that, but cat /dev/.../LV >/dev/null can do only 15 - 16 MB/s
and cat /dev/md31 >/dev/null can do 34-38MB/s.
(the network is G-Ethernet, but only 32bit/33MHz PCI!)

Thanks
Janos


----- Original Message -----
From: "NeilBrown" <neilb@cse.unsw.edu.au>
To: "Andrew Morton" <akpm@osdl.org>
Cc: <linux-raid@vger.kernel.org>
Sent: Sunday, July 17, 2005 10:27 AM
Subject: [PATCH md ] When resizing an array, we need to update
resync_max_sectors as well as size.


> Another md patch against 2.6.13-rc2-mm2, suitable for 2.6.13.
> Thanks,
> NeilBrown
>
> ### Comments for Changeset
>
> Without this, and attempt to 'grow' an array will claim to have synced
> the extra part without actually having done anything.
>
> Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
>
> ### Diffstat output
>  ./drivers/md/raid1.c     |    1 +
>  ./drivers/md/raid5.c     |    1 +
>  ./drivers/md/raid6main.c |    1 +
>  3 files changed, 3 insertions(+)
>
> diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
> --- ./drivers/md/raid1.c~current~ 2005-07-17 18:25:47.000000000 +1000
> +++ ./drivers/md/raid1.c 2005-07-17 17:18:13.000000000 +1000
> @@ -1467,6 +1467,7 @@ static int raid1_resize(mddev_t *mddev,
>   set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   }
>   mddev->size = mddev->array_size;
> + mddev->resync_max_sectors = sectors;
>   return 0;
>  }
>
>
> diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
> --- ./drivers/md/raid5.c~current~ 2005-07-17 18:25:47.000000000 +1000
> +++ ./drivers/md/raid5.c 2005-07-17 18:25:52.000000000 +1000
> @@ -1931,6 +1931,7 @@ static int raid5_resize(mddev_t *mddev,
>   set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   }
>   mddev->size = sectors /2;
> + mddev->resync_max_sectors = sectors;
>   return 0;
>  }
>
>
> diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
> --- ./drivers/md/raid6main.c~current~ 2005-07-17 18:25:47.000000000 +1000
> +++ ./drivers/md/raid6main.c 2005-07-17 17:19:04.000000000 +1000
> @@ -2095,6 +2095,7 @@ static int raid6_resize(mddev_t *mddev,
>   set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   }
>   mddev->size = sectors /2;
> + mddev->resync_max_sectors = sectors;
>   return 0;
>  }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-07-17 12:10   ` Found a new bug! djani22
@ 2005-07-17 22:13     ` Neil Brown
  2005-07-17 22:31       ` djani22
  2005-08-14 22:38       ` djani22
  0 siblings, 2 replies; 20+ messages in thread
From: Neil Brown @ 2005-07-17 22:13 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid

On Sunday July 17, djani22@dynamicweb.hu wrote:
> Hi all!
> 
> I think I found a new bug in the kernel ! (or mdadm?)

Yes.  With the current code you cannot have components of a 'linear'
which are larger than 2^32 sectors.  I'll try to put together a fix
for this in the next day or so.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-07-17 22:13     ` Neil Brown
@ 2005-07-17 22:31       ` djani22
  2005-08-14 22:38       ` djani22
  1 sibling, 0 replies; 20+ messages in thread
From: djani22 @ 2005-07-17 22:31 UTC (permalink / raw)
  To: linux-raid


----- Original Message -----
From: "Neil Brown" <neilb@cse.unsw.edu.au>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Monday, July 18, 2005 12:13 AM
Subject: Re: Found a new bug!


> On Sunday July 17, djani22@dynamicweb.hu wrote:
> > Hi all!
> >
> > I think I found a new bug in the kernel ! (or mdadm?)
>
> Yes.  With the current code you cannot have components of a 'linear'
> which are larger than 2^32 sectors.  I'll try to put together a fix
> for this in the next day or so.
>
> NeilBrown

Thanks for help!

Another one question:

I did'nt find an usable way for me, but my system must start anyway....

I hava created the XFS directly to the 8TB raid0 (/dev/md31), and now the
copy is runing...
It is possible in the future to convert it to be a part of the planned
linear array without backup all data?

Thanks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-07-17 22:13     ` Neil Brown
  2005-07-17 22:31       ` djani22
@ 2005-08-14 22:38       ` djani22
  2005-08-15  1:21         ` Neil Brown
  1 sibling, 1 reply; 20+ messages in thread
From: djani22 @ 2005-08-14 22:38 UTC (permalink / raw)
  To: linux-raid

Hello list, Neil!

Is there something news with the 2TB raid-input problem?
Sooner or later, I will need to join two 8TB array to one big 16TB. :-)

Thanks, 

Janos


----- Original Message ----- 
From: "Neil Brown" <neilb@cse.unsw.edu.au>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Monday, July 18, 2005 12:13 AM
Subject: Re: Found a new bug!


> On Sunday July 17, djani22@dynamicweb.hu wrote:
> > Hi all!
> > 
> > I think I found a new bug in the kernel ! (or mdadm?)
> 
> Yes.  With the current code you cannot have components of a 'linear'
> which are larger than 2^32 sectors.  I'll try to put together a fix
> for this in the next day or so.
> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-08-14 22:38       ` djani22
@ 2005-08-15  1:21         ` Neil Brown
  2005-08-15 10:50           ` djani22
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2005-08-15  1:21 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid

On Monday August 15, djani22@dynamicweb.hu wrote:
> Hello list, Neil!
> 
> Is there something news with the 2TB raid-input problem?
> Sooner or later, I will need to join two 8TB array to one big 16TB. :-)

Thanks for the reminder.

The following patch should work, but my test machine won't boot the
current -mm kernels :-( so it is hard to test properly.

Let me know the results if you are able to test it.

Thanks,
NeilBrown

---------------------------------
Support md/linear array with components greater than 2 terabytes.

linear currently uses division by the size of the smallest componenet
device to find which device a request goes to.
If that smallest device is larger than 2 terabytes, then the division
will not work on some systems.

So we introduce a pre-shift, and take care not to make the hash table
too large, much like the code in raid0.

Also get rid of conf->nr_zones, which is not needed.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>

### Diffstat output
 ./drivers/md/linear.c         |   99 ++++++++++++++++++++++++++++--------------
 ./include/linux/raid/linear.h |    4 -
 2 files changed, 70 insertions(+), 33 deletions(-)

diff ./drivers/md/linear.c~current~ ./drivers/md/linear.c
--- ./drivers/md/linear.c~current~	2005-08-15 11:18:21.000000000 +1000
+++ ./drivers/md/linear.c	2005-08-15 11:18:27.000000000 +1000
@@ -38,7 +38,8 @@ static inline dev_info_t *which_dev(mdde
 	/*
 	 * sector_div(a,b) returns the remainer and sets a to a/b
 	 */
-	(void)sector_div(block, conf->smallest->size);
+	block >>= conf->preshift;
+	(void)sector_div(block, conf->hash_spacing);
 	hash = conf->hash_table[block];
 
 	while ((sector>>1) >= (hash->size + hash->offset))
@@ -47,7 +48,7 @@ static inline dev_info_t *which_dev(mdde
 }
 
 /**
- *	linear_mergeable_bvec -- tell bio layer if a two requests can be merged
+ *	linear_mergeable_bvec -- tell bio layer if two requests can be merged
  *	@q: request queue
  *	@bio: the buffer head that's been built up so far
  *	@biovec: the request that could be merged to it.
@@ -116,7 +117,7 @@ static int linear_run (mddev_t *mddev)
 	dev_info_t **table;
 	mdk_rdev_t *rdev;
 	int i, nb_zone, cnt;
-	sector_t start;
+	sector_t min_spacing;
 	sector_t curr_offset;
 	struct list_head *tmp;
 
@@ -127,11 +128,6 @@ static int linear_run (mddev_t *mddev)
 	memset(conf, 0, sizeof(*conf) + mddev->raid_disks*sizeof(dev_info_t));
 	mddev->private = conf;
 
-	/*
-	 * Find the smallest device.
-	 */
-
-	conf->smallest = NULL;
 	cnt = 0;
 	mddev->array_size = 0;
 
@@ -159,8 +155,6 @@ static int linear_run (mddev_t *mddev)
 		disk->size = rdev->size;
 		mddev->array_size += rdev->size;
 
-		if (!conf->smallest || (disk->size < conf->smallest->size))
-			conf->smallest = disk;
 		cnt++;
 	}
 	if (cnt != mddev->raid_disks) {
@@ -168,6 +162,36 @@ static int linear_run (mddev_t *mddev)
 		goto out;
 	}
 
+	min_spacing = mddev->array_size;
+	sector_div(min_spacing, PAGE_SIZE/sizeof(struct dev_info *));
+
+	/* min_spacing is the minimum spacing that will fit the hash
+	 * table in one PAGE.  This may be much smaller than needed.
+	 * We find the smallest non-terminal set of consecutive devices
+	 * that is larger than min_spacing as use the size of that as
+	 * the actual spacing 
+	 */
+	conf->hash_spacing = mddev->array_size;
+	for (i=0; i < cnt-1 ; i++) {
+		sector_t sz = 0;
+		int j;
+		for (j=i; i<cnt-1 && sz < min_spacing ; j++)
+			sz += conf->disks[j].size;
+		if (sz >= min_spacing && sz < conf->hash_spacing)
+			conf->hash_spacing = sz;
+	}
+
+	/* hash_spacing may be too large for sector_div to work with,
+	 * so we might need to pre-shift 
+	 */
+	conf->preshift = 0;
+	if (sizeof(sector_t) > sizeof(u32)) {
+		sector_t space = conf->hash_spacing;
+		while (space > (sector_t)(~(u32)0)) {
+			space >>= 1;
+			conf->preshift++;
+		}
+	}
 	/*
 	 * This code was restructured to work around a gcc-2.95.3 internal
 	 * compiler error.  Alter it with care.
@@ -177,39 +201,52 @@ static int linear_run (mddev_t *mddev)
 		unsigned round;
 		unsigned long base;
 
-		sz = mddev->array_size;
-		base = conf->smallest->size;
+		sz = mddev->array_size >> conf->preshift;
+		sz += 1; /* force round-up */
+		base = conf->hash_spacing >> conf->preshift;
 		round = sector_div(sz, base);
-		nb_zone = conf->nr_zones = sz + (round ? 1 : 0);
+		nb_zone = sz + (round ? 1 : 0);
 	}
-			
-	conf->hash_table = kmalloc (sizeof (dev_info_t*) * nb_zone,
+	BUG_ON(nb_zone > PAGE_SIZE / sizeof(struct dev_info *));
+
+	conf->hash_table = kmalloc (sizeof (struct dev_info *) * nb_zone,
 					GFP_KERNEL);
 	if (!conf->hash_table)
 		goto out;
 
 	/*
 	 * Here we generate the linear hash table
+	 * First calculate the device offsets.
 	 */
+	conf->disks[0].offset = 0;
+	for (i=1; i<mddev->raid_disks; i++)
+		conf->disks[i].offset =
+			conf->disks[i-1].offset +
+			conf->disks[i-1].size;
+
 	table = conf->hash_table;
-	start = 0;
 	curr_offset = 0;
-	for (i = 0; i < cnt; i++) {
-		dev_info_t *disk = conf->disks + i;
-
-		disk->offset = curr_offset;
-		curr_offset += disk->size;
-
-		/* 'curr_offset' is the end of this disk
-		 * 'start' is the start of table
+	i = 0;
+	for (curr_offset = 0;
+	     curr_offset < mddev->array_size;
+	     curr_offset += conf->hash_spacing) {
+
+		while (i < mddev->raid_disks-1 &&
+		       curr_offset >= conf->disks[i+1].offset)
+			i++;
+
+		*table ++ = conf->disks + i;
+	}
+
+	if (conf->preshift) {
+		conf->hash_spacing >>= conf->preshift;
+		/* round hash_spacing up so that when we divide by it,
+		 * we err on the side of "too-low", which is safest.
 		 */
-		while (start < curr_offset) {
-			*table++ = disk;
-			start += conf->smallest->size;
-		}
+		conf->hash_spacing++;
 	}
-	if (table-conf->hash_table != nb_zone)
-		BUG();
+
+	BUG_ON(table - conf->hash_table > nb_zone);
 
 	blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
 	mddev->queue->unplug_fn = linear_unplug;
@@ -299,7 +336,7 @@ static void linear_status (struct seq_fi
 	sector_t s = 0;
   
 	seq_printf(seq, "      ");
-	for (j = 0; j < conf->nr_zones; j++)
+	for (j = 0; j < mddev->raid_disks; j++)
 	{
 		char b[BDEVNAME_SIZE];
 		s += conf->smallest_size;

diff ./include/linux/raid/linear.h~current~ ./include/linux/raid/linear.h
--- ./include/linux/raid/linear.h~current~	2005-08-15 11:18:21.000000000 +1000
+++ ./include/linux/raid/linear.h	2005-08-15 09:13:55.000000000 +1000
@@ -14,8 +14,8 @@ typedef struct dev_info dev_info_t;
 struct linear_private_data
 {
 	dev_info_t		**hash_table;
-	dev_info_t		*smallest;
-	int			nr_zones;
+	sector_t		hash_spacing;
+	int			preshift; /* shift before dividing by hash_spacing */
 	dev_info_t		disks[0];
 };
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-08-15  1:21         ` Neil Brown
@ 2005-08-15 10:50           ` djani22
  2005-08-16 13:54             ` perfomance question djani22
  2005-08-18  4:34             ` Found a new bug! Neil Brown
  0 siblings, 2 replies; 20+ messages in thread
From: djani22 @ 2005-08-15 10:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Thanks, I will test it, when I can...

In this moment, my system is an working online system, and now only one 8TB
space what I can use...
Thats right, maybe I can built linear array from only one soure device,but:
My first problem is, on my 8TB device is already exists XFS filesystem, with
valuable data, what I can't backup.
It is still OK, but I can't insert one raid layer, because the raid's
superblock, and the XFS is'nt shrinkable. :-(

The only one way (I think) to plug in another raw device, and build an array
from 8TB-device + new small device, to get much space to FS.

But it is too risky for me!

Do you think it is safe?

Currently I use 2.6.13-rc3.
This patch is good for this version, or only the last version?

Witch is the last? 2.6.13-rc6 or rc6-git7, or 2.6.14 -git cvs? :)

Thanks,

Janos


----- Original Message -----
From: "Neil Brown" <neilb@cse.unsw.edu.au>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Monday, August 15, 2005 3:21 AM
Subject: Re: Found a new bug!


> On Monday August 15, djani22@dynamicweb.hu wrote:
> > Hello list, Neil!
> >
> > Is there something news with the 2TB raid-input problem?
> > Sooner or later, I will need to join two 8TB array to one big 16TB. :-)
>
> Thanks for the reminder.
>
> The following patch should work, but my test machine won't boot the
> current -mm kernels :-( so it is hard to test properly.
>
> Let me know the results if you are able to test it.
>
> Thanks,
> NeilBrown
>
> ---------------------------------
> Support md/linear array with components greater than 2 terabytes.
>
> linear currently uses division by the size of the smallest componenet
> device to find which device a request goes to.
> If that smallest device is larger than 2 terabytes, then the division
> will not work on some systems.
>
> So we introduce a pre-shift, and take care not to make the hash table
> too large, much like the code in raid0.
>
> Also get rid of conf->nr_zones, which is not needed.
>
> Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
>
> ### Diffstat output
>  ./drivers/md/linear.c         |   99
++++++++++++++++++++++++++++--------------
>  ./include/linux/raid/linear.h |    4 -
>  2 files changed, 70 insertions(+), 33 deletions(-)
>
> diff ./drivers/md/linear.c~current~ ./drivers/md/linear.c
> --- ./drivers/md/linear.c~current~ 2005-08-15 11:18:21.000000000 +1000
> +++ ./drivers/md/linear.c 2005-08-15 11:18:27.000000000 +1000
> @@ -38,7 +38,8 @@ static inline dev_info_t *which_dev(mdde
>   /*
>   * sector_div(a,b) returns the remainer and sets a to a/b
>   */
> - (void)sector_div(block, conf->smallest->size);
> + block >>= conf->preshift;
> + (void)sector_div(block, conf->hash_spacing);
>   hash = conf->hash_table[block];
>
>   while ((sector>>1) >= (hash->size + hash->offset))
> @@ -47,7 +48,7 @@ static inline dev_info_t *which_dev(mdde
>  }
>
>  /**
> - * linear_mergeable_bvec -- tell bio layer if a two requests can be
merged
> + * linear_mergeable_bvec -- tell bio layer if two requests can be merged
>   * @q: request queue
>   * @bio: the buffer head that's been built up so far
>   * @biovec: the request that could be merged to it.
> @@ -116,7 +117,7 @@ static int linear_run (mddev_t *mddev)
>   dev_info_t **table;
>   mdk_rdev_t *rdev;
>   int i, nb_zone, cnt;
> - sector_t start;
> + sector_t min_spacing;
>   sector_t curr_offset;
>   struct list_head *tmp;
>
> @@ -127,11 +128,6 @@ static int linear_run (mddev_t *mddev)
>   memset(conf, 0, sizeof(*conf) + mddev->raid_disks*sizeof(dev_info_t));
>   mddev->private = conf;
>
> - /*
> - * Find the smallest device.
> - */
> -
> - conf->smallest = NULL;
>   cnt = 0;
>   mddev->array_size = 0;
>
> @@ -159,8 +155,6 @@ static int linear_run (mddev_t *mddev)
>   disk->size = rdev->size;
>   mddev->array_size += rdev->size;
>
> - if (!conf->smallest || (disk->size < conf->smallest->size))
> - conf->smallest = disk;
>   cnt++;
>   }
>   if (cnt != mddev->raid_disks) {
> @@ -168,6 +162,36 @@ static int linear_run (mddev_t *mddev)
>   goto out;
>   }
>
> + min_spacing = mddev->array_size;
> + sector_div(min_spacing, PAGE_SIZE/sizeof(struct dev_info *));
> +
> + /* min_spacing is the minimum spacing that will fit the hash
> + * table in one PAGE.  This may be much smaller than needed.
> + * We find the smallest non-terminal set of consecutive devices
> + * that is larger than min_spacing as use the size of that as
> + * the actual spacing
> + */
> + conf->hash_spacing = mddev->array_size;
> + for (i=0; i < cnt-1 ; i++) {
> + sector_t sz = 0;
> + int j;
> + for (j=i; i<cnt-1 && sz < min_spacing ; j++)
> + sz += conf->disks[j].size;
> + if (sz >= min_spacing && sz < conf->hash_spacing)
> + conf->hash_spacing = sz;
> + }
> +
> + /* hash_spacing may be too large for sector_div to work with,
> + * so we might need to pre-shift
> + */
> + conf->preshift = 0;
> + if (sizeof(sector_t) > sizeof(u32)) {
> + sector_t space = conf->hash_spacing;
> + while (space > (sector_t)(~(u32)0)) {
> + space >>= 1;
> + conf->preshift++;
> + }
> + }
>   /*
>   * This code was restructured to work around a gcc-2.95.3 internal
>   * compiler error.  Alter it with care.
> @@ -177,39 +201,52 @@ static int linear_run (mddev_t *mddev)
>   unsigned round;
>   unsigned long base;
>
> - sz = mddev->array_size;
> - base = conf->smallest->size;
> + sz = mddev->array_size >> conf->preshift;
> + sz += 1; /* force round-up */
> + base = conf->hash_spacing >> conf->preshift;
>   round = sector_div(sz, base);
> - nb_zone = conf->nr_zones = sz + (round ? 1 : 0);
> + nb_zone = sz + (round ? 1 : 0);
>   }
> -
> - conf->hash_table = kmalloc (sizeof (dev_info_t*) * nb_zone,
> + BUG_ON(nb_zone > PAGE_SIZE / sizeof(struct dev_info *));
> +
> + conf->hash_table = kmalloc (sizeof (struct dev_info *) * nb_zone,
>   GFP_KERNEL);
>   if (!conf->hash_table)
>   goto out;
>
>   /*
>   * Here we generate the linear hash table
> + * First calculate the device offsets.
>   */
> + conf->disks[0].offset = 0;
> + for (i=1; i<mddev->raid_disks; i++)
> + conf->disks[i].offset =
> + conf->disks[i-1].offset +
> + conf->disks[i-1].size;
> +
>   table = conf->hash_table;
> - start = 0;
>   curr_offset = 0;
> - for (i = 0; i < cnt; i++) {
> - dev_info_t *disk = conf->disks + i;
> -
> - disk->offset = curr_offset;
> - curr_offset += disk->size;
> -
> - /* 'curr_offset' is the end of this disk
> - * 'start' is the start of table
> + i = 0;
> + for (curr_offset = 0;
> +      curr_offset < mddev->array_size;
> +      curr_offset += conf->hash_spacing) {
> +
> + while (i < mddev->raid_disks-1 &&
> +        curr_offset >= conf->disks[i+1].offset)
> + i++;
> +
> + *table ++ = conf->disks + i;
> + }
> +
> + if (conf->preshift) {
> + conf->hash_spacing >>= conf->preshift;
> + /* round hash_spacing up so that when we divide by it,
> + * we err on the side of "too-low", which is safest.
>   */
> - while (start < curr_offset) {
> - *table++ = disk;
> - start += conf->smallest->size;
> - }
> + conf->hash_spacing++;
>   }
> - if (table-conf->hash_table != nb_zone)
> - BUG();
> +
> + BUG_ON(table - conf->hash_table > nb_zone);
>
>   blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
>   mddev->queue->unplug_fn = linear_unplug;
> @@ -299,7 +336,7 @@ static void linear_status (struct seq_fi
>   sector_t s = 0;
>
>   seq_printf(seq, "      ");
> - for (j = 0; j < conf->nr_zones; j++)
> + for (j = 0; j < mddev->raid_disks; j++)
>   {
>   char b[BDEVNAME_SIZE];
>   s += conf->smallest_size;
>
> diff ./include/linux/raid/linear.h~current~ ./include/linux/raid/linear.h
> --- ./include/linux/raid/linear.h~current~ 2005-08-15 11:18:21.000000000
+1000
> +++ ./include/linux/raid/linear.h 2005-08-15 09:13:55.000000000 +1000
> @@ -14,8 +14,8 @@ typedef struct dev_info dev_info_t;
>  struct linear_private_data
>  {
>   dev_info_t **hash_table;
> - dev_info_t *smallest;
> - int nr_zones;
> + sector_t hash_spacing;
> + int preshift; /* shift before dividing by hash_spacing */
>   dev_info_t disks[0];
>  };
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* perfomance question.
  2005-08-15 10:50           ` djani22
@ 2005-08-16 13:54             ` djani22
  2005-08-16 14:30               ` RAID6 Query Colonel Hell
  2005-08-18  4:59               ` perfomance question Neil Brown
  2005-08-18  4:34             ` Found a new bug! Neil Brown
  1 sibling, 2 replies; 20+ messages in thread
From: djani22 @ 2005-08-16 13:54 UTC (permalink / raw)
  To: linux-raid

Hello list,

I have performance problem. (again) :-)

What chunk size is better in  raid5, and raid0?
The lot of small chunks, or some bigger?

I know it is depends on FS, but I think only the raid code!

Which is better readable/writable?

Thanks

Janos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RAID6 Query
  2005-08-16 13:54             ` perfomance question djani22
@ 2005-08-16 14:30               ` Colonel Hell
  2005-08-16 15:40                 ` dean gaudet
  2005-08-18  4:59               ` perfomance question Neil Brown
  1 sibling, 1 reply; 20+ messages in thread
From: Colonel Hell @ 2005-08-16 14:30 UTC (permalink / raw)
  To: linux-raid

Hi,
I just went thru a couple of papers describing RAID6. 
I dunno how relevant this discussion grp is for the qry ...but here I go :) ...
I couldnt figure out why is P+Q configuration better over P+q' where
q' == P. What I mean is instead of calculating a new checksum (thru a
lot of GF theory etc) just store the parity block (P)again. In this
case as well we have the same amount of fault tolerance or not
:-s  ...   
Let me know,

here are the links which I went thru.

http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

http://www.intel.com/design/storage/papers/308122.htm

Regards,
Amritanshu.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID6 Query
  2005-08-16 14:30               ` RAID6 Query Colonel Hell
@ 2005-08-16 15:40                 ` dean gaudet
  2005-08-16 16:44                   ` Colonel Hell
  0 siblings, 1 reply; 20+ messages in thread
From: dean gaudet @ 2005-08-16 15:40 UTC (permalink / raw)
  To: Colonel Hell; +Cc: linux-raid

On Tue, 16 Aug 2005, Colonel Hell wrote:

> I just went thru a couple of papers describing RAID6. 
> I dunno how relevant this discussion grp is for the qry ...but here I go :) ...
> I couldnt figure out why is P+Q configuration better over P+q' where
> q' == P. What I mean is instead of calculating a new checksum (thru a
> lot of GF theory etc) just store the parity block (P)again. In this
> case as well we have the same amount of fault tolerance or not
> :-s  ...   

this is no better than raid5 at surviving a two disk failure.  i.e. 
consider the case of two data blocks missing -- you can't reconstruct if 
all you have is parity.

-dean

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: RAID6 Query
  2005-08-16 15:40                 ` dean gaudet
@ 2005-08-16 16:44                   ` Colonel Hell
  0 siblings, 0 replies; 20+ messages in thread
From: Colonel Hell @ 2005-08-16 16:44 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid

thanks and sorry for a stupid qry suffering from foot-in-the-mouth disease :P

On 8/16/05, dean gaudet <dean-list-linux-raid@arctic.org> wrote:
> On Tue, 16 Aug 2005, Colonel Hell wrote:
> 
> > I just went thru a couple of papers describing RAID6.
> > I dunno how relevant this discussion grp is for the qry ...but here I go :) ...
> > I couldnt figure out why is P+Q configuration better over P+q' where
> > q' == P. What I mean is instead of calculating a new checksum (thru a
> > lot of GF theory etc) just store the parity block (P)again. In this
> > case as well we have the same amount of fault tolerance or not
> > :-s  ...
> 
> this is no better than raid5 at surviving a two disk failure.  i.e.
> consider the case of two data blocks missing -- you can't reconstruct if
> all you have is parity.
> 
> -dean
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: perfomance question.
  2005-08-16 13:54             ` perfomance question djani22
  2005-08-16 14:30               ` RAID6 Query Colonel Hell
@ 2005-08-18  4:59               ` Neil Brown
  2005-08-18 15:20                 ` djani22
  1 sibling, 1 reply; 20+ messages in thread
From: Neil Brown @ 2005-08-18  4:59 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid

On Tuesday August 16, djani22@dynamicweb.hu wrote:
> Hello list,
> 
> I have performance problem. (again) :-)
> 
> What chunk size is better in  raid5, and raid0?
> The lot of small chunks, or some bigger?

This is highly dependant one workload and hardware performance. 
The best thing to do is develop a test that simulates your real
workload and run it with various stripe sizes, and see which one wins.

I suspect there would be very little gain in going to very small chunk
sizes (<16k).  Anywhere between there and 1Meg is worth trying.

mdadm uses a default of 64k which is probably not too bad for most
situations, but I cannot promise it being optimal for any.

Sorry I cannot be more helpful.

Your performance problem may not be chunk-size related.  Maybe
increasing the readahead (with blockdev) would help...

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: perfomance question.
  2005-08-18  4:59               ` perfomance question Neil Brown
@ 2005-08-18 15:20                 ` djani22
  0 siblings, 0 replies; 20+ messages in thread
From: djani22 @ 2005-08-18 15:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Thanks for trying to help me!

My problem is (looks like) solved.
It was a kernel problem. (I think...)
When I switch to 2.6.13-rc6 (from rc3), the problem is gone!
It is very interesting!

I use SWRAID to distribute equal load to nodes.
(raid0 chunksize 32k)
In my system with 2.6.13-rc3 the "node-3" gets  much more (4x - 5x) read
requests, but dont know why, dont ask! :-)

First I think, the XFS's log is somehow always on 3 rd chunk.
I send this question to XFS-list too, and get this answer:
"The XFS log is always write, except recoverying." - Thats right!

Next idea is to break more the 32k chunks, and send this previous letter to
here.
But I have more problems (network layer-bug) with 13-rc3, and try the newer
kernel, and the problem is gone. :-)
It looks like some network issue.

Thanks

Janos

----- Original Message -----
From: "Neil Brown" <neilb@cse.unsw.edu.au>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Thursday, August 18, 2005 6:59 AM
Subject: Re: perfomance question.

> On Tuesday August 16, djani22@dynamicweb.hu wrote:
> > Hello list,
> >
> > I have performance problem. (again) :-)
> >
> > What chunk size is better in  raid5, and raid0?
> > The lot of small chunks, or some bigger?
>
> This is highly dependant one workload and hardware performance.
> The best thing to do is develop a test that simulates your real
> workload and run it with various stripe sizes, and see which one wins.
>
> I suspect there would be very little gain in going to very small chunk
> sizes (<16k).  Anywhere between there and 1Meg is worth trying.
>
> mdadm uses a default of 64k which is probably not too bad for most
> situations, but I cannot promise it being optimal for any.
>
> Sorry I cannot be more helpful.
>
> Your performance problem may not be chunk-size related.  Maybe
> increasing the readahead (with blockdev) would help...
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-08-15 10:50           ` djani22
  2005-08-16 13:54             ` perfomance question djani22
@ 2005-08-18  4:34             ` Neil Brown
  2005-08-18 15:39               ` djani22
  1 sibling, 1 reply; 20+ messages in thread
From: Neil Brown @ 2005-08-18  4:34 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid

On Monday August 15, djani22@dynamicweb.hu wrote:
> Thanks, I will test it, when I can...
> 
> In this moment, my system is an working online system, and now only one 8TB
> space what I can use...
> Thats right, maybe I can built linear array from only one soure device,but:
> My first problem is, on my 8TB device is already exists XFS filesystem, with
> valuable data, what I can't backup.
> It is still OK, but I can't insert one raid layer, because the raid's
> superblock, and the XFS is'nt shrinkable. :-(
> 
> The only one way (I think) to plug in another raw device, and build an array
> from 8TB-device + new small device, to get much space to FS.
> 
> But it is too risky for me!

Yes, I wouldn't bother just for testing.  I've managed to put together
some huge devices with spare files and multi-layer linear arrays (ext3
won't allow files as big as 2TB) and I am happy that the patch works.

Longer term, I have been thinking of enhancing mdadm so that when you
create a linear array, it copies the few blocks from the end that will
be over written by the superblock onto the start of the second
device.  This would allow a single device to be extended into a linear
array without loss.  (I also have patches to hot-add devices to the
end of a linear array which I really should dust-off and get into
mainline). 
> 
> Do you think it is safe?
> 
> Currently I use 2.6.13-rc3.
> This patch is good for this version, or only the last version?
> 
> Witch is the last? 2.6.13-rc6 or rc6-git7, or 2.6.14 -git cvs? :)

The patch should be good against any reasonable recent version of
2.6.  I always work against the latest -mm, but this code has been
largely untouched for a while so there shouldn't be any patch
conflicts.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Found a new bug!
  2005-08-18  4:34             ` Found a new bug! Neil Brown
@ 2005-08-18 15:39               ` djani22
  2005-08-20  9:55                 ` Oops in raid1? djani22
  0 siblings, 1 reply; 20+ messages in thread
From: djani22 @ 2005-08-18 15:39 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

----- Original Message -----
From: "Neil Brown" <neilb@cse.unsw.edu.au>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Thursday, August 18, 2005 6:34 AM
Subject: Re: Found a new bug!


> On Monday August 15, djani22@dynamicweb.hu wrote:
> > Thanks, I will test it, when I can...
> >
> > In this moment, my system is an working online system, and now only one
8TB
> > space what I can use...
> > Thats right, maybe I can built linear array from only one soure
device,but:
> > My first problem is, on my 8TB device is already exists XFS filesystem,
with
> > valuable data, what I can't backup.
> > It is still OK, but I can't insert one raid layer, because the raid's
> > superblock, and the XFS is'nt shrinkable. :-(
> >
> > The only one way (I think) to plug in another raw device, and build an
array
> > from 8TB-device + new small device, to get much space to FS.
> >
> > But it is too risky for me!
>
> Yes, I wouldn't bother just for testing.  I've managed to put together
> some huge devices with spare files and multi-layer linear arrays (ext3
> won't allow files as big as 2TB) and I am happy that the patch works.
>
> Longer term, I have been thinking of enhancing mdadm so that when you
> create a linear array, it copies the few blocks from the end that will
> be over written by the superblock onto the start of the second
> device.  This would allow a single device to be extended into a linear
> array without loss.  (I also have patches to hot-add devices to the
> end of a linear array which I really should dust-off and get into
> mainline).

Yes!
This is very good idea!
I can do that manually with dd, but some people can't.
This, and sometimes reverse of this is a usefull options!

In my case:
I add some small HDD to my big array, to try the patch.
Thats ok.
But later, when I try to change the small to another big, there is no easy
way, to do this.
When I copy the small drive with dd or cat to 2nd big array, the superblock
is wrong placed.
(or not?)


> >
> > Do you think it is safe?
> >
> > Currently I use 2.6.13-rc3.
> > This patch is good for this version, or only the last version?
> >
> > Witch is the last? 2.6.13-rc6 or rc6-git7, or 2.6.14 -git cvs? :)
>
> The patch should be good against any reasonable recent version of
> 2.6.  I always work against the latest -mm, but this code has been
> largely untouched for a while so there shouldn't be any patch
> conflicts.

Thanks, I will try it!
But in the last month my system's downtime is almost more than uptime, and
now I try to fix this very bad stat. :-)


Janos


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Oops in raid1?
  2005-08-18 15:39               ` djani22
@ 2005-08-20  9:55                 ` djani22
  2005-08-20 15:53                   ` Pallai Roland
  0 siblings, 1 reply; 20+ messages in thread
From: djani22 @ 2005-08-20  9:55 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3895 bytes --]

Hello list, Neil!

I found this, bud don't know what is this exactly...
It is not look like the *NBD's deadlock. :-/

Neil!
It is the "original" 2.6.13-rc6, not with your patch!
Only with two mods, what I get from netdev list, and attached to this
letter....


Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] Unable to handle
kernel paging request at virtual address a014d7a5
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000]  printing eip:
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] c0118cee
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] *pde = f7bedd02
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] Oops: 0000 [#1]
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] SMP
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] Modules linked in:
netconsole gnbd
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] CPU:    0
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] EIP:
0060:[<c0118cee>]    Not tainted VLI
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] EFLAGS: 00010296
(2.6.13-rc6)
Aug 20 01:07:23 192.168.2.50 kernel: [42992885.040000] EIP is at
kmap+0x1e/0x54
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] eax: 00000246   ebx:
a014d7a5   ecx: c11ef260   edx: cabbc400
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] esi: 00008000   edi:
00000001   ebp: f6c7fe00   esp: f6c7fdf4
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] ds: 007b   es: 007b
ss: 0068
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Process md3_raid1
(pid: 2769, threadinfo=f6c7e000 task=f7eef020)
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Stack: c0577800
00000006 f5f93cfc f6c7fe54 f895a9cc a014d7a5 00000001 c
f793000
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]        00001000
00004000 d3fc3180 f73e9bf0 f895e718 cabbc400 007ea037 0
1000000
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]        d4175a4c
f895e6f0 65000000 00f03d8d 00100000 d4175a4c f895e6f0 f
895e700
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Call Trace:
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0103ca2>]
show_stack+0x9a/0xd0
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0103e6d>]
show_registers+0x175/0x209
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c010408c>]
die+0xfa/0x17c
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0117b68>]
do_page_fault+0x269/0x7bd
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01038d7>]
error_code+0x4f/0x54
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895a9cc>]
__gnbd_send_req+0x196/0x28d [gnbd]
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895af12>]
do_gnbd_request+0xe5/0x198 [gnbd]
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0383a0d>]
__generic_unplug_device+0x28/0x2e
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c038150f>]
__elv_add_request+0xaa/0xac
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0384e5b>]
__make_request+0x20d/0x512
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0385490>]
generic_make_request+0xb2/0x27a
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04748a2>]
raid1d+0xbf/0x2cb
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04825c9>]
md_thread+0x134/0x16f
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01010d5>]
kernel_thread_helper+0x5/0xb
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Code: 89 c1 81 e1 ff
ff 0f 00 eb b0 90 90 90 55 89 e5 53 83 ec 08 8b 5d
 08 c7 44 24 04 06 00 00 00 c7 04 24 00 78 57 c0 e8 72 47 00 00 <8b> 03 c1
e8 1e 8b 14 85 14 db 73 c0 8b 82 0c 04 00 00 05 00
09
Aug 20 01:07:24 192.168.2.50 Fatal exception: panic in 5 seconds
Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  <0>Fatal exception:
panic in 5 seconds
Aug 20 01:07:27 192.168.2.50 [42992890.060000] Kernel panic - not syncing:
Fatal exception


Janos

[-- Attachment #2: p.txt --]
[-- Type: text/plain, Size: 567 bytes --]

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1474,6 +1474,10 @@ static void tcp_mark_head_lost(struct so
 	int cnt = packets;
 
 	BUG_TRAP(cnt <= tp->packets_out);
+	if (unlikely(cnt > tp->packets_out)) {
+		printk("packets_out = %d, fackets_out = %d, reordering = %d, sack_ok = 0x%x, mss_cache=%d\n", tp->packets_out, tp->fackets_out, tp->reordering, tp->rx_opt.sack_ok, tp->mss_cache);
+		dump_stack();
+	}
 
 	sk_stream_for_retrans_queue(skb, sk) {
 		cnt -= tcp_skb_pcount(skb);

[-- Attachment #3: fix.txt --]
[-- Type: text/plain, Size: 854 bytes --]

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1370,15 +1370,21 @@ int tcp_retransmit_skb(struct sock *sk, 
 
 	if (skb->len > cur_mss) {
 		int old_factor = tcp_skb_pcount(skb);
-		int new_factor;
+		int diff;
 
 		if (tcp_fragment(sk, skb, cur_mss, cur_mss))
 			return -ENOMEM; /* We'll try again later. */
 
 		/* New SKB created, account for it. */
-		new_factor = tcp_skb_pcount(skb);
-		tp->packets_out -= old_factor - new_factor;
-		tp->packets_out += tcp_skb_pcount(skb->next);
+		diff = old_factor - tcp_skb_pcount(skb) -
+		       tcp_skb_pcount(skb->next);
+		tp->packets_out -= diff;
+
+		if (diff > 0) {
+			tp->fackets_out -= diff;
+			if ((int)tp->fackets_out < 0)
+				tp->fackets_out = 0;
+		}
 	}
 
 	/* Collapse two adjacent packets if worthwhile and we can. */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Oops in raid1?
  2005-08-20  9:55                 ` Oops in raid1? djani22
@ 2005-08-20 15:53                   ` Pallai Roland
  2005-08-20 16:26                     ` djani22
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2005-08-20 15:53 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid


Hi,

On Sat, 2005-08-20 at 11:55 +0200, djani22@dynamicweb.hu wrote:
> I found this, bud don't know what is this exactly...
> It is not look like the *NBD's deadlock. :-/
 it's exactly a GNBD bug, imho

> [...]
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Process md3_raid1
> (pid: 2769, threadinfo=f6c7e000 task=f7eef020)
> [...]
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0117b68>]
> do_page_fault+0x269/0x7bd
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01038d7>]
> error_code+0x4f/0x54
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895a9cc>]
> __gnbd_send_req+0x196/0x28d [gnbd]
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895af12>]
> do_gnbd_request+0xe5/0x198 [gnbd]
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0383a0d>]
> __generic_unplug_device+0x28/0x2e
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c038150f>]
> __elv_add_request+0xaa/0xac
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0384e5b>]
> __make_request+0x20d/0x512
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0385490>]
> generic_make_request+0xb2/0x27a
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04748a2>]
> raid1d+0xbf/0x2cb
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04825c9>]
> md_thread+0x134/0x16f
> Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01010d5>]
> kernel_thread_helper+0x5/0xb


--
 dap


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Oops in raid1?
  2005-08-20 15:53                   ` Pallai Roland
@ 2005-08-20 16:26                     ` djani22
  2005-08-20 16:50                       ` Pallai Roland
  0 siblings, 1 reply; 20+ messages in thread
From: djani22 @ 2005-08-20 16:26 UTC (permalink / raw)
  To: Pallai Roland; +Cc: linux-raid


----- Original Message -----
From: "Pallai Roland" <dap@mail.index.hu>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Saturday, August 20, 2005 5:53 PM
Subject: Re: Oops in raid1?


>
> Hi,
>
> On Sat, 2005-08-20 at 11:55 +0200, djani22@dynamicweb.hu wrote:
> > I found this, bud don't know what is this exactly...
> > It is not look like the *NBD's deadlock. :-/
>  it's exactly a GNBD bug, imho

Hmmm.
Possibly...
I get this message, when high upload. (disk write)
But the GNBD generates this, in that situation, and thats why I think, this
is something else...

Jul 17 23:05:10 dy-base kernel: ------------[ cut here ]------------
Jul 17 23:05:10 dy-base kernel: kernel BUG at mm/highmem.c:183!
Jul 17 23:05:10 dy-base kernel: invalid operand: 0000 [#1]
Jul 17 23:05:10 dy-base kernel: PREEMPT SMP
Jul 17 23:05:10 dy-base kernel: Modules linked in: gnbd
Jul 17 23:05:10 dy-base kernel: CPU:    0
Jul 17 23:05:10 dy-base kernel: EIP:    0060:[<c0155aff>]    Tainted: G    B
VLI
Jul 17 23:05:10 dy-base kernel: EFLAGS: 00010246   (2.6.13-rc3-plus-NFS)
Jul 17 23:05:10 dy-base kernel: EIP is at kunmap_high+0x1f/0xa0
Jul 17 23:05:10 dy-base kernel: eax: 00000000   ebx: c1a98cc0   ecx:
c1a98cc0   edx: 00000202
Jul 17 23:05:10 dy-base kernel: esi: dc9f0900   edi: 00000000   ebp:
d5a1e600   esp: ee6c3e74
Jul 17 23:05:10 dy-base kernel: ds: 007b   es: 007b   ss: 0068
Jul 17 23:05:10 dy-base kernel: Process md4_raid1 (pid: 15185,
threadinfo=ee6c2000 task=d224e020)
Jul 17 23:05:10 dy-base kernel: Stack: c1a98cc0 00001000 f883fa7e c1a98cc0
00000001 c009c000 00001000 00000000
Jul 17 23:05:10 dy-base kernel:        40e38500 003d2431 007ea037 01000000
e593104c ee6c2000 5d000000 0000faff
Jul 17 23:05:10 dy-base kernel:        00200000 c055b2c6 e593104c f8842d08
f8842d18 f6e7abf0 f884001b f8842d08
Jul 17 23:05:10 dy-base kernel: Call Trace:
---> Jul 17 23:05:10 dy-base kernel:  [<f883fa7e>]
__gnbd_send_req+0x15e/0x280 [gnbd]
Jul 17 23:05:10 dy-base kernel:  [<c055b2c6>] preempt_schedule+0x56/0x80
Jul 17 23:05:10 dy-base kernel:  [<f884001b>] do_gnbd_request+0xeb/0x1a0
[gnbd]
Jul 17 23:05:10 dy-base kernel:  [<c03788f6>]
__generic_unplug_device+0x36/0x40
Jul 17 23:05:10 dy-base kernel:  [<c037891e>]
generic_unplug_device+0x1e/0x30
Jul 17 23:05:10 dy-base kernel:  [<c0461018>] unplug_slaves+0xe8/0x100
Jul 17 23:05:10 dy-base kernel:  [<c0462405>] raid1d+0x205/0x2a0
Jul 17 23:05:10 dy-base kernel:  [<c0470919>] md_thread+0x159/0x1a0
Jul 17 23:05:10 dy-base kernel:  [<c0137370>]
autoremove_wake_function+0x0/0x60
Jul 17 23:05:10 dy-base kernel:  [<c01030d2>] ret_from_fork+0x6/0x14
Jul 17 23:05:10 dy-base kernel:  [<c0137370>]
autoremove_wake_function+0x0/0x60
Jul 17 23:05:10 dy-base kernel:  [<c04707c0>] md_thread+0x0/0x1a0
Jul 17 23:05:10 dy-base kernel:  [<c0101205>] kernel_thread_helper+0x5/0x10
Jul 17 23:05:10 dy-base kernel: Code: ff 8d 74 26 00 8d bc 27 00 00 00 00 83
ec 08 89 5c 24 04 89 c3 b8 80 10 6d c0 e8 0d 68 4
Jul 17 23:05:10 dy-base kernel:  <6>note: md4_raid1[15185] exited with
preempt_count 1


Thanks

Janos

>
> > [...]
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000] Process md3_raid1
> > (pid: 2769, threadinfo=f6c7e000 task=f7eef020)
> > [...]
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0117b68>]
> > do_page_fault+0x269/0x7bd
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01038d7>]
> > error_code+0x4f/0x54
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895a9cc>]
> > __gnbd_send_req+0x196/0x28d [gnbd]
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<f895af12>]
> > do_gnbd_request+0xe5/0x198 [gnbd]
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0383a0d>]
> > __generic_unplug_device+0x28/0x2e
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c038150f>]
> > __elv_add_request+0xaa/0xac
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0384e5b>]
> > __make_request+0x20d/0x512
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c0385490>]
> > generic_make_request+0xb2/0x27a
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04748a2>]
> > raid1d+0xbf/0x2cb
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c04825c9>]
> > md_thread+0x134/0x16f
> > Aug 20 01:07:24 192.168.2.50 kernel: [42992885.040000]  [<c01010d5>]
> > kernel_thread_helper+0x5/0xb
>
>
> --
>  dap
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Oops in raid1?
  2005-08-20 16:26                     ` djani22
@ 2005-08-20 16:50                       ` Pallai Roland
  2005-08-20 16:57                         ` djani22
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2005-08-20 16:50 UTC (permalink / raw)
  To: djani22; +Cc: linux-raid


On Sat, 2005-08-20 at 18:26 +0200, djani22@dynamicweb.hu wrote:
> I get this message, when high upload. (disk write)
> But the GNBD generates this, in that situation, and thats why I think, this
> is something else...
 yes, seems like it's an another bug in the GNBD, but the backtrace is
clear in the first case too, the request to the underlying device what's
generated that panic, not the raid1d's own.

 all in all, try to disable the preempt mode, that may help..


--
 dap


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Oops in raid1?
  2005-08-20 16:50                       ` Pallai Roland
@ 2005-08-20 16:57                         ` djani22
  0 siblings, 0 replies; 20+ messages in thread
From: djani22 @ 2005-08-20 16:57 UTC (permalink / raw)
  To: Pallai Roland; +Cc: linux-raid


----- Original Message -----
From: "Pallai Roland" <dap@mail.index.hu>
To: <djani22@dynamicweb.hu>
Cc: <linux-raid@vger.kernel.org>
Sent: Saturday, August 20, 2005 6:50 PM
Subject: Re: Oops in raid1?


>
> On Sat, 2005-08-20 at 18:26 +0200, djani22@dynamicweb.hu wrote:
> > I get this message, when high upload. (disk write)
> > But the GNBD generates this, in that situation, and thats why I think,
this
> > is something else...
>  yes, seems like it's an another bug in the GNBD, but the backtrace is
> clear in the first case too, the request to the underlying device what's
> generated that panic, not the raid1d's own.

OK, I understand. :-)
In this case, I'll send it to RedHat's list...

>
>  all in all, try to disable the preempt mode, that may help..

Yes, I know it! :-)
The preempt-kernel is much older! :-)

Thanks for help!

Janos

>
>
> --
>  dap


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-08-20 16:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20050717182650.24540.patches@notabene>
2005-07-17  8:27 ` [PATCH md ] When resizing an array, we need to update resync_max_sectors as well as size NeilBrown
2005-07-17 12:10   ` Found a new bug! djani22
2005-07-17 22:13     ` Neil Brown
2005-07-17 22:31       ` djani22
2005-08-14 22:38       ` djani22
2005-08-15  1:21         ` Neil Brown
2005-08-15 10:50           ` djani22
2005-08-16 13:54             ` perfomance question djani22
2005-08-16 14:30               ` RAID6 Query Colonel Hell
2005-08-16 15:40                 ` dean gaudet
2005-08-16 16:44                   ` Colonel Hell
2005-08-18  4:59               ` perfomance question Neil Brown
2005-08-18 15:20                 ` djani22
2005-08-18  4:34             ` Found a new bug! Neil Brown
2005-08-18 15:39               ` djani22
2005-08-20  9:55                 ` Oops in raid1? djani22
2005-08-20 15:53                   ` Pallai Roland
2005-08-20 16:26                     ` djani22
2005-08-20 16:50                       ` Pallai Roland
2005-08-20 16:57                         ` djani22

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).