Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: filesystem-level tool to validate array
From: Gordon Henderson @ 2011-05-30  8:20 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <BANLkTi=9mp0v4kn+3ZZP7CtBxE88Ox3_rA@mail.gmail.com>

On Sun, 29 May 2011, Michael Stumpf wrote:

> I'm looking for a filesystem-level tool to perform something similar
> to what badblocks does at the drive level.  I can certainly write it
> on my own (I'd build it as a Perl or Python script), but if someone's
> already invented this..
>
> (The intended purpose is to validate that there are no quirks/bugs in
> the overall fs.)

If you can take the partition offline, then fsck -fC might work, although 
it'll depend on the fileysstem type... And fsck doesn't actually read the 
file blocks (that I'm aware of)

For something crude, you can use find to descend a heirarchy then copy the 
file, or maybe even something like

   cd /top-level/dir/
   fgrep -r "wumpus" .

that'll perform a read of every file - well, mostly as some might be in 
the filesystem cache.

But if you want to make sure every file block belongs to a file, and the 
structure (directory) integrity is there, then fsck is probably the best 
bet...

Another way might be to recursively compute md5 checksums for all files - 
then do it again and compare.. (at a later date?)

You might want to look at something like tripwire to automate this though.

(Obviously won't work if you get the same error at the same place every 
time though!)

One of the burn-in tools I have is a script that writes a file of random 
numbers - md5's it. Then copies this file to n+1, then copies n+1 to n+2, 
then n+2 to n+3 and so on, then md5's the final file. The file-sizes are 
typically double RAM size to negate the effects of cache (same idea as 
bonnie)... However if there's a failure, then it's it's not clear where 
the issue is - memory, PCI bus, SATA cable, disk platter?

Of-course in a RAID array, looking at it from the fileysstem level (or 
even the block level) isn't going to read all platters of all disks - you 
need to use the /sys/block/mdX/md/sync_action mechanism.

Gordon

^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: Stan Hoeppner @ 2011-05-30 10:43 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: linux-raid
In-Reply-To: <BANLkTi=236kncpunzodSci-1K33u_FBkPA@mail.gmail.com>

On 5/30/2011 2:14 AM, fibreraid@gmail.com wrote:
> Hi all,
> 
> I am looking to optimize md RAID performance as much as possible.
> 
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
> 
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
> 
> Here are the results.I used the following commands to perform these benchmarks:
> 
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0

Did you test with buffered IO?  Unless you're running Oracle or a custom
app that only uses O_DIRECT, you should probably be testing buffered IO
as well as it's a more real world test case most of the time.

> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.

IOPS and throughput tuning often traditionally have an inverse
relationship.  It may prove difficult to tune maximum performance for
both cases.

> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s

> Note that each individual SSD tests out as follows:
> 
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s

This looks like a filesystem limitation.

> My concerns:
> 
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
> 
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!

Your filesystem interaction with mdraid levels (stripe/chunk meshing)
may be limiting your performance.  FIO does test files IIRC, not direct
block IO.  Are you using EXT3/4?  XFS?

I suggest you try the following.  Create an md raid *linear* array of
all 24 SSDs using a 4KB chunk size.  Format the resulting md device with
XFS specifying 24 allocation groups, not other options.  Something like:

~# mdadm -C /dev/md0 -n=24 -c=4 -l=linear /dev/sd[a..x]
~# mdadm -A /dev/md0 /dev/sb[a..x]
~# mkfs.xfs /dev/md0 -d agcount=24

This setup will parallelize the IO load at the file level instead of at
the stripe or chunk level of the md RAID layer.  Each file in the test
will be wholly written to and read from only one SSD, but you'll get 24
parallel streams, one to/from each SSD.  (You can do the same thing with
RAID 10, 6, etc, but files will get striped across multiple drives,
which doesn't work well for small files)

Simply specify agcount=[number of actual data devices], not including
devices, or space, consumed by redundancy.  For example, in a 10 disk
RAID 10 you'd use agcount=5.  For a 10 disk RAID 6, agcount=8, and so on.

Since you're using 2.6.38 you'll want to enable XFS delayed logging,
which speeds up large metadata write loads substantially.  To do so,
simply add 'delaylog' to your fstab mount options, such as:

/dev/md0       /test           xfs     defaults,delaylog

I'm interested to see what kind of performance increase you get with
this setup.

-- 
Stan

^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: David Brown @ 2011-05-30 11:20 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <BANLkTi=236kncpunzodSci-1K33u_FBkPA@mail.gmail.com>

On 30/05/2011 09:14, fibreraid@gmail.com wrote:
> Hi all,
>
> I am looking to optimize md RAID performance as much as possible.
>
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
>
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
>
> Here are the results.I used the following commands to perform these benchmarks:
>
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
>
> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
>
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
>
> Note that each individual SSD tests out as follows:
>
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s
>
>
> My concerns:
>
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
>
>
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!
>

(This is in addition to what Stan said about filesystems, etc.)

If my mental calculations are correct, writing 4M to this raid5/raid6 
setup takes about 1.5 stripes.  Typically that will mean two partial 
stripe writes (or even two partials and one full).  Partial stripe 
writes on raid5/6 means reading in most of the old stripe, calculating 
the new parity, and writing out the new data and parity.  When you tried 
with a raid0 of two raid5 groups, this effect was less because more of 
the writes were full stripes.

With SSDs, you have very low latency between a read system call and the 
data being accessed - that's what gives it a high IOps.  But it also 
means that layers of indirection such as more complex raid or layered 
raid have more effect.

Try your measurements with a raid10,far setup.  It costs more on data 
space, but should, I think, be quite a bit faster.


^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: John Robinson @ 2011-05-30 11:57 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <irvuj1$rc3$1@dough.gmane.org>

On 30/05/2011 12:20, David Brown wrote:
> (This is in addition to what Stan said about filesystems, etc.)
[...]
> Try your measurements with a raid10,far setup. It costs more on data
> space, but should, I think, be quite a bit faster.

I'd also be interested in what performance is like with RAID60, e.g. 4 
6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement 
because it gives slightly better data space (33% better than the RAID10 
arrangement), better redundancy (if that's a consideration[1]), and 
would keep all your stripe widths in powers of two, e.g. 64K chunk on 
the RAID6s would give a 256K stripe width and end up with an overall 
stripe width of 1M at the RAID0.

You will probably always have relatively poor small write performance 
with any parity RAID for reasons both David and Stan already pointed 
out, though the above might be the least worst, if you see what I mean.

You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd 
definitely have to be careful - as Stan says - with your filesystem 
configuration because of the stripe widths, and the bigger your parity 
RAIDs the worse your small write and degraded performance becomes.

Cheers,

John.

[1] RAID6 lets you get away with sector errors while rebuilding after a 
disc failure. In addition, as it happens, setting up this arrangement 
with two drives on each controller for each of the RAID6s would mean you 
could tolerate a controller failure, albeit with horrible performance 
and you would have no redundancy left. You could configure smaller 
RAID6s or RAID10 to tolerate a controller failure too.


^ permalink raw reply

* [PATCH] md: use proper little-endian bitops
From: Akinobu Mita @ 2011-05-30 12:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Akinobu Mita, NeilBrown, linux-raid

Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
---
 drivers/md/bitmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 70bd738..52fa6cf 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -856,7 +856,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 		if (bitmap->flags & BITMAP_HOSTENDIAN)
 			set_bit(bit, kaddr);
 		else
-			__test_and_set_bit_le(bit, kaddr);
+			__set_bit_le(bit, kaddr);
 		kunmap_atomic(kaddr, KM_USER0);
 		PRINTK("set file bit %lu page %lu\n", bit, page->index);
 	}
@@ -1228,7 +1228,7 @@ void bitmap_daemon_work(mddev_t *mddev)
 						clear_bit(file_page_offset(bitmap, j),
 							  paddr);
 					else
-						__test_and_clear_bit_le(file_page_offset(bitmap, j),
+						__clear_bit_le(file_page_offset(bitmap, j),
 							       paddr);
 					kunmap_atomic(paddr, KM_USER0);
 				} else
-- 
1.7.4.4

^ permalink raw reply related

* Re: Optimizing small IO with md RAID
From: David Brown @ 2011-05-30 13:08 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DE38628.1050201@anonymous.org.uk>

On 30/05/2011 13:57, John Robinson wrote:
> On 30/05/2011 12:20, David Brown wrote:
>> (This is in addition to what Stan said about filesystems, etc.)
> [...]
>> Try your measurements with a raid10,far setup. It costs more on data
>> space, but should, I think, be quite a bit faster.
>
> I'd also be interested in what performance is like with RAID60, e.g. 4
> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
> because it gives slightly better data space (33% better than the RAID10
> arrangement), better redundancy (if that's a consideration[1]), and
> would keep all your stripe widths in powers of two, e.g. 64K chunk on
> the RAID6s would give a 256K stripe width and end up with an overall
> stripe width of 1M at the RAID0.
>

Power-of-two stripe widths may be better for xfs than non-power-of-two 
widths - perhaps Stan can answer that (he seems to know lots about xfs 
on raid).  But you have to be careful when testing and benchmarking - 
with power-of-two stripe widths, it's easy to get great 4 MB performance 
but terrible 5 MB performance.


As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is 
different but not necessarily better, depending on your failure types 
and requirements.  raid6 will tolerate any two drives failing, while 
raid10 will tolerate up to half the drives failing as long as you don't 
lose both halves of a pair.  Depending on the chances of a random disk 
failing, if you have enough disks then the chances of two disks in a 
pair failing are less than the chances of three disks in a raid6 setup 
failing.  And raid10 suffers much less from running in degraded mode 
than raid6, and recovery is faster and less stressful.  So which is 
"better" depends on the user.

Of course, there is no question about the differences in space 
efficiency - that's easy to calculate.

For greater paranoia, you can always go for raid15 or even raid16...

> You will probably always have relatively poor small write performance
> with any parity RAID for reasons both David and Stan already pointed
> out, though the above might be the least worst, if you see what I mean.
>
> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
> definitely have to be careful - as Stan says - with your filesystem
> configuration because of the stripe widths, and the bigger your parity
> RAIDs the worse your small write and degraded performance becomes.
>
> Cheers,
>
> John.
>
> [1] RAID6 lets you get away with sector errors while rebuilding after a
> disc failure. In addition, as it happens, setting up this arrangement
> with two drives on each controller for each of the RAID6s would mean you
> could tolerate a controller failure, albeit with horrible performance
> and you would have no redundancy left. You could configure smaller
> RAID6s or RAID10 to tolerate a controller failure too.
>


^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: fibreraid @ 2011-05-30 15:24 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid
In-Reply-To: <is04s2$2rn$1@dough.gmane.org>

Hi All,

I appreciate the feedback but most of it seems around File System
recommendations or to change to parity-less RAID, like RAID 10. In my
tests, there is no file system; I am testing the raw block device as I
want to establish best-numbers there before layering on the file
system.

-Tommy


On Mon, May 30, 2011 at 6:08 AM, David Brown <david@westcontrol.com> wrote:
> On 30/05/2011 13:57, John Robinson wrote:
>>
>> On 30/05/2011 12:20, David Brown wrote:
>>>
>>> (This is in addition to what Stan said about filesystems, etc.)
>>
>> [...]
>>>
>>> Try your measurements with a raid10,far setup. It costs more on data
>>> space, but should, I think, be quite a bit faster.
>>
>> I'd also be interested in what performance is like with RAID60, e.g. 4
>> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
>> because it gives slightly better data space (33% better than the RAID10
>> arrangement), better redundancy (if that's a consideration[1]), and
>> would keep all your stripe widths in powers of two, e.g. 64K chunk on
>> the RAID6s would give a 256K stripe width and end up with an overall
>> stripe width of 1M at the RAID0.
>>
>
> Power-of-two stripe widths may be better for xfs than non-power-of-two
> widths - perhaps Stan can answer that (he seems to know lots about xfs on
> raid).  But you have to be careful when testing and benchmarking - with
> power-of-two stripe widths, it's easy to get great 4 MB performance but
> terrible 5 MB performance.
>
>
> As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is
> different but not necessarily better, depending on your failure types and
> requirements.  raid6 will tolerate any two drives failing, while raid10 will
> tolerate up to half the drives failing as long as you don't lose both halves
> of a pair.  Depending on the chances of a random disk failing, if you have
> enough disks then the chances of two disks in a pair failing are less than
> the chances of three disks in a raid6 setup failing.  And raid10 suffers
> much less from running in degraded mode than raid6, and recovery is faster
> and less stressful.  So which is "better" depends on the user.
>
> Of course, there is no question about the differences in space efficiency -
> that's easy to calculate.
>
> For greater paranoia, you can always go for raid15 or even raid16...
>
>> You will probably always have relatively poor small write performance
>> with any parity RAID for reasons both David and Stan already pointed
>> out, though the above might be the least worst, if you see what I mean.
>>
>> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
>> definitely have to be careful - as Stan says - with your filesystem
>> configuration because of the stripe widths, and the bigger your parity
>> RAIDs the worse your small write and degraded performance becomes.
>>
>> Cheers,
>>
>> John.
>>
>> [1] RAID6 lets you get away with sector errors while rebuilding after a
>> disc failure. In addition, as it happens, setting up this arrangement
>> with two drives on each controller for each of the RAID6s would mean you
>> could tolerate a controller failure, albeit with horrible performance
>> and you would have no redundancy left. You could configure smaller
>> RAID6s or RAID10 to tolerate a controller failure too.
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: David Brown @ 2011-05-30 16:56 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <BANLkTikw9cqfhHAVaxZ2T2EErroCMT5Zow@mail.gmail.com>

On 30/05/11 17:24, fibreraid@gmail.com wrote:
> Hi All,
>
> I appreciate the feedback but most of it seems around File System
> recommendations or to change to parity-less RAID, like RAID 10. In my
> tests, there is no file system; I am testing the raw block device as I
> want to establish best-numbers there before layering on the file
> system.
>

I understand about testing the low-level speed before adding filesystem 
(and possibly lvm) layers, but what's wrong with parity-less RAID? 
RAID10,far has lower space efficiency than RAID5 or RAID6, but typically 
has performance close to RAID0, and it sounded like you were judging 
performance to be the most important factor.

mvh.,

David


> -Tommy
>
>
> On Mon, May 30, 2011 at 6:08 AM, David Brown<david@westcontrol.com>  wrote:
>> On 30/05/2011 13:57, John Robinson wrote:
>>>
>>> On 30/05/2011 12:20, David Brown wrote:
>>>>
>>>> (This is in addition to what Stan said about filesystems, etc.)
>>>
>>> [...]
>>>>
>>>> Try your measurements with a raid10,far setup. It costs more on data
>>>> space, but should, I think, be quite a bit faster.
>>>
>>> I'd also be interested in what performance is like with RAID60, e.g. 4
>>> 6-drive RAID6 sets, combined into one RAID0. I suggest this arrangement
>>> because it gives slightly better data space (33% better than the RAID10
>>> arrangement), better redundancy (if that's a consideration[1]), and
>>> would keep all your stripe widths in powers of two, e.g. 64K chunk on
>>> the RAID6s would give a 256K stripe width and end up with an overall
>>> stripe width of 1M at the RAID0.
>>>
>>
>> Power-of-two stripe widths may be better for xfs than non-power-of-two
>> widths - perhaps Stan can answer that (he seems to know lots about xfs on
>> raid).  But you have to be careful when testing and benchmarking - with
>> power-of-two stripe widths, it's easy to get great 4 MB performance but
>> terrible 5 MB performance.
>>
>>
>> As for the redundancy of raid6 (or 60) vs. raid10, the redundancy is
>> different but not necessarily better, depending on your failure types and
>> requirements.  raid6 will tolerate any two drives failing, while raid10 will
>> tolerate up to half the drives failing as long as you don't lose both halves
>> of a pair.  Depending on the chances of a random disk failing, if you have
>> enough disks then the chances of two disks in a pair failing are less than
>> the chances of three disks in a raid6 setup failing.  And raid10 suffers
>> much less from running in degraded mode than raid6, and recovery is faster
>> and less stressful.  So which is "better" depends on the user.
>>
>> Of course, there is no question about the differences in space efficiency -
>> that's easy to calculate.
>>
>> For greater paranoia, you can always go for raid15 or even raid16...
>>
>>> You will probably always have relatively poor small write performance
>>> with any parity RAID for reasons both David and Stan already pointed
>>> out, though the above might be the least worst, if you see what I mean.
>>>
>>> You could also try 3 8-drive RAID6s or 2 12-drive RAID6s but you'd
>>> definitely have to be careful - as Stan says - with your filesystem
>>> configuration because of the stripe widths, and the bigger your parity
>>> RAIDs the worse your small write and degraded performance becomes.
>>>
>>> Cheers,
>>>
>>> John.
>>>
>>> [1] RAID6 lets you get away with sector errors while rebuilding after a
>>> disc failure. In addition, as it happens, setting up this arrangement
>>> with two drives on each controller for each of the RAID6s would mean you
>>> could tolerate a controller failure, albeit with horrible performance
>>> and you would have no redundancy left. You could configure smaller
>>> RAID6s or RAID10 to tolerate a controller failure too.
>>>
>>


^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: Stan Hoeppner @ 2011-05-30 21:21 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: David Brown, linux-raid
In-Reply-To: <BANLkTikw9cqfhHAVaxZ2T2EErroCMT5Zow@mail.gmail.com>

On 5/30/2011 10:24 AM, fibreraid@gmail.com wrote:
> Hi All,
> 
> I appreciate the feedback but most of it seems around File System
> recommendations or to change to parity-less RAID, like RAID 10. In my
> tests, there is no file system; I am testing the raw block device as I
> want to establish best-numbers there before layering on the file
> system.

You're not performing valid taste case.  You will always have a
filesystem in production.  The performance of every md raid level plus
filesystem plus hardware combination will be different, and thus they
must be tuned together, not each in isolation, especially in the case of
SSDs.   Case in point:  slap EXT2 on your current array setup and then
XFS.  Test each with file based IO.  You'll see XFS has radically
superior parallel IO performance compared to EXT2.  Tweaking the array
setup will not yield significant EXT2 speedup for parallel IO.

Disk striping was invented 2+ decades ago to increase performance of
slow spindles for large file reads and writes, but the performance is
very low for small file IO due to partial stripe width operations taking
many of your spindles out of play, decreasing parallelism.  Adding
parity to the striping exacerbates this problem.  This is the classic
trade off between performance and redundancy.

SSDs have no moving parts, and natively have extremely high IOPS and
throughput rates, each SSD having on the order of 150x the seek rate of
a mech drive, and 2-3x the streaming throughput rate.  Thus, striping is
irrelevant to SSD performance, and, as you've seen, will degrade small
file performance due to partial width writes etc.

If you truly want to maximize real world performance of those 24 SSDs,
take one of your striped RAID configurations and format it with XFS
using the defaults.  Then run FIO with highly parallel file based IO
tests, i.e. two to four worker threads per CPU core.  Then delete the
array and create the linear setup I previously recommended and run the
same tests.  When comparing the  results I think you'll begin to see why
I recommend this setup for both highly parallel small and large file IO.
 Your large file IO numbers may be a little smaller with this setup, but
you should be able to play with chunk size to achieve the best balance
with both small and large file IO.

Regardless of chunk size, you should still see better overall parallel
IOPS and throughput than with striping, especially parity striping.  If
you need redundancy and maximum parallel performance, and can afford to
'waste' SSDs, create 12 RAID1 devices and make a linear array of the 12,
giving XFS 12 allocation groups.  For parallel small file workloads this
will yield better performance than RAID10 for the same cost of device
space.  Large file parallel performance should be similar.

-- 
Stan

^ permalink raw reply

* Re: md question re: max_hw_sectors_kb
From: fibreraid @ 2011-05-31  3:06 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: NeilBrown, Michael Reed, linux-raid, Jeremy Higdon,
	Hannes Reinecke
In-Reply-To: <yq1oc38poej.fsf@sermon.lab.mkp.net>

Hi Mike,

Thank you for the patch. I've tested it though on 2.6.38 and do not
see any significant change in md RAID 5 write performance compared to
the unpatched kernel. Can you clarify how you may have achieved a
demonstrable improvement?

-Tommy


On Wed, May 11, 2011 at 8:51 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:
>
> Neil,
>
>>> Your fix is functionally correct. However, another case just popped
>>> up this week where we need to distinguish between stacking driver and
>>> LLD defaults.
>
> Neil> What case is this?
>
> This particular case involved the need to set different defaults for
> discard depending on whether it was a stacking or a low level driver.
>
>
> Neil> If you have FS -> DM -> MD, then any change that MD makes to
> Neil> max_hw_sectors_kb will not be visible to the FS.  So adding and
> Neil> activating a hot spare with smaller max_hw_sectors_kb cause cause
> Neil> it to receive requests that are too big.
>
> Yeah, this issue pops up occasionally. Alasdair and I were discussing it
> just a couple of weeks ago.
>
>
> Neil> So we really need a propery resolution to this problem first.
> Neil> i.e. A way for 'dm' to notice when 'md' changes its parameters -
> Neil> or in general any stacking deivce to find out when an underlying
> Neil> device changes in any way.
>
> Neil> I would implement this by having blkdev_get{,_by_path,_by_dev}
> Neil> take an extra arg which is a pointer to a struct of functions.  In
> Neil> the first instance there would be just one which tells the claimer
> Neil> that something in queue.limits has changed.  Later we could add
> Neil> other calls to help with size changes.
>
> I agree we need a way to propagate queue limit and capacity changes up
> the stack. I'll put in on my todo list.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: Stefan /*St0fF*/ Hübner @ 2011-05-31  3:23 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: linux-raid
In-Reply-To: <BANLkTi=236kncpunzodSci-1K33u_FBkPA@mail.gmail.com>

Hi,

are those LSISAS2008 in IR or in IT mode?  Software RAID performance on
those controllers is really bad with a high throw-out in IR mode, as the
IR mode is made for those "integrated RAID" types like RAID0, RAID1,
RAID1E and RAID10.

We've seen much better SoftwareRAID performance on this Controller in IT
(IniTiator) Mode.  See
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html#Product%20Brief
the downloads site.

If your controller BIOS already says: "SAS2008-IT" or "LSI 9211-IT" on
boot-up, then you already got IT Firmware on it.  That would be the
moment I'd start thinking about a 9265 Controller and not software RAID.

I mean, with a Westmere board and CPU ... you spend enough money on the
hardware, but you want to save on the real bottleneck?  Sounds a bit
irrational to me...

Cheers,
Stefan

Am 30.05.2011 09:14, schrieb fibreraid@gmail.com:
> Hi all,
> 
> I am looking to optimize md RAID performance as much as possible.
> 
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.
> 
> CPU: 2 x Intel Westmere 6-core 2.4GHz
> RAM: 24GB DDR3 1066
> SAS controllers: 3 x LSI SAS2008 (6 Gbps SAS)
> Drives: 24 x SSD's
> Kernel: 2.6.38 x64 kernel (home-grown)
> Benchmarking Tool: fio 1.54
> 
> Here are the results.I used the following commands to perform these benchmarks:
> 
> 4K READ: fio --bs=4k --direct=1 --rw=read --ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4K WRITE: fio --bs=4k --direct=1 --rw=write--ioengine=libaio
> --iodepth=512 --runtime=60 --name=/dev/md0
> 4M READ: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 4M WRITE: fio --bs=4m --direct=1 --rw=read --ioengine=libaio
> --iodepth=64 --runtime=60 --name=/dev/md0
> 
> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
> 
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s
> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
> 
> Note that each individual SSD tests out as follows:
> 
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s
> 
> 
> My concerns:
> 
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.
> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.
> 3. 4k read/write looks good for RAID 0, but drop off by over 40% with
> RAID 5. While somewhat understandable on writes, why such a
> significant hit on reads?
> 4. RAID 5 4M writes take a big hit compared to RAID 0, from 3146 MB/s
> to 1337 MB/s. Despite the RAID 5 overhead, that still seems huge given
> the CPU's at hand. Why?
> 5. Using a RAID 0 across two 11-SSD RAID 5's gives better RAID 5 4M
> write performance, but worse in reads and significantly worse in 4K
> reads/writes. Why?
> 
> 
> Any thoughts would be greatly appreciated, especially patch ideas for
> tweaking options. Thanks!
> 
> Best,
> Tommy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: Optimizing small IO with md RAID
From: Joe Landman @ 2011-05-31  3:48 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: linux-raid
In-Reply-To: <BANLkTi=236kncpunzodSci-1K33u_FBkPA@mail.gmail.com>

On 05/30/2011 03:14 AM, fibreraid@gmail.com wrote:
> Hi all,
>
> I am looking to optimize md RAID performance as much as possible.
>
> I've managed to get some rather strong large 4M IOps performance, but
> small 4K IOps are still rather subpar, given the hardware.

Understand that much of what passes for realistic test cases for SSDs 
are ... well ... not that good.  Write something other than zeros, and 
turn off write caching on the SSDs.  Then you get similar results to 
what you see.

[...]

> In each case below, the md chunk size was 64K. In RAID 5 and RAID 6,
> one hot-spare was specified.
>
> 	raid0 24 x SSD	raid5 23 x SSD	raid6 23 x SSD	raid0 (2 * (raid5 x 11 SSD))						
> 4K read	179,923 IO/s	93,503 IO/s	116,866 IO/s	75,782 IO/s
> 4K write	168,027 IO/s	108,408 IO/s	120,477 IO/s	90,954 IO/s

A 4k random read/write? Or a sequential?  The 4k sequential reads/writes 
will be merged into a larger size.

A 4k write is going to result in a read-modify-write cycle for this 
config.

These results suggest a 7k IOP 4k write performance, and about 7.5k IOP 
4k read performance.  Are these Intel drives?  These numbers are in line 
with what I've measured for them.

> 4M read	4,576.7 MB/s	4,406.7 MB/s	4,052.2 MB/s	3,566.6 MB/s
> 4M write	3,146.8 MB/s	1,337.2 MB/s	1,259.9 MB/s	1,856.4 MB/s
>
> Note that each individual SSD tests out as follows:
>
> 4k read: 56,342 IO/s
> 4k write: 33,792 IO/s
> 4M read: 231 MB/s
> 4M write: 130 MB/s

Is write caching on in this case but not the other?

>
>
> My concerns:
>
> 1. Given the above individual SSD performance, 24 SSD's in an md array
> is at best getting 4K read/write performance of 2-3 drives, which
> seems very low. I would expect significantly better linear scaling.

You've got lots of RMW cycles going on for the write side, I wouldn't 
expect million IOP performance out of a system like this.


> 2. On the other hand, 4M read/write are performing more like 10-15
> drives, which is much better, though still seems like it could get
> better.

These controllers are often on PCIe-x8 gen 2 ports.  Thats 4GB/s maximum 
in each direction.  After the overhead on the bus,  you get 86% of the 
remaining bandwidth.  This is 3.4 GB/s.  So your 4+ GB/s results are 
either the result of caching, or multiple controllers. Since I see the 
direct=1, I am guessing multiple controllers.   Unless you have a single 
controller in a PCIe-x16 gen 2 slot ...



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply

* RE: creating degraded raid1 with imsm metadata
From: Jiang, Dave @ 2011-05-31 14:36 UTC (permalink / raw)
  To: FDi; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <20110528171354.GA9767@r00t3d.com>

> -----Original Message-----
> From: FDi [mailto:fld@r00t3d.com]
> Sent: Saturday, May 28, 2011 10:14 AM
> To: Jiang, Dave
> Cc: linux-raid@vger.kernel.org
> Subject: Re: creating degraded raid1 with imsm metadata
> 
> On Thu, May 26, 2011 at 09:40:16AM -0700, Jiang, Dave wrote:
> > > -----Original Message-----
> > > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> > > owner@vger.kernel.org] On Behalf Of FDi
> > > Sent: Thursday, May 26, 2011 12:43 AM
> > > To: linux-raid@vger.kernel.org
> > > Subject: creating degraded raid1 with imsm metadata
> > >
> > > Hello *,
> > >
> > > Since Intel's Matrix Storage Manager option ROM doesn't support
> > > creating of degraded arrays I was wondering if I could use mdadm to
> > > make one? I had a very hard time finding documentation about how
> > > mdadm is supposed to work with imsm.
> > >
> > > The plan is to make a 2x1TB raid1 with one device missing and then
> > > later add the other disk in once all the data has been copied to the
> > > degraded array. So a typical raid1 migration scenario, which Intel
> > > oddly enough doesn't seem to support with their option ROM.
> >
> > Not sure if that's possible but have you looked at the Linux RAID wiki on
> IMSM information?
> > https://raid.wiki.kernel.org/index.php/RAID_setup#External_Metadata
> I wasn't able to figure out how to do what I wanted based on the wiki, but
> after lots of googling I found the exact commands:
> 
> mdadm --create --force -v -e imsm --level=container -n 1 /dev/md/imsm
> /dev/sdb
> 
> mdadm --create -v --level raid1 -n 2 /dev/md/myraid /dev/sdb missing
> 
> However I also learned that these commands have to be done on the target
> machine while its running with RAID mode selected from BIOS. Otherwise
> you will get this warning:
> 
> mdadm: imsm unable to enumerate platform support
>     array may not be compatible with hardware/firmware
> 	 Continue creating array?
> 
> And indeed if that warning is displayed during the create, Intel's option rom
> won't see a working array on the device. I'm kinda curious why is this exactly?
> What kind of information mdadm uses from the controller running in RAID
> mode?

It queries the table exported by the OROM regarding the capabilities of the OROM. Limits are checked such as chunk size, leve, number of devices and etc. This ensures that the OROM can recognize the RAID volume you created on Linux. You can also bypass the warning by doing: "export IMSM_NO_PLATFORM=1"
 
> When I created my array on the target machine using the commands from
> above it worked correctly and Intel option rom saw the array and was able to
> boot from the MBR I installed on the array as a test. Haven't tested rebuilding
> yet.

^ permalink raw reply

* Re: Storage device enumeration script
From: Simon McNair @ 2011-05-31 18:51 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <4DDE40F7.1090908@turmel.org>

Hi Phil,
Thanks for doing this work off your own back, I'm sure it'll help a lot 
of people.

My 2p would be to ask if you could have a comment at the start of the 
file with the prerequisite applications/packages named it may help the 
people that it doesn't work for (initially).  The other thing is that 
the virtualbox install of Ubuntu 11.04 that I just completed did not 
contain the LVM2 package.

regards
Simon

On 26/05/2011 13:00, Phil Turmel wrote:
> On 05/26/2011 04:24 AM, CoolCold wrote:
> [...]
>> On Debian Lenny produces error:
>> root@gamma2:/tmp# python lsdrv
>> Traceback (most recent call last):
>>    File "lsdrv", line 17, in<module>
>>      import os, io, re
>> ImportError: No module named io
> Huh.  The 'io' module is v2.6 and above.  Another reason to make a helper function for reading these files.
>
> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message tomajordomo@vger.kernel.org
> More majordomo info athttp://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Storage device enumeration script
From: CoolCold @ 2011-05-31 21:21 UTC (permalink / raw)
  To: simonmcnair; +Cc: Phil Turmel, linux-raid
In-Reply-To: <4DE538C2.5050504@gmail.com>

On Tue, May 31, 2011 at 10:51 PM, Simon McNair <simonmcnair@gmail.com> wrote:
> Hi Phil,
> Thanks for doing this work off your own back, I'm sure it'll help a lot of
> people.
>
> My 2p would be to ask if you could have a comment at the start of the file
> with the prerequisite applications/packages named it may help the people
> that it doesn't work for (initially).  The other thing is that the
> virtualbox install of Ubuntu 11.04 that I just completed did not contain the
> LVM2 package.
There is debian packaging for this tool, checkout
https://github.com/pturmel/lsdrv
You should be able build your package with proper dependencies.


>
> regards
> Simon
>
> On 26/05/2011 13:00, Phil Turmel wrote:
>>
>> On 05/26/2011 04:24 AM, CoolCold wrote:
>> [...]
>>>
>>> On Debian Lenny produces error:
>>> root@gamma2:/tmp# python lsdrv
>>> Traceback (most recent call last):
>>>   File "lsdrv", line 17, in<module>
>>>     import os, io, re
>>> ImportError: No module named io
>>
>> Huh.  The 'io' module is v2.6 and above.  Another reason to make a helper
>> function for reading these files.
>>
>> Phil
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message tomajordomo@vger.kernel.org
>> More majordomo info athttp://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* rebuild after reboot
From: Pol Hallen @ 2011-05-31 21:55 UTC (permalink / raw)
  To: linux-raid

Hi folks :-)

Yesterday I changed a fail disk, after complete recovery my raid was
perfectly. I rebooted my os and at start-up mdadm was again in recovery
mode (one disk fail)

Now I ask to myself, I have to do mdadm --examine --scan >>
/etc/mdadm/mdadm.conf everytimes I change a disk? or only when create a
new raid?

Why after reboot mdadm recovery again the array?

what is the problem?

thanks! :-)

Pol


^ permalink raw reply

* Re: Storage device enumeration script
From: Phil Turmel @ 2011-06-01  3:43 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid
In-Reply-To: <4DDF8997.5090109@turmel.org>

Hi John,

On 05/27/2011 07:23 AM, Phil Turmel wrote:
> On 05/27/2011 05:44 AM, John Robinson wrote:
>> On 27/05/2011 10:15, John Robinson wrote:
>> [...]
>>> I'm not entirely sure where dev.ID_ etc are supposed to be coming
>>> from, but if it's that `blkid -p -o udev /dev/block/8:0` then I'm
>>> afraid CentOS 5's blkid doesn't understand the -p or -o udev options,
>>> it doesn't produce any output for whole drives with partition tables,
>>> and there isn't a /dev/block directory. It's blkid 1.0.0 from
>>> e2fsprogs 1.39-23.el5_5.1.
>>>
>>> If that knocks CentOS 5 support on the head then so be it...
>>
>> Hmm, udevinfo might be of some use. Still doesn't say it's found a DOS
>> partition table, but it does get you e.g. ID_FS_TYPE=linux_raid_member and perhaps `file -s` will tell you there's DOS partition table (sort of).
> 
> I'll look into this when I have a new CentOS 5 VM installed on my laptop.  I do want lsdrv to work with all of the CentOS 5 releases.

I've been playing with lsdrv in a CentOS 5 VM, and found a number of items to address.  The result has been pushed to github, with a detailed description.  Please give it a whirl.

https://github.com/pturmel/lsdrv

(Further bug reports should be posted there...  trying to keep the noise down on linux-raid.)

Phil

^ permalink raw reply

* Re: Storage device enumeration script
From: Phil Turmel @ 2011-06-01  3:58 UTC (permalink / raw)
  To: CoolCold; +Cc: simonmcnair, linux-raid
In-Reply-To: <BANLkTimxnJaS0J6ZWLfZUo1GHiahMA+4gQ@mail.gmail.com>

Hi Simon,

On 05/31/2011 05:21 PM, CoolCold wrote:
> On Tue, May 31, 2011 at 10:51 PM, Simon McNair <simonmcnair@gmail.com> wrote:
>> Hi Phil,
>> Thanks for doing this work off your own back, I'm sure it'll help a lot of
>> people.
>>
>> My 2p would be to ask if you could have a comment at the start of the file
>> with the prerequisite applications/packages named it may help the people
>> that it doesn't work for (initially).  The other thing is that the
>> virtualbox install of Ubuntu 11.04 that I just completed did not contain the
>> LVM2 package.
> There is debian packaging for this tool, checkout
> https://github.com/pturmel/lsdrv
> You should be able build your package with proper dependencies.

The latest version (this evening) reports on the utilities that didn't work, so you can look for what you need.  Of course, if you have LVM volumes, but not the utilities, your system won't actually be able to start up those devices.

If you don't have LVM volumes, you can safely ignore the warning.  Hmmm.  Maybe I should suppress that if no LVM PVs are found in the first pass....

G'night!

Phil

^ permalink raw reply

* Raid1 info (not active degraded raid)
From: Pol Hallen @ 2011-06-01 16:32 UTC (permalink / raw)
  To: linux-raid

Hi folks :-)

I build a raid1 with mdadm. For purpose test I disconnected a disk and raid1 
runs perfectly.

After reboot /dev/md1 (raid1) doesn't appear by cat /proc/mdstat

(I have also a raid6 on /dev/md0)

mdadm -E /dev/sdh (a disk of raid1) show me correct raid

How I start this raid?

raid6 works so mdadm runs.

thanks and sorry for my bad english :-@
 
Pol

^ permalink raw reply

* Re: Raid1 info (not active degraded raid)
From: Phil Turmel @ 2011-06-01 17:34 UTC (permalink / raw)
  Cc: linux-raid
In-Reply-To: <201106011832.51688.raid2@fuckaround.org>

On 06/01/2011 12:32 PM, Pol Hallen wrote:
> Hi folks :-)
> 
> I build a raid1 with mdadm. For purpose test I disconnected a disk and raid1 
> runs perfectly.
> 
> After reboot /dev/md1 (raid1) doesn't appear by cat /proc/mdstat
> 
> (I have also a raid6 on /dev/md0)
> 
> mdadm -E /dev/sdh (a disk of raid1) show me correct raid
> 
> How I start this raid?

Look at kernel parameter "md-mod.start_dirty_degraded=1"

> raid6 works so mdadm runs.
> 
> thanks and sorry for my bad english :-@

Your english is fine.  Your return address is not so nice.  I'm somewhat surprised it got through my spam filters.  It won't again.

HTH,

Phil

^ permalink raw reply

* About monitor raid event using poll/select
From: majianpeng @ 2011-06-02  2:34 UTC (permalink / raw)
  To: linux-raid

I recently did a project which monitor raids event in system.
I used this way:
while(1){
	open(/proc/mdstat);
	poll();
    do-something
	close
}
I wanted to monitor all events in times.Supposing events A--10s-->B---1hour---->C,do-somethings taks 30s.
First I must fetch the event A.But I do do-somthing using 30s,so I did not monitor event B.
But the interval of B and C is 1 hour,so  I only after 1 hours can detected the events B.

I can modified the do-something in order to reduce times.But in theory,I can omit event.
so I think modified kernel function mdstat_poll:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 39b27c4..aaecc03 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6539,8 +6539,10 @@ static unsigned int mdstat_poll(struct file *filp, poll_table *wait)
        /* always allow read */
        mask = POLLIN | POLLRDNORM;
 
-   if (mi->event != atomic_read(&md_event_count))
+ if (mi->event != atomic_read(&md_event_count)){
+              mi->event = atomic_read(&md_event_count);
                mask |= POLLERR | POLLPRI;
+ }
        return mask;
 }
 user space function like this:
opne(/proc/mdstat)
while(1){
	poll
    do-something
}
close

2011-06-02 



majianpeng 


^ permalink raw reply related

* Re: About monitor raid event using poll/select
From: NeilBrown @ 2011-06-02  2:50 UTC (permalink / raw)
  To: majianpeng; +Cc: linux-raid
In-Reply-To: <201106021034490930721@gmail.com>

On Thu, 2 Jun 2011 10:34:52 +0800 "majianpeng" <majianpeng@gmail.com> wrote:

> I recently did a project which monitor raids event in system.
> I used this way:
> while(1){
> 	open(/proc/mdstat);
> 	poll();
>     do-something
> 	close
> }

This is the wrong way to use 'poll' - no matter what sort of file descriptor
you have.

You need to open /proc/mdstat and keep it open.

You then read /proc/mdstat and possibly act upon the state that you find
there.
Then you call 'poll'.
When poll tells you that something has happened, you seek back to the start
of the file and read it again.  If something has changed, you can act on that
change.
Then you can call 'poll' again.

'poll' tells you if something has changed since the last time you read the
file.
When you open the file and use poll straight away it should tell you that the
file has data ready for you.. maybe it doesn't but it should.  Because there
is data there for you to read.

NeilBrown





> I wanted to monitor all events in times.Supposing events A--10s-->B---1hour---->C,do-somethings taks 30s.
> First I must fetch the event A.But I do do-somthing using 30s,so I did not monitor event B.
> But the interval of B and C is 1 hour,so  I only after 1 hours can detected the events B.
> 
> I can modified the do-something in order to reduce times.But in theory,I can omit event.
> so I think modified kernel function mdstat_poll:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 39b27c4..aaecc03 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6539,8 +6539,10 @@ static unsigned int mdstat_poll(struct file *filp, poll_table *wait)
>         /* always allow read */
>         mask = POLLIN | POLLRDNORM;
>  
> -   if (mi->event != atomic_read(&md_event_count))
> + if (mi->event != atomic_read(&md_event_count)){
> +              mi->event = atomic_read(&md_event_count);
>                 mask |= POLLERR | POLLPRI;
> + }
>         return mask;
>  }
>  user space function like this:
> opne(/proc/mdstat)
> while(1){
> 	poll
>     do-something
> }
> close
> 
> 2011-06-02 
> 
> 
> 
> majianpeng 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [PATCH 1/2] md: convert double atomic_inc() to atomic_add()
From: Namhyung Kim @ 2011-06-02  4:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/md.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index aa640a85bb21..f210e42a56ca 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -396,8 +396,7 @@ static void submit_flushes(struct work_struct *ws)
 			 * we reclaim rcu_read_lock
 			 */
 			struct bio *bi;
-			atomic_inc(&rdev->nr_pending);
-			atomic_inc(&rdev->nr_pending);
+			atomic_add(2, &rdev->nr_pending);
 			rcu_read_unlock();
 			bi = bio_alloc_mddev(GFP_KERNEL, 0, mddev);
 			bi->bi_end_io = md_end_flush;
-- 
1.7.5.2

^ permalink raw reply related

* [PATCH 2/2] md: check ->hot_remove_disk when removing disk
From: Namhyung Kim @ 2011-06-02  4:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, stable
In-Reply-To: <1306990383-14133-1-git-send-email-namhyung@gmail.com>

Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
during disk removal. The linear personality only has ->hot_add_disk and
no ->hot_remove_disk, so that removing disk in the array resulted to
following kernel bug:

$ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
$ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<          (null)>]           (null)
 PGD c9f5d067 PUD 8575a067 PMD 0
 Oops: 0010 [#1] SMP
 CPU 2
 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg

 Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
 RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
 RSP: 0018:ffff880085757df0  EFLAGS: 00010282
 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
 R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
 R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
 FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
 Stack:
  ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
  ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
  ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
 Call Trace:
  [<ffffffff8138496a>] ? slot_store+0xaa/0x265
  [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
  [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
  [<ffffffff81106b87>] vfs_write+0xb1/0x10d
  [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
  [<ffffffff81106cac>] sys_write+0x4d/0x77
  [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
 Code:  Bad RIP value.
 RIP  [<          (null)>]           (null)
  RSP <ffff880085757df0>
 CR2: 0000000000000000
 ---[ end trace ba5fc64319a826fb ]---

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: stable@kernel.org
---
 drivers/md/md.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f210e42a56ca..3db106b7b245 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2461,7 +2461,7 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 		if (rdev->raid_disk == -1)
 			return -EEXIST;
 		/* personality does all needed checks */
-		if (rdev->mddev->pers->hot_add_disk == NULL)
+		if (rdev->mddev->pers->hot_remove_disk == NULL)
 			return -EINVAL;
 		err = rdev->mddev->pers->
 			hot_remove_disk(rdev->mddev, rdev->raid_disk);
-- 
1.7.5.2

^ permalink raw reply related

* Re: [PATCH 1/2] md: convert double atomic_inc() to atomic_add()
From: NeilBrown @ 2011-06-02  5:47 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid, linux-kernel
In-Reply-To: <1306990383-14133-1-git-send-email-namhyung@gmail.com>

On Thu,  2 Jun 2011 13:53:02 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  drivers/md/md.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index aa640a85bb21..f210e42a56ca 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -396,8 +396,7 @@ static void submit_flushes(struct work_struct *ws)
>  			 * we reclaim rcu_read_lock
>  			 */
>  			struct bio *bi;
> -			atomic_inc(&rdev->nr_pending);
> -			atomic_inc(&rdev->nr_pending);
> +			atomic_add(2, &rdev->nr_pending);
>  			rcu_read_unlock();
>  			bi = bio_alloc_mddev(GFP_KERNEL, 0, mddev);
>  			bi->bi_end_io = md_end_flush;

Thanks, but I don't think I want this patch.
I'm happy having two separate 'atomic_inc' calls. I think it makes it a bit
clearer what is happening, and it is easier to search for all atomic_inc
calls.

NeilBrown

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox