* What are these reads in what should be simply a full-stripe write?
@ 2012-02-28 18:47 John Adams
2012-02-28 23:14 ` linbloke
2012-02-29 5:52 ` Doug Dumitru
0 siblings, 2 replies; 4+ messages in thread
From: John Adams @ 2012-02-28 18:47 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
For some years I've been working on some niche filesystems which serve
workflows involving lots of video. Lately, I have had occasion to
investigate the behavior of md as a possible raid solution (2.6.32
kernel).
As part of that, we looked at some fio based loads in the buffered and
O_DIRECT cases and noticed some reading that we didn't understand when
using O_DIRECT. We were led to this comparision by incorrect
information from a vendor. (We were trying to repro some reported
performance and were initially told that O_DIRECT had been used).
We are aware of the problems discussed concerning O_DIRECT. As fs
guys, we're accustomed to worrying about copies and such, so it wasn't
immediately obvious to us that O_DIRECT would be a mistake in our
case. This is essentially an embedded system with a single process
owning a group of disks with no filesystem. There is no possibility
of a race with another process.
Anyway, I am curious about this reading behavior and I would grateful for any
comments.
I tried writing single stripes under both scenarios. To give the
barest possible summary. I used a dd command like this with
oflag=direct omitted or not. This was driven from a script that
sets up some blktrace and ftrace things, waits an appropriate time in
the buffered case etc.
dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
8+2 128k strip
[physical disk completions via blkparse]
Buffered:
Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB
Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB
Direct Example 1:
Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB
Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB
Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB
Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB
Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB
Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB
Direct Example 2:
Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB
Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB
Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB
Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB
Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB
Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB
Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB
I was able to gain a little bit of insight through blktrace and
ftrace. Our initial assumption was that maybe things were being
broken up differently such that md thought it needed to do a rmv.
But as I dug into the blktrace output, that did not seem to be the
case (reads are coming after what is obviously the strip write). I
used ftrace to show me the path down to md_make_request in the
O_DIRECT and buffered cases. This showed me some calls refering to
read_ahead in the direct case.
<...>-14859 [001] 510340.525310: md_make_request
<...>-14859 [001] 510340.525311: <stack trace>
=> generic_make_request
=> submit_bio
=> submit_bh
=> block_read_full_page
=> blkdev_readpage
=> __do_page_cache_readahead
=> force_page_cache_readahead
=> page_cache_sync_readahead
So is this read ahead I'm observing? Why does it occur only in the
direct case?
I noticed that blktrace sometime identifies what I assume to be the
instigator of the io. So I can sometimes see dd or md_raid6 there.
As in [dd] or [md0_raid6]:
8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6]
These unexplained reads either mention blkid or [0] or [(null)].
It isn't clear to me whether the unexpected read behavior is due to a
tuning problem in the O_DIRECT case or simply the way things work.
Thank you for any comments.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: What are these reads in what should be simply a full-stripe write?
2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
@ 2012-02-28 23:14 ` linbloke
2012-03-05 16:29 ` John Adams
2012-02-29 5:52 ` Doug Dumitru
1 sibling, 1 reply; 4+ messages in thread
From: linbloke @ 2012-02-28 23:14 UTC (permalink / raw)
To: John Adams; +Cc: linux-raid@vger.kernel.org
On 29/02/12 5:47 AM, John Adams wrote:
> For some years I've been working on some niche filesystems which serve
> workflows involving lots of video. Lately, I have had occasion to
> investigate the behavior of md as a possible raid solution (2.6.32
> kernel).
>
> As part of that, we looked at some fio based loads in the buffered and
> O_DIRECT cases and noticed some reading that we didn't understand when
> using O_DIRECT. We were led to this comparision by incorrect
> information from a vendor. (We were trying to repro some reported
> performance and were initially told that O_DIRECT had been used).
>
> We are aware of the problems discussed concerning O_DIRECT. As fs
> guys, we're accustomed to worrying about copies and such, so it wasn't
> immediately obvious to us that O_DIRECT would be a mistake in our
> case. This is essentially an embedded system with a single process
> owning a group of disks with no filesystem. There is no possibility
> of a race with another process.
>
> Anyway, I am curious about this reading behavior and I would grateful for any
> comments.
>
> I tried writing single stripes under both scenarios. To give the
> barest possible summary. I used a dd command like this with
> oflag=direct omitted or not. This was driven from a script that
> sets up some blktrace and ftrace things, waits an appropriate time in
> the buffered case etc.
>
> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>
> 8+2 128k strip
>
> [physical disk completions via blkparse]
>
> Buffered:
>
> Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB
> Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB
>
> Direct Example 1:
>
> Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB
> Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB
> Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB
> Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB
> Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB
> Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB
>
> Direct Example 2:
>
> Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB
> Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB
> Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB
> Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB
> Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB
> Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB
> Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB
>
>
> I was able to gain a little bit of insight through blktrace and
> ftrace. Our initial assumption was that maybe things were being
> broken up differently such that md thought it needed to do a rmv.
>
> But as I dug into the blktrace output, that did not seem to be the
> case (reads are coming after what is obviously the strip write). I
> used ftrace to show me the path down to md_make_request in the
> O_DIRECT and buffered cases. This showed me some calls refering to
> read_ahead in the direct case.
>
> <...>-14859 [001] 510340.525310: md_make_request
> <...>-14859 [001] 510340.525311:<stack trace>
> => generic_make_request
> => submit_bio
> => submit_bh
> => block_read_full_page
> => blkdev_readpage
> => __do_page_cache_readahead
> => force_page_cache_readahead
> => page_cache_sync_readahead
>
> So is this read ahead I'm observing? Why does it occur only in the
> direct case?
>
> I noticed that blktrace sometime identifies what I assume to be the
> instigator of the io. So I can sometimes see dd or md_raid6 there.
> As in [dd] or [md0_raid6]:
>
> 8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6]
>
> These unexplained reads either mention blkid or [0] or [(null)].
>
> It isn't clear to me whether the unexpected read behavior is due to a
> tuning problem in the O_DIRECT case or simply the way things work.
>
> Thank you for any comments.--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
G'day John,
You need to give us more detail about your md raid setup. Beside a
reference to md_raid6, there is no other details about your array.
How about sending:
mdadm -V
uname -a
mdadm -Dvv /dev/mdarray
mdadm -Evv /dev/arraycomponentdevices - for all of them
Good luck in the hunt,
J
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: What are these reads in what should be simply a full-stripe write?
2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
2012-02-28 23:14 ` linbloke
@ 2012-02-29 5:52 ` Doug Dumitru
1 sibling, 0 replies; 4+ messages in thread
From: Doug Dumitru @ 2012-02-29 5:52 UTC (permalink / raw)
To: John Adams; +Cc: linux-raid@vger.kernel.org
Mr. Adams,
Raid 5/6 exports a parameter called "optimal_io_size". You should
find this in /sys/block/mdx/queue/optimal_io_size.
This is the size of a single stripe. In theory, if you write exactly
this size aligned blocks to raid 5/6, then the stripe cache should
handle the IO perfectly and you should see zero reads. If you miss
the boundaries, most of the time, raid 5/6 will cache the writes in
the stripe cache and you will still get zero reads. Unfortunately, a
small percentage of the time, a read/modify/write will get scheduled
in between two inbound write requests.
To make this somewhat more complicated, there is also a limit to how
large a single request can be. This is limited globally to "#define
BIO_MAX_PAGES 256" or 1MB (as of 3.1.7). With raid 5/6 arrays, with
64KB chunks, this lets you have 16 active drives. At least so goes
the theory. I seem to remember some other limit at 1023 sectors,
which then limits you 511KB or 7 active drives.
If you need to drive this from an application, then the application
has to hit "optimal_io_size" exactly, both in terms of size and
alignment. You can test this with 'dd'. If you miss the alignment,
then you will get a small number of reads.
If you want to drive this from user space, then O_DIRECT will work.
Ideally, you want multiple outstanding IOs so that the drives can
stream. This implies AIO (which sucked the last time I tried it), or
else you need to hack something inside of kernel space.
Now why raid 5/6 tends to miss and schedule read/modify/write at
inopportune times seems to just be a design trade-off inside of raid.
I stared at the code for a long time, and never did find any type of
specific timing for how long to wait before scheduling a RMW, so it
looks like you are just at the mercy of where clock ticks happen.
All in all, the raid 5/6 code is really elegant, but it would be nice
if the kernel in general allowed for longer atomic requests. 1MB (or
512KB, or 511KB depending on where you look), is just too short for
some "high bandwidth" application.
Doug Dumitru
EasyCo LLC
On Tue, Feb 28, 2012 at 10:47 AM, John Adams <john.adams@avid.com> wrote:
>
> For some years I've been working on some niche filesystems which serve
> workflows involving lots of video. Lately, I have had occasion to
> investigate the behavior of md as a possible raid solution (2.6.32
> kernel).
>
> As part of that, we looked at some fio based loads in the buffered and
> O_DIRECT cases and noticed some reading that we didn't understand when
> using O_DIRECT. We were led to this comparision by incorrect
> information from a vendor. (We were trying to repro some reported
> performance and were initially told that O_DIRECT had been used).
>
> We are aware of the problems discussed concerning O_DIRECT. As fs
> guys, we're accustomed to worrying about copies and such, so it wasn't
> immediately obvious to us that O_DIRECT would be a mistake in our
> case. This is essentially an embedded system with a single process
> owning a group of disks with no filesystem. There is no possibility
> of a race with another process.
>
> Anyway, I am curious about this reading behavior and I would grateful for any
> comments.
>
> I tried writing single stripes under both scenarios. To give the
> barest possible summary. I used a dd command like this with
> oflag=direct omitted or not. This was driven from a script that
> sets up some blktrace and ftrace things, waits an appropriate time in
> the buffered case etc.
>
> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>
> 8+2 128k strip
>
> [physical disk completions via blkparse]
>
> Buffered:
>
> Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
> Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB
> Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB
>
> Direct Example 1:
>
> Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB
> Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB
> Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB
> Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB
> Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
> Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB
> Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB
>
> Direct Example 2:
>
> Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB
> Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB
> Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB
> Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB
> Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB
> Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB
> Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB
>
>
> I was able to gain a little bit of insight through blktrace and
> ftrace. Our initial assumption was that maybe things were being
> broken up differently such that md thought it needed to do a rmv.
>
> But as I dug into the blktrace output, that did not seem to be the
> case (reads are coming after what is obviously the strip write). I
> used ftrace to show me the path down to md_make_request in the
> O_DIRECT and buffered cases. This showed me some calls refering to
> read_ahead in the direct case.
>
> <...>-14859 [001] 510340.525310: md_make_request
> <...>-14859 [001] 510340.525311: <stack trace>
> => generic_make_request
> => submit_bio
> => submit_bh
> => block_read_full_page
> => blkdev_readpage
> => __do_page_cache_readahead
> => force_page_cache_readahead
> => page_cache_sync_readahead
>
> So is this read ahead I'm observing? Why does it occur only in the
> direct case?
>
> I noticed that blktrace sometime identifies what I assume to be the
> instigator of the io. So I can sometimes see dd or md_raid6 there.
> As in [dd] or [md0_raid6]:
>
> 8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6]
>
> These unexplained reads either mention blkid or [0] or [(null)].
>
> It isn't clear to me whether the unexpected read behavior is due to a
> tuning problem in the O_DIRECT case or simply the way things work.
>
> Thank you for any comments.--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: What are these reads in what should be simply a full-stripe write?
2012-02-28 23:14 ` linbloke
@ 2012-03-05 16:29 ` John Adams
0 siblings, 0 replies; 4+ messages in thread
From: John Adams @ 2012-03-05 16:29 UTC (permalink / raw)
To: linbloke; +Cc: linux-raid@vger.kernel.org
Thank you for commenting. Here is the info you requested.
mdadm - v3.2.2 - 17th June 2011
Linux jadams-sbb 2.6.32-131.12.1.hulk.avidnl.1.x86_64.debug #2 SMP Thu Feb 16 11:32:09 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
/dev/md0:
Version : 1.2
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Array Size : 838860800 (800.00 GiB 858.99 GB)
Used Dev Size : 104857600 (100.00 GiB 107.37 GB)
Raid Devices : 10
Total Devices : 10
Persistence : Superblock is persistent
Update Time : Mon Mar 5 12:35:11 2012
State : clean
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
Name : jadams-sbb:0 (local to host jadams-sbb)
UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Events : 62
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
3 8 64 3 active sync /dev/sde
4 8 80 4 active sync /dev/sdf
5 8 96 5 active sync /dev/sdg
6 8 112 6 active sync /dev/sdh
7 8 128 7 active sync /dev/sdi
8 8 144 8 active sync /dev/sdj
9 8 160 9 active sync /dev/sdk
/dev/sdb:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 26854371:b12034ac:d4119d00:5d1b8f1a
Update Time : Mon Mar 5 12:35:42 2012
Checksum : c3dfa790 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 0
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 04d4a4aa:dd376906:c3e479a0:cfbae1bb
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 98a57ffd - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 1
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdd:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 67d2b514:3fd5f360:73222322:97888b7b
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 9e94273a - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 2
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sde:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 81c96b5b:2a944dd0:fb069b36:b1fb4660
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 4dd734e3 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 3
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdf:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 4741b8b0:2e0e0c73:bc4ff323:5f653d70
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 4330d91d - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 4
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdg:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 4ff9f259:2781af76:b427d4c4:af50651a
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 3b17c767 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 5
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdh:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 3c04b3f7:df628911:b47e2dec:a9eabecd
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 4e64a508 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 6
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdi:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 13c8ac51:5fd71796:7b385c45:8c31990e
Update Time : Mon Mar 5 12:35:42 2012
Checksum : c6f5de08 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 7
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdj:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9be300f7:97329a54:3c86ec36:9dedd667
Update Time : Mon Mar 5 12:35:42 2012
Checksum : 759a5e9c - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 8
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdk:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
Name : jadams-sbb:0 (local to host jadams-sbb)
Creation Time : Fri Feb 17 19:39:25 2012
Raid Level : raid6
Raid Devices : 10
Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
Array Size : 1677721600 (800.00 GiB 858.99 GB)
Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : ada0d14e:ad9c3d6b:4ba27221:fe6ea679
Update Time : Mon Mar 5 12:35:42 2012
Checksum : e0642334 - correct
Events : 62
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 9
Array State : AAAAAAAAAA ('A' == active, '.' == missing)
On Feb 28, 2012, at 6:14 PM, linbloke wrote:
>
> On 29/02/12 5:47 AM, John Adams wrote:
>> For some years I've been working on some niche filesystems which serve
>> workflows involving lots of video. Lately, I have had occasion to
>> investigate the behavior of md as a possible raid solution (2.6.32
>> kernel).
>>
>> As part of that, we looked at some fio based loads in the buffered and
>> O_DIRECT cases and noticed some reading that we didn't understand when
>> using O_DIRECT. We were led to this comparision by incorrect
>> information from a vendor. (We were trying to repro some reported
>> performance and were initially told that O_DIRECT had been used).
>>
>> We are aware of the problems discussed concerning O_DIRECT. As fs
>> guys, we're accustomed to worrying about copies and such, so it wasn't
>> immediately obvious to us that O_DIRECT would be a mistake in our
>> case. This is essentially an embedded system with a single process
>> owning a group of disks with no filesystem. There is no possibility
>> of a race with another process.
>>
>> Anyway, I am curious about this reading behavior and I would grateful for any
>> comments.
>>
>> I tried writing single stripes under both scenarios. To give the
>> barest possible summary. I used a dd command like this with
>> oflag=direct omitted or not. This was driven from a script that
>> sets up some blktrace and ftrace things, waits an appropriate time in
>> the buffered case etc.
>>
>> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>>
>> 8+2 128k strip
>>
>> [physical disk completions via blkparse]
>>
>> Buffered:
>>
>> Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
>> Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB
>> Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB
>>
>> Direct Example 1:
>>
>> Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB
>> Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB
>> Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB
>> Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB
>> Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
>> Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
>> Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB
>> Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB
>>
>> Direct Example 2:
>>
>> Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB
>> Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB
>> Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB
>> Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB
>> Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB
>> Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB
>> Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB
>>
>>
>> I was able to gain a little bit of insight through blktrace and
>> ftrace. Our initial assumption was that maybe things were being
>> broken up differently such that md thought it needed to do a rmv.
>>
>> But as I dug into the blktrace output, that did not seem to be the
>> case (reads are coming after what is obviously the strip write). I
>> used ftrace to show me the path down to md_make_request in the
>> O_DIRECT and buffered cases. This showed me some calls refering to
>> read_ahead in the direct case.
>>
>> <...>-14859 [001] 510340.525310: md_make_request
>> <...>-14859 [001] 510340.525311:<stack trace>
>> => generic_make_request
>> => submit_bio
>> => submit_bh
>> => block_read_full_page
>> => blkdev_readpage
>> => __do_page_cache_readahead
>> => force_page_cache_readahead
>> => page_cache_sync_readahead
>>
>> So is this read ahead I'm observing? Why does it occur only in the
>> direct case?
>>
>> I noticed that blktrace sometime identifies what I assume to be the
>> instigator of the io. So I can sometimes see dd or md_raid6 there.
>> As in [dd] or [md0_raid6]:
>>
>> 8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6]
>>
>> These unexplained reads either mention blkid or [0] or [(null)].
>>
>> It isn't clear to me whether the unexpected read behavior is due to a
>> tuning problem in the O_DIRECT case or simply the way things work.
>>
>> Thank you for any comments.--
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> G'day John,
>
> You need to give us more detail about your md raid setup. Beside a reference to md_raid6, there is no other details about your array.
> How about sending:
>
> mdadm -V
> uname -a
> mdadm -Dvv /dev/mdarray
> mdadm -Evv /dev/arraycomponentdevices - for all of them
>
>
> Good luck in the hunt,
> J
>
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-03-05 16:29 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
2012-02-28 23:14 ` linbloke
2012-03-05 16:29 ` John Adams
2012-02-29 5:52 ` Doug Dumitru
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.