What are these reads in what should be simply a full-stripe write?

All of lore.kernel.org
 help / color / mirror / Atom feed

* What are these reads in what should be simply a full-stripe write?
@ 2012-02-28 18:47 John Adams
  2012-02-28 23:14 ` linbloke
  2012-02-29  5:52 ` Doug Dumitru
  0 siblings, 2 replies; 4+ messages in thread
From: John Adams @ 2012-02-28 18:47 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

For some years I've been working on some niche filesystems which serve
workflows involving lots of video.  Lately, I have had occasion to
investigate the behavior of md as a possible raid solution (2.6.32
kernel).

As part of that, we looked at some fio based loads in the buffered and
O_DIRECT cases and noticed some reading that we didn't understand when
using O_DIRECT.  We were led to this comparision by incorrect
information from a vendor. (We were trying to repro some reported
performance and were initially told that O_DIRECT had been used).

We are aware of the problems discussed concerning O_DIRECT.  As fs
guys, we're accustomed to worrying about copies and such, so it wasn't
immediately obvious to us that O_DIRECT would be a mistake in our
case.  This is essentially an embedded system with a single process
owning a group of disks with no filesystem.  There is no possibility
of a race with another process.

Anyway, I am curious about this reading behavior and I would grateful for any
comments.

I tried writing single stripes under both scenarios.  To give the
barest possible summary. I used a dd command like this with
oflag=direct omitted or not.  This was driven from a script that
sets up some blktrace and ftrace things, waits an appropriate time in
the buffered case etc.

dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1

8+2 128k strip

[physical disk completions via blkparse]

Buffered:

 Reads Completed:        2,        5KiB  Writes Completed:        4,      258KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
 Reads Completed:        2,        8KiB  Writes Completed:        3,      130KiB
 Reads Completed:        6,       24KiB  Writes Completed:        4,      258KiB

Direct Example 1:

 Reads Completed:        2,        5KiB  Writes Completed:       20,      258KiB
 Reads Completed:        9,       36KiB  Writes Completed:       14,      130KiB
 Reads Completed:       32,      128KiB  Writes Completed:       14,      130KiB
 Reads Completed:        1,        4KiB  Writes Completed:       16,      130KiB
 Reads Completed:       32,      128KiB  Writes Completed:       12,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
 Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
 Reads Completed:        2,        8KiB  Writes Completed:        8,      130KiB
 Reads Completed:        6,       24KiB  Writes Completed:       19,      258KiB

Direct Example 2:

 Reads Completed:        4,      133KiB  Writes Completed:        3,      130KiB
 Reads Completed:       11,      164KiB  Writes Completed:        3,      130KiB
 Reads Completed:       34,      256KiB  Writes Completed:        3,      130KiB
 Reads Completed:        2,      132KiB  Writes Completed:        3,      130KiB
 Reads Completed:       33,      256KiB  Writes Completed:        3,      130KiB
 Reads Completed:        3,      136KiB  Writes Completed:        3,      130KiB
 Reads Completed:        7,      152KiB  Writes Completed:        3,      130KiB

I was able to gain a little bit of insight through blktrace and
ftrace.  Our initial assumption was that maybe things were being
broken up differently such that md thought it needed to do a rmv.

But as I dug into the blktrace output, that did not seem to be the
case (reads are coming after what is obviously the strip write).  I
used ftrace to show me the path down to md_make_request in the
O_DIRECT and buffered cases.  This showed me some calls refering to
read_ahead in the direct case.

           <...>-14859 [001] 510340.525310: md_make_request
           <...>-14859 [001] 510340.525311: <stack trace>
 => generic_make_request
 => submit_bio
 => submit_bh
 => block_read_full_page
 => blkdev_readpage
 => __do_page_cache_readahead
 => force_page_cache_readahead
 => page_cache_sync_readahead

So is this read ahead I'm observing?  Why does it occur only in the
direct case?

I noticed that blktrace sometime identifies what I assume to be the
instigator of the io.  So I can sometimes see dd or md_raid6 there.
As in [dd] or [md0_raid6]:

 8,16   1      115     0.042000000  2910  D   W 2256 + 48 [md0_raid6]

These unexplained reads either mention blkid or [0] or [(null)].

It isn't clear to me whether the unexpected read behavior is due to a
tuning problem in the O_DIRECT case or simply the way things work.

Thank you for any comments.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What are these reads in what should be simply a full-stripe write?
  2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
@ 2012-02-28 23:14 ` linbloke
  2012-03-05 16:29   ` John Adams
  2012-02-29  5:52 ` Doug Dumitru
  1 sibling, 1 reply; 4+ messages in thread
From: linbloke @ 2012-02-28 23:14 UTC (permalink / raw)
  To: John Adams; +Cc: linux-raid@vger.kernel.org


On 29/02/12 5:47 AM, John Adams wrote:
> For some years I've been working on some niche filesystems which serve
> workflows involving lots of video.  Lately, I have had occasion to
> investigate the behavior of md as a possible raid solution (2.6.32
> kernel).
>
> As part of that, we looked at some fio based loads in the buffered and
> O_DIRECT cases and noticed some reading that we didn't understand when
> using O_DIRECT.  We were led to this comparision by incorrect
> information from a vendor. (We were trying to repro some reported
> performance and were initially told that O_DIRECT had been used).
>
> We are aware of the problems discussed concerning O_DIRECT.  As fs
> guys, we're accustomed to worrying about copies and such, so it wasn't
> immediately obvious to us that O_DIRECT would be a mistake in our
> case.  This is essentially an embedded system with a single process
> owning a group of disks with no filesystem.  There is no possibility
> of a race with another process.
>
> Anyway, I am curious about this reading behavior and I would grateful for any
> comments.
>
> I tried writing single stripes under both scenarios.  To give the
> barest possible summary. I used a dd command like this with
> oflag=direct omitted or not.  This was driven from a script that
> sets up some blktrace and ftrace things, waits an appropriate time in
> the buffered case etc.
>
> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>
> 8+2 128k strip
>
> [physical disk completions via blkparse]
>
> Buffered:
>
>   Reads Completed:        2,        5KiB  Writes Completed:        4,      258KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        2,        8KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        6,       24KiB  Writes Completed:        4,      258KiB
>
> Direct Example 1:
>
>   Reads Completed:        2,        5KiB  Writes Completed:       20,      258KiB
>   Reads Completed:        9,       36KiB  Writes Completed:       14,      130KiB
>   Reads Completed:       32,      128KiB  Writes Completed:       14,      130KiB
>   Reads Completed:        1,        4KiB  Writes Completed:       16,      130KiB
>   Reads Completed:       32,      128KiB  Writes Completed:       12,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>   Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>   Reads Completed:        2,        8KiB  Writes Completed:        8,      130KiB
>   Reads Completed:        6,       24KiB  Writes Completed:       19,      258KiB
>
> Direct Example 2:
>
>   Reads Completed:        4,      133KiB  Writes Completed:        3,      130KiB
>   Reads Completed:       11,      164KiB  Writes Completed:        3,      130KiB
>   Reads Completed:       34,      256KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        2,      132KiB  Writes Completed:        3,      130KiB
>   Reads Completed:       33,      256KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        3,      136KiB  Writes Completed:        3,      130KiB
>   Reads Completed:        7,      152KiB  Writes Completed:        3,      130KiB
>
>
> I was able to gain a little bit of insight through blktrace and
> ftrace.  Our initial assumption was that maybe things were being
> broken up differently such that md thought it needed to do a rmv.
>
> But as I dug into the blktrace output, that did not seem to be the
> case (reads are coming after what is obviously the strip write).  I
> used ftrace to show me the path down to md_make_request in the
> O_DIRECT and buffered cases.  This showed me some calls refering to
> read_ahead in the direct case.
>
>             <...>-14859 [001] 510340.525310: md_make_request
>             <...>-14859 [001] 510340.525311:<stack trace>
>   =>  generic_make_request
>   =>  submit_bio
>   =>  submit_bh
>   =>  block_read_full_page
>   =>  blkdev_readpage
>   =>  __do_page_cache_readahead
>   =>  force_page_cache_readahead
>   =>  page_cache_sync_readahead
>
> So is this read ahead I'm observing?  Why does it occur only in the
> direct case?
>
> I noticed that blktrace sometime identifies what I assume to be the
> instigator of the io.  So I can sometimes see dd or md_raid6 there.
> As in [dd] or [md0_raid6]:
>
>   8,16   1      115     0.042000000  2910  D   W 2256 + 48 [md0_raid6]
>
> These unexplained reads either mention blkid or [0] or [(null)].
>
> It isn't clear to me whether the unexpected read behavior is due to a
> tuning problem in the O_DIRECT case or simply the way things work.
>
> Thank you for any comments.--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

G'day John,

You need to give us more detail about your md raid setup. Beside a 
reference to md_raid6, there is no other details about your array.
How about sending:

mdadm -V
uname -a
mdadm -Dvv /dev/mdarray
mdadm -Evv /dev/arraycomponentdevices - for all of them


Good luck in the hunt,
J




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What are these reads in what should be simply a full-stripe write?
  2012-02-28 23:14 ` linbloke
@ 2012-03-05 16:29   ` John Adams
  0 siblings, 0 replies; 4+ messages in thread
From: John Adams @ 2012-03-05 16:29 UTC (permalink / raw)
  To: linbloke; +Cc: linux-raid@vger.kernel.org

Thank you for commenting.  Here is the info you requested.

mdadm - v3.2.2 - 17th June 2011
Linux jadams-sbb 2.6.32-131.12.1.hulk.avidnl.1.x86_64.debug #2 SMP Thu Feb 16 11:32:09 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
/dev/md0:
        Version : 1.2
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
     Array Size : 838860800 (800.00 GiB 858.99 GB)
  Used Dev Size : 104857600 (100.00 GiB 107.37 GB)
   Raid Devices : 10
  Total Devices : 10
    Persistence : Superblock is persistent

    Update Time : Mon Mar  5 12:35:11 2012
          State : clean 
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           Name : jadams-sbb:0  (local to host jadams-sbb)
           UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
         Events : 62

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd
       3       8       64        3      active sync   /dev/sde
       4       8       80        4      active sync   /dev/sdf
       5       8       96        5      active sync   /dev/sdg
       6       8      112        6      active sync   /dev/sdh
       7       8      128        7      active sync   /dev/sdi
       8       8      144        8      active sync   /dev/sdj
       9       8      160        9      active sync   /dev/sdk
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 26854371:b12034ac:d4119d00:5d1b8f1a

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : c3dfa790 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 0
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 04d4a4aa:dd376906:c3e479a0:cfbae1bb

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 98a57ffd - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 1
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 67d2b514:3fd5f360:73222322:97888b7b

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 9e94273a - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 2
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 81c96b5b:2a944dd0:fb069b36:b1fb4660

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 4dd734e3 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 3
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4741b8b0:2e0e0c73:bc4ff323:5f653d70

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 4330d91d - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 4
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdg:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4ff9f259:2781af76:b427d4c4:af50651a

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 3b17c767 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 5
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdh:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3c04b3f7:df628911:b47e2dec:a9eabecd

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 4e64a508 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 6
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdi:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 13c8ac51:5fd71796:7b385c45:8c31990e

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : c6f5de08 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 7
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdj:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 9be300f7:97329a54:3c86ec36:9dedd667

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : 759a5e9c - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 8
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)
/dev/sdk:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3ac47570:b9b222e0:8e09eb62:4071de0a
           Name : jadams-sbb:0  (local to host jadams-sbb)
  Creation Time : Fri Feb 17 19:39:25 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953523120 (931.51 GiB 1000.20 GB)
     Array Size : 1677721600 (800.00 GiB 858.99 GB)
  Used Dev Size : 209715200 (100.00 GiB 107.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : ada0d14e:ad9c3d6b:4ba27221:fe6ea679

    Update Time : Mon Mar  5 12:35:42 2012
       Checksum : e0642334 - correct
         Events : 62

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 9
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)



On Feb 28, 2012, at 6:14 PM, linbloke wrote:

> 
> On 29/02/12 5:47 AM, John Adams wrote:
>> For some years I've been working on some niche filesystems which serve
>> workflows involving lots of video.  Lately, I have had occasion to
>> investigate the behavior of md as a possible raid solution (2.6.32
>> kernel).
>> 
>> As part of that, we looked at some fio based loads in the buffered and
>> O_DIRECT cases and noticed some reading that we didn't understand when
>> using O_DIRECT.  We were led to this comparision by incorrect
>> information from a vendor. (We were trying to repro some reported
>> performance and were initially told that O_DIRECT had been used).
>> 
>> We are aware of the problems discussed concerning O_DIRECT.  As fs
>> guys, we're accustomed to worrying about copies and such, so it wasn't
>> immediately obvious to us that O_DIRECT would be a mistake in our
>> case.  This is essentially an embedded system with a single process
>> owning a group of disks with no filesystem.  There is no possibility
>> of a race with another process.
>> 
>> Anyway, I am curious about this reading behavior and I would grateful for any
>> comments.
>> 
>> I tried writing single stripes under both scenarios.  To give the
>> barest possible summary. I used a dd command like this with
>> oflag=direct omitted or not.  This was driven from a script that
>> sets up some blktrace and ftrace things, waits an appropriate time in
>> the buffered case etc.
>> 
>> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>> 
>> 8+2 128k strip
>> 
>> [physical disk completions via blkparse]
>> 
>> Buffered:
>> 
>>  Reads Completed:        2,        5KiB  Writes Completed:        4,      258KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        2,        8KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        6,       24KiB  Writes Completed:        4,      258KiB
>> 
>> Direct Example 1:
>> 
>>  Reads Completed:        2,        5KiB  Writes Completed:       20,      258KiB
>>  Reads Completed:        9,       36KiB  Writes Completed:       14,      130KiB
>>  Reads Completed:       32,      128KiB  Writes Completed:       14,      130KiB
>>  Reads Completed:        1,        4KiB  Writes Completed:       16,      130KiB
>>  Reads Completed:       32,      128KiB  Writes Completed:       12,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>>  Reads Completed:        2,        8KiB  Writes Completed:        8,      130KiB
>>  Reads Completed:        6,       24KiB  Writes Completed:       19,      258KiB
>> 
>> Direct Example 2:
>> 
>>  Reads Completed:        4,      133KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:       11,      164KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:       34,      256KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        2,      132KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:       33,      256KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        3,      136KiB  Writes Completed:        3,      130KiB
>>  Reads Completed:        7,      152KiB  Writes Completed:        3,      130KiB
>> 
>> 
>> I was able to gain a little bit of insight through blktrace and
>> ftrace.  Our initial assumption was that maybe things were being
>> broken up differently such that md thought it needed to do a rmv.
>> 
>> But as I dug into the blktrace output, that did not seem to be the
>> case (reads are coming after what is obviously the strip write).  I
>> used ftrace to show me the path down to md_make_request in the
>> O_DIRECT and buffered cases.  This showed me some calls refering to
>> read_ahead in the direct case.
>> 
>>            <...>-14859 [001] 510340.525310: md_make_request
>>            <...>-14859 [001] 510340.525311:<stack trace>
>>  =>  generic_make_request
>>  =>  submit_bio
>>  =>  submit_bh
>>  =>  block_read_full_page
>>  =>  blkdev_readpage
>>  =>  __do_page_cache_readahead
>>  =>  force_page_cache_readahead
>>  =>  page_cache_sync_readahead
>> 
>> So is this read ahead I'm observing?  Why does it occur only in the
>> direct case?
>> 
>> I noticed that blktrace sometime identifies what I assume to be the
>> instigator of the io.  So I can sometimes see dd or md_raid6 there.
>> As in [dd] or [md0_raid6]:
>> 
>>  8,16   1      115     0.042000000  2910  D   W 2256 + 48 [md0_raid6]
>> 
>> These unexplained reads either mention blkid or [0] or [(null)].
>> 
>> It isn't clear to me whether the unexpected read behavior is due to a
>> tuning problem in the O_DIRECT case or simply the way things work.
>> 
>> Thank you for any comments.--
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> G'day John,
> 
> You need to give us more detail about your md raid setup. Beside a reference to md_raid6, there is no other details about your array.
> How about sending:
> 
> mdadm -V
> uname -a
> mdadm -Dvv /dev/mdarray
> mdadm -Evv /dev/arraycomponentdevices - for all of them
> 
> 
> Good luck in the hunt,
> J
> 
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: What are these reads in what should be simply a full-stripe write?
  2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
  2012-02-28 23:14 ` linbloke
@ 2012-02-29  5:52 ` Doug Dumitru
  1 sibling, 0 replies; 4+ messages in thread
From: Doug Dumitru @ 2012-02-29  5:52 UTC (permalink / raw)
  To: John Adams; +Cc: linux-raid@vger.kernel.org

Mr. Adams,

Raid 5/6 exports a parameter called "optimal_io_size".  You should
find this in /sys/block/mdx/queue/optimal_io_size.

This is the size of a single stripe.  In theory, if you write exactly
this size aligned blocks to raid 5/6, then the stripe cache should
handle the IO perfectly and you should see zero reads.  If you miss
the boundaries, most of the time, raid 5/6 will cache the writes in
the stripe cache and you will still get zero reads.  Unfortunately, a
small percentage of the time, a read/modify/write will get scheduled
in between two inbound write requests.

To make this somewhat more complicated, there is also a limit to how
large a single request can be.  This is limited globally to "#define
BIO_MAX_PAGES 256" or 1MB (as of 3.1.7).  With raid 5/6 arrays, with
64KB chunks, this lets you have 16 active drives.  At least so goes
the theory.  I seem to remember some other limit at 1023 sectors,
which then limits you 511KB or 7 active drives.

If you need to drive this from an application, then the application
has to hit "optimal_io_size" exactly, both in terms of size and
alignment.  You can test this with 'dd'.  If you miss the alignment,
then you will get a small number of reads.

If you want to drive this from user space, then O_DIRECT will work.
Ideally, you want multiple outstanding IOs so that the drives can
stream.  This implies AIO (which sucked the last time I tried it), or
else you need to hack something inside of kernel space.

Now why raid 5/6 tends to miss and schedule read/modify/write at
inopportune times seems to just be a design trade-off inside of raid.
I stared at the code for a long time, and never did find any type of
specific timing for how long to wait before scheduling a RMW, so it
looks like you are just at the mercy of where clock ticks happen.

All in all, the raid 5/6 code is really elegant, but it would be nice
if the kernel in general allowed for longer atomic requests.  1MB (or
512KB, or 511KB depending on where you look), is just too short for
some "high bandwidth" application.

Doug Dumitru
EasyCo LLC

On Tue, Feb 28, 2012 at 10:47 AM, John Adams <john.adams@avid.com> wrote:
>
> For some years I've been working on some niche filesystems which serve
> workflows involving lots of video.  Lately, I have had occasion to
> investigate the behavior of md as a possible raid solution (2.6.32
> kernel).
>
> As part of that, we looked at some fio based loads in the buffered and
> O_DIRECT cases and noticed some reading that we didn't understand when
> using O_DIRECT.  We were led to this comparision by incorrect
> information from a vendor. (We were trying to repro some reported
> performance and were initially told that O_DIRECT had been used).
>
> We are aware of the problems discussed concerning O_DIRECT.  As fs
> guys, we're accustomed to worrying about copies and such, so it wasn't
> immediately obvious to us that O_DIRECT would be a mistake in our
> case.  This is essentially an embedded system with a single process
> owning a group of disks with no filesystem.  There is no possibility
> of a race with another process.
>
> Anyway, I am curious about this reading behavior and I would grateful for any
> comments.
>
> I tried writing single stripes under both scenarios.  To give the
> barest possible summary. I used a dd command like this with
> oflag=direct omitted or not.  This was driven from a script that
> sets up some blktrace and ftrace things, waits an appropriate time in
> the buffered case etc.
>
> dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
>
> 8+2 128k strip
>
> [physical disk completions via blkparse]
>
> Buffered:
>
>  Reads Completed:        2,        5KiB  Writes Completed:        4,      258KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        2,        8KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        6,       24KiB  Writes Completed:        4,      258KiB
>
> Direct Example 1:
>
>  Reads Completed:        2,        5KiB  Writes Completed:       20,      258KiB
>  Reads Completed:        9,       36KiB  Writes Completed:       14,      130KiB
>  Reads Completed:       32,      128KiB  Writes Completed:       14,      130KiB
>  Reads Completed:        1,        4KiB  Writes Completed:       16,      130KiB
>  Reads Completed:       32,      128KiB  Writes Completed:       12,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>  Reads Completed:        0,        0KiB  Writes Completed:        8,      130KiB
>  Reads Completed:        2,        8KiB  Writes Completed:        8,      130KiB
>  Reads Completed:        6,       24KiB  Writes Completed:       19,      258KiB
>
> Direct Example 2:
>
>  Reads Completed:        4,      133KiB  Writes Completed:        3,      130KiB
>  Reads Completed:       11,      164KiB  Writes Completed:        3,      130KiB
>  Reads Completed:       34,      256KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        2,      132KiB  Writes Completed:        3,      130KiB
>  Reads Completed:       33,      256KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        3,      136KiB  Writes Completed:        3,      130KiB
>  Reads Completed:        7,      152KiB  Writes Completed:        3,      130KiB
>
>
> I was able to gain a little bit of insight through blktrace and
> ftrace.  Our initial assumption was that maybe things were being
> broken up differently such that md thought it needed to do a rmv.
>
> But as I dug into the blktrace output, that did not seem to be the
> case (reads are coming after what is obviously the strip write).  I
> used ftrace to show me the path down to md_make_request in the
> O_DIRECT and buffered cases.  This showed me some calls refering to
> read_ahead in the direct case.
>
>           <...>-14859 [001] 510340.525310: md_make_request
>           <...>-14859 [001] 510340.525311: <stack trace>
>  => generic_make_request
>  => submit_bio
>  => submit_bh
>  => block_read_full_page
>  => blkdev_readpage
>  => __do_page_cache_readahead
>  => force_page_cache_readahead
>  => page_cache_sync_readahead
>
> So is this read ahead I'm observing?  Why does it occur only in the
> direct case?
>
> I noticed that blktrace sometime identifies what I assume to be the
> instigator of the io.  So I can sometimes see dd or md_raid6 there.
> As in [dd] or [md0_raid6]:
>
>  8,16   1      115     0.042000000  2910  D   W 2256 + 48 [md0_raid6]
>
> These unexplained reads either mention blkid or [0] or [(null)].
>
> It isn't clear to me whether the unexpected read behavior is due to a
> tuning problem in the O_DIRECT case or simply the way things work.
>
> Thank you for any comments.--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-03-05 16:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-28 18:47 What are these reads in what should be simply a full-stripe write? John Adams
2012-02-28 23:14 ` linbloke
2012-03-05 16:29   ` John Adams
2012-02-29  5:52 ` Doug Dumitru

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.