public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Brett Russ <icycle+lkml@gmail.com>
To: linux-kernel@vger.kernel.org
Subject: Re: O_DIRECT reads appear to be cached on block device partition file?
Date: Fri, 17 Sep 2010 18:22:28 -0400	[thread overview]
Message-ID: <i70pn6$otu$1@dough.gmane.org> (raw)
In-Reply-To: <20100914073643.GB15695@dastard>

Dave Chinner wrote:
> On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
>> If I run the above on the monitoring blade, then sync an update to
>> the sector in question from another blade, then re-reun the above
>> code on the monitoring blade, believe it or not I appear to be
>> reading stale data.  If I use dd with iflag=direct, reading the same
>> sector offset at the /dev/sdX3 partition file, I see the same stale
>> data as seen from the code above.  If, however, I instead access
>> this sector offset from the /dev/sdX device file using the (offset
>> of partition 3 + offset of the sector) I see the intended data,
>> which makes me believe some caching occurred locally for /dev/sdX3.
>
> What does blktrace tell you?

Thanks Dave for the pointer to blktrace.  I'd not used this before.

The short answer is that I now trust O_DIRECT.  The cause for me going 
down this path to begin with was caused by a stale cache in our application.

The longer answer of how my dd double-check could have gone wrong follows:

I've discovered that the start-of-partition LBA does not *always* agree 
between the kernel (reported by blktrace and sysfs) and utilities such 
as {fdisk|sfdisk}.  This means that my experiment of accessing the 
sector within the partition via the parent device may have been invalid, 
since I was trusting fdisk to determine the correct sector offset of the 
partition.

> spu0103# fdisk -l -u /dev/sdbk
...
> Units = sectors of 1 * 512 = 512 bytes
>
>     Device Boot      Start         End      Blocks  Id System
...
> /dev/sdbk3      1197742140  1944780704   373519282+ 83 Linux

> spu0103# sfdisk -uS -l /dev/sdbk
...
> Units = sectors of 512 bytes, counting from 0
>
>    Device Boot    Start       End   #sectors  Id  System
...
> /dev/sdbk3     1197742140 1944780704  747038565  83  Linux

> spu0103# cat /sys/block/sdbk/sdbk3/start
> 1197934920

The above discrepancy was also shown with blktrace:

> spu0103# blkparse -q 1
> Input file 1.blktrace.5 added
> Input file 1.blktrace.6 added
> Input file 1.blktrace.7 added
>

This command:

> spu0103# dd-7.1  if=/dev/sdbk3 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace:

>  67,224  5        1     0.000000000 29726  A   R 1197934920 + 1 <- (67,227) 0

Note the kernel remapped the access to sdbk3 (offset 0) to sdbk (offset
1197934920) (see the major:minor numbers listed after the trace), which
is quite different from the partition start shown in fdisk of 1197742140.

>  67,224  5        2     0.000000564 29726  Q   R 1197934920 + 1 [dd-7.1]
>  67,224  5        3     0.000004032 29726  G   R 1197934920 + 1 [dd-7.1]
>  67,224  5        4     0.000006223 29726  P   N [dd-7.1]
>  67,224  5        5     0.000008152 29726  I   R 1197934920 + 1 [dd-7.1]
>  67,224  5        6     0.000009916 29726  U   N [dd-7.1] 1
>  67,224  5        7     0.000012286 29726  D   R 1197934920 + 1 [dd-7.1]
>  67,224  7        1     0.006802504     0  C   R 1197934920 + 1 [0]

And this command (accessing the start of partition using fdisk sector 
offset):

> spu0103# dd-7.1  if=/dev/sdbk skip=1197742140 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace (as expected):

>  67,224  7        2    75.330506824 29924  Q   R 1197742140 + 1 [dd-7.1]
>  67,224  7        3    75.330509804 29924  G   R 1197742140 + 1 [dd-7.1]
>  67,224  7        4    75.330511985 29924  P   N [dd-7.1]
>  67,224  7        5    75.330513836 29924  I   R 1197742140 + 1 [dd-7.1]
>  67,224  7        6    75.330515495 29924  U   N [dd-7.1] 1
>  67,224  7        7    75.330517901 29924  D   R 1197742140 + 1 [dd-7.1]
>  67,224  6        1    75.340722638     0  C   R 1197742140 + 1 [0]

The aforementioned major/minor numbers:

> spu0103# ls -l /dev/|grep 67|grep '22[47]'
> brw-rw-rw-    1 root     root      67, 224 Sep 15 11:59 sdbk
> brw-rw-rw-    1 root     root      67, 227 Sep 15 11:59 sdbk3

*All* other drives in my system that I tested do show a match between 
the 3 methods above (fdisk, sfdisk, sysfs).

I don't know how this discrepancy with the partition start could have 
been introduced, but it is most likely a byproduct of my testing.

Thanks,
Brett


      reply	other threads:[~2010-09-17 22:22 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-14  3:49 O_DIRECT reads appear to be cached on block device partition file? Brett Russ
2010-09-14  7:36 ` Dave Chinner
2010-09-17 22:22   ` Brett Russ [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='i70pn6$otu$1@dough.gmane.org' \
    --to=icycle+lkml@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox