From: Brett Russ <icycle+lkml@gmail.com>
To: linux-kernel@vger.kernel.org
Subject: Re: O_DIRECT reads appear to be cached on block device partition file?
Date: Fri, 17 Sep 2010 18:22:28 -0400 [thread overview]
Message-ID: <i70pn6$otu$1@dough.gmane.org> (raw)
In-Reply-To: <20100914073643.GB15695@dastard>
Dave Chinner wrote:
> On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
>> If I run the above on the monitoring blade, then sync an update to
>> the sector in question from another blade, then re-reun the above
>> code on the monitoring blade, believe it or not I appear to be
>> reading stale data. If I use dd with iflag=direct, reading the same
>> sector offset at the /dev/sdX3 partition file, I see the same stale
>> data as seen from the code above. If, however, I instead access
>> this sector offset from the /dev/sdX device file using the (offset
>> of partition 3 + offset of the sector) I see the intended data,
>> which makes me believe some caching occurred locally for /dev/sdX3.
>
> What does blktrace tell you?
Thanks Dave for the pointer to blktrace. I'd not used this before.
The short answer is that I now trust O_DIRECT. The cause for me going
down this path to begin with was caused by a stale cache in our application.
The longer answer of how my dd double-check could have gone wrong follows:
I've discovered that the start-of-partition LBA does not *always* agree
between the kernel (reported by blktrace and sysfs) and utilities such
as {fdisk|sfdisk}. This means that my experiment of accessing the
sector within the partition via the parent device may have been invalid,
since I was trusting fdisk to determine the correct sector offset of the
partition.
> spu0103# fdisk -l -u /dev/sdbk
...
> Units = sectors of 1 * 512 = 512 bytes
>
> Device Boot Start End Blocks Id System
...
> /dev/sdbk3 1197742140 1944780704 373519282+ 83 Linux
> spu0103# sfdisk -uS -l /dev/sdbk
...
> Units = sectors of 512 bytes, counting from 0
>
> Device Boot Start End #sectors Id System
...
> /dev/sdbk3 1197742140 1944780704 747038565 83 Linux
> spu0103# cat /sys/block/sdbk/sdbk3/start
> 1197934920
The above discrepancy was also shown with blktrace:
> spu0103# blkparse -q 1
> Input file 1.blktrace.5 added
> Input file 1.blktrace.6 added
> Input file 1.blktrace.7 added
>
This command:
> spu0103# dd-7.1 if=/dev/sdbk3 bs=512 count=1 iflag=direct |hexdump -C
Produced this trace:
> 67,224 5 1 0.000000000 29726 A R 1197934920 + 1 <- (67,227) 0
Note the kernel remapped the access to sdbk3 (offset 0) to sdbk (offset
1197934920) (see the major:minor numbers listed after the trace), which
is quite different from the partition start shown in fdisk of 1197742140.
> 67,224 5 2 0.000000564 29726 Q R 1197934920 + 1 [dd-7.1]
> 67,224 5 3 0.000004032 29726 G R 1197934920 + 1 [dd-7.1]
> 67,224 5 4 0.000006223 29726 P N [dd-7.1]
> 67,224 5 5 0.000008152 29726 I R 1197934920 + 1 [dd-7.1]
> 67,224 5 6 0.000009916 29726 U N [dd-7.1] 1
> 67,224 5 7 0.000012286 29726 D R 1197934920 + 1 [dd-7.1]
> 67,224 7 1 0.006802504 0 C R 1197934920 + 1 [0]
And this command (accessing the start of partition using fdisk sector
offset):
> spu0103# dd-7.1 if=/dev/sdbk skip=1197742140 bs=512 count=1 iflag=direct |hexdump -C
Produced this trace (as expected):
> 67,224 7 2 75.330506824 29924 Q R 1197742140 + 1 [dd-7.1]
> 67,224 7 3 75.330509804 29924 G R 1197742140 + 1 [dd-7.1]
> 67,224 7 4 75.330511985 29924 P N [dd-7.1]
> 67,224 7 5 75.330513836 29924 I R 1197742140 + 1 [dd-7.1]
> 67,224 7 6 75.330515495 29924 U N [dd-7.1] 1
> 67,224 7 7 75.330517901 29924 D R 1197742140 + 1 [dd-7.1]
> 67,224 6 1 75.340722638 0 C R 1197742140 + 1 [0]
The aforementioned major/minor numbers:
> spu0103# ls -l /dev/|grep 67|grep '22[47]'
> brw-rw-rw- 1 root root 67, 224 Sep 15 11:59 sdbk
> brw-rw-rw- 1 root root 67, 227 Sep 15 11:59 sdbk3
*All* other drives in my system that I tested do show a match between
the 3 methods above (fdisk, sfdisk, sysfs).
I don't know how this discrepancy with the partition start could have
been introduced, but it is most likely a byproduct of my testing.
Thanks,
Brett
prev parent reply other threads:[~2010-09-17 22:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-14 3:49 O_DIRECT reads appear to be cached on block device partition file? Brett Russ
2010-09-14 7:36 ` Dave Chinner
2010-09-17 22:22 ` Brett Russ [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='i70pn6$otu$1@dough.gmane.org' \
--to=icycle+lkml@gmail.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox