All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brett Russ <icycle+lkml@gmail.com>
To: linux-kernel@vger.kernel.org
Subject: Re: O_DIRECT reads appear to be cached on block device partition file?
Date: Fri, 17 Sep 2010 18:22:28 -0400	[thread overview]
Message-ID: <i70pn6$otu$1@dough.gmane.org> (raw)
In-Reply-To: <20100914073643.GB15695@dastard>

Dave Chinner wrote:
> On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
>> If I run the above on the monitoring blade, then sync an update to
>> the sector in question from another blade, then re-reun the above
>> code on the monitoring blade, believe it or not I appear to be
>> reading stale data.  If I use dd with iflag=direct, reading the same
>> sector offset at the /dev/sdX3 partition file, I see the same stale
>> data as seen from the code above.  If, however, I instead access
>> this sector offset from the /dev/sdX device file using the (offset
>> of partition 3 + offset of the sector) I see the intended data,
>> which makes me believe some caching occurred locally for /dev/sdX3.
>
> What does blktrace tell you?

Thanks Dave for the pointer to blktrace.  I'd not used this before.

The short answer is that I now trust O_DIRECT.  The cause for me going 
down this path to begin with was caused by a stale cache in our application.

The longer answer of how my dd double-check could have gone wrong follows:

I've discovered that the start-of-partition LBA does not *always* agree 
between the kernel (reported by blktrace and sysfs) and utilities such 
as {fdisk|sfdisk}.  This means that my experiment of accessing the 
sector within the partition via the parent device may have been invalid, 
since I was trusting fdisk to determine the correct sector offset of the 
partition.

> spu0103# fdisk -l -u /dev/sdbk
...
> Units = sectors of 1 * 512 = 512 bytes
>
>     Device Boot      Start         End      Blocks  Id System
...
> /dev/sdbk3      1197742140  1944780704   373519282+ 83 Linux

> spu0103# sfdisk -uS -l /dev/sdbk
...
> Units = sectors of 512 bytes, counting from 0
>
>    Device Boot    Start       End   #sectors  Id  System
...
> /dev/sdbk3     1197742140 1944780704  747038565  83  Linux

> spu0103# cat /sys/block/sdbk/sdbk3/start
> 1197934920

The above discrepancy was also shown with blktrace:

> spu0103# blkparse -q 1
> Input file 1.blktrace.5 added
> Input file 1.blktrace.6 added
> Input file 1.blktrace.7 added
>

This command:

> spu0103# dd-7.1  if=/dev/sdbk3 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace:

>  67,224  5        1     0.000000000 29726  A   R 1197934920 + 1 <- (67,227) 0

Note the kernel remapped the access to sdbk3 (offset 0) to sdbk (offset
1197934920) (see the major:minor numbers listed after the trace), which
is quite different from the partition start shown in fdisk of 1197742140.

>  67,224  5        2     0.000000564 29726  Q   R 1197934920 + 1 [dd-7.1]
>  67,224  5        3     0.000004032 29726  G   R 1197934920 + 1 [dd-7.1]
>  67,224  5        4     0.000006223 29726  P   N [dd-7.1]
>  67,224  5        5     0.000008152 29726  I   R 1197934920 + 1 [dd-7.1]
>  67,224  5        6     0.000009916 29726  U   N [dd-7.1] 1
>  67,224  5        7     0.000012286 29726  D   R 1197934920 + 1 [dd-7.1]
>  67,224  7        1     0.006802504     0  C   R 1197934920 + 1 [0]

And this command (accessing the start of partition using fdisk sector 
offset):

> spu0103# dd-7.1  if=/dev/sdbk skip=1197742140 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace (as expected):

>  67,224  7        2    75.330506824 29924  Q   R 1197742140 + 1 [dd-7.1]
>  67,224  7        3    75.330509804 29924  G   R 1197742140 + 1 [dd-7.1]
>  67,224  7        4    75.330511985 29924  P   N [dd-7.1]
>  67,224  7        5    75.330513836 29924  I   R 1197742140 + 1 [dd-7.1]
>  67,224  7        6    75.330515495 29924  U   N [dd-7.1] 1
>  67,224  7        7    75.330517901 29924  D   R 1197742140 + 1 [dd-7.1]
>  67,224  6        1    75.340722638     0  C   R 1197742140 + 1 [0]

The aforementioned major/minor numbers:

> spu0103# ls -l /dev/|grep 67|grep '22[47]'
> brw-rw-rw-    1 root     root      67, 224 Sep 15 11:59 sdbk
> brw-rw-rw-    1 root     root      67, 227 Sep 15 11:59 sdbk3

*All* other drives in my system that I tested do show a match between 
the 3 methods above (fdisk, sfdisk, sysfs).

I don't know how this discrepancy with the partition start could have 
been introduced, but it is most likely a byproduct of my testing.

Thanks,
Brett


      reply	other threads:[~2010-09-17 22:22 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-14  3:49 O_DIRECT reads appear to be cached on block device partition file? Brett Russ
2010-09-14  7:36 ` Dave Chinner
2010-09-17 22:22   ` Brett Russ [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='i70pn6$otu$1@dough.gmane.org' \
    --to=icycle+lkml@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.