From: Chris Mason <chris.mason@oracle.com>
To: jim owens <owens6336@gmail.com>
Cc: Josef Bacik <josef@redhat.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH V2] Btrfs: Full direct I/O and AIO read implementation.
Date: Tue, 16 Feb 2010 11:01:07 -0500 [thread overview]
Message-ID: <20100216160107.GF3497@think> (raw)
In-Reply-To: <4B799DF7.9090008@gmail.com>
On Mon, Feb 15, 2010 at 02:18:15PM -0500, jim owens wrote:
> Chris Mason wrote:
> >Thanks to both Jim and Josef for their time on this one. More comments
> >below.
> >
> >On Sat, Feb 13, 2010 at 08:30:07PM -0500, jim owens wrote:
> >>Josef Bacik wrote:
> >>>On Wed, Feb 10, 2010 at 01:53:50PM -0500, jim owens wrote:
> >>>>+ /* traditional 512-byte device sector alignment is the
> >>The problem that I did not explain in my comment is that other fs's
> >>get the blocksize from 1 device bdi, which usually will be 512. But
> >>the btrfs bdi covers all volumes and gives the btrf logical 4096 size.
> >>If we do what is done in direct-io.c then we only allow access on 4k
> >>(or in the future larger logical) boundaries.
> >
> >We can record the smallest alignment while scanning the devices and just
> >use that. In general 4K is good enough, although some other database
> >company might be more worried about smaller numbers.
>
> So you are saying at mount time we will set a new field in a
> btrfs data structure that we can use here instead of the hard-code.
> With today's disks this will usually be 512.
Yes please.
[ use i_mutex for reads? ]
> >
> >But, we're making the performance of the common case (parallel reads)
> >dramatically slower to close a small race. Reads and writes are allowed
> >to race, and the results are allowed to be undefined. The only
> >requirement is that we not return garbage (zeros, or good data only).
>
> I agree with both why we don't want the mutex and that we don't care
> about the read returning old or new data.
>
> But in spite of what you said above, I found lock_extent() was
> not enough and I needed the i_mutex.
>
> The problem I had was that if btrfs_wait_ordered_range() does not
> run on the new data, that data can still be on disk but the btree
> checksum is wrong. I need the checksum forced into the btree so
> I don't fail the comparison.
>
> Even an overwrite when nodatacow is set will cause the same problem.
>
> I can try to code something like btrfs_page_mkwrite() does and see
> if that works, though I really question if that will be a dramatic
> improvement in parallel read because now we have to take the lock
> and search the ordered tree.
But, we already need the code that btrfs_page_mkwrite uses. It should
be enough to wait for the ordered extents and have the extent range
locked.
> To me this will be a net loss of
> performance on files that are not being modified as the i_mutex
> does not slow down parallel readers except when the area they
> are reading needs flushed.
The cost of i_mutex on parallel readers + high speed devices can be
surprising.
> And if the file is being modified in
> the spot we are reading, without i_mutex we may need to flush
> it multiple times, endlessly. And unlike btrfs_page_mkwrite()
> this code is not doing just 1 page.
If the file is being modified by buffered writes, they get what they
deserve. That's not what O_DIRECT is for.
>
> Or am I getting something wrong?
>
> >I think the temp buffer setup is the best place to start for
> >O_DIRECT + compressed.
>
> Not sure what you mean by that.
I meant I think your way is right ;)
-chris
next prev parent reply other threads:[~2010-02-16 16:01 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-10 18:53 [PATCH V2] Btrfs: Full direct I/O and AIO read implementation jim owens
2010-02-12 19:28 ` Josef Bacik
2010-02-14 1:30 ` jim owens
2010-02-15 16:42 ` Chris Mason
2010-02-15 19:18 ` jim owens
2010-02-16 16:01 ` Chris Mason [this message]
2010-02-16 17:09 ` jim owens
2010-02-15 21:58 ` Christoph Hellwig
2010-02-15 22:26 ` jim owens
2010-02-15 22:32 ` Christoph Hellwig
2010-02-15 22:40 ` jim owens
2010-02-16 15:49 ` Chris Mason
2010-02-15 22:01 ` rk
2010-02-15 22:31 ` jim owens
2010-02-16 19:28 ` jim owens
2010-02-16 19:39 ` Josef Bacik
2010-03-03 18:54 ` jim owens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100216160107.GF3497@think \
--to=chris.mason@oracle.com \
--cc=josef@redhat.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=owens6336@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox