All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: jim owens <owens6336@gmail.com>
Cc: Josef Bacik <josef@redhat.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH V2] Btrfs: Full direct I/O and AIO read implementation.
Date: Tue, 16 Feb 2010 11:01:07 -0500	[thread overview]
Message-ID: <20100216160107.GF3497@think> (raw)
In-Reply-To: <4B799DF7.9090008@gmail.com>

On Mon, Feb 15, 2010 at 02:18:15PM -0500, jim owens wrote:
> Chris Mason wrote:
> >Thanks to both Jim and Josef for their time on this one.  More comments
> >below.
> >
> >On Sat, Feb 13, 2010 at 08:30:07PM -0500, jim owens wrote:
> >>Josef Bacik wrote:
> >>>On Wed, Feb 10, 2010 at 01:53:50PM -0500, jim owens wrote:
> >>>>+	/* traditional 512-byte device sector alignment is the
> >>The problem that I did not explain in my comment is that other fs's
> >>get the blocksize from 1 device bdi, which usually will be 512.  But
> >>the btrfs bdi covers all volumes and gives the btrf logical 4096 size.
> >>If we do what is done in direct-io.c then we only allow access on 4k
> >>(or in the future larger logical) boundaries.
> >
> >We can record the smallest alignment while scanning the devices and just
> >use that.  In general 4K is good enough, although some other database
> >company might be more worried about smaller numbers.
> 
> So you are saying at mount time we will set a new field in a
> btrfs data structure that we can use here instead of the hard-code.
> With today's disks this will usually be 512.

Yes please.

[ use i_mutex for reads? ]

> >
> >But, we're making the performance of the common case (parallel reads)
> >dramatically slower to close a small race.  Reads and writes are allowed
> >to race, and the results are allowed to be undefined.  The only
> >requirement is that we not return garbage (zeros, or good data only).
> 
> I agree with both why we don't want the mutex and that we don't care
> about the read returning old or new data.
> 
> But in spite of what you said above, I found lock_extent() was
> not enough and I needed the i_mutex.
> 
> The problem I had was that if btrfs_wait_ordered_range() does not
> run on the new data, that data can still be on disk but the btree
> checksum is wrong. I need the checksum forced into the btree so
> I don't fail the comparison.
> 
> Even an overwrite when nodatacow is set will cause the same problem.
> 
> I can try to code something like btrfs_page_mkwrite() does and see
> if that works, though I really question if that will be a dramatic
> improvement in parallel read because now we have to take the lock
> and search the ordered tree.

But, we already need the code that btrfs_page_mkwrite uses.  It should
be enough to wait for the ordered extents and have the extent range
locked.

> To me this will be a net loss of
> performance on files that are not being modified as the i_mutex
> does not slow down parallel readers except when the area they
> are reading needs flushed. 

The cost of i_mutex on parallel readers + high speed devices can be
surprising.

> And if the file is being modified in
> the spot we are reading, without i_mutex we may need to flush
> it multiple times, endlessly.  And unlike btrfs_page_mkwrite()
> this code is not doing just 1 page.

If the file is being modified by buffered writes, they get what they
deserve.  That's not what O_DIRECT is for.

> 
> Or am I getting something wrong?
> 
> >I think the temp buffer setup is the best place to start for
> >O_DIRECT + compressed.
> 
> Not sure what you mean by that.

I meant I think your way is right ;)

-chris


  reply	other threads:[~2010-02-16 16:01 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-10 18:53 [PATCH V2] Btrfs: Full direct I/O and AIO read implementation jim owens
2010-02-12 19:28 ` Josef Bacik
2010-02-14  1:30   ` jim owens
2010-02-15 16:42     ` Chris Mason
2010-02-15 19:18       ` jim owens
2010-02-16 16:01         ` Chris Mason [this message]
2010-02-16 17:09           ` jim owens
2010-02-15 21:58       ` Christoph Hellwig
2010-02-15 22:26         ` jim owens
2010-02-15 22:32           ` Christoph Hellwig
2010-02-15 22:40             ` jim owens
2010-02-16 15:49               ` Chris Mason
2010-02-15 22:01     ` rk
2010-02-15 22:31       ` jim owens
2010-02-16 19:28   ` jim owens
2010-02-16 19:39     ` Josef Bacik
2010-03-03 18:54       ` jim owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100216160107.GF3497@think \
    --to=chris.mason@oracle.com \
    --cc=josef@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=owens6336@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.