From: Andrew Morton <akpm@zip.com.au>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: lkml <linux-kernel@vger.kernel.org>
Subject: Re: [patch 11/13] don't hold i_sem during O_DIRECT writes to blockdevs
Date: Sun, 28 Jul 2002 17:39:06 -0700 [thread overview]
Message-ID: <3D448EAA.CE3382D8@zip.com.au> (raw)
In-Reply-To: Pine.LNX.4.44.0207281656110.9427-100000@home.transmeta.com
Linus Torvalds wrote:
>
> On Sun, 28 Jul 2002, Andrew Morton wrote:
> >
> > We're moving in the direction of deprecating the raw driver and
> > recommending that applications use O_DIRECT reads and writes against
> > blockdevs.
>
> This should probably be done unconditionally or not at all.
>
> We've worked very hard on making block devices more "normal" in 2.5.x, and
> I don't want to start diverging again.
>
> If this is really a scalability issue, I would suggest that people who
> care look into just getting rid of "i_sem", and replacing it with a
> read-write semaphore that explicitly protects only "i_size". Then you make
> reads and non-extending writes take that semaphore for reading, and
> extending writes and truncates taking it for writing.
I don't know if it is a scalability issue, frankly. It will be for
buffered writes, but for writes which wait on IO, the mechanics of
the media probably make the benefits small. Conceivably there are
some additional merging opportunities, but it's thin.
We can do the rwsem thing, and that would be good. But there may
be filesystems which are relying on i_sem to provide protection
against concurrent invokations of get_block(create=1), inside i_size.
> [ The "nonextending writes" case is somewhat interesting, a write probably
> needs to actually take the semaphore for writing, and then downgrading
> it to reading after it has checked that it doesn't end up extending the
> file.
>
> What makes this even more interesting is that depending on the semaphore
> implementation you can actually split up the "take write lock" into
> "prepare to take write lock" and "turn it into a read lock" or "confirm
> write lock", where the "prepare to take write lock" allows existing
> readers but not new write-lockers, so that if you downgrade to a read
> lock you never had to synchronize with anybody else who was already
> reading. ]
>
> I'd much rather do this _right_ than have some ugly blockdev-only hack,
> since the problem certainly would happen with files too. A lot of people
> want to do databases on a filesystem, just because it is so much easier to
> administer.
OK. It'd be nice to get some benchmarks first (say, between O_DIRECT-to-blockdev
and the raw driver) to see if it's worth bothering with.
-
next prev parent reply other threads:[~2002-07-29 0:27 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-07-28 7:33 [patch 11/13] don't hold i_sem during O_DIRECT writes to blockdevs Andrew Morton
2002-07-28 11:06 ` Christoph Hellwig
2002-07-28 17:55 ` Andrew Morton
2002-07-28 18:05 ` Christoph Hellwig
2002-07-28 18:41 ` Christoph Hellwig
2002-07-29 0:04 ` Linus Torvalds
2002-07-29 0:39 ` Andrew Morton [this message]
2002-07-29 0:47 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3D448EAA.CE3382D8@zip.com.au \
--to=akpm@zip.com.au \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.