From: Andres Freund <andres@anarazel.de>
To: Matthew Wilcox <matthew@wil.cx>
Cc: Andi Kleen <andi@firstfloor.org>,
viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, robertmhaas@gmail.com,
pgsql-hackers@postgresql.org
Subject: Re: Improve lseek scalability v3
Date: Fri, 16 Sep 2011 19:27:33 +0200 [thread overview]
Message-ID: <201109161927.34472.andres@anarazel.de> (raw)
In-Reply-To: <20110916153620.GA9913@parisc-linux.org>
Hi,
On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote:
> On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote:
> > I sent an email containing benchmarks from Robert Haas regarding the
> > Subject. Looking at lkml.org I can't see it right now, Will recheck when
> > I am at home.
> >
> > He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times
> > the lseek performance.
> > The workload was 64 clients hammering postgres with a simple readonly
> > workload (pgbench -S).
> Yay! Data!
> > For reference see the thread in the postgres archives which also links to
> > performance data: http://archives.postgresql.org/message-
> > id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA@mail.gmail.com
> So both fstat and lseek do more work than postgres wants. lseek modifies
> the file pointer while fstat copies all kinds of unnecessary information
> into userspace. I imagine this is the source of the slowdown seen in
> the 1-client case.
Yes, that was my theory as well.
> I'd like to dig into the requirement for knowing the file size a little
> better. According to the blog entry it's used for "the query planner".
Its used for multiple things - one of which is the query planner.
The query planner needs to know how many tuples a table has to produce a
sensible plan. For that is has stats which tell 1. how big is the table 2. how
many tuples does the table have. Those statistics are only updated every now
and then though.
So it uses those old stats to check how many tuples are normally stored on a
page and then uses that to extrapolate the number of tuples from the current
nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a
table).
I am not sure how interested you are on the relevant postgres internals?
> Does the query planner need to know the exact number of bytes in the file,
> or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?
It depends on where the information is used. For some of the uses it needs to
be exact (the assumed size is rechecked after acquiring a lock preventing
extension) at other places I guess it would be ok if the accuracy got lower
with bigger files (those files won't ever get bigger than 1GB).
But I have a hard time seeing an implementation where the approximate size
would be faster to get than just the filesize?
Andres
next prev parent reply other threads:[~2011-09-16 17:27 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-15 23:06 Improve lseek scalability v3 Andi Kleen
2011-09-15 23:06 ` [PATCH 1/7] BTRFS: Fix lseek return value for error Andi Kleen
2011-09-15 23:47 ` Thomas Gleixner
2011-09-16 15:48 ` Christoph Hellwig
2011-09-16 16:38 ` Andi Kleen
2011-09-17 6:10 ` Jeff Liu
2011-09-17 23:03 ` Andreas Dilger
2011-09-18 1:46 ` Andi Kleen
2011-09-18 7:29 ` Jeff Liu
2011-09-18 8:42 ` Marco Stornelli
2011-09-18 10:33 ` Jeff liu
2011-09-18 14:55 ` Chris Mason
2011-09-19 17:52 ` Andi Kleen
2011-09-19 19:30 ` Chris Mason
2011-09-19 19:59 ` Andi Kleen
2011-09-19 22:55 ` Chris Mason
2011-09-15 23:06 ` [PATCH 2/7] VFS: Do (nearly) lockless generic_file_llseek Andi Kleen
2011-09-15 23:06 ` [PATCH 3/7] VFS: Make generic lseek lockless safe Andi Kleen
2011-09-15 23:06 ` [PATCH 4/7] VFS: Add generic_file_llseek_size Andi Kleen
2011-09-16 15:50 ` Christoph Hellwig
2011-09-15 23:06 ` [PATCH 5/7] LSEEK: EXT4: Replace cut'n'pasted llseek code with generic_file_llseek_size Andi Kleen
2011-09-15 23:06 ` [PATCH 6/7] LSEEK: NFS: Drop unnecessary locking in llseek Andi Kleen
2011-09-15 23:06 ` [PATCH 7/7] LSEEK: BTRFS: Avoid i_mutex for SEEK_{CUR,SET,END} Andi Kleen
2011-09-16 13:00 ` Improve lseek scalability v3 Matthew Wilcox
2011-09-16 13:19 ` Josef Bacik
2011-09-16 14:16 ` Andres Freund
2011-09-16 14:23 ` Andi Kleen
2011-09-16 14:41 ` Andres Freund
2011-09-16 15:36 ` Matthew Wilcox
2011-09-16 17:27 ` Andres Freund [this message]
2011-09-16 17:39 ` Alvaro Herrera
2011-09-16 17:50 ` [HACKERS] " Andi Kleen
2011-09-16 20:08 ` Benjamin LaHaise
2011-09-16 21:02 ` Andres Freund
2011-09-16 21:05 ` [HACKERS] " Andres Freund
2011-09-16 22:44 ` Greg Stark
2011-09-19 12:31 ` Stephen Frost
2011-09-19 13:25 ` [HACKERS] " Matthew Wilcox
2011-09-20 7:18 ` Marco Stornelli
2011-09-19 13:30 ` Robert Haas
2011-09-16 14:26 ` Andres Freund
2011-10-01 20:46 ` Andres Freund
2011-10-01 20:49 ` [PATCH 1/2] LSEEK: BTRFS: Avoid i_mutex for SEEK_{CUR,SET,END} Andres Freund
2011-11-02 8:29 ` Christoph Hellwig
2011-11-05 15:27 ` Chris Mason
2012-03-07 17:16 ` Andres Freund
2011-10-01 20:50 ` [PATCH 2/2] btrfs: Don't have multiple paths to error out in btrfs_file_llseek Andres Freund
2011-10-02 5:28 ` Improve lseek scalability v3 Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201109161927.34472.andres@anarazel.de \
--to=andres@anarazel.de \
--cc=andi@firstfloor.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=matthew@wil.cx \
--cc=pgsql-hackers@postgresql.org \
--cc=robertmhaas@gmail.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).