From: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
To: "Ted Ts'o" <tytso@mit.edu>
Cc: linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: infinite getdents64 loop
Date: Tue, 31 May 2011 19:07:40 +0200 [thread overview]
Message-ID: <4DE5205C.5020209@itwm.fraunhofer.de> (raw)
In-Reply-To: <20110531123518.GB4215@thunk.org>
On 05/31/2011 02:35 PM, Ted Ts'o wrote:
> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>
>> Out of interest, did anyone ever benchmark if dirindex provides any
>> advantages to readdir? And did those benchmarks include the
>> disadvantages of the present implementation (non-linear inode
>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>> 'rm -fr $dir')?
>
> The problem is that seekdir/telldir is terminally broken (and so is
> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
> a linear data structure. If you're going to use any kind of
> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
> doesn't cut it. We actually play games where we memoize the low
> 32-bits of the hash and keep track of which cookies we hand out via
> seekdir/telldir so that things mostly work --- except for NFSv2, where
> with the 32-bit cookie, you're just hosed.
Well, lets just ignore NFSv2, for NFS there are better working v3 and v4
alternatives. My real concern are ext3 and ext4, which have
#define pos2min_hash(pos) (0)
>
> The reason why we have to iterate over the directory in hash tree
> order is because if we have a leaf node split, half the directories
> entries get copied to another directory entry, given the promises made
> by seekdir() and telldir() about directory entries appearing exactly
> once during a readdir() stream, even if you hold the fd open for weeks
> or days, mean that you really have to iterate over things in hash
> order.
Ah, I never looked into the dirindex implementation, I always thought
the dirindex blocks get updated and not real directory entries as well.
>
> I'd have to look, since it's been too many years, but as I recall the
> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
> don't know whether we can hand back a 32-bit cookie or a 64-bit
> cookie, so we're always handing the NFS server a 32-bit "offset", even
> though ew could do better. Actually, if we had an interface where we
> could give you a 128-bit "offset" into the directory, we could
> probably eliminate the duplicate cookie problem entirely. We just
> send 64-bits worth of hash, plus the first two bytes of the of file
> name.
Well, personally I'm more interested in user space, but I don't see any
difference between NFS, other kernel paths and user space. I think this
is used for everything:
/* Some one has messed with f_pos; reset the world */
if (info->last_pos != filp->f_pos) {
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
info->curr_hash = pos2maj_hash(filp->f_pos);
info->curr_minor_hash = pos2min_hash(filp->f_pos);
}
So with the above #define pos2min_hash(), info->curr_minor_hash is
always zero with no exception. Or do I miss something?
>
>> 3) Disable dirindexing for readdirs
>
> That won't work, since it will break POSIX compliance. Once again,
> we're tied by the decisions made decades ago...
I really wonder if we couldn't set a flag somewhere to ignore posix for
applications that could handle it on their own. Pity that opendir
doesn't allow to set flags. An ioctl would be another choice.
Thanks,
Bernd
next prev parent reply other threads:[~2011-05-31 17:07 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <201105281502.32719.sweet_f_a@gmx.de>
[not found] ` <201105301137.02061.sweet_f_a@gmx.de>
[not found] ` <1306767521.5971.2.camel@lade.trondhjem.org>
[not found] ` <201105311147.24939.sweet_f_a@gmx.de>
2011-05-31 10:18 ` infinite getdents64 loop Bernd Schubert
2011-05-31 12:35 ` Ted Ts'o
2011-05-31 17:07 ` Bernd Schubert [this message]
2011-05-31 17:13 ` Boaz Harrosh
[not found] ` <4DE521B9.5050603-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2011-05-31 17:30 ` Bernd Schubert
[not found] ` <4DE525AE.9030806-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2011-06-01 13:10 ` Boaz Harrosh
2011-06-01 16:15 ` Trond Myklebust
[not found] ` <20110531123518.GB4215-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2011-05-31 17:26 ` Andreas Dilger
[not found] ` <D598829B-FB36-4DA8-978E-8C689940D0FA-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org>
2011-05-31 17:43 ` Bernd Schubert
[not found] ` <4DE528DE.5020908-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>
2011-05-31 19:16 ` Andreas Dilger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4DE5205C.5020209@itwm.fraunhofer.de \
--to=bernd.schubert@itwm.fraunhofer.de \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox