From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261868AbUBWN61 (ORCPT ); Mon, 23 Feb 2004 08:58:27 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261864AbUBWN5v (ORCPT ); Mon, 23 Feb 2004 08:57:51 -0500 Received: from mail.shareable.org ([81.29.64.88]:21890 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S261870AbUBWNzd (ORCPT ); Mon, 23 Feb 2004 08:55:33 -0500 Date: Mon, 23 Feb 2004 13:55:20 +0000 From: Jamie Lokier To: Pavel Machek Cc: Ingo Molnar , Linus Torvalds , Tridge , Al Viro , "H. Peter Anvin" , Kernel Mailing List Subject: Re: explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Message-ID: <20040223135520.GA30321@mail.shareable.org> References: <20040219182948.GA3414@mail.shareable.org> <20040220120417.GA4010@elte.hu> <20040220170438.GA19722@elte.hu> <20040220184822.GA23460@elte.hu> <20040221075853.GA828@elte.hu> <20040223105953.GA2992@openzaurus.ucw.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040223105953.GA2992@openzaurus.ucw.cz> User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Pavel Machek wrote: > > monotonity is important: two successive directory operations to not be > > possible within the same nanosecond. This is not possible with current > > hardware - but how about future hardware? Can we make an assumption like > > this? > > Does not ndelay(1) if samba notices mtime is too young in the samba code > fix that? No, because Samba cannot tell. To Samba, it looks like the directory hasn't changed, when it has. Another issue is that most machines don't have nanosecond resolution clocks (e.g. m68k is limited to timer interrupt resolution, and some x86 machines cannot use the cycle counter), and most filesystems don't store them either. The right place to put the delay is in the kernel, when mtime is set or when it is read, or both. Important: you *don't* have to delay unless someone has _read_ the mtime since the last time it was set, _and_ if the mtime wouldn't be changed by the current operation, _and_ if you cannot simply increment the low-order bits of mtime due to known limits on the system clock resolution. So maintain a flag per inode that says "this inode's mtime has been read". It is set whenever the mtime is read, or whenever the inode is written to disk if you are implementing persistence (because you never know if someone reads it from the disk at another time). The flag does not have to be stored - this works with all filesystems. Also, you don't have to put the delay where the mtime is updated. You can also put it where the mtime is _read_, and then only when the flag is not set. Or you can balance it equally between both operations, so that neither operation can be a DOS for the other. The delaying strategy works very nicely for filesystems that store low resolution clocks (i.e. nearly all of them), because even though the delay is longer (e.g. sleep(1) for ext2, sleep(2) for FAT), that flag is rarely alternated so you hardly ever need the delay - and mtimes are still observed to be strictly monotonic. (You can further eliminate the need for delays by assigning some bits as a sub-generation counter instead of a timestamp. This is equivalent to pretending the system clock has lower resolution than the filesystem can store. In fact, taking bits _away_ from mtime and using them a sub-generation counter instead provides better performance with the guarantee of monotonicity.) It's a generally useful feature, but I'm not sure why we're looking at this for Samba, which needs the O_CLEAN mechanism more than it needs change-detection - for this, Samba can already use the existing dnotify even though the interface is a bit cumbersome, whereas O_CLEAN or its equivalent isn't available yet. -- Jamie