From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Dilger Subject: Re: [PATCH 1/5] vfs: vfs-level fiemap interface Date: Wed, 28 May 2008 13:42:15 -0600 Message-ID: <20080528194215.GI7263@webber.adilger.int> References: <20080525000157.GK8325@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7BIT Cc: linux-fsdevel@vger.kernel.org, Josef Bacik To: Mark Fasheh Return-path: Received: from sca-es-mail-2.Sun.COM ([192.18.43.133]:37047 "EHLO sca-es-mail-2.sun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752506AbYE1TmU (ORCPT ); Wed, 28 May 2008 15:42:20 -0400 Received: from fe-sfbay-10.sun.com ([192.18.43.129]) by sca-es-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id m4SJgIuo003035 for ; Wed, 28 May 2008 12:42:19 -0700 (PDT) Received: from conversion-daemon.fe-sfbay-10.sun.com by fe-sfbay-10.sun.com (Sun Java System Messaging Server 6.2-8.04 (built Feb 28 2007)) id <0K1L00A01FSIJF00@fe-sfbay-10.sun.com> (original mail from adilger@sun.com) for linux-fsdevel@vger.kernel.org; Wed, 28 May 2008 12:42:18 -0700 (PDT) In-reply-to: <20080525000157.GK8325@wotan.suse.de> Content-disposition: inline Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On May 24, 2008 17:01 -0700, Mark Fasheh wrote: > Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap > inode operation. Mark, I was looking at a way to remove the special-casing of NUM_EXTENTS from ioctl_fiemap() in an effort to remove Christoph's objection to keeping these in the same ioctl. I think it is possible and reasonable to move the special-case handling into fiemap_fill_next_extent(). > +static int ioctl_fiemap(struct file *filp, unsigned long arg) > +{ > + > + if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS) && > + (fiemap.fm_extent_count == 0 || > + fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS)) > + return -EINVAL; This can be changed to only check: if (fm_extent_count > FIEMAP_MAX_EXTENTS) return -EINVAL; > + fieinfo.fi_flags = fiemap.fm_flags; > + if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS)) { > + fieinfo.fi_extents_max = fiemap.fm_extent_count; > + fieinfo.fi_extents_start = (char *)arg + sizeof(fiemap); > + > + if (!access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, > + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) > + return -EFAULT; > + } It is harmless to set fi_extents_max and fi_extents_start, as this is ignored by NUM_EXTENTS. The fiemap_fill_next_extent() will already check in copy_to_user() whether the fi_extents_start pointer is valid, and fiemap_fill_next_extent() doesn't even get far enough to look at fi_extents_max or fi_extents_start. We just do: fieinfo.fi_extents = fiemap.fm_extent_count; fieinfo.fi_extents_start = (struct fiemap_extent *)((char *)arg + sizeof(fiemap)); This leaves us with no checks for FIEMAP_FLAG_NUM_EXTENTS in ioctl_fiemap() at all, and no changes needed in fiemap_fill_next_extent(). > > What about the idea to have fiemap_fill_next_extent() do "extent" merging > > for filesystems that use the generic helper but do not return multiple > > blocks via get_blocks()? I don't think that is too hard to implement, > > and makes the output more useful, otherwise we get an extent per block. > > The above is what I _think_ will work, haven't actually tried it out. > > I don't think we want to automatically merge extents within this helper > function. Otherwise we would diverge from the actual disk layout for extent > based file systems where an extent might be broken up between two records > for some other reason, such as maximum extent length being exceeded. Do we really want to expose the filesystem-specific extent-length limits to userspace? In some sense, a block-based filesystem has a maximum extent length of the blocksize, but it seems totally reasonable to merge contiguous blocks into a single "extent" for return to userspace. I don't see this significantly different for ext4, even though it can have extents up to 128MB, or unwritten extents up to 64MB. > Btw, how many block-based file systems that don't return multiple blocks via > get_blocks() are there that we actually care about enough to write this > code? That I have no clue about. Joseph? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.