From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Subject: Re: [PATCH 17/18] xfs: implement pnfs export operations
Date: Wed, 7 Jan 2015 11:40:10 +0100
Message-ID: <20150107104010.GD28783@lst.de>
References: <1420561721-9150-1-git-send-email-hch@lst.de> <1420561721-9150-18-git-send-email-hch@lst.de> <20150107002434.GG31508@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	"J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
	Jeff Layton <jlayton-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Return-path: <linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20150107002434.GG31508@dastard>
Sender: linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-Id: linux-fsdevel.vger.kernel.org

On Wed, Jan 07, 2015 at 11:24:34AM +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index fdc6422..2b86be8 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -601,6 +601,8 @@ xfs_growfs_data(
> >  	if (!mutex_trylock(&mp->m_growlock))
> >  		return -EWOULDBLOCK;
> >  	error =3D xfs_growfs_data_private(mp, in);
> > +	if (!error)
> > +		mp->m_generation++;
> >  	mutex_unlock(&mp->m_growlock);
> >  	return error;
> >  }
>=20
> I couldn't find an explanation of what this generation number is
> for. What are it's semantics w.r.t. server crashes?

The generation is incremented when we grow the filesystem, so that
a new layout (block mapping) returned to the cl=D1=96ent referers to th=
e
new NFS device ID, which will make the client aware of the new size.

The device IDs aren't persistent, so after a server crash / reboot
we'll start at zero again.

I'll add comments explaining this to the code.

> Why does this function get passed an offset it is not actually used?

Historic reasons..

> > +static int
> > +xfs_fs_update_flags(
> > +	struct xfs_inode	*ip)
> > +{
> > +	struct xfs_mount	*mp =3D ip->i_mount;
> > +	struct xfs_trans	*tp;
> > +	int			error;
> > +
> > +	/*
> > +	 * Update the mode, and prealloc flag bits.
> > +	 */
> > +	tp =3D xfs_trans_alloc(mp, XFS_TRANS_WRITEID);
> > +	error =3D xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0);
> > +	if (error) {
> > +		xfs_trans_cancel(tp, 0);
> > +		return error;
> > +	}
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > +	ip->i_d.di_mode &=3D ~S_ISUID;
> > +	if (ip->i_d.di_mode & S_IXGRP)
> > +		ip->i_d.di_mode &=3D ~S_ISGID;
> > +
> > +	ip->i_d.di_flags |=3D XFS_DIFLAG_PREALLOC;
> > +
> > +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > +	return xfs_trans_commit(tp, 0);
> > +}
>=20
> That needs timestamp changes as well. i.e.:
>=20
> 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);

The time stamps are only updated when we actually commit the data.
Updating them here might be harmless, but I'll have to dig into the
protocol specification and tests a bit more to check if doing the
additional timestamp update would be harmless.

> > +
> > +/*
> > + * Get a layout for the pNFS client.
> > + *
> > + * Note that in the allocation case we do force out the transactio=
n here.
> > + * There is no metadata update that is required to be stable for N=
=46S
> > + * semantics, and layouts are not valid over a server crash.  Inst=
ead
> > + * we'll have to be careful in the commit routine as it might pass=
 us
> > + * blocks for an allocation that never made it to disk in the reco=
very
> > + * case.
>=20
> I think you are saying that because block allocation is an async
> transaction, then we have to deal with the possibility that we crash
> before the transaction hits the disk.
>=20
> How often do we have to allocate
> new blocks like this? Do we need to use async transactions for this
> case, or should we simply do the brute force thing (by making the
> allocation transaction synchronous) initially and then, if
> performance problems arise, optimise from there?

Every block allocation from a pNFS client goes through this path, so
yes it is performance critical.

> > +	xfs_map_iomap(ip, iomap, &imap, offset);
> > +	*device_generation =3D mp->m_generation;
>=20
> So whenever the server first starts up the generation number in a
> map is going to be zero - what purpose does this actually serve?

So that we can communicate if a device was grown to the client, which
in this case needs to re-read the device information.

> > +		if (!length)
> > +			continue;
> > +
> > +		error =3D xfs_iomap_write_unwritten(ip, start, length);
> > +		if (error)
> > +			goto out_drop_iolock;
> > +	}
> > +
> > +	/*
> > +	 * Make sure reads through the pagecache see the new data.
> > +	 */
> > +	invalidate_inode_pages2(inode->i_mapping);
>=20
> Probably should do that first. Also, what happens if there is local
> dirty data on the file at this point? Doesn't this just toss them
> away?

If there was local data it will be tossed.  For regular writes this can=
't
happen because we really outstanding layouts in the write path.  For
mmap we for now ignore this problem, as a pNFS server should generally
not be used locally. =20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html