From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mingming Cao <cmm@us.ibm.com>
Subject: Re: [PATCH] dio: track and serialise unaligned direct IO
Date: Tue, 03 Aug 2010 17:11:18 -0700
Message-ID: <1280880678.2334.27.camel@mingming-laptop>
References: <1280443516-14448-1-git-send-email-david@fromorbit.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com, sandeen@sandeen.net
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e2.ny.us.ibm.com ([32.97.182.142]:55956 "EHLO e2.ny.us.ibm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757957Ab0HDALV (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 3 Aug 2010 20:11:21 -0400
Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233])
	by e2.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id o73NvVfr026096
	for <linux-fsdevel@vger.kernel.org>; Tue, 3 Aug 2010 19:57:31 -0400
Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64])
	by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o740BKlr281396
	for <linux-fsdevel@vger.kernel.org>; Tue, 3 Aug 2010 20:11:20 -0400
Received: from d01av04.pok.ibm.com (loopback [127.0.0.1])
	by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o740BJeZ016076
	for <linux-fsdevel@vger.kernel.org>; Tue, 3 Aug 2010 20:11:20 -0400
In-Reply-To: <1280443516-14448-1-git-send-email-david@fromorbit.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, 2010-07-30 at 08:45 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>=20
> If we get two unaligned direct IO's to the same filesystem block
> that is marked as a new allocation (i.e. buffer_new), then both IOs w=
ill
> zero the portion of the block they are not writing data to. As a
> result, when the IOs complete there will be a portion of the block
> that contains zeros from the last IO to complete rather than the
> data that should be there.
>=20
> This is easily manifested by qemu using aio+dio with an unaligned
> guest filesystem - every IO is unaligned and fileystem corruption is
> encountered in the guest filesystem. xfstest 240 (from Eric Sandeen)
> is also a simple reproducer.
>=20
> To avoid this problem, track unaligned IO that triggers sub-block zer=
oing and
> check new incoming unaligned IO that require sub-block zeroing agains=
t that
> list. If we get an overlap where the start and end of unaligned IOs h=
it the
> same filesystem block, then we need to block the incoming IOs until t=
he IO that
> is zeroing the block completes. The blocked IO can then continue with=
out
> needing to do any zeroing and hence won't overwrite valid data with z=
eros.
>=20

This seems to address both two IOs are unaligned direct IO. If the firs=
t
IO is aligned direct IO, then it is not tracked?

I am also concerned about the aligned direct IO case...

1) first thread aio+dio+aligned write to a hole, there is no zero-out
submitted from kernel. But the hole remains initialized before all IO
complete and convert it from uninitialized extent to initialized.
2) second thread aio+dio+unalign write to the same hole, this time it i=
s
unaligned. since buffer is still new (not converted yet), the new
incoming thread zero out port of data that first thread has written to

It seems would need to track any direct IO written to a hole/unwrtten
extent/ buffer new, not only the unaligned dio ...

regards,
Mingming
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/direct-io.c |  152 ++++++++++++++++++++++++++++++++++++++++++++++=
+++++++--
>  1 files changed, 146 insertions(+), 6 deletions(-)
>=20
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index a10cb91..611524e 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -71,6 +71,9 @@ struct dio {
>  	unsigned start_zero_done;	/* flag: sub-blocksize zeroing has
>  					   been performed at the start of a
>  					   write */
> +#define LAST_SECTOR ((sector_t)-1LL)
> +	sector_t zero_block_front;	/* fs block we are zeroing at front */
> +	sector_t zero_block_rear;	/* fs block we are zeroing at rear */
>  	int pages_in_io;		/* approximate total IO pages */
>  	size_t	size;			/* total request size (doesn't change)*/
>  	sector_t block_in_file;		/* Current offset into the underlying
> @@ -135,6 +138,101 @@ struct dio {
>  	struct page *pages[DIO_PAGES];	/* page buffer */
>  };
>=20
> +
> +/*
> + * record fs blocks we are doing zeroing on in a zero block list.
> + * unaligned IO is not very performant and so is relatively uncommon=
,
> + * so a global list should be sufficent to track them.
> + */
> +struct dio_zero_block {
> +	struct list_head dio_list;	/* list of io in progress */
> +	sector_t	zero_block;	/* block being zeroed */
> +	struct dio	*dio;		/* owner dio */
> +	wait_queue_head_t wq;		/* New IO block here */
> +	atomic_t	ref;		/* reference count */
> +};
> +
> +DEFINE_SPINLOCK(dio_zero_block_lock);
> +LIST_HEAD(dio_zero_block_list);
> +
> +/*
> + * Add a filesystem block to the list of blocks we are tracking.
> + */
> +static void
> +dio_start_zero_block(struct dio *dio, sector_t zero_block)
> +{
> +	struct dio_zero_block *zb;
> +
> +	zb =3D kmalloc(sizeof(*zb), GFP_NOIO);
> +	if (!zb)
> +		return;
> +	INIT_LIST_HEAD(&zb->dio_list);
> +	init_waitqueue_head(&zb->wq);
> +	zb->zero_block =3D zero_block;
> +	zb->dio =3D dio;
> +	atomic_set(&zb->ref, 1);
> +
> +	spin_lock(&dio_zero_block_lock);
> +	list_add(&zb->dio_list, &dio_zero_block_list);
> +	spin_unlock(&dio_zero_block_lock);
> +}
> +
> +static void
> +dio_drop_zero_block(struct dio_zero_block *zb)
> +{
> +	if (atomic_dec_and_test(&zb->ref))
> +		kfree(zb);
> +}
> +
> +/*
> + * Check whether a filesystem block is currently being zeroed, and i=
f it is
> + * wait for it to complete before returning. If we waited for a bloc=
k being
> + * zeroed, return 1 to indicate that the block is already initialise=
d,
> + * otherwise return 0 to indicate that it needs zeroing.
> + */
> +static int
> +dio_wait_zero_block(struct dio *dio, sector_t zero_block)
> +{
> +	struct dio_zero_block *zb;
> +
> +	spin_lock(&dio_zero_block_lock);
> +	list_for_each_entry(zb, &dio_zero_block_list, dio_list) {
> +		if (zb->dio->inode !=3D dio->inode)
> +			continue;
> +		if (zb->zero_block !=3D zero_block)
> +			continue;
> +		atomic_inc(&zb->ref);
> +		spin_unlock(&dio_zero_block_lock);
> +		wait_event(zb->wq, (list_empty(&zb->dio_list)));
> +		dio_drop_zero_block(zb);
> +		return 1;
> +	}
> +	spin_unlock(&dio_zero_block_lock);
> +	return 0;
> +}
> +
> +/*
> + * Complete a block zeroing and wake up anyone waiting for it.
> + */
> +static void dio_end_zero_block(struct dio *dio, sector_t zero_block)
> +{
> +	struct dio_zero_block *zb;
> +
> +	spin_lock(&dio_zero_block_lock);
> +	list_for_each_entry(zb, &dio_zero_block_list, dio_list) {
> +		if (zb->dio->inode !=3D dio->inode)
> +			continue;
> +		if (zb->zero_block !=3D zero_block)
> +			continue;
> +		list_del_init(&zb->dio_list);
> +		spin_unlock(&dio_zero_block_lock);
> +		wake_up(&zb->wq);
> +		dio_drop_zero_block(zb);
> +		return;
> +	}
> +	spin_unlock(&dio_zero_block_lock);
> +}
> +
>  /*
>   * How many pages are in the queue?
>   */
> @@ -253,6 +351,11 @@ static int dio_complete(struct dio *dio, loff_t =
offset, int ret, bool is_async)
>  		aio_complete(dio->iocb, ret, 0);
>  	}
>=20
> +	if (dio->zero_block_front !=3D LAST_SECTOR)
> +		dio_end_zero_block(dio, dio->zero_block_front);
> +	if (dio->zero_block_rear !=3D LAST_SECTOR)
> +		dio_end_zero_block(dio, dio->zero_block_rear);
> +
>  	if (dio->flags & DIO_LOCKING)
>  		/* lockdep: non-owner release */
>  		up_read_non_owner(&dio->inode->i_alloc_sem);
> @@ -777,6 +880,12 @@ static void clean_blockdev_aliases(struct dio *d=
io)
>   * block with zeros. This happens only if user-buffer, fileoffset or
>   * io length is not filesystem block-size multiple.
>   *
> + * We need to track the blocks we are zeroing. If we have concurrent=
 IOs that hit
> + * the same start or end block, we do not want all the IOs to zero t=
he portion
> + * they are not writing data to as that will overwrite data from the=
 other IOs.
> + * Hence we need to block until the first unaligned IO completes bef=
ore we can
> + * continue (without executing any zeroing).
> + *
>   * `end' is zero if we're doing the start of the IO, 1 at the end of=
 the
>   * IO.
>   */
> @@ -784,8 +893,8 @@ static void dio_zero_block(struct dio *dio, int e=
nd)
>  {
>  	unsigned dio_blocks_per_fs_block;
>  	unsigned this_chunk_blocks;	/* In dio_blocks */
> -	unsigned this_chunk_bytes;
>  	struct page *page;
> +	sector_t fsblock;
>=20
>  	dio->start_zero_done =3D 1;
>  	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
> @@ -797,17 +906,41 @@ static void dio_zero_block(struct dio *dio, int=
 end)
>  	if (!this_chunk_blocks)
>  		return;
>=20
> +	if (end)
> +		this_chunk_blocks =3D dio_blocks_per_fs_block - this_chunk_blocks;
> +
>  	/*
>  	 * We need to zero out part of an fs block.  It is either at the
> -	 * beginning or the end of the fs block.
> +	 * beginning or the end of the fs block, but first we need to check=
 if
> +	 * there is already a zeroing being run on this block.
> +	 *
> +	 * If we are doing a sub-block IO (i.e. zeroing both front and rear=
 of
> +	 * the same block) we don't need to wait or set a gaurd for the rea=
r of
> +	 * the block as we already have one set.
>  	 */
> -	if (end)=20
> -		this_chunk_blocks =3D dio_blocks_per_fs_block - this_chunk_blocks;
> +	fsblock =3D dio->block_in_file >> dio->blkfactor;
> +	if (!end || dio->zero_block_front !=3D fsblock) {
>=20
> -	this_chunk_bytes =3D this_chunk_blocks << dio->blkbits;
> +		/* wait for any zeroing already in progress */
> +		if (dio_wait_zero_block(dio, fsblock)) {
> +			/* skip the range we would have zeroed. */
> +			dio->next_block_for_io +=3D this_chunk_blocks;
> +			return;
> +		}
> +
> +		/*
> +		 * we are going to zero stuff now, so set a guard to catch
> +		 * others that might want to zero the same block.
> +		 */
> +		dio_start_zero_block(dio, fsblock);
> +		if (end)
> +			dio->zero_block_rear =3D fsblock;
> +		else
> +			dio->zero_block_front =3D fsblock;
> +	}
>=20
>  	page =3D ZERO_PAGE(0);
> -	if (submit_page_section(dio, page, 0, this_chunk_bytes,=20
> +	if (submit_page_section(dio, page, 0, this_chunk_blocks << dio->blk=
bits,
>  				dio->next_block_for_io))
>  		return;
>=20
> @@ -1191,6 +1324,13 @@ __blockdev_direct_IO_newtrunc(int rw, struct k=
iocb *iocb, struct inode *inode,
>  	 */
>  	memset(dio, 0, offsetof(struct dio, pages));
>=20
> +	/*
> +	 * zero_blocks need to initialised to large=D1=95t value to avoid
> +	 * matching the zero block accidentally.
> +	 */
> +	dio->zero_block_front =3D LAST_SECTOR;
> +	dio->zero_block_rear =3D LAST_SECTOR;
> +
>  	dio->flags =3D flags;
>  	if (dio->flags & DIO_LOCKING) {
>  		/* watch out for a 0 len io from a tricksy fs */
> --=20
> 1.7.1
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdev=
el" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html