All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven Whitehouse <swhiteho@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH v3] GFS2: Add a next-resource-group pointer to resource groups
Date: Tue, 14 Feb 2017 10:32:54 +0000	[thread overview]
Message-ID: <9def9caa-4eae-0f8d-ae07-eec7672f6013@redhat.com> (raw)
In-Reply-To: <20170213175937.12705-1-anprice@redhat.com>

Hi,


On 13/02/17 17:59, Andrew Price wrote:
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
>    this rgrp's address and the next rgrp's address. This can be used as a
>    hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
> to be 0 it will attempt to set it to the difference between its rd_addr
> and the rd_addr of the next resource group.
>
> The only special case is the final rgrp, which always has a rg_skip of
> 0. It is not set to a special value (like -1) because, when the
> filesystem is grown, the rgrp will no longer be the final one and it
> will then need to have its rg_skip field set. The overhead of this
> special case is a gfs2_rgrpd_get_next() call each time
> gfs2_rgrp_bh_get() is called for the final resource group.
>
> For the other resource groups, if the rg_skip field is 0, it is set
> appropriately and then the only overhead becomes the rgd->rg_skip == 0
> comparison in gfs2_rgrp_bh_get().
>
> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
> the rg_skip field can get set back to 0 in cases where nodes with and
> without this patch are mixed in a cluster. In some cases, the field may
> bounce between being set by one node and then zeroed by another which
> may harm performance slightly, e.g. when two nodes create many small
> files. In testing this situation is rare but it becomes more likely as
> the filesystem fills up and there are fewer resource groups to choose
> from. The problem goes away when all nodes are running with this patch.
> Dipping into the space currently occupied by the rg_reserved field would
> have resulted in the same problem as it is also explicitly zeroed, so
> unfortunately there is no other way around it.
>
> Signed-off-by: Andrew Price <anprice@redhat.com>
We will also need patches for gfs2_convert, mkfs.gfs2, fsck.gfs2, 
gfs2_edit and gfs2_grow too most likely. Ideally we want to be in a 
situation where we can default to using only the info in the rgrp 
headers, and fall back to looking at the rindex only in case that 
information is not there/broken in the rgrp headers.


> ---
>   fs/gfs2/incore.h                 |  1 +
>   fs/gfs2/rgrp.c                   | 27 ++++++++++++++++++++++++++-
>   include/uapi/linux/gfs2_ondisk.h |  5 ++++-
>   3 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a6a3389..2c03287 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -88,6 +88,7 @@ struct gfs2_rgrpd {
>   	u32 rd_reserved;                /* number of blocks reserved */
>   	u32 rd_free_clone;
>   	u32 rd_dinodes;
> +	u32 rd_skip;                    /* Distance to the next rgrp in fs blocks */
>   	u64 rd_igeneration;
>   	struct gfs2_bitmap *rd_bits;
>   	struct gfs2_sbd *rd_sbd;
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 86ccc015..aaf435d 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
>   	rgd->rd_flags |= rg_flags;
>   	rgd->rd_free = be32_to_cpu(str->rg_free);
>   	rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
> +	rgd->rd_skip = be32_to_cpu(str->rg_skip);
>   	rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
>   }
>   
> @@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
>   	str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
>   	str->rg_free = cpu_to_be32(rgd->rd_free);
>   	str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
> -	str->__pad = cpu_to_be32(0);
> +	str->rg_skip = cpu_to_be32(rgd->rd_skip);
>   	str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
>   	memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
>   }
> @@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
>   	return count;
>   }
>   
> +/**
> + * Set the rg_next field if this isn't the final rgrp.
> + */
> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
> +{
> +	struct gfs2_sbd *sdp = rgd->rd_sbd;
> +	struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
> +	struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
> +
> +	if (next == NULL || next->rd_addr <= rgd->rd_addr)
> +		return;
> +
> +	if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
> +		return;
> +
> +	rgd->rd_skip = next->rd_addr - rgd->rd_addr;
> +	gfs2_trans_add_meta(rgd->rd_gl, bh);
> +	gfs2_rgrp_out(rgd, bh->b_data);
> +	gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
> +	gfs2_trans_end(sdp);
> +	return;
> +}
>   
>   /**
>    * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
> @@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
>   		if (rgd->rd_rgl->rl_unlinked == 0)
>   			rgd->rd_flags &= ~GFS2_RDF_CHECK;
>   	}
> +	if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY))
> +		gfs2_rgrp_set_skip(rgd);
>   	return 0;
This looks like it can be called under a read lock. It might be better 
to move the setting of the field so that we only set it when we are 
going to write the rgrp anyway. That will remove the additional 
overhead, and also avoid the complication of trying to figure out when 
it is ok to set it. You could also avoid storing it directly by 
calculating it on write, and using the existing rindex fields in struct 
gfs2_rgrpd - no need to duplicate the information there.


>   
>   fail:
> diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
> index 7c4be77..0064381f 100644
> --- a/include/uapi/linux/gfs2_ondisk.h
> +++ b/include/uapi/linux/gfs2_ondisk.h
> @@ -186,7 +186,10 @@ struct gfs2_rgrp {
>   	__be32 rg_flags;
>   	__be32 rg_free;
>   	__be32 rg_dinodes;
> -	__be32 __pad;
> +	union {
> +		__be32 __pad;
> +		__be32 rg_skip; /* Distance to the next rgrp in fs blocks */
> +	};
>   	__be64 rg_igeneration;
>   
>   	__u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
That looks like a good place to fit rg_skip in, but gfs2_rindex contains 
ri_data0, ri_data and ri_bitbytes and to allow the eventual removal of 
the rindex, those need also to be duplicated into the rgrp. There is 
plenty of space there to fit those fields in, so we should do that at 
the same time. We could also take the opportunity to add a checksum too, 
since rg_skip being non-zero would tell us whether we should be checking 
the checksum - I don't think that should be too tricky to do, but we 
could always add it later too, if required.

Other thoughts... if we also added a field to give us the size of the 
following rgrp in disk blocks, then we could do proper readahead on 
them. So as well as rg_skip we would need rg_next_len too, which would 
be the same as ri_length of the following rgrp header,

Steve.



  parent reply	other threads:[~2017-02-14 10:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-12 11:39 [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups Andrew Price
2016-01-12 11:47 ` Steven Whitehouse
2016-01-12 12:03   ` [Cluster-devel] [PATCH v2] " Andrew Price
2016-01-12 15:50     ` Bob Peterson
2017-02-02 14:57       ` Andrew Price
2017-02-13 17:59       ` [Cluster-devel] [PATCH v3] " Andrew Price
2017-02-13 18:06         ` Andrew Price
2017-02-14 10:32         ` Steven Whitehouse [this message]
2017-02-15 18:15           ` Andreas Gruenbacher
2016-01-12 12:39 ` [Cluster-devel] [PATCH] " Andreas Gruenbacher
2016-01-12 13:12   ` Andrew Price
2016-01-12 13:30     ` Andreas Gruenbacher
2016-01-12 15:23       ` Andrew Price
2017-02-15 17:40         ` Andreas Gruenbacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9def9caa-4eae-0f8d-ae07-eec7672f6013@redhat.com \
    --to=swhiteho@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.