* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
@ 2016-01-12 11:39 Andrew Price
2016-01-12 11:47 ` Steven Whitehouse
2016-01-12 12:39 ` [Cluster-devel] [PATCH] " Andreas Gruenbacher
0 siblings, 2 replies; 14+ messages in thread
From: Andrew Price @ 2016-01-12 11:39 UTC (permalink / raw)
To: cluster-devel.redhat.com
Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
rg_skip field has the following meaning:
- If rg_skip is zero, it is considered unset and not useful.
- If rg_skip is non-zero, its value will be the number of blocks between
this rgrp's address and the next rgrp's address. This can be used as a
hint by fsck.gfs2 when rebuilding a bad rindex, for example.
When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
to be 0 it will attempt to set it to the difference between its rd_addr
and the rd_addr of the next resource group.
The only special case is the final rgrp, which always has a rg_skip of
0. It is not set to a special value (like -1) because, when the
filesystem is grown, the rgrp will no longer be the final one and it
will then need to have its rg_skip field set. The overhead of this
special case is a gfs2_rgrpd_get_next() call each time
gfs2_rgrp_bh_get() is called for the final resource group.
For the other resource groups, if the rg_skip field is 0, it is set
appropriately and then the only overhead becomes the rgd->rg_skip == 0
comparison in gfs2_rgrp_bh_get().
Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
the rg_skip field can get set back to 0 in cases where nodes with and
without this patch are mixed in a cluster. In some cases, the field may
bounce between being set by one node and then zeroed by another which
may harm performance slightly, e.g. when two nodes create many small
files. In testing this situation is rare but it becomes more likely as
the filesystem fills up and there are fewer resource groups to choose
from. The problem goes away when all nodes are running with this patch.
Dipping into the space currently occupied by the rg_reserved field would
have resulted in the same problem as it is also explicitly zeroed, so
unfortunately there is no other way around it.
Signed-off-by: Andrew Price <anprice@redhat.com>
---
fs/gfs2/incore.h | 1 +
fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
include/uapi/linux/gfs2_ondisk.h | 2 +-
3 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 845fb09..84cc1fd 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -88,6 +88,7 @@ struct gfs2_rgrpd {
u32 rd_reserved; /* number of blocks reserved */
u32 rd_free_clone;
u32 rd_dinodes;
+ u32 rd_skip; /* Distance to the next rgrp in fs blocks */
u64 rd_igeneration;
struct gfs2_bitmap *rd_bits;
struct gfs2_sbd *rd_sbd;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 07c0265..9779258 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1048,6 +1048,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
rgd->rd_flags |= rg_flags;
rgd->rd_free = be32_to_cpu(str->rg_free);
rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
+ rgd->rd_skip = be32_to_cpu(str->rg_skip);
rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
}
@@ -1058,7 +1059,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
str->rg_free = cpu_to_be32(rgd->rd_free);
str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
- str->__pad = cpu_to_be32(0);
+ str->rg_skip = cpu_to_be32(rgd->rd_skip);
str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
}
@@ -1118,6 +1119,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
return count;
}
+/**
+ * Set the rg_next field if this isn't the final rgrp.
+ */
+static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
+{
+ struct gfs2_sbd *sdp = rgd->rd_sbd;
+ struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
+ struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
+
+ if (next == NULL || next->rd_addr <= rgd->rd_addr)
+ return;
+
+ if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
+ return;
+
+ rgd->rd_skip = next->rd_addr - rgd->rd_addr;
+ gfs2_trans_add_meta(rgd->rd_gl, bh);
+ gfs2_rgrp_out(rgd, bh->b_data);
+ gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
+ gfs2_trans_end(sdp);
+ return;
+}
/**
* gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
@@ -1183,6 +1206,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
if (rgd->rd_rgl->rl_unlinked == 0)
rgd->rd_flags &= ~GFS2_RDF_CHECK;
}
+ if (rgd->rd_skip == 0)
+ gfs2_rgrp_set_skip(rgd);
return 0;
fail:
diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
index 7c4be77..a35c26c 100644
--- a/include/uapi/linux/gfs2_ondisk.h
+++ b/include/uapi/linux/gfs2_ondisk.h
@@ -186,7 +186,7 @@ struct gfs2_rgrp {
__be32 rg_flags;
__be32 rg_free;
__be32 rg_dinodes;
- __be32 __pad;
+ __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
__be64 rg_igeneration;
__u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
--
2.4.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 11:39 [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups Andrew Price
@ 2016-01-12 11:47 ` Steven Whitehouse
2016-01-12 12:03 ` [Cluster-devel] [PATCH v2] " Andrew Price
2016-01-12 12:39 ` [Cluster-devel] [PATCH] " Andreas Gruenbacher
1 sibling, 1 reply; 14+ messages in thread
From: Steven Whitehouse @ 2016-01-12 11:47 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 12/01/16 11:39, Andrew Price wrote:
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
> this rgrp's address and the next rgrp's address. This can be used as a
> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
> to be 0 it will attempt to set it to the difference between its rd_addr
> and the rd_addr of the next resource group.
>
> The only special case is the final rgrp, which always has a rg_skip of
> 0. It is not set to a special value (like -1) because, when the
> filesystem is grown, the rgrp will no longer be the final one and it
> will then need to have its rg_skip field set. The overhead of this
> special case is a gfs2_rgrpd_get_next() call each time
> gfs2_rgrp_bh_get() is called for the final resource group.
>
> For the other resource groups, if the rg_skip field is 0, it is set
> appropriately and then the only overhead becomes the rgd->rg_skip == 0
> comparison in gfs2_rgrp_bh_get().
>
> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
> the rg_skip field can get set back to 0 in cases where nodes with and
> without this patch are mixed in a cluster. In some cases, the field may
> bounce between being set by one node and then zeroed by another which
> may harm performance slightly, e.g. when two nodes create many small
> files. In testing this situation is rare but it becomes more likely as
> the filesystem fills up and there are fewer resource groups to choose
> from. The problem goes away when all nodes are running with this patch.
> Dipping into the space currently occupied by the rg_reserved field would
> have resulted in the same problem as it is also explicitly zeroed, so
> unfortunately there is no other way around it.
>
> Signed-off-by: Andrew Price <anprice@redhat.com>
> ---
> fs/gfs2/incore.h | 1 +
> fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
> include/uapi/linux/gfs2_ondisk.h | 2 +-
> 3 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 845fb09..84cc1fd 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -88,6 +88,7 @@ struct gfs2_rgrpd {
> u32 rd_reserved; /* number of blocks reserved */
> u32 rd_free_clone;
> u32 rd_dinodes;
> + u32 rd_skip; /* Distance to the next rgrp in fs blocks */
> u64 rd_igeneration;
> struct gfs2_bitmap *rd_bits;
> struct gfs2_sbd *rd_sbd;
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 07c0265..9779258 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1048,6 +1048,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
> rgd->rd_flags |= rg_flags;
> rgd->rd_free = be32_to_cpu(str->rg_free);
> rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
> + rgd->rd_skip = be32_to_cpu(str->rg_skip);
> rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
> }
>
> @@ -1058,7 +1059,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
> str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
> str->rg_free = cpu_to_be32(rgd->rd_free);
> str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
> - str->__pad = cpu_to_be32(0);
> + str->rg_skip = cpu_to_be32(rgd->rd_skip);
> str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
> memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
> }
> @@ -1118,6 +1119,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
> return count;
> }
>
> +/**
> + * Set the rg_next field if this isn't the final rgrp.
> + */
> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
> +{
> + struct gfs2_sbd *sdp = rgd->rd_sbd;
> + struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
> + struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
> +
> + if (next == NULL || next->rd_addr <= rgd->rd_addr)
> + return;
> +
> + if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
> + return;
> +
> + rgd->rd_skip = next->rd_addr - rgd->rd_addr;
> + gfs2_trans_add_meta(rgd->rd_gl, bh);
> + gfs2_rgrp_out(rgd, bh->b_data);
> + gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
> + gfs2_trans_end(sdp);
> + return;
> +}
>
> /**
> * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
> @@ -1183,6 +1206,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
> if (rgd->rd_rgl->rl_unlinked == 0)
> rgd->rd_flags &= ~GFS2_RDF_CHECK;
> }
> + if (rgd->rd_skip == 0)
> + gfs2_rgrp_set_skip(rgd);
> return 0;
>
> fail:
> diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
> index 7c4be77..a35c26c 100644
> --- a/include/uapi/linux/gfs2_ondisk.h
> +++ b/include/uapi/linux/gfs2_ondisk.h
> @@ -186,7 +186,7 @@ struct gfs2_rgrp {
> __be32 rg_flags;
> __be32 rg_free;
> __be32 rg_dinodes;
> - __be32 __pad;
> + __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
This should be:
union {
__be32 __pad;
__be32 rg_skip;
};
To avoid breaking any userland programs which might be using this header
file too. Otherwise that looks good, and I think this is a good step
forward. We will need to update gfs2_convert, gfs2_grow and fsck.gfs2 as
to do use the new field as well,
Steve.
> __be64 rg_igeneration;
>
> __u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v2] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 11:47 ` Steven Whitehouse
@ 2016-01-12 12:03 ` Andrew Price
2016-01-12 15:50 ` Bob Peterson
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Price @ 2016-01-12 12:03 UTC (permalink / raw)
To: cluster-devel.redhat.com
Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
rg_skip field has the following meaning:
- If rg_skip is zero, it is considered unset and not useful.
- If rg_skip is non-zero, its value will be the number of blocks between
this rgrp's address and the next rgrp's address. This can be used as a
hint by fsck.gfs2 when rebuilding a bad rindex, for example.
When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
to be 0 it will attempt to set it to the difference between its rd_addr
and the rd_addr of the next resource group.
The only special case is the final rgrp, which always has a rg_skip of
0. It is not set to a special value (like -1) because, when the
filesystem is grown, the rgrp will no longer be the final one and it
will then need to have its rg_skip field set. The overhead of this
special case is a gfs2_rgrpd_get_next() call each time
gfs2_rgrp_bh_get() is called for the final resource group.
For the other resource groups, if the rg_skip field is 0, it is set
appropriately and then the only overhead becomes the rgd->rg_skip == 0
comparison in gfs2_rgrp_bh_get().
Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
the rg_skip field can get set back to 0 in cases where nodes with and
without this patch are mixed in a cluster. In some cases, the field may
bounce between being set by one node and then zeroed by another which
may harm performance slightly, e.g. when two nodes create many small
files. In testing this situation is rare but it becomes more likely as
the filesystem fills up and there are fewer resource groups to choose
from. The problem goes away when all nodes are running with this patch.
Dipping into the space currently occupied by the rg_reserved field would
have resulted in the same problem as it is also explicitly zeroed, so
unfortunately there is no other way around it.
Signed-off-by: Andrew Price <anprice@redhat.com>
---
fs/gfs2/incore.h | 1 +
fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
include/uapi/linux/gfs2_ondisk.h | 5 ++++-
3 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 845fb09..84cc1fd 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -88,6 +88,7 @@ struct gfs2_rgrpd {
u32 rd_reserved; /* number of blocks reserved */
u32 rd_free_clone;
u32 rd_dinodes;
+ u32 rd_skip; /* Distance to the next rgrp in fs blocks */
u64 rd_igeneration;
struct gfs2_bitmap *rd_bits;
struct gfs2_sbd *rd_sbd;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 07c0265..9779258 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1048,6 +1048,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
rgd->rd_flags |= rg_flags;
rgd->rd_free = be32_to_cpu(str->rg_free);
rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
+ rgd->rd_skip = be32_to_cpu(str->rg_skip);
rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
}
@@ -1058,7 +1059,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
str->rg_free = cpu_to_be32(rgd->rd_free);
str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
- str->__pad = cpu_to_be32(0);
+ str->rg_skip = cpu_to_be32(rgd->rd_skip);
str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
}
@@ -1118,6 +1119,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
return count;
}
+/**
+ * Set the rg_next field if this isn't the final rgrp.
+ */
+static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
+{
+ struct gfs2_sbd *sdp = rgd->rd_sbd;
+ struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
+ struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
+
+ if (next == NULL || next->rd_addr <= rgd->rd_addr)
+ return;
+
+ if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
+ return;
+
+ rgd->rd_skip = next->rd_addr - rgd->rd_addr;
+ gfs2_trans_add_meta(rgd->rd_gl, bh);
+ gfs2_rgrp_out(rgd, bh->b_data);
+ gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
+ gfs2_trans_end(sdp);
+ return;
+}
/**
* gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
@@ -1183,6 +1206,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
if (rgd->rd_rgl->rl_unlinked == 0)
rgd->rd_flags &= ~GFS2_RDF_CHECK;
}
+ if (rgd->rd_skip == 0)
+ gfs2_rgrp_set_skip(rgd);
return 0;
fail:
diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
index 7c4be77..0064381f 100644
--- a/include/uapi/linux/gfs2_ondisk.h
+++ b/include/uapi/linux/gfs2_ondisk.h
@@ -186,7 +186,10 @@ struct gfs2_rgrp {
__be32 rg_flags;
__be32 rg_free;
__be32 rg_dinodes;
- __be32 __pad;
+ union {
+ __be32 __pad;
+ __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
+ };
__be64 rg_igeneration;
__u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
--
2.4.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 11:39 [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups Andrew Price
2016-01-12 11:47 ` Steven Whitehouse
@ 2016-01-12 12:39 ` Andreas Gruenbacher
2016-01-12 13:12 ` Andrew Price
1 sibling, 1 reply; 14+ messages in thread
From: Andreas Gruenbacher @ 2016-01-12 12:39 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Jan 12, 2016 at 12:39 PM, Andrew Price <anprice@redhat.com> wrote:
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
> this rgrp's address and the next rgrp's address. This can be used as a
> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
How is that better than a resource group size field that wouldn't need
to special case the last resource group?
Thanks,
Andreas
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 12:39 ` [Cluster-devel] [PATCH] " Andreas Gruenbacher
@ 2016-01-12 13:12 ` Andrew Price
2016-01-12 13:30 ` Andreas Gruenbacher
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Price @ 2016-01-12 13:12 UTC (permalink / raw)
To: cluster-devel.redhat.com
On 12/01/16 12:39, Andreas Gruenbacher wrote:
> On Tue, Jan 12, 2016 at 12:39 PM, Andrew Price <anprice@redhat.com> wrote:
>> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
>> rg_skip field has the following meaning:
>>
>> - If rg_skip is zero, it is considered unset and not useful.
>> - If rg_skip is non-zero, its value will be the number of blocks between
>> this rgrp's address and the next rgrp's address. This can be used as a
>> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> How is that better than a resource group size field that wouldn't need
> to special case the last resource group?
The last block of the mapped space in a resource group (ri_data0 +
ri_data) is not necessarily at the block address immediately preceding
the next resource group, due to device alignment and bitmap rounding.
Also, if we assume we know the position of the first new resource group
that would be created by gfs2_grow then we might get it wrong, so we
can't set the final resource group's rg_skip to a useful value.
Andy
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 13:12 ` Andrew Price
@ 2016-01-12 13:30 ` Andreas Gruenbacher
2016-01-12 15:23 ` Andrew Price
0 siblings, 1 reply; 14+ messages in thread
From: Andreas Gruenbacher @ 2016-01-12 13:30 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Jan 12, 2016 at 2:12 PM, Andrew Price <anprice@redhat.com> wrote:
> On 12/01/16 12:39, Andreas Gruenbacher wrote:
>>
>> On Tue, Jan 12, 2016 at 12:39 PM, Andrew Price <anprice@redhat.com> wrote:
>>>
>>> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
>>> rg_skip field has the following meaning:
>>>
>>> - If rg_skip is zero, it is considered unset and not useful.
>>> - If rg_skip is non-zero, its value will be the number of blocks between
>>> this rgrp's address and the next rgrp's address. This can be used as a
>>> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>>
>>
>> How is that better than a resource group size field that wouldn't need
>> to special case the last resource group?
>
>
> The last block of the mapped space in a resource group (ri_data0 + ri_data)
> is not necessarily at the block address immediately preceding the next
> resource group, due to device alignment and bitmap rounding. Also, if we
> assume we know the position of the first new resource group that would be
> created by gfs2_grow then we might get it wrong, so we can't set the final
> resource group's rg_skip to a useful value.
Okay, I can understand that it makes sense to leave alignment details
to mkfs / growfs.
Is it guaranteed that the kernel will never set the new rg_skip field
if it hasn't verified that the next resource group is where it thinks
it is? Otherwise we could end up with incorrect fsck "hints" for
corrupted filesystems, which would make things worse.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 13:30 ` Andreas Gruenbacher
@ 2016-01-12 15:23 ` Andrew Price
2017-02-15 17:40 ` Andreas Gruenbacher
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Price @ 2016-01-12 15:23 UTC (permalink / raw)
To: cluster-devel.redhat.com
On 12/01/16 13:30, Andreas Gruenbacher wrote:
> Is it guaranteed that the kernel will never set the new rg_skip field
> if it hasn't verified that the next resource group is where it thinks
> it is?
No, but rg_skip is set from the ri_addr of the next resource group in
the rindex and gfs2 generally trusts the rindex so if it's corrupt then
there'll be other problems down the line.
> Otherwise we could end up with incorrect fsck "hints" for
> corrupted filesystems, which would make things worse.
I don't think it would make things worse. fsck.gfs2 will be able to
check whether the rg_skip field is correct by checking it against the
rindex and whether there's an rgrp where it's pointing (and checking
that it's within a sensible range). Before, we only had the rindex and
the rgrp header to check against each other but now we have a third factor.
That said, the fsck.gfs2 case is only one of the ways this would be
useful. The original plan was to generally reduce dependence on the
rindex, effectively allowing us to iterate over all rgrps without
reading it in. Though I don't recall the specific issue which motivated
it, I can think of cases where it would improve things.
Andy
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v2] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 12:03 ` [Cluster-devel] [PATCH v2] " Andrew Price
@ 2016-01-12 15:50 ` Bob Peterson
2017-02-02 14:57 ` Andrew Price
2017-02-13 17:59 ` [Cluster-devel] [PATCH v3] " Andrew Price
0 siblings, 2 replies; 14+ messages in thread
From: Bob Peterson @ 2016-01-12 15:50 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
> this rgrp's address and the next rgrp's address. This can be used as a
> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
> to be 0 it will attempt to set it to the difference between its rd_addr
> and the rd_addr of the next resource group.
>
> The only special case is the final rgrp, which always has a rg_skip of
> 0. It is not set to a special value (like -1) because, when the
> filesystem is grown, the rgrp will no longer be the final one and it
> will then need to have its rg_skip field set. The overhead of this
> special case is a gfs2_rgrpd_get_next() call each time
> gfs2_rgrp_bh_get() is called for the final resource group.
>
> For the other resource groups, if the rg_skip field is 0, it is set
> appropriately and then the only overhead becomes the rgd->rg_skip == 0
> comparison in gfs2_rgrp_bh_get().
>
> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
> the rg_skip field can get set back to 0 in cases where nodes with and
> without this patch are mixed in a cluster. In some cases, the field may
> bounce between being set by one node and then zeroed by another which
> may harm performance slightly, e.g. when two nodes create many small
> files. In testing this situation is rare but it becomes more likely as
> the filesystem fills up and there are fewer resource groups to choose
> from. The problem goes away when all nodes are running with this patch.
> Dipping into the space currently occupied by the rg_reserved field would
> have resulted in the same problem as it is also explicitly zeroed, so
> unfortunately there is no other way around it.
>
> Signed-off-by: Andrew Price <anprice@redhat.com>
Hi Andy,
I've been talking about doing something like this for years, so it's
good to see someone finally acting on it.
Although this is a good first stab at the solution, my main concern about
this implementation is that, AFAICT, it doesn't take read-only mounts into
account. In fact, a "spectator" mount might even cause it to BUG_ON from
gfs2_trans_begin, since there's no journal. But it's close.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v2] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 15:50 ` Bob Peterson
@ 2017-02-02 14:57 ` Andrew Price
2017-02-13 17:59 ` [Cluster-devel] [PATCH v3] " Andrew Price
1 sibling, 0 replies; 14+ messages in thread
From: Andrew Price @ 2017-02-02 14:57 UTC (permalink / raw)
To: cluster-devel.redhat.com
So it's about time I revived this patch and got it finalised...
On 12/01/16 15:50, Bob Peterson wrote:
> ----- Original Message -----
>> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
>> rg_skip field has the following meaning:
>>
>> - If rg_skip is zero, it is considered unset and not useful.
>> - If rg_skip is non-zero, its value will be the number of blocks between
>> this rgrp's address and the next rgrp's address. This can be used as a
>> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>>
>> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
>> to be 0 it will attempt to set it to the difference between its rd_addr
>> and the rd_addr of the next resource group.
>>
>> The only special case is the final rgrp, which always has a rg_skip of
>> 0. It is not set to a special value (like -1) because, when the
>> filesystem is grown, the rgrp will no longer be the final one and it
>> will then need to have its rg_skip field set. The overhead of this
>> special case is a gfs2_rgrpd_get_next() call each time
>> gfs2_rgrp_bh_get() is called for the final resource group.
>>
>> For the other resource groups, if the rg_skip field is 0, it is set
>> appropriately and then the only overhead becomes the rgd->rg_skip == 0
>> comparison in gfs2_rgrp_bh_get().
>>
>> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
>> the rg_skip field can get set back to 0 in cases where nodes with and
>> without this patch are mixed in a cluster. In some cases, the field may
>> bounce between being set by one node and then zeroed by another which
>> may harm performance slightly, e.g. when two nodes create many small
>> files. In testing this situation is rare but it becomes more likely as
>> the filesystem fills up and there are fewer resource groups to choose
>> from. The problem goes away when all nodes are running with this patch.
>> Dipping into the space currently occupied by the rg_reserved field would
>> have resulted in the same problem as it is also explicitly zeroed, so
>> unfortunately there is no other way around it.
>>
>> Signed-off-by: Andrew Price <anprice@redhat.com>
>
> Hi Andy,
>
> I've been talking about doing something like this for years, so it's
> good to see someone finally acting on it.
>
> Although this is a good first stab at the solution, my main concern about
> this implementation is that, AFAICT, it doesn't take read-only mounts into
> account. In fact, a "spectator" mount might even cause it to BUG_ON from
> gfs2_trans_begin, since there's no journal. But it's close.
I've been testing but I haven't found a way to trigger that BUG_ON yet.
Are there any syscalls I can exercise that make gfs2 read in the rgrp
headers in spectator mode? I'm guessing the reason I haven't been able
to trigger it is because they're just doing path lookups and don't need
to look up allocation states.
Andy
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v3] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 15:50 ` Bob Peterson
2017-02-02 14:57 ` Andrew Price
@ 2017-02-13 17:59 ` Andrew Price
2017-02-13 18:06 ` Andrew Price
2017-02-14 10:32 ` Steven Whitehouse
1 sibling, 2 replies; 14+ messages in thread
From: Andrew Price @ 2017-02-13 17:59 UTC (permalink / raw)
To: cluster-devel.redhat.com
Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
rg_skip field has the following meaning:
- If rg_skip is zero, it is considered unset and not useful.
- If rg_skip is non-zero, its value will be the number of blocks between
this rgrp's address and the next rgrp's address. This can be used as a
hint by fsck.gfs2 when rebuilding a bad rindex, for example.
When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
to be 0 it will attempt to set it to the difference between its rd_addr
and the rd_addr of the next resource group.
The only special case is the final rgrp, which always has a rg_skip of
0. It is not set to a special value (like -1) because, when the
filesystem is grown, the rgrp will no longer be the final one and it
will then need to have its rg_skip field set. The overhead of this
special case is a gfs2_rgrpd_get_next() call each time
gfs2_rgrp_bh_get() is called for the final resource group.
For the other resource groups, if the rg_skip field is 0, it is set
appropriately and then the only overhead becomes the rgd->rg_skip == 0
comparison in gfs2_rgrp_bh_get().
Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
the rg_skip field can get set back to 0 in cases where nodes with and
without this patch are mixed in a cluster. In some cases, the field may
bounce between being set by one node and then zeroed by another which
may harm performance slightly, e.g. when two nodes create many small
files. In testing this situation is rare but it becomes more likely as
the filesystem fills up and there are fewer resource groups to choose
from. The problem goes away when all nodes are running with this patch.
Dipping into the space currently occupied by the rg_reserved field would
have resulted in the same problem as it is also explicitly zeroed, so
unfortunately there is no other way around it.
Signed-off-by: Andrew Price <anprice@redhat.com>
---
fs/gfs2/incore.h | 1 +
fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
include/uapi/linux/gfs2_ondisk.h | 5 ++++-
3 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a6a3389..2c03287 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -88,6 +88,7 @@ struct gfs2_rgrpd {
u32 rd_reserved; /* number of blocks reserved */
u32 rd_free_clone;
u32 rd_dinodes;
+ u32 rd_skip; /* Distance to the next rgrp in fs blocks */
u64 rd_igeneration;
struct gfs2_bitmap *rd_bits;
struct gfs2_sbd *rd_sbd;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 86ccc015..aaf435d 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
rgd->rd_flags |= rg_flags;
rgd->rd_free = be32_to_cpu(str->rg_free);
rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
+ rgd->rd_skip = be32_to_cpu(str->rg_skip);
rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
}
@@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
str->rg_free = cpu_to_be32(rgd->rd_free);
str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
- str->__pad = cpu_to_be32(0);
+ str->rg_skip = cpu_to_be32(rgd->rd_skip);
str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
}
@@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
return count;
}
+/**
+ * Set the rg_next field if this isn't the final rgrp.
+ */
+static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
+{
+ struct gfs2_sbd *sdp = rgd->rd_sbd;
+ struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
+ struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
+
+ if (next == NULL || next->rd_addr <= rgd->rd_addr)
+ return;
+
+ if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
+ return;
+
+ rgd->rd_skip = next->rd_addr - rgd->rd_addr;
+ gfs2_trans_add_meta(rgd->rd_gl, bh);
+ gfs2_rgrp_out(rgd, bh->b_data);
+ gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
+ gfs2_trans_end(sdp);
+ return;
+}
/**
* gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
@@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
if (rgd->rd_rgl->rl_unlinked == 0)
rgd->rd_flags &= ~GFS2_RDF_CHECK;
}
+ if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY))
+ gfs2_rgrp_set_skip(rgd);
return 0;
fail:
diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
index 7c4be77..0064381f 100644
--- a/include/uapi/linux/gfs2_ondisk.h
+++ b/include/uapi/linux/gfs2_ondisk.h
@@ -186,7 +186,10 @@ struct gfs2_rgrp {
__be32 rg_flags;
__be32 rg_free;
__be32 rg_dinodes;
- __be32 __pad;
+ union {
+ __be32 __pad;
+ __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
+ };
__be64 rg_igeneration;
__u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
--
2.9.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v3] GFS2: Add a next-resource-group pointer to resource groups
2017-02-13 17:59 ` [Cluster-devel] [PATCH v3] " Andrew Price
@ 2017-02-13 18:06 ` Andrew Price
2017-02-14 10:32 ` Steven Whitehouse
1 sibling, 0 replies; 14+ messages in thread
From: Andrew Price @ 2017-02-13 18:06 UTC (permalink / raw)
To: cluster-devel.redhat.com
I should have mentioned: v3 just adds a check for MS_RDONLY and only
updates the rg_skip value if the fs is rw.
Andy
On 13/02/17 17:59, Andrew Price wrote:
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
> this rgrp's address and the next rgrp's address. This can be used as a
> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
> to be 0 it will attempt to set it to the difference between its rd_addr
> and the rd_addr of the next resource group.
>
> The only special case is the final rgrp, which always has a rg_skip of
> 0. It is not set to a special value (like -1) because, when the
> filesystem is grown, the rgrp will no longer be the final one and it
> will then need to have its rg_skip field set. The overhead of this
> special case is a gfs2_rgrpd_get_next() call each time
> gfs2_rgrp_bh_get() is called for the final resource group.
>
> For the other resource groups, if the rg_skip field is 0, it is set
> appropriately and then the only overhead becomes the rgd->rg_skip == 0
> comparison in gfs2_rgrp_bh_get().
>
> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
> the rg_skip field can get set back to 0 in cases where nodes with and
> without this patch are mixed in a cluster. In some cases, the field may
> bounce between being set by one node and then zeroed by another which
> may harm performance slightly, e.g. when two nodes create many small
> files. In testing this situation is rare but it becomes more likely as
> the filesystem fills up and there are fewer resource groups to choose
> from. The problem goes away when all nodes are running with this patch.
> Dipping into the space currently occupied by the rg_reserved field would
> have resulted in the same problem as it is also explicitly zeroed, so
> unfortunately there is no other way around it.
>
> Signed-off-by: Andrew Price <anprice@redhat.com>
> ---
> fs/gfs2/incore.h | 1 +
> fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
> include/uapi/linux/gfs2_ondisk.h | 5 ++++-
> 3 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a6a3389..2c03287 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -88,6 +88,7 @@ struct gfs2_rgrpd {
> u32 rd_reserved; /* number of blocks reserved */
> u32 rd_free_clone;
> u32 rd_dinodes;
> + u32 rd_skip; /* Distance to the next rgrp in fs blocks */
> u64 rd_igeneration;
> struct gfs2_bitmap *rd_bits;
> struct gfs2_sbd *rd_sbd;
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 86ccc015..aaf435d 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
> rgd->rd_flags |= rg_flags;
> rgd->rd_free = be32_to_cpu(str->rg_free);
> rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
> + rgd->rd_skip = be32_to_cpu(str->rg_skip);
> rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
> }
>
> @@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
> str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
> str->rg_free = cpu_to_be32(rgd->rd_free);
> str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
> - str->__pad = cpu_to_be32(0);
> + str->rg_skip = cpu_to_be32(rgd->rd_skip);
> str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
> memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
> }
> @@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
> return count;
> }
>
> +/**
> + * Set the rg_next field if this isn't the final rgrp.
> + */
> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
> +{
> + struct gfs2_sbd *sdp = rgd->rd_sbd;
> + struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
> + struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
> +
> + if (next == NULL || next->rd_addr <= rgd->rd_addr)
> + return;
> +
> + if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
> + return;
> +
> + rgd->rd_skip = next->rd_addr - rgd->rd_addr;
> + gfs2_trans_add_meta(rgd->rd_gl, bh);
> + gfs2_rgrp_out(rgd, bh->b_data);
> + gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
> + gfs2_trans_end(sdp);
> + return;
> +}
>
> /**
> * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
> @@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
> if (rgd->rd_rgl->rl_unlinked == 0)
> rgd->rd_flags &= ~GFS2_RDF_CHECK;
> }
> + if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY))
> + gfs2_rgrp_set_skip(rgd);
> return 0;
>
> fail:
> diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
> index 7c4be77..0064381f 100644
> --- a/include/uapi/linux/gfs2_ondisk.h
> +++ b/include/uapi/linux/gfs2_ondisk.h
> @@ -186,7 +186,10 @@ struct gfs2_rgrp {
> __be32 rg_flags;
> __be32 rg_free;
> __be32 rg_dinodes;
> - __be32 __pad;
> + union {
> + __be32 __pad;
> + __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
> + };
> __be64 rg_igeneration;
>
> __u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v3] GFS2: Add a next-resource-group pointer to resource groups
2017-02-13 17:59 ` [Cluster-devel] [PATCH v3] " Andrew Price
2017-02-13 18:06 ` Andrew Price
@ 2017-02-14 10:32 ` Steven Whitehouse
2017-02-15 18:15 ` Andreas Gruenbacher
1 sibling, 1 reply; 14+ messages in thread
From: Steven Whitehouse @ 2017-02-14 10:32 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 13/02/17 17:59, Andrew Price wrote:
> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
> rg_skip field has the following meaning:
>
> - If rg_skip is zero, it is considered unset and not useful.
> - If rg_skip is non-zero, its value will be the number of blocks between
> this rgrp's address and the next rgrp's address. This can be used as a
> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>
> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
> to be 0 it will attempt to set it to the difference between its rd_addr
> and the rd_addr of the next resource group.
>
> The only special case is the final rgrp, which always has a rg_skip of
> 0. It is not set to a special value (like -1) because, when the
> filesystem is grown, the rgrp will no longer be the final one and it
> will then need to have its rg_skip field set. The overhead of this
> special case is a gfs2_rgrpd_get_next() call each time
> gfs2_rgrp_bh_get() is called for the final resource group.
>
> For the other resource groups, if the rg_skip field is 0, it is set
> appropriately and then the only overhead becomes the rgd->rg_skip == 0
> comparison in gfs2_rgrp_bh_get().
>
> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
> the rg_skip field can get set back to 0 in cases where nodes with and
> without this patch are mixed in a cluster. In some cases, the field may
> bounce between being set by one node and then zeroed by another which
> may harm performance slightly, e.g. when two nodes create many small
> files. In testing this situation is rare but it becomes more likely as
> the filesystem fills up and there are fewer resource groups to choose
> from. The problem goes away when all nodes are running with this patch.
> Dipping into the space currently occupied by the rg_reserved field would
> have resulted in the same problem as it is also explicitly zeroed, so
> unfortunately there is no other way around it.
>
> Signed-off-by: Andrew Price <anprice@redhat.com>
We will also need patches for gfs2_convert, mkfs.gfs2, fsck.gfs2,
gfs2_edit and gfs2_grow too most likely. Ideally we want to be in a
situation where we can default to using only the info in the rgrp
headers, and fall back to looking at the rindex only in case that
information is not there/broken in the rgrp headers.
> ---
> fs/gfs2/incore.h | 1 +
> fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
> include/uapi/linux/gfs2_ondisk.h | 5 ++++-
> 3 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a6a3389..2c03287 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -88,6 +88,7 @@ struct gfs2_rgrpd {
> u32 rd_reserved; /* number of blocks reserved */
> u32 rd_free_clone;
> u32 rd_dinodes;
> + u32 rd_skip; /* Distance to the next rgrp in fs blocks */
> u64 rd_igeneration;
> struct gfs2_bitmap *rd_bits;
> struct gfs2_sbd *rd_sbd;
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 86ccc015..aaf435d 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd, const void *buf)
> rgd->rd_flags |= rg_flags;
> rgd->rd_free = be32_to_cpu(str->rg_free);
> rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
> + rgd->rd_skip = be32_to_cpu(str->rg_skip);
> rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
> }
>
> @@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd, void *buf)
> str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
> str->rg_free = cpu_to_be32(rgd->rd_free);
> str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
> - str->__pad = cpu_to_be32(0);
> + str->rg_skip = cpu_to_be32(rgd->rd_skip);
> str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
> memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
> }
> @@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
> return count;
> }
>
> +/**
> + * Set the rg_next field if this isn't the final rgrp.
> + */
> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
> +{
> + struct gfs2_sbd *sdp = rgd->rd_sbd;
> + struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
> + struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
> +
> + if (next == NULL || next->rd_addr <= rgd->rd_addr)
> + return;
> +
> + if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
> + return;
> +
> + rgd->rd_skip = next->rd_addr - rgd->rd_addr;
> + gfs2_trans_add_meta(rgd->rd_gl, bh);
> + gfs2_rgrp_out(rgd, bh->b_data);
> + gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
> + gfs2_trans_end(sdp);
> + return;
> +}
>
> /**
> * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
> @@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
> if (rgd->rd_rgl->rl_unlinked == 0)
> rgd->rd_flags &= ~GFS2_RDF_CHECK;
> }
> + if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY))
> + gfs2_rgrp_set_skip(rgd);
> return 0;
This looks like it can be called under a read lock. It might be better
to move the setting of the field so that we only set it when we are
going to write the rgrp anyway. That will remove the additional
overhead, and also avoid the complication of trying to figure out when
it is ok to set it. You could also avoid storing it directly by
calculating it on write, and using the existing rindex fields in struct
gfs2_rgrpd - no need to duplicate the information there.
>
> fail:
> diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
> index 7c4be77..0064381f 100644
> --- a/include/uapi/linux/gfs2_ondisk.h
> +++ b/include/uapi/linux/gfs2_ondisk.h
> @@ -186,7 +186,10 @@ struct gfs2_rgrp {
> __be32 rg_flags;
> __be32 rg_free;
> __be32 rg_dinodes;
> - __be32 __pad;
> + union {
> + __be32 __pad;
> + __be32 rg_skip; /* Distance to the next rgrp in fs blocks */
> + };
> __be64 rg_igeneration;
>
> __u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
That looks like a good place to fit rg_skip in, but gfs2_rindex contains
ri_data0, ri_data and ri_bitbytes and to allow the eventual removal of
the rindex, those need also to be duplicated into the rgrp. There is
plenty of space there to fit those fields in, so we should do that at
the same time. We could also take the opportunity to add a checksum too,
since rg_skip being non-zero would tell us whether we should be checking
the checksum - I don't think that should be too tricky to do, but we
could always add it later too, if required.
Other thoughts... if we also added a field to give us the size of the
following rgrp in disk blocks, then we could do proper readahead on
them. So as well as rg_skip we would need rg_next_len too, which would
be the same as ri_length of the following rgrp header,
Steve.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups
2016-01-12 15:23 ` Andrew Price
@ 2017-02-15 17:40 ` Andreas Gruenbacher
0 siblings, 0 replies; 14+ messages in thread
From: Andreas Gruenbacher @ 2017-02-15 17:40 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Jan 12, 2016 at 4:23 PM, Andrew Price <anprice@redhat.com> wrote:
> On 12/01/16 13:30, Andreas Gruenbacher wrote:
>>
>> Is it guaranteed that the kernel will never set the new rg_skip field
>> if it hasn't verified that the next resource group is where it thinks
>> it is?
>
>
> No, but rg_skip is set from the ri_addr of the next resource group in the
> rindex and gfs2 generally trusts the rindex so if it's corrupt then there'll
> be other problems down the line.
>
>> Otherwise we could end up with incorrect fsck "hints" for
>> corrupted filesystems, which would make things worse.
>
>
> I don't think it would make things worse. fsck.gfs2 will be able to check
> whether the rg_skip field is correct by checking it against the rindex and
> whether there's an rgrp where it's pointing (and checking that it's within a
> sensible range). Before, we only had the rindex and the rgrp header to check
> against each other but now we have a third factor.
>
> That said, the fsck.gfs2 case is only one of the ways this would be useful.
> The original plan was to generally reduce dependence on the rindex,
> effectively allowing us to iterate over all rgrps without reading it in.
> Though I don't recall the specific issue which motivated it, I can think of
> cases where it would improve things.
It would be helpful to explain the reasons for adding rg_skip in the
patch description.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Cluster-devel] [PATCH v3] GFS2: Add a next-resource-group pointer to resource groups
2017-02-14 10:32 ` Steven Whitehouse
@ 2017-02-15 18:15 ` Andreas Gruenbacher
0 siblings, 0 replies; 14+ messages in thread
From: Andreas Gruenbacher @ 2017-02-15 18:15 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Feb 14, 2017 at 11:32 AM, Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
>
> On 13/02/17 17:59, Andrew Price wrote:
>>
>> Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
>> rg_skip field has the following meaning:
>>
>> - If rg_skip is zero, it is considered unset and not useful.
>> - If rg_skip is non-zero, its value will be the number of blocks between
>> this rgrp's address and the next rgrp's address. This can be used as a
>> hint by fsck.gfs2 when rebuilding a bad rindex, for example.
>>
>> When gfs2_rgrp_bh_get() reads a resource group header and finds rg_skip
>> to be 0 it will attempt to set it to the difference between its rd_addr
>> and the rd_addr of the next resource group.
>>
>> The only special case is the final rgrp, which always has a rg_skip of
>> 0. It is not set to a special value (like -1) because, when the
>> filesystem is grown, the rgrp will no longer be the final one and it
>> will then need to have its rg_skip field set. The overhead of this
>> special case is a gfs2_rgrpd_get_next() call each time
>> gfs2_rgrp_bh_get() is called for the final resource group.
>>
>> For the other resource groups, if the rg_skip field is 0, it is set
>> appropriately and then the only overhead becomes the rgd->rg_skip == 0
>> comparison in gfs2_rgrp_bh_get().
>>
>> Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
>> the rg_skip field can get set back to 0 in cases where nodes with and
>> without this patch are mixed in a cluster. In some cases, the field may
>> bounce between being set by one node and then zeroed by another which
>> may harm performance slightly, e.g. when two nodes create many small
>> files. In testing this situation is rare but it becomes more likely as
>> the filesystem fills up and there are fewer resource groups to choose
>> from. The problem goes away when all nodes are running with this patch.
>> Dipping into the space currently occupied by the rg_reserved field would
>> have resulted in the same problem as it is also explicitly zeroed, so
>> unfortunately there is no other way around it.
>>
>> Signed-off-by: Andrew Price <anprice@redhat.com>
>
> We will also need patches for gfs2_convert, mkfs.gfs2, fsck.gfs2, gfs2_edit
> and gfs2_grow too most likely. Ideally we want to be in a situation where we
> can default to using only the info in the rgrp headers, and fall back to
> looking at the rindex only in case that information is not there/broken in
> the rgrp headers.
>
>
>
>> ---
>> fs/gfs2/incore.h | 1 +
>> fs/gfs2/rgrp.c | 27 ++++++++++++++++++++++++++-
>> include/uapi/linux/gfs2_ondisk.h | 5 ++++-
>> 3 files changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
>> index a6a3389..2c03287 100644
>> --- a/fs/gfs2/incore.h
>> +++ b/fs/gfs2/incore.h
>> @@ -88,6 +88,7 @@ struct gfs2_rgrpd {
>> u32 rd_reserved; /* number of blocks reserved */
>> u32 rd_free_clone;
>> u32 rd_dinodes;
>> + u32 rd_skip; /* Distance to the next rgrp in fs
>> blocks */
>> u64 rd_igeneration;
>> struct gfs2_bitmap *rd_bits;
>> struct gfs2_sbd *rd_sbd;
>> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
>> index 86ccc015..aaf435d 100644
>> --- a/fs/gfs2/rgrp.c
>> +++ b/fs/gfs2/rgrp.c
>> @@ -1049,6 +1049,7 @@ static void gfs2_rgrp_in(struct gfs2_rgrpd *rgd,
>> const void *buf)
>> rgd->rd_flags |= rg_flags;
>> rgd->rd_free = be32_to_cpu(str->rg_free);
>> rgd->rd_dinodes = be32_to_cpu(str->rg_dinodes);
>> + rgd->rd_skip = be32_to_cpu(str->rg_skip);
>> rgd->rd_igeneration = be64_to_cpu(str->rg_igeneration);
>> }
>> @@ -1059,7 +1060,7 @@ static void gfs2_rgrp_out(struct gfs2_rgrpd *rgd,
>> void *buf)
>> str->rg_flags = cpu_to_be32(rgd->rd_flags & ~GFS2_RDF_MASK);
>> str->rg_free = cpu_to_be32(rgd->rd_free);
>> str->rg_dinodes = cpu_to_be32(rgd->rd_dinodes);
>> - str->__pad = cpu_to_be32(0);
>> + str->rg_skip = cpu_to_be32(rgd->rd_skip);
>> str->rg_igeneration = cpu_to_be64(rgd->rd_igeneration);
>> memset(&str->rg_reserved, 0, sizeof(str->rg_reserved));
>> }
>> @@ -1119,6 +1120,28 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)
>> return count;
>> }
>> +/**
>> + * Set the rg_next field if this isn't the final rgrp.
>> + */
>> +static void gfs2_rgrp_set_skip(struct gfs2_rgrpd *rgd)
>> +{
>> + struct gfs2_sbd *sdp = rgd->rd_sbd;
>> + struct buffer_head *bh = rgd->rd_bits[0].bi_bh;
>> + struct gfs2_rgrpd *next = gfs2_rgrpd_get_next(rgd);
>> +
>> + if (next == NULL || next->rd_addr <= rgd->rd_addr)
>> + return;
>> +
>> + if (gfs2_trans_begin(sdp, RES_RG_HDR, 0) != 0)
>> + return;
>> +
>> + rgd->rd_skip = next->rd_addr - rgd->rd_addr;
>> + gfs2_trans_add_meta(rgd->rd_gl, bh);
>> + gfs2_rgrp_out(rgd, bh->b_data);
>> + gfs2_rgrp_ondisk2lvb(rgd->rd_rgl, bh->b_data);
>> + gfs2_trans_end(sdp);
>> + return;
>> +}
>> /**
>> * gfs2_rgrp_bh_get - Read in a RG's header and bitmaps
>> @@ -1184,6 +1207,8 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
>> if (rgd->rd_rgl->rl_unlinked == 0)
>> rgd->rd_flags &= ~GFS2_RDF_CHECK;
>> }
>> + if (rgd->rd_skip == 0 && !(sdp->sd_vfs->s_flags & MS_RDONLY))
>> + gfs2_rgrp_set_skip(rgd);
>> return 0;
>
> This looks like it can be called under a read lock. It might be better to
> move the setting of the field so that we only set it when we are going to
> write the rgrp anyway. That will remove the additional overhead, and also
> avoid the complication of trying to figure out when it is ok to set it.
I second that. The skip fields will be initialized eventually either
way; I don't see a need to speed that up.
> You
> could also avoid storing it directly by calculating it on write, and using
> the existing rindex fields in struct gfs2_rgrpd - no need to duplicate the
> information there.
>
>
>> fail:
>> diff --git a/include/uapi/linux/gfs2_ondisk.h
>> b/include/uapi/linux/gfs2_ondisk.h
>> index 7c4be77..0064381f 100644
>> --- a/include/uapi/linux/gfs2_ondisk.h
>> +++ b/include/uapi/linux/gfs2_ondisk.h
>> @@ -186,7 +186,10 @@ struct gfs2_rgrp {
>> __be32 rg_flags;
>> __be32 rg_free;
>> __be32 rg_dinodes;
>> - __be32 __pad;
>> + union {
>> + __be32 __pad;
>> + __be32 rg_skip; /* Distance to the next rgrp in fs blocks
>> */
>> + };
>> __be64 rg_igeneration;
>> __u8 rg_reserved[80]; /* Several fields from gfs1 now reserved */
>
> That looks like a good place to fit rg_skip in, but gfs2_rindex contains
> ri_data0, ri_data and ri_bitbytes and to allow the eventual removal of the
> rindex, those need also to be duplicated into the rgrp. There is plenty of
> space there to fit those fields in, so we should do that at the same time.
> We could also take the opportunity to add a checksum too,
Are there plans for adding metadata checksums throughout all
structures? rgrp checksums don't seem very valuable if we don't at
least also checksum the super-block.
> since rg_skip
> being non-zero would tell us whether we should be checking the checksum - I
> don't think that should be too tricky to do, but we could always add it
> later too, if required.
>
> Other thoughts... if we also added a field to give us the size of the
> following rgrp in disk blocks, then we could do proper readahead on them. So
> as well as rg_skip we would need rg_next_len too, which would be the same as
> ri_length of the following rgrp header,
>
> Steve.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-02-15 18:15 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-12 11:39 [Cluster-devel] [PATCH] GFS2: Add a next-resource-group pointer to resource groups Andrew Price
2016-01-12 11:47 ` Steven Whitehouse
2016-01-12 12:03 ` [Cluster-devel] [PATCH v2] " Andrew Price
2016-01-12 15:50 ` Bob Peterson
2017-02-02 14:57 ` Andrew Price
2017-02-13 17:59 ` [Cluster-devel] [PATCH v3] " Andrew Price
2017-02-13 18:06 ` Andrew Price
2017-02-14 10:32 ` Steven Whitehouse
2017-02-15 18:15 ` Andreas Gruenbacher
2016-01-12 12:39 ` [Cluster-devel] [PATCH] " Andreas Gruenbacher
2016-01-12 13:12 ` Andrew Price
2016-01-12 13:30 ` Andreas Gruenbacher
2016-01-12 15:23 ` Andrew Price
2017-02-15 17:40 ` Andreas Gruenbacher
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.