[Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
       [not found] <1612755211.6184739.1412773976742.JavaMail.zimbra@redhat.com>
@ 2014-10-08 13:29 ` Bob Peterson
  2014-10-08 15:03   ` Bob Peterson
  0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2014-10-08 13:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Steve Whitehouse wrote:
| I'd much prefer to see an algorithm that is adaptive, rather than simply
| bumping up the default here. We do need to be able to cope with cases
| where the files are much smaller, and without adaptive sizing, we might
| land up creating small holes in the allocations, which would then cause
| problems for later allocations,

I took this to heart and came up with a new design. The idea is not unlike
the block reservation doubling schemes in other file systems. The results
are fantastic; much better than those gained by hard coding 64 blocks.
This is the best level of fragmentation I've ever achieved with this app:

EXTENT COUNT FOR OUTPUT FILES =  310103
EXTENT COUNT FOR OUTPUT FILES =  343990
EXTENT COUNT FOR OUTPUT FILES =  332818
EXTENT COUNT FOR OUTPUT FILES =  336852
EXTENT COUNT FOR OUTPUT FILES =  334820

Compare these results to counts without the patch:

EXTENT COUNT FOR OUTPUT FILES =  951813
EXTENT COUNT FOR OUTPUT FILES =  966978
EXTENT COUNT FOR OUTPUT FILES =  1065481

The only down side I see is that it makes the gfs2 inode structure bigger.
I also thought about changing the minimum reservation size to 16 blocks
rather than 32, since it's now adaptive, but before I did that, I'd have
to run some performance tests. What do you think?

Regards,

Bob Peterson
Red Hat File Systems
--------------------------------------------------------------------
Patch text:

This patch introduces a new block reservation doubling scheme. If we
get to the end of a multi-block reservation, we probably did not
reserve enough blocks. So we double the size of the reservation for
next time. If we can't find any rgrps that match, we go back to the
default 32 blocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/main.c   | 2 ++
 fs/gfs2/rgrp.c   | 7 ++++++-
 fs/gfs2/rgrp.h   | 9 ++-------
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 39e7e99..f98fa37 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -398,6 +398,7 @@ struct gfs2_inode {
 	u32 i_diskflags;
 	u8 i_height;
 	u8 i_depth;
+	u32 i_rsrv_minblks; /* minimum blocks per reservation */
 };
 
 /*
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..2be2f98 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -29,6 +29,7 @@
 #include "glock.h"
 #include "quota.h"
 #include "recovery.h"
+#include "rgrp.h"
 #include "dir.h"
 
 struct workqueue_struct *gfs2_control_wq;
@@ -42,6 +43,7 @@ static void gfs2_init_inode_once(void *foo)
 	INIT_LIST_HEAD(&ip->i_trunc_list);
 	ip->i_res = NULL;
 	ip->i_hash_cache = NULL;
+	ip->i_rsrv_minblks = RGRP_RSRV_MINBLKS;
 }
 
 static void gfs2_init_glock_once(void *foo)
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 7474c41..986c33f 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1465,7 +1465,7 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct gfs2_inode *ip,
 		extlen = 1;
 	else {
 		extlen = max_t(u32, atomic_read(&rs->rs_sizehint), ap->target);
-		extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks);
+		extlen = clamp(extlen, ip->i_rsrv_minblks, free_blocks);
 	}
 	if ((rgd->rd_free_clone < rgd->rd_reserved) || (free_blocks < extlen))
 		return;
@@ -2000,6 +2000,7 @@ next_rgrp:
 		 * then this checks for some less likely conditions before
 		 * trying again.
 		 */
+		ip->i_rsrv_minblks = RGRP_RSRV_MINBLKS;
 		loops++;
 		/* Check that fs hasn't grown if writing to rindex */
 		if (ip == GFS2_I(sdp->sd_rindex) && !sdp->sd_rindex_uptodate) {
@@ -2195,6 +2196,10 @@ static void gfs2_adjust_reservation(struct gfs2_inode *ip,
 			trace_gfs2_rs(rs, TRACE_RS_CLAIM);
 			if (rs->rs_free && !ret)
 				goto out;
+			/* We used up our block reservation, so double the
+			   minimum reservation size for the next write. */
+			if (ip->i_rsrv_minblks < RGRP_RSRV_MAXBLKS)
+				ip->i_rsrv_minblks <<= 1;
 		}
 		__rs_deltree(rs);
 	}
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 5d8f085..d081cac 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -13,13 +13,8 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 
-/* Since each block in the file system is represented by two bits in the
- * bitmap, one 64-bit word in the bitmap will represent 32 blocks.
- * By reserving 32 blocks at a time, we can optimize / shortcut how we search
- * through the bitmaps by looking a word at a time.
- */
-#define RGRP_RSRV_MINBYTES 8
-#define RGRP_RSRV_MINBLKS ((u32)(RGRP_RSRV_MINBYTES * GFS2_NBBY))
+#define RGRP_RSRV_MINBLKS 32
+#define RGRP_RSRV_MAXBLKS 512
 
 struct gfs2_rgrpd;
 struct gfs2_sbd;



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
  2014-10-08 13:29 ` [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme Bob Peterson
@ 2014-10-08 15:03   ` Bob Peterson
  2014-10-10  3:39     ` Bob Peterson
  0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2014-10-08 15:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> This patch introduces a new block reservation doubling scheme. If we

Maybe I sent this patch out prematurely. Instead of doubling the
reservation, maybe I should experiment with making it grow additively.
IOW, Instead of 32-64-128-256-512, I should use: 32-64-96-128-160-192-224-etc...
I know other file systems using doubling schemes, but I'm concerned
about it being too aggressive.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
  2014-10-08 15:03   ` Bob Peterson
@ 2014-10-10  3:39     ` Bob Peterson
  2014-10-10  9:07       ` Steven Whitehouse
  0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2014-10-10  3:39 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> ----- Original Message -----
> > This patch introduces a new block reservation doubling scheme. If we
> 
> Maybe I sent this patch out prematurely. Instead of doubling the
> reservation, maybe I should experiment with making it grow additively.
> IOW, Instead of 32-64-128-256-512, I should use:
> 32-64-96-128-160-192-224-etc...
> I know other file systems using doubling schemes, but I'm concerned
> about it being too aggressive.

I tried an additive reservations algorithm. I basically changed the
previous patch from doubling the reservation to adding 32 blocks.
In other words, I replaced:

+				ip->i_rsrv_minblks <<= 1;
with this:
+				ip->i_rsrv_minblks += RGRP_RSRV_MINBLKS;

The results were not as good, but still very impressive, and maybe
acceptable:

Reservation doubling scheme:
EXTENT COUNT FOR OUTPUT FILES =  310103
EXTENT COUNT FOR OUTPUT FILES =  343990
EXTENT COUNT FOR OUTPUT FILES =  332818
EXTENT COUNT FOR OUTPUT FILES =  336852
EXTENT COUNT FOR OUTPUT FILES =  334820

Reservation additive scheme (32 blocks):
EXTENT COUNT FOR OUTPUT FILES =  322406
EXTENT COUNT FOR OUTPUT FILES =  341665
EXTENT COUNT FOR OUTPUT FILES =  341769
EXTENT COUNT FOR OUTPUT FILES =  348676
EXTENT COUNT FOR OUTPUT FILES =  348079

So I'm looking for opinions:
(a) Stick with the original reservation doubling patch, or
(b) Go with the additive version.
(c) Any other ideas?

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
  2014-10-10  3:39     ` Bob Peterson
@ 2014-10-10  9:07       ` Steven Whitehouse
  2014-10-14 13:44         ` Bob Peterson
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Whitehouse @ 2014-10-10  9:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 10/10/14 04:39, Bob Peterson wrote:
> ----- Original Message -----
>> ----- Original Message -----
>>> This patch introduces a new block reservation doubling scheme. If we
>> Maybe I sent this patch out prematurely. Instead of doubling the
>> reservation, maybe I should experiment with making it grow additively.
>> IOW, Instead of 32-64-128-256-512, I should use:
>> 32-64-96-128-160-192-224-etc...
>> I know other file systems using doubling schemes, but I'm concerned
>> about it being too aggressive.
> I tried an additive reservations algorithm. I basically changed the
> previous patch from doubling the reservation to adding 32 blocks.
> In other words, I replaced:
>
> +				ip->i_rsrv_minblks <<= 1;
> with this:
> +				ip->i_rsrv_minblks += RGRP_RSRV_MINBLKS;
>
> The results were not as good, but still very impressive, and maybe
> acceptable:
>
> Reservation doubling scheme:
> EXTENT COUNT FOR OUTPUT FILES =  310103
> EXTENT COUNT FOR OUTPUT FILES =  343990
> EXTENT COUNT FOR OUTPUT FILES =  332818
> EXTENT COUNT FOR OUTPUT FILES =  336852
> EXTENT COUNT FOR OUTPUT FILES =  334820
>
> Reservation additive scheme (32 blocks):
> EXTENT COUNT FOR OUTPUT FILES =  322406
> EXTENT COUNT FOR OUTPUT FILES =  341665
> EXTENT COUNT FOR OUTPUT FILES =  341769
> EXTENT COUNT FOR OUTPUT FILES =  348676
> EXTENT COUNT FOR OUTPUT FILES =  348079
>
> So I'm looking for opinions:
> (a) Stick with the original reservation doubling patch, or
> (b) Go with the additive version.
> (c) Any other ideas?
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

I think you are very much along the right lines. The issue is to ensure 
that all the evidence that is available is taken into account in 
figuring out how large a reservation to make. There are various clues, 
such as the time between writes, the size of the writes, whether the 
file gets closed between writes, whether the writes are contiguous and 
so forth.

Some of those things are taken into account already, however we can 
probably do better. We may be able to also take some hints from things 
like calls to fsync (should we drop reservations that are small at this 
point, since it likely signifies a significant point in the file, if 
fsync is called?) or even detect well known non-linear write patterns, 
e.g. backwards stride patterns or large matrix access patterns (by row 
or column).

The struct file is really the best place to store this context 
information, since if there are multiple writers to the same inode, then 
there is a fair chance that they'll have separate struct files. Does 
this happen in your test workload?

The readahead code can already detect some common read patterns, and it 
also turns itself off if the reads are random. The readahead problem is 
actually very much the same problem in that it tries to estimate which 
reads are coming next based on the context that has been seen already, 
so there may well be some lessons to be learned from that too.

I think its important to look at the statistics of lots of different 
workloads, and to check them off against your candidate algorithm(s), to 
ensure that the widest range of potential access patterns are taken into 
account,

Steve.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme
  2014-10-10  9:07       ` Steven Whitehouse
@ 2014-10-14 13:44         ` Bob Peterson
  0 siblings, 0 replies; 5+ messages in thread
From: Bob Peterson @ 2014-10-14 13:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> >>> This patch introduces a new block reservation doubling scheme. If we
> >> Maybe I sent this patch out prematurely. Instead of doubling the
> >> reservation, maybe I should experiment with making it grow additively.
> >> IOW, Instead of 32-64-128-256-512, I should use:
> >> 32-64-96-128-160-192-224-etc...
> >> I know other file systems using doubling schemes, but I'm concerned
> >> about it being too aggressive.
> > I tried an additive reservations algorithm. I basically changed the
> > previous patch from doubling the reservation to adding 32 blocks.
> > In other words, I replaced:
> >
> > +				ip->i_rsrv_minblks <<= 1;
> > with this:
> > +				ip->i_rsrv_minblks += RGRP_RSRV_MINBLKS;
> >
> > The results were not as good, but still very impressive, and maybe
> > acceptable:
(snip)
> I think you are very much along the right lines. The issue is to ensure
> that all the evidence that is available is taken into account in
> figuring out how large a reservation to make. There are various clues,
> such as the time between writes, the size of the writes, whether the
> file gets closed between writes, whether the writes are contiguous and
> so forth.
> 
> Some of those things are taken into account already, however we can
> probably do better. We may be able to also take some hints from things
> like calls to fsync (should we drop reservations that are small at this
> point, since it likely signifies a significant point in the file, if
> fsync is called?) or even detect well known non-linear write patterns,
> e.g. backwards stride patterns or large matrix access patterns (by row
> or column).
> 
> The struct file is really the best place to store this context
> information, since if there are multiple writers to the same inode, then
> there is a fair chance that they'll have separate struct files. Does
> this happen in your test workload?
> 
> The readahead code can already detect some common read patterns, and it
> also turns itself off if the reads are random. The readahead problem is
> actually very much the same problem in that it tries to estimate which
> reads are coming next based on the context that has been seen already,
> so there may well be some lessons to be learned from that too.
> 
> I think its important to look at the statistics of lots of different
> workloads, and to check them off against your candidate algorithm(s), to
> ensure that the widest range of potential access patterns are taken into
> account,
> 
> Steve.

Hi Steve,

Sorry it's taken me a bit to respond. I've been giving this a lot of thought
and doing a lot of experiments and tests.

I see multiple issues/problems, and my patches have been trying to address
them or solve them separately. You make some very good points here, so
I want to address them individually in the light of my latest findings.
I basically see three main performance problems:

1. Inter-node contention for resource groups. In the past, it was solved
   with "try" locks that ended with a chaos of block assignments.
   In RHEL7 and up, we eliminated them, but the contention came back and
   performance suffers. I posted a patch for this issue that allows each
   node in the cluster to "prefer" a unique set of resource groups. It
   improved reduced inter-node contention greatly and improved performance
   greatly. It was called "GFS2: Set of distributed preferences for rgrps"
   posted on October 8.
2. We need to more accurately predict the size of multi-block reservations.
   This is the issue you talk about here, and so far it's one that I
   haven't addressed yet.
3. We need a way to adjust those predictions if they're found to be
   inadequate. That's the problem I was addressing with the reservation
   doubling scheme or additive reservation scheme.

Issues 2 and 3 might possibly be treated as one issue: we could have a
self-adjusting reservation size system, based on a number of factors,
and I'm in the process of reworking how we do it. I've been doing lots of
experiments and running lots of tests against different workloads. You're
right that #2 is necessary, and I've verified that without it, some
workloads get faster while others get slower (although there's an overall
improvement).

Here are some thoughts:

1. Today, reservations are based on write size, which as you say, is
   not a very good predictor. We can do better.
2. My reservation doubling scheme helps, and reduces fragmentation, but
   we need a more sophisticated scheme.
3. I don't think the time between writes should affect the reservation
   because different applications have different dynamics.
4. Size of the writes are already taken into account. However, the way
   we do it now is kind of bogus. With every write, we adjust the size
   hint. But if the application is doing rewrites, it shouldn't matter.
   If it's writing backwards or at random locations, it might matter.
   Last night I experimented with a new scheme that basically only
   adjusts the size hint if block allocations are necessary. That way,
   applications that do a long sequence of: "large-appends" followed by
   "small rewrites" don't get their "append" size hint whacked by the
   small rewrite. This didn't help the customer application I'm testing,
   but it did help some of the other benchmarks I ran yesterday.
5. I don't like the idea of adjusting the reservation at fsync time.
   Similarly, I don't like the idea of adjusting the reservation at
   file close time. I think it makes the most sense to keep the info
   associated with the inode as we do today. My next iteration will
   hopefully not add any fields to the inode.
6. I like the idea of adjusting the reservation for non-linear writes,
   such as backwards writes, but I may have to do more testing. For
   example, if I do multiple writes to a file at: 2MB, 1MB, 500KB, etc.,
   is it better to reserve X blocks which will be assigned in reverse
   order? Or is it better to just reserve them as needed and have them
   more scattered but possibly more linear? Maybe testing will show.
7. In regards to storing the context information in the struct file:
   It depends on what information. Today, there is only one reservation
   structure, and reservation size per inode, whereas there can be many
   struct files for many writers to the inode. The question of whether
   a reservation is adequate is not so much about "will this reservation
   be adequate for this writer?". Rather, it's about "will this
   reservation be adequate for our most demanding writer?" All the
   rewriters in the world shouldn't affect the outcome of a single
   aggressive appender, for example.

To answer your question: I'd wager that yes, there are multiple writers
to at least some of the files, but I'm not sure how extensive it is.
The workload seems to have a good variety of linear and non-linear writes
as well. At least now I'm starting to use multiple benchmarks for my
tests.

Regards,

Bob Peterson
Red Hat File Systems

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-14 13:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1612755211.6184739.1412773976742.JavaMail.zimbra@redhat.com>
2014-10-08 13:29 ` [Cluster-devel] [GFS2 PATCH] GFS2: Block reservation doubling scheme Bob Peterson
2014-10-08 15:03   ` Bob Peterson
2014-10-10  3:39     ` Bob Peterson
2014-10-10  9:07       ` Steven Whitehouse
2014-10-14 13:44         ` Bob Peterson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.