public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fs/xfs: Add support for passing write life-time hint with log
       [not found] <CGME20181203131558epcas2p14b6b38cb67d4915b1ba782e11ce7ffe6@epcas2p1.samsung.com>
@ 2018-12-03 13:12 ` Kanchan Joshi
  2018-12-03 15:48   ` Holger Hoffstätte
  0 siblings, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2018-12-03 13:12 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Kanchan Joshi

Log gets updated in a circular fashion, and that makes life-time   
of log-data different from other types of meta/user-data.                            
By passing a write life-time hint with log, GC efficiency of multi-stream SSD   
gets improved, leading to endurance/performance benefits.                       
It is described in greater detail (along with results) in this "FAST 2018"      
paper -                                                                         
https://www.usenix.org/conference/fast18/presentation/rho                       
                                                                                
This patch introduces new mount option "logwritehint" to pass write hint
with XFS log.
Among other Linux file-systems, F2FS supports passing down such write      
hints. While for Ext4 journal, I am preparing similar proposal.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/xfs/xfs_buf.c         |  2 ++
 fs/xfs/xfs_buf.h         |  1 +
 fs/xfs/xfs_log.c         |  3 +++
 fs/xfs/xfs_log_recover.c |  1 +
 fs/xfs/xfs_mount.h       |  2 ++
 fs/xfs/xfs_super.c       | 15 +++++++++++++--
 6 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b21ea2b..00d17f6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1370,6 +1370,8 @@ xfs_buf_ioapply_map(
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 	bio_set_op_attrs(bio, op, op_flags);
+	/* set write hint in bio */
+	bio->bi_write_hint = bp->b_write_hint;
 
 	for (; size && nr_pages; nr_pages--, page_index++) {
 		int	rbytes, nbytes = PAGE_SIZE - offset;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b9f5511..ba9c78c 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -196,6 +196,7 @@ typedef struct xfs_buf {
 	int			b_retries;
 	unsigned long		b_first_retry_time; /* in jiffies */
 	int			b_last_error;
+	enum rw_hint		b_write_hint;	/* write hint for I/O */
 
 	const struct xfs_buf_ops	*b_ops;
 } xfs_buf_t;
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c3b610b..45e220d 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1881,6 +1881,8 @@ xlog_sync(
 	XFS_BUF_SET_ADDR(bp, BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn)));
 
 	XFS_STATS_ADD(log->l_mp, xs_log_blocks, BTOBB(count));
+	/* set write hint in buffer */
+	bp->b_write_hint = log->l_mp->m_logwritehint;
 
 	/* Do we need to split this write into 2 parts? */
 	if (XFS_BUF_ADDR(bp) + BTOBB(count) > log->l_logBBsize) {
@@ -1971,6 +1973,7 @@ xlog_sync(
 		bp->b_log_item = iclog;
 		bp->b_flags &= ~XBF_FLUSH;
 		bp->b_flags |= (XBF_ASYNC | XBF_SYNCIO | XBF_WRITE | XBF_FUA);
+		bp->b_write_hint = log->l_mp->m_logwritehint;
 
 		ASSERT(XFS_BUF_ADDR(bp) <= log->l_logBBsize-1);
 		ASSERT(XFS_BUF_ADDR(bp) + BTOBB(count) <= log->l_logBBsize);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1fc9e90..8bf89fa 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -282,6 +282,7 @@ xlog_bwrite(
 	xfs_buf_lock(bp);
 	bp->b_io_length = nbblks;
 	bp->b_error = 0;
+	bp->b_write_hint = log->l_mp->m_logwritehint;
 
 	error = xfs_bwrite(bp);
 	if (error)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7964513..7f6b2b8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -171,6 +171,8 @@ typedef struct xfs_mount {
 	struct workqueue_struct	*m_log_workqueue;
 	struct workqueue_struct *m_eofblocks_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
+	/* To store write hint (for log writes) passed during mount */
+	int			m_logwritehint;
 
 	/*
 	 * Generation of the filesysyem layout.  This is incremented by each
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d3e6cd0..6449d213 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -71,7 +71,7 @@ enum {
 	Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
 	Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
-	Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
+	Opt_discard, Opt_nodiscard, Opt_dax, Opt_logwritehint, Opt_err,
 };
 
 static const match_table_t tokens = {
@@ -119,6 +119,7 @@ static const match_table_t tokens = {
 	{Opt_discard,	"discard"},	/* Discard unused blocks */
 	{Opt_nodiscard,	"nodiscard"},	/* Do not discard unused blocks */
 	{Opt_dax,	"dax"},		/* Enable direct access to bdev pages */
+	{Opt_logwritehint, "logwritehint=%u"},/* Write-hint for log */
 	{Opt_err,	NULL},
 };
 
@@ -225,6 +226,10 @@ xfs_parseargs(
 			if (match_int(args, &mp->m_logbufs))
 				return -EINVAL;
 			break;
+		case Opt_logwritehint:
+			if (match_int(args, &mp->m_logwritehint))
+				return -EINVAL;
+			break;
 		case Opt_logbsize:
 			if (suffix_kstrtoint(args, 10, &mp->m_logbsize))
 				return -EINVAL;
@@ -405,7 +410,6 @@ xfs_parseargs(
 		mp->m_dalign = dsunit;
 		mp->m_swidth = dswidth;
 	}
-
 	if (mp->m_logbufs != -1 &&
 	    mp->m_logbufs != 0 &&
 	    (mp->m_logbufs < XLOG_MIN_ICLOGS ||
@@ -438,6 +442,13 @@ xfs_parseargs(
 		mp->m_readio_log = iosizelog;
 		mp->m_writeio_log = iosizelog;
 	}
+	if (mp->m_logwritehint < WRITE_LIFE_NOT_SET ||
+	    mp->m_logwritehint > WRITE_LIFE_EXTREME) {
+		xfs_warn(mp, "invalid logwritehint value: %d [not %d-%d]",
+			mp->m_logwritehint, WRITE_LIFE_NOT_SET, WRITE_LIFE_EXTREME);
+		return -EINVAL;
+
+	}
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-03 13:12 ` [PATCH] fs/xfs: Add support for passing write life-time hint with log Kanchan Joshi
@ 2018-12-03 15:48   ` Holger Hoffstätte
  2018-12-03 16:34     ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Holger Hoffstätte @ 2018-12-03 15:48 UTC (permalink / raw)
  To: Kanchan Joshi, darrick.wong; +Cc: linux-xfs

On 12/3/18 2:12 PM, Kanchan Joshi wrote:
> Log gets updated in a circular fashion, and that makes life-time
> of log-data different from other types of meta/user-data.
> By passing a write life-time hint with log, GC efficiency of multi-stream SSD
> gets improved, leading to endurance/performance benefits.
> It is described in greater detail (along with results) in this "FAST 2018"
> paper -
> https://www.usenix.org/conference/fast18/presentation/rho
>                                                                                  
> This patch introduces new mount option "logwritehint" to pass write hint
> with XFS log.

Is there any downside to passing the hints unconditionally?
Introducing a new mount option which depends on the internals of
an SSD seems .. unlikely to gain many friends.
Otherwise a great idea. :)

-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-03 15:48   ` Holger Hoffstätte
@ 2018-12-03 16:34     ` Darrick J. Wong
  2018-12-03 20:09       ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2018-12-03 16:34 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Kanchan Joshi, linux-xfs

On Mon, Dec 03, 2018 at 04:48:12PM +0100, Holger Hoffstätte wrote:
> On 12/3/18 2:12 PM, Kanchan Joshi wrote:
> > Log gets updated in a circular fashion, and that makes life-time
> > of log-data different from other types of meta/user-data.
> > By passing a write life-time hint with log, GC efficiency of multi-stream SSD
> > gets improved, leading to endurance/performance benefits.
> > It is described in greater detail (along with results) in this "FAST 2018"
> > paper -
> > https://www.usenix.org/conference/fast18/presentation/rho
> > This patch introduces new mount option "logwritehint" to pass write hint
> > with XFS log.
> 
> Is there any downside to passing the hints unconditionally?

Why wouldn't we always pass LIFE_EXTREME?  Do people have setups where,
say, hint <= LIFE_MEDIUM gets a disk but anything longer than that gets
a big slow stone tablet, which is not where we'd want the metadata log?

For that matter, should we be passing write hints for other fs metadata?
Fixed AG headers never move, should they be LIFE_whateverthelogis ?  How
about space and file metadata, which aren't fixed to certain locations?

> Introducing a new mount option which depends on the internals of
> an SSD seems .. unlikely to gain many friends.
> Otherwise a great idea. :)

Likewise, I'm not wild about adding mount options or passing raw
integers via mount(8) command line:

mount /dev/fd0 /mnt -o logwritehint=3 # ???

--D

> -h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-03 16:34     ` Darrick J. Wong
@ 2018-12-03 20:09       ` Dave Chinner
  2018-12-04 12:11         ` Kanchan Joshi
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-12-03 20:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Holger Hoffstätte, Kanchan Joshi, linux-xfs

On Mon, Dec 03, 2018 at 08:34:57AM -0800, Darrick J. Wong wrote:
> On Mon, Dec 03, 2018 at 04:48:12PM +0100, Holger Hoffstätte wrote:
> > On 12/3/18 2:12 PM, Kanchan Joshi wrote:
> > > Log gets updated in a circular fashion, and that makes life-time
> > > of log-data different from other types of meta/user-data.
> > > By passing a write life-time hint with log, GC efficiency of multi-stream SSD
> > > gets improved, leading to endurance/performance benefits.
> > > It is described in greater detail (along with results) in this "FAST 2018"
> > > paper -
> > > https://www.usenix.org/conference/fast18/presentation/rho
> > > This patch introduces new mount option "logwritehint" to pass write hint
> > > with XFS log.
> > 
> > Is there any downside to passing the hints unconditionally?
> 
> Why wouldn't we always pass LIFE_EXTREME?  Do people have setups where,
> say, hint <= LIFE_MEDIUM gets a disk but anything longer than that gets
> a big slow stone tablet, which is not where we'd want the metadata log?
> 
> For that matter, should we be passing write hints for other fs metadata?
> Fixed AG headers never move, should they be LIFE_whateverthelogis ?  How
> about space and file metadata, which aren't fixed to certain locations?

I started looking at this recently because of the problems that were
being had with the XFS allocator interleaving short term and long
term data for certain applications. Part of this was getting the
userspace hints plumbed through to the inode, which then canbe used
by the allocator to make high level placement decisions (e.g. AG
level) and then the hint gets plumbed through to the user data bios
as well.

Metadata is largely static, even the dynamic metadata, because we
overwrite in place and it doesn't move about all that much in common
workloads. So it was just looking at treating all the metadata as
the same, given that there are only 4 or 5 hint levels available.

> > Introducing a new mount option which depends on the internals of
> > an SSD seems .. unlikely to gain many friends.
> > Otherwise a great idea. :)
> 
> Likewise, I'm not wild about adding mount options or passing raw
> integers via mount(8) command line:
> 
> mount /dev/fd0 /mnt -o logwritehint=3 # ???

No mount option, please. Fix the log and metadata as "always
overwritten in place" write type hints, let user data be specified
by the dynamic per-inode hinting interface we already have.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-03 20:09       ` Dave Chinner
@ 2018-12-04 12:11         ` Kanchan Joshi
  2018-12-04 22:09           ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2018-12-04 12:11 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong
  Cc: Holger Hoffstätte, linux-xfs, jooyoung.hwang, chur.lee,
	prakash.v

I expect log to have lifetime as "SHORT" in general. Log is bound to be 
overwritten, as XFS continues performing transaction. So it is not good 
idea to place it (inside SSD) with some other meta/data that is more 
stable (or less stable, for that matter).
By assigning a distinct write-hint (SHORT, or anything else than NONE) 
to log, this problem of mixing is solved.

Keeping a mount option seemed to offer more flexibility to 
admin/system-designers. Assuming a single large SSD, hosting two XFS 
volumes - one catering to fsync-heavy workloads, while another one with 
reduced frequency of log writes. In that situation, one would not want 
to mix the writes of two logs and instead prefer to configure one log as 
"SHORT" and another one as "MEDIUM or EXTREME".

Also, this way (through mount option) seemed more in sync with how rest 
of the kernel currently deals with streams/write-hints. In order to be 
useful, write-hints need to be converted to specific stream numbers. For 
NVMe SSDs, this is done by nvme-core module, but only if it is loaded 
with "streams=1" option. F2FS has mount option for passing write-hints. 
Default behavior is passing no write-hint.

To summarize, I have listed three schemes below. Please let me know 
which one sounds more acceptable for patch -
1. [Current proposal] Keep write-hint (NONE) as default, and make it 
overridable through mount option.
2. Keep immutable write-hint (say SHORT). Provide no mount option.
3. Keep write-hint (SHORT) as default, and make it overridable through 
mount option.

Thanks,
On Tuesday 04 December 2018 01:39 AM, Dave Chinner wrote:
> On Mon, Dec 03, 2018 at 08:34:57AM -0800, Darrick J. Wong wrote:
>> On Mon, Dec 03, 2018 at 04:48:12PM +0100, Holger Hoffstätte wrote:
>>> On 12/3/18 2:12 PM, Kanchan Joshi wrote:
>>>> Log gets updated in a circular fashion, and that makes life-time
>>>> of log-data different from other types of meta/user-data.
>>>> By passing a write life-time hint with log, GC efficiency of multi-stream SSD
>>>> gets improved, leading to endurance/performance benefits.
>>>> It is described in greater detail (along with results) in this "FAST 2018"
>>>> paper -
>>>> https://www.usenix.org/conference/fast18/presentation/rho
>>>> This patch introduces new mount option "logwritehint" to pass write hint
>>>> with XFS log.
>>>
>>> Is there any downside to passing the hints unconditionally?
>>
>> Why wouldn't we always pass LIFE_EXTREME?  Do people have setups where,
>> say, hint <= LIFE_MEDIUM gets a disk but anything longer than that gets
>> a big slow stone tablet, which is not where we'd want the metadata log?
>>
>> For that matter, should we be passing write hints for other fs metadata?
>> Fixed AG headers never move, should they be LIFE_whateverthelogis ?  How
>> about space and file metadata, which aren't fixed to certain locations?
> 
> I started looking at this recently because of the problems that were
> being had with the XFS allocator interleaving short term and long
> term data for certain applications. Part of this was getting the
> userspace hints plumbed through to the inode, which then canbe used
> by the allocator to make high level placement decisions (e.g. AG
> level) and then the hint gets plumbed through to the user data bios
> as well.
> 
> Metadata is largely static, even the dynamic metadata, because we
> overwrite in place and it doesn't move about all that much in common
> workloads. So it was just looking at treating all the metadata as
> the same, given that there are only 4 or 5 hint levels available.
> 
>>> Introducing a new mount option which depends on the internals of
>>> an SSD seems .. unlikely to gain many friends.
>>> Otherwise a great idea. :)
>>
>> Likewise, I'm not wild about adding mount options or passing raw
>> integers via mount(8) command line:
>>
>> mount /dev/fd0 /mnt -o logwritehint=3 # ???
> 
> No mount option, please. Fix the log and metadata as "always
> overwritten in place" write type hints, let user data be specified
> by the dynamic per-inode hinting interface we already have.
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-04 12:11         ` Kanchan Joshi
@ 2018-12-04 22:09           ` Dave Chinner
  2018-12-10 15:15             ` Kanchan Joshi
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-12-04 22:09 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Darrick J. Wong, Holger Hoffstätte, linux-xfs,
	jooyoung.hwang, chur.lee, prakash.v

On Tue, Dec 04, 2018 at 05:41:26PM +0530, Kanchan Joshi wrote:
> I expect log to have lifetime as "SHORT" in general. Log is bound to
> be overwritten, as XFS continues performing transaction. So it is
> not good idea to place it (inside SSD) with some other meta/data
> that is more stable (or less stable, for that matter).
> By assigning a distinct write-hint (SHORT, or anything else than
> NONE) to log, this problem of mixing is solved.

So, we have different definitions of what is "short lived"
and what is "long lived". The log is a -static allocation- it never
moves and so it always gets overwritten in place. It exists for the
life of the filesystem, so it's a long-lived structure. Some
metadata moves around - it's allocated and freed on demand, but is
still overwritten in place while it's in use.

The in-use life time of metadata can be very short, but it can also
be very long. It may never get overwritten, or it could be
overwritten multiple times a second. We have no real idea what is
going to happen with each individual piece of metadata because it is
completely dependent on user workloads.

So from a metadata perspective, life-time refers to how long the
metadata is in use in the filesystem, not how often it is accessed
or written. There's no "one-size-fits-all" bucket here.

> Keeping a mount option seemed to offer more flexibility to
> admin/system-designers.

OTOH, it gives everyone who is not an expert in storage and
filesystem implemetnations an oportunity to screw up in new and
exciting ways that are difficult to detect and impossible for XFS
developers to reproduce or debug.

>
> Assuming a single large SSD, hosting two XFS
> volumes - one catering to fsync-heavy workloads, while another one
> with reduced frequency of log writes. In that situation, one would
> not want to mix the writes of two logs and instead prefer to
> configure one log as "SHORT" and another one as "MEDIUM or EXTREME".

Here's the problem: you're making an assumption that "frequency of
log writes" equates to "the log is overwritten more often", and
that's not true. Frequent fsyncs typically mean lots of small log
writes that block each other, while applicaitons that don't use
fsync will be doing lots large async log writes and potentially
writing a lot more metadata to the log because nothing is blocking
waiting on journal IO completion......

Filesystems rarely behave in the ways non-filesystem developers
expect them to.

> Also, this way (through mount option) seemed more in sync with how
> rest of the kernel currently deals with streams/write-hints. In
> order to be useful, write-hints need to be converted to specific
> stream numbers. For NVMe SSDs, this is done by nvme-core module, but
> only if it is loaded with "streams=1" option. F2FS has mount option
> for passing write-hints. Default behavior is passing no write-hint.

There is no need for mount options, because we already have a
fcntl() interface that applications can use for setting write hints
on files. It was introduced in 4.13, and XFS already plumbs it
through for buffered write IO.

FYI:

$ man fcntl
....
   File read/write hints

       Write lifetime hints can be used to inform the kernel about
       the relative expected lifetime of writes on a given inode or
       via  a  particular  open  file description.   (See open(2)
       for  an  explanation of open file descriptions.) In this
       context, the term "write lifetime" means the expected time
       the data will live on media, before being over¿ written or
       erased.
.....

And the interfaces are:

       F_GET_RW_HINT (uint64_t *; since Linux 4.13)
       F_SET_RW_HINT (uint64_t *; since Linux 4.13)
       F_GET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
       F_SET_FILE_RW_HINT (uint64_t *; since Linux 4.13)

And the types are:

       RWH_WRITE_LIFE_NOT_SET
       RWH_WRITE_LIFE_NONE
       RWH_WRITE_LIFE_SHORT
       RWH_WRITE_LIFE_MEDIUM
       RWH_WRITE_LIFE_LONG
       RWH_WRITE_LIFE_EXTREME

We probably also should make sure direct IO uses this hint, too, and
ideally we want set the write hint for the metadata in that file to
the same value as the user data being written, as the file metadata
is likely to have a similar lifetime to the user data it refers to.

IOWs, we want different metadata to have appropriately different
write hints, some of it will be controllable by the user per-file
write hints, others will be controlled by the filesystem itself as
userspace has no visibility or control over how that internal
metadata is managed.

> To summarize, I have listed three schemes below. Please let me know
> which one sounds more acceptable for patch -
> 1. [Current proposal] Keep write-hint (NONE) as default, and make it
> overridable through mount option.
> 2. Keep immutable write-hint (say SHORT). Provide no mount option.
> 3. Keep write-hint (SHORT) as default, and make it overridable
> through mount option.

Option 4: let the filesystem decide what is best dynamically,
because the lifetime of metadata and how often it is written is
a dynamic property of the specific metadata type.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] fs/xfs: Add support for passing write life-time hint with log
  2018-12-04 22:09           ` Dave Chinner
@ 2018-12-10 15:15             ` Kanchan Joshi
  0 siblings, 0 replies; 7+ messages in thread
From: Kanchan Joshi @ 2018-12-10 15:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Holger Hoffstätte, linux-xfs,
	jooyoung.hwang, chur.lee, prakash.v

Write life-time hint is not a feature in itself, it's an abstraction 
built over a SSD feature "stream". And this abstraction is more rigid 
than the feature, in terms of defining life-time buckets. Feature-wise, 
it is sufficient to assign two stream numbers X and Y to isolate one 
data from other (and to reap the benefits). While abstraction compels us 
to debate on relative hotness-level between these two types of data. 
Deciding relative hotness gets trickier as data-types increase, and 
worse, it may not bring any goodness. If aim of the change is to get 
goodness from SSD, we should consider lifetime from SSD's point-of-view. 
And that is based on "overwrites".

Please refer figure 1 in this paper -
https://www.usenix.org/system/files/conference/fast18/fast18-rho.pdf
If a block is not overwritten by Host, it stays valid inside SSD; If it 
gets overwritten, it becomes invalid and creates a hole. No holes are 
good. All holes are also good. Intermixing of _few_ holes with _few_ 
valid blocks is bad.
Due to the way log is written, it stays valid (i.e. no overwrites) until 
roll-over. After roll-over, it starts getting overwritten.If volume is 
meta-light, log will stay valid for long. If volume is meta-heavy, 
log-writes will start creating holes (invalid data). But either of the 
situation is not problematic in itself. Problematic situation is when, 
along with log updates, we start getting other data/meta updates. This 
meta/data may or may not be as stable or transient. But point is, why to 
bother about whether log is as hot/cold as something else. Problem can 
be solved by isolating log-data in its own chamber, in its own stream. 
It will either remain all-valid or turn all-invalid, unaffected by 
everything else that goes around.

 > Option 4: let the filesystem decide what is best dynamically,
 > because the lifetime of metadata and how often it is written is
 > a dynamic property of the specific metadata type.

I think log should be treated independently than any other meta/data. 
Matching dynamic nature of meta-data with life-time hints seems harder 
(than log) to get right. Abstraction-wise, FS can try to be very 
accurate about changing life-time hints (change something from warm to 
cold to hot etc.). But one should note that streams come with allocation 
granularity. One can refer "SGS" in NVMe spec, page 275 - 
https://nvmexpress.org/wp-content/uploads/NVM_Express_Revision_1.3.pdf. 
Or, as seen in above figure 1, internally each write-hint/stream is 
assigned on a fixed-size large region. Therefore possibility of internal 
fragmentation needs to be considered while hoping from one hint to 
another.



On Wednesday 05 December 2018 03:39 AM, Dave Chinner wrote:
> On Tue, Dec 04, 2018 at 05:41:26PM +0530, Kanchan Joshi wrote:
>> I expect log to have lifetime as "SHORT" in general. Log is bound to
>> be overwritten, as XFS continues performing transaction. So it is
>> not good idea to place it (inside SSD) with some other meta/data
>> that is more stable (or less stable, for that matter).
>> By assigning a distinct write-hint (SHORT, or anything else than
>> NONE) to log, this problem of mixing is solved.
> 
> So, we have different definitions of what is "short lived"
> and what is "long lived". The log is a -static allocation- it never
> moves and so it always gets overwritten in place. It exists for the
> life of the filesystem, so it's a long-lived structure. Some
> metadata moves around - it's allocated and freed on demand, but is
> still overwritten in place while it's in use.
> 
> The in-use life time of metadata can be very short, but it can also
> be very long. It may never get overwritten, or it could be
> overwritten multiple times a second. We have no real idea what is
> going to happen with each individual piece of metadata because it is
> completely dependent on user workloads.
> 
> So from a metadata perspective, life-time refers to how long the
> metadata is in use in the filesystem, not how often it is accessed
> or written. There's no "one-size-fits-all" bucket here.
> 
>> Keeping a mount option seemed to offer more flexibility to
>> admin/system-designers.
> 
> OTOH, it gives everyone who is not an expert in storage and
> filesystem implemetnations an oportunity to screw up in new and
> exciting ways that are difficult to detect and impossible for XFS
> developers to reproduce or debug.
> 
>>
>> Assuming a single large SSD, hosting two XFS
>> volumes - one catering to fsync-heavy workloads, while another one
>> with reduced frequency of log writes. In that situation, one would
>> not want to mix the writes of two logs and instead prefer to
>> configure one log as "SHORT" and another one as "MEDIUM or EXTREME".
> 
> Here's the problem: you're making an assumption that "frequency of
> log writes" equates to "the log is overwritten more often", and
> that's not true. Frequent fsyncs typically mean lots of small log
> writes that block each other, while applicaitons that don't use
> fsync will be doing lots large async log writes and potentially
> writing a lot more metadata to the log because nothing is blocking
> waiting on journal IO completion......
> 
> Filesystems rarely behave in the ways non-filesystem developers
> expect them to.
> 
>> Also, this way (through mount option) seemed more in sync with how
>> rest of the kernel currently deals with streams/write-hints. In
>> order to be useful, write-hints need to be converted to specific
>> stream numbers. For NVMe SSDs, this is done by nvme-core module, but
>> only if it is loaded with "streams=1" option. F2FS has mount option
>> for passing write-hints. Default behavior is passing no write-hint.
> 
> There is no need for mount options, because we already have a
> fcntl() interface that applications can use for setting write hints
> on files. It was introduced in 4.13, and XFS already plumbs it
> through for buffered write IO.
> 
> FYI:
> 
> $ man fcntl
> ....
>     File read/write hints
> 
>         Write lifetime hints can be used to inform the kernel about
>         the relative expected lifetime of writes on a given inode or
>         via  a  particular  open  file description.   (See open(2)
>         for  an  explanation of open file descriptions.) In this
>         context, the term "write lifetime" means the expected time
>         the data will live on media, before being over¿ written or
>         erased.
> .....
> 
> And the interfaces are:
> 
>         F_GET_RW_HINT (uint64_t *; since Linux 4.13)
>         F_SET_RW_HINT (uint64_t *; since Linux 4.13)
>         F_GET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
>         F_SET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
> 
> And the types are:
> 
>         RWH_WRITE_LIFE_NOT_SET
>         RWH_WRITE_LIFE_NONE
>         RWH_WRITE_LIFE_SHORT
>         RWH_WRITE_LIFE_MEDIUM
>         RWH_WRITE_LIFE_LONG
>         RWH_WRITE_LIFE_EXTREME
> 
> We probably also should make sure direct IO uses this hint, too, and
> ideally we want set the write hint for the metadata in that file to
> the same value as the user data being written, as the file metadata
> is likely to have a similar lifetime to the user data it refers to.
> 
> IOWs, we want different metadata to have appropriately different
> write hints, some of it will be controllable by the user per-file
> write hints, others will be controlled by the filesystem itself as
> userspace has no visibility or control over how that internal
> metadata is managed.
> 
>> To summarize, I have listed three schemes below. Please let me know
>> which one sounds more acceptable for patch -
>> 1. [Current proposal] Keep write-hint (NONE) as default, and make it
>> overridable through mount option.
>> 2. Keep immutable write-hint (say SHORT). Provide no mount option.
>> 3. Keep write-hint (SHORT) as default, and make it overridable
>> through mount option.
> 
> Option 4: let the filesystem decide what is best dynamically,
> because the lifetime of metadata and how often it is written is
> a dynamic property of the specific metadata type.
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-12-10 15:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20181203131558epcas2p14b6b38cb67d4915b1ba782e11ce7ffe6@epcas2p1.samsung.com>
2018-12-03 13:12 ` [PATCH] fs/xfs: Add support for passing write life-time hint with log Kanchan Joshi
2018-12-03 15:48   ` Holger Hoffstätte
2018-12-03 16:34     ` Darrick J. Wong
2018-12-03 20:09       ` Dave Chinner
2018-12-04 12:11         ` Kanchan Joshi
2018-12-04 22:09           ` Dave Chinner
2018-12-10 15:15             ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox