* [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
@ 2014-10-29 18:23 Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 1/5] block, ioprio: include caching advice via ionice Jason B. Akers
` (6 more replies)
0 siblings, 7 replies; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:23 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, dan.j.williams, kapil.karkra, linux-kernel
The following series enables the use of Solid State hybrid drives
ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
OPRIO_ADV_NONE,
IOPRIO_ADV_EVICT, /* actively discard cached data */
IOPRIO_ADV_DONTNEED, /* caching this data has little value */
IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
IOPRIO_ADV_RESERVED1, /* reserved for future use */
IOPRIO_ADV_RESERVED2,
IOPRIO_ADV_RESERVED3,
IOPRIO_ADV_WILLNEED, /* high temporal locality */
For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
Note that the class field indicates the best effort while the n field identifies the io priority. Also note that these io priorities translate to the SSHD hints that pass down the SATA link as follows IOPRIO_ADV_NONE translates to 0x0 IOPRIO_ADV_EVICT translates to 0x20 IOPRIO_ADV_DONTNEED translates to 0x21 IOPRIO_ADV_NORMAL translates to MAX PRI - 1 IOPRIO_ADV_NORMAL translates to MAX PRI
MAX PRI is the maximum priority level supported by the SSHD
The translation from ionice cache hint to ATA hybrid hint is controlled using a hybrid_translation_table. A default table is built-in to libata.
This default table can be overridden with a new table through the firmware_update. The new table can be applied on a per ata_device basis. The reasoning is that the translation can be optimized for VendorA SSHDs differently from VendorB SSHDs. A hybrid_information_table.bin could override the ./firmware directory to update this table.
The .config might be modified to build-in the new table CONFIG_EXTRA_FIRMWARE="hybrid_information_table.bin"
CONFIG_EXTRA_FIRMWARE_DIR="firmware"
To toggle enable/disable of the feature
echo 1 > /sys/class/ata_device/devX.0/hybrid
We ran performance tests with SSHDs from two different vendors and saw the following results with default all-insert policy:
50% improvement in boot times
45% improvment in App launch
3x faster browser
4x faster SQLite
The performance results also showed that host hinting using the above default performed significantly better than the self hinting in using benchmarks.
We are looking for feedback on the cache hinting approach, as it seems others are interested in this capability. We're also looking for comments on the libata implementation for supporting host hinted SSHDs. A patch to sg3_utils for retrieving the Hybrid Information Log follows. Per the spec host-hinted SSHDs revert to internal firmware hinted SSHDs if the log is not read after a certain number of power cycles.
---
Dan Williams (3):
block, ioprio: include caching advice via ionice
block: ioprio hint to low-level device drivers
block: untangle ioprio from BLK_CGROUP and BLK_DEV_THROTTLING
Jason B. Akers (1):
libata: Enabling Solid State Hybrid Drives (SSHDs) based on SATA 3.2 standard
Kapil Karkra (1):
block, mm: Added the necessary plumbing to take ioprio hints down to block layer
block/bio.c | 76 +++++++-----
block/blk-throttle.c | 5 +
block/blk.h | 2
block/ioprio.c | 22 ++-
drivers/ata/Makefile | 2
drivers/ata/libata-core.c | 13 ++
drivers/ata/libata-eh.c | 11 ++
drivers/ata/libata-hybrid.c | 261 ++++++++++++++++++++++++++++++++++++++++
drivers/ata/libata-hybrid.h | 14 ++
drivers/ata/libata-scsi.c | 4 -
drivers/ata/libata-transport.c | 45 +++++++
drivers/ata/libata.h | 2
include/linux/ata.h | 1
include/linux/bio.h | 69 ++++++++++-
include/linux/ioprio.h | 32 ++++-
include/linux/libata.h | 4 +
include/linux/page-flags.h | 24 ++++
mm/debug.c | 5 +
mm/filemap.c | 18 +++
19 files changed, 561 insertions(+), 49 deletions(-)
create mode 100644 drivers/ata/libata-hybrid.c
create mode 100644 drivers/ata/libata-hybrid.h
--
Thanks,
jba
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 1/5] block, ioprio: include caching advice via ionice
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
@ 2014-10-29 18:23 ` Jason B. Akers
2014-10-29 19:02 ` Jeff Moyer
2014-10-29 18:23 ` [RFC PATCH 2/5] block: ioprio hint to low-level device drivers Jason B. Akers
` (5 subsequent siblings)
6 siblings, 1 reply; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:23 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, kapil.karkra, dan.j.williams, linux-kernel
From: Dan Williams <dan.j.williams@intel.com>
Steal one unused bit from the priority class and two bits from the
priority data, to implement a 3 bit cache-advice field. Similar to the
page cache advice from fadvise() these hints are meant to be consumed
by hybrid drives. Solid State Hyrbid-Drives, as defined by the SATA-IO
Specification, implement up to a 4-bit cache priority that can be
specified along with a FPDMA command.
IOPRIO_ADV_NONE: default if ionice hint is not provided
IOPRIO_ADV_EVICT: indicate that if the lba's associated with
this command are in the cache, write them back
and invalidate.
IOPRIO_ADV_DONTNEED: caching this data has little value, but no
need to actively evict
IOPRIO_ADV_NORMAL: perform best-effort / device-default caching
IOPRIO_ADV_RESERVED1: reserved for future use, potentially
IOPRIO_ADV_RESERVED2: permit the kernel to use these for
IOPRIO_ADV_RESERVED3: internal cache priorities, but userspace
owns highest priority override
IOPRIO_ADV_WILLNEED: cache this data at the highest possible priority
The expectation is that a table in the driver is responsible for
translating this advice into transport/device specific priority value.
Signed-off-by: Kapil Karkra <kapil.karkra@intel.com>
Signed-off-by: Jason B. Akers <jason.b.akers@intel.com>
---
include/linux/ioprio.h | 32 ++++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index beb9ce1..752813d 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -5,17 +5,27 @@
#include <linux/iocontext.h>
/*
- * Gives us 8 prio classes with 13-bits of data for each class
+ * Gives us 4 prio classes with 11-bits of data for each class
+ * ...additionally a prio can indicate one of 7 cacheability hints
*/
#define IOPRIO_BITS (16)
+#define IOPRIO_CACHE_SHIFT (15) /* msb of the cache-advice mask */
#define IOPRIO_CLASS_SHIFT (13)
-#define IOPRIO_PRIO_MASK ((1UL << IOPRIO_CLASS_SHIFT) - 1)
+#define IOPRIO_ADV_SHIFT (11)
+#define IOPRIO_PRIO_MASK ((1UL << IOPRIO_ADV_SHIFT) - 1)
-#define IOPRIO_PRIO_CLASS(mask) ((mask) >> IOPRIO_CLASS_SHIFT)
+#define IOPRIO_PRIO_CLASS(mask) (((mask) >> IOPRIO_CLASS_SHIFT) & 3)
#define IOPRIO_PRIO_DATA(mask) ((mask) & IOPRIO_PRIO_MASK)
+#define IOPRIO_ADVICE(mask) ((((mask) >> IOPRIO_ADV_SHIFT) & 3) | \
+ (((mask) >> IOPRIO_CACHE_SHIFT & 1) << 2))
#define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data)
+#define IOPRIO_ADVISE(class, data, advice) \
+ ((IOPRIO_PRIO_VALUE(class, data) | ((advice) & 3) << IOPRIO_ADV_SHIFT)\
+ | (((advice) & 4) << (IOPRIO_CACHE_SHIFT - 2)))
-#define ioprio_valid(mask) (IOPRIO_PRIO_CLASS((mask)) != IOPRIO_CLASS_NONE)
+#define ioprio_valid(mask) (IOPRIO_PRIO_CLASS((mask)) != \
+ IOPRIO_CLASS_NONE)
+#define ioprio_advice_valid(mask) (IOPRIO_ADVICE(mask) != IOPRIO_ADV_NONE)
/*
* These are the io priority groups as implemented by CFQ. RT is the realtime
@@ -31,6 +41,20 @@ enum {
};
/*
+ * Four cacheability hints that map to their fadvise(2) equivalents
+ */
+enum {
+ IOPRIO_ADV_NONE,
+ IOPRIO_ADV_EVICT, /* actively discard cached data */
+ IOPRIO_ADV_DONTNEED, /* caching this data has little value */
+ IOPRIO_ADV_NORMAL, /* best-effort / device-default cache priority */
+ IOPRIO_ADV_RESERVED1, /* reserved for future use */
+ IOPRIO_ADV_RESERVED2,
+ IOPRIO_ADV_RESERVED3,
+ IOPRIO_ADV_WILLNEED, /* high temporal locality or cache valuable */
+};
+
+/*
* 8 best effort priority levels are supported
*/
#define IOPRIO_BE_NR (8)
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC PATCH 2/5] block: ioprio hint to low-level device drivers
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 1/5] block, ioprio: include caching advice via ionice Jason B. Akers
@ 2014-10-29 18:23 ` Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 3/5] block: untangle ioprio from BLK_CGROUP and BLK_DEV_THROTTLING Jason B. Akers
` (4 subsequent siblings)
6 siblings, 0 replies; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:23 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, dan.j.williams, kapil.karkra, linux-kernel
From: Dan Williams <dan.j.williams@intel.com>
The priority in the io_context is consumed by the io scheduler. For
caching advice we need the request->ioprio field to be up-to-date. Set
the bio ioprio at submit time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason B. Akers <jason.b.akers@intel.com>
---
block/bio.c | 1 +
block/ioprio.c | 22 ++++++++++++++++------
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3e6e198..e133b5c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2009,6 +2009,7 @@ int bio_associate_current(struct bio *bio)
/* acquire active ref on @ioc and associate */
get_io_context_active(ioc);
bio->bi_ioc = ioc;
+ bio_set_prio(bio, ioprio_best(ioc->ioprio, bio_prio(bio)));
/* associate blkcg if exists */
rcu_read_lock();
diff --git a/block/ioprio.c b/block/ioprio.c
index e50170c..fec1202 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -159,18 +159,28 @@ int ioprio_best(unsigned short aprio, unsigned short bprio)
{
unsigned short aclass = IOPRIO_PRIO_CLASS(aprio);
unsigned short bclass = IOPRIO_PRIO_CLASS(bprio);
+ unsigned short class, data, advice;
if (aclass == IOPRIO_CLASS_NONE)
aclass = IOPRIO_CLASS_BE;
if (bclass == IOPRIO_CLASS_NONE)
bclass = IOPRIO_CLASS_BE;
- if (aclass == bclass)
- return min(aprio, bprio);
- if (aclass > bclass)
- return bprio;
- else
- return aprio;
+ /* best priority */
+ if (aclass == bclass) {
+ class = aclass;
+ data = min(IOPRIO_PRIO_DATA(aprio), IOPRIO_PRIO_DATA(bprio));
+ } else if (aclass > bclass) {
+ class = bclass;
+ data = IOPRIO_PRIO_DATA(bprio);
+ } else {
+ class = aclass;
+ data = IOPRIO_PRIO_DATA(aprio);
+ }
+
+ /* best cache advice, assumes invalid advice is zero */
+ advice = max(IOPRIO_ADVICE(aprio), IOPRIO_ADVICE(bprio));
+ return IOPRIO_ADVISE(class, data, advice);
}
SYSCALL_DEFINE2(ioprio_get, int, which, int, who)
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC PATCH 3/5] block: untangle ioprio from BLK_CGROUP and BLK_DEV_THROTTLING
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 1/5] block, ioprio: include caching advice via ionice Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 2/5] block: ioprio hint to low-level device drivers Jason B. Akers
@ 2014-10-29 18:23 ` Jason B. Akers
2014-10-29 18:24 ` [RFC PATCH 4/5] block, mm: Added the necessary plumbing to take ioprio hints down to block layer Jason B. Akers
` (3 subsequent siblings)
6 siblings, 0 replies; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:23 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, dan.j.williams, kapil.karkra, linux-kernel
From: Dan Williams <dan.j.williams@intel.com>
If BLK_CGROUP is disabled, still enable ionice to set advice on bios.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason B. Akers <jason.b.akers@intel.com>
---
block/bio.c | 43 +++++++++++---------------------
block/blk.h | 2 ++
include/linux/bio.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 79 insertions(+), 34 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index e133b5c..b93ae04 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1980,7 +1980,6 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_
}
EXPORT_SYMBOL(bioset_create_nobvec);
-#ifdef CONFIG_BLK_CGROUP
/**
* bio_associate_current - associate a bio with %current
* @bio: target bio
@@ -1989,34 +1988,28 @@ EXPORT_SYMBOL(bioset_create_nobvec);
* layer will treat @bio as if it were issued by %current no matter which
* task actually issues it.
*
- * This function takes an extra reference of @task's io_context and blkcg
- * which will be put when @bio is released. The caller must own @bio,
- * ensure %current->io_context exists, and is responsible for synchronizing
- * calls to this function.
+ * When BLK_CGROUP=y this function takes an extra reference of @task's
+ * io_context and blkcg which will be put when @bio is released. The caller
+ * must own @bio, ensure %current->io_context exists, and is responsible for
+ * synchronizing calls to this function.
+ *
+ * When BLK_CGROUP=n this function simply sets the bio priority and cache advice
*/
int bio_associate_current(struct bio *bio)
{
struct io_context *ioc;
- struct cgroup_subsys_state *css;
-
- if (bio->bi_ioc)
- return -EBUSY;
+ int rc;
ioc = current->io_context;
if (!ioc)
return -ENOENT;
- /* acquire active ref on @ioc and associate */
- get_io_context_active(ioc);
- bio->bi_ioc = ioc;
- bio_set_prio(bio, ioprio_best(ioc->ioprio, bio_prio(bio)));
+ rc = bio_associate_ioc(bio, ioc);
+ if (rc)
+ return rc;
- /* associate blkcg if exists */
- rcu_read_lock();
- css = task_css(current, blkio_cgrp_id);
- if (css && css_tryget_online(css))
- bio->bi_css = css;
- rcu_read_unlock();
+ bio_associate_blkcg(bio, current);
+ bio_set_prio(bio, ioprio_best(ioc->ioprio, bio_prio(bio)));
return 0;
}
@@ -2027,18 +2020,10 @@ int bio_associate_current(struct bio *bio)
*/
void bio_disassociate_task(struct bio *bio)
{
- if (bio->bi_ioc) {
- put_io_context(bio->bi_ioc);
- bio->bi_ioc = NULL;
- }
- if (bio->bi_css) {
- css_put(bio->bi_css);
- bio->bi_css = NULL;
- }
+ bio_disassociate_ioc(bio);
+ bio_disassociate_blkcg(bio);
}
-#endif /* CONFIG_BLK_CGROUP */
-
static void __init biovec_init_slabs(void)
{
int i;
diff --git a/block/blk.h b/block/blk.h
index 43b0361..6d7c4df 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -274,6 +274,8 @@ extern void blk_throtl_exit(struct request_queue *q);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
{
+ /* set prio, but don't throttle */
+ bio_associate_current(bio);
return false;
}
static inline void blk_throtl_drain(struct request_queue *q) { }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7347f48..8419319 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -22,6 +22,7 @@
#include <linux/highmem.h>
#include <linux/mempool.h>
+#include <linux/cgroup.h>
#include <linux/ioprio.h>
#include <linux/bug.h>
@@ -469,13 +470,70 @@ extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
-#ifdef CONFIG_BLK_CGROUP
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
-#else /* CONFIG_BLK_CGROUP */
-static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
-static inline void bio_disassociate_task(struct bio *bio) { }
-#endif /* CONFIG_BLK_CGROUP */
+
+#ifdef CONFIG_BLK_CGROUP
+static inline int bio_associate_ioc(struct bio *bio, struct io_context *ioc)
+{
+ if (bio->bi_ioc)
+ return -EBUSY;
+
+ /* acquire active ref on @ioc and associate */
+ get_io_context_active(ioc);
+ bio->bi_ioc = ioc;
+
+ return 0;
+}
+
+static inline void bio_associate_blkcg(struct bio *bio,
+ struct task_struct *task)
+{
+ struct cgroup_subsys_state *css;
+
+ /* associate blkcg if exists */
+ rcu_read_lock();
+ css = task_css(task, blkio_cgrp_id);
+ if (css && css_tryget(css))
+ bio->bi_css = css;
+ rcu_read_unlock();
+}
+
+static inline void bio_disassociate_ioc(struct bio *bio)
+{
+ if (bio->bi_ioc) {
+ put_io_context(bio->bi_ioc);
+ bio->bi_ioc = NULL;
+ }
+}
+
+static inline void bio_disassociate_blkcg(struct bio *bio)
+{
+ if (bio->bi_css) {
+ css_put(bio->bi_css);
+ bio->bi_css = NULL;
+ }
+}
+#else
+static inline int bio_associate_ioc(struct bio *bio, struct io_context *ioc)
+{
+ return 0;
+}
+
+static inline void bio_associate_blkcg(struct bio *bio,
+ struct task_struct *task)
+{
+}
+
+static inline void bio_disassociate_ioc(struct bio *bio)
+{
+}
+
+static inline void bio_disassociate_blkcg(struct bio *bio)
+{
+}
+
+#endif
#ifdef CONFIG_HIGHMEM
/*
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC PATCH 4/5] block, mm: Added the necessary plumbing to take ioprio hints down to block layer
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
` (2 preceding siblings ...)
2014-10-29 18:23 ` [RFC PATCH 3/5] block: untangle ioprio from BLK_CGROUP and BLK_DEV_THROTTLING Jason B. Akers
@ 2014-10-29 18:24 ` Jason B. Akers
2014-10-29 18:24 ` [RFC PATCH 5/5] libata: Enabling Solid State Hybrid Drives (SSHDs) based on SATA 3.2 standard Jason B. Akers
` (2 subsequent siblings)
6 siblings, 0 replies; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:24 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, kapil.karkra, dan.j.williams, linux-kernel
From: Kapil Karkra <kapil.karkra@intel.com>
Added the necessary plumbing to take the ioprio hints down to the block
layer from where they further flow down into the libata. For reads or
direct IO, bio_associate_ioprio (invoked from blk_throtl_bio) copies
the ioprio from the current io context into the bio in the submit_bio
context. For lazy writes, 3 bits from the page_flags are used to record
ioprio in every page associated with a particular IO. Since page-flags
are scarce, we do this enabling only on 64 bit platforms. We take the
ioprio from the current io context and store it into each page in
grab_cache_page_write_begin function. the bio_associate_ioprio method
walks through all pages and determines the overall best priority to be
associated to the bio. The bio carries the io priority further down the
IO stack.
Signed-off-by: Kapil Karkra <kapil.karkra@intel.com>
Signed-off-by: Jason B. Akers <jason.b.akers@intel.com>
---
block/bio.c | 34 ++++++++++++++++++++++++++++++++++
block/blk-throttle.c | 5 +++++
include/linux/bio.h | 1 +
include/linux/page-flags.h | 24 ++++++++++++++++++++++++
mm/debug.c | 5 +++++
mm/filemap.c | 18 ++++++++++++++++++
6 files changed, 87 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index b93ae04..cc5cc64 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1965,6 +1965,40 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
}
EXPORT_SYMBOL(bioset_create);
+int bio_associate_ioprio(struct bio *bio)
+{
+ struct io_context *ioc;
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ int max_ioprio = 0; /* init max_ioprio to 0 (invalid) */
+ int advice, ioprio;
+
+ ioc = current->io_context;
+ if (!ioc)
+ return -ENOENT;
+
+ /* scan the bio_vecs for this bio and get the highest
+ * ioprio to use for current
+ */
+ bio_for_each_segment(bv, bio, iter) {
+ advice = PageGetAdvice(bv.bv_page);
+ ioprio = IOPRIO_ADVISE(0, 0, advice);
+ if (ioprio_advice_valid(ioprio))
+ max_ioprio = ioprio_best(ioprio, max_ioprio);
+ }
+
+ /* set max priority found in all bio_vecs */
+ bio_set_prio(bio, max_ioprio);
+
+ /* acquire active ref on @ioc and associate
+ * also handles the read case
+ */
+ bio_associate_ioc(bio,ioc);
+ bio_set_prio(bio, ioprio_best(ioc->ioprio, max_ioprio));
+
+ return 0;
+}
+
/**
* bioset_create_nobvec - Create a bio_set without bio_vec mempool
* @pool_size: Number of bio to cache in the mempool
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9273d09..abc33a5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1484,6 +1484,11 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
struct blkcg *blkcg;
bool throttled = false;
+ /* associate the best ioprio to the bio */
+ spin_lock_irq(q->queue_lock);
+ bio_associate_ioprio(bio);
+ spin_unlock_irq(q->queue_lock);
+
/* see throtl_charge_bio() */
if (bio->bi_rw & REQ_THROTTLED)
goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8419319..4747c78 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -470,6 +470,7 @@ extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
+int bio_associate_ioprio(struct bio *bio);
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e1f5fcd..8811234 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,11 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ PG_ioprio_advice_0, /* 3 flag bits store ioprio advice */
+ PG_ioprio_advice_1,
+ PG_ioprio_advice_2,
+#endif
__NR_PAGEFLAGS,
/* Filesystems */
@@ -370,6 +375,25 @@ static inline void ClearPageCompound(struct page *page)
#define PG_head_mask ((1L << PG_head))
+/*
+ * ioprio advise is recorded here
+ */
+static inline void PageSetAdvice(struct page *page, unsigned int advice)
+{
+ page->flags = (page->flags |
+ ((((advice >> 0) & 1) << PG_ioprio_advice_0) |
+ (((advice >> 1) & 1) << PG_ioprio_advice_1) |
+ (((advice >> 2) & 1) << PG_ioprio_advice_2)));
+}
+
+static inline int PageGetAdvice(struct page *page)
+{
+ unsigned int advice = (((page->flags >> PG_ioprio_advice_0) & 1) |
+ (((page->flags >> PG_ioprio_advice_1) & 1) << 1) |
+ (((page->flags >> PG_ioprio_advice_2) & 1) << 2));
+ return advice;
+}
+
#else
/*
* Reduce page flag use as much as possible by overlapping
diff --git a/mm/debug.c b/mm/debug.c
index 5ce45c9..c785b06 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,11 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ {1UL << PG_ioprio_advice_0, "ioprio_adv0" },
+ {1UL << PG_ioprio_advice_1, "ioprio_adv1" },
+ {1UL << PG_ioprio_advice_2, "ioprio_adv2" },
+#endif
};
static void dump_flags(unsigned long flags,
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..f82529d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2438,6 +2438,9 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
{
struct page *page;
int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
+ struct io_context *ioc;
+ int advice;
+ int ioprio;
if (flags & AOP_FLAG_NOFS)
fgp_flags |= FGP_NOFS;
@@ -2448,6 +2451,21 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
if (page)
wait_for_stable_page(page);
+ /* store the ioprio into the page flags */
+ if (current && current->io_context) {
+ ioc = current->io_context;
+ advice = PageGetAdvice(page);
+ ioprio = IOPRIO_ADVISE(0, 0, advice);
+ if (ioprio_advice_valid(ioc->ioprio)) {
+ if (ioprio_advice_valid(ioprio))
+ ioprio = ioprio_best(ioprio, ioc->ioprio);
+ else
+ ioprio = ioc->ioprio;
+
+ PageSetAdvice(page, IOPRIO_ADVICE(ioprio));
+ }
+ }
+
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC PATCH 5/5] libata: Enabling Solid State Hybrid Drives (SSHDs) based on SATA 3.2 standard
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
` (3 preceding siblings ...)
2014-10-29 18:24 ` [RFC PATCH 4/5] block, mm: Added the necessary plumbing to take ioprio hints down to block layer Jason B. Akers
@ 2014-10-29 18:24 ` Jason B. Akers
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
2014-10-30 2:05 ` Martin K. Petersen
6 siblings, 0 replies; 27+ messages in thread
From: Jason B. Akers @ 2014-10-29 18:24 UTC (permalink / raw)
To: linux-ide; +Cc: axboe, kapil.karkra, dan.j.williams, linux-kernel
From: Jason B. Akers <jason.b.akers@intel.com>
Augment the libata to add support for SSHDs--hard disks with a
small embedded NAND memory in them. The hybrid information feature is
part of the SATA standard 3.2 that specifies a way for host drivers to
pass hints to the drives over the SATA interface to guide the placement
of data on either the NAND or spindle. A new module libata-hybrid adds
the atomic methods to initialize and translate the ioprio to a SATA
hybrid hint according to a default table. This default table can be
overridden with a new table through the firmware_update. The new table
can be applied on a per ata_device basis. The reasoning is that the
translation can be optimized for Seagate SSHDs differently from Western
Digital SSHDs.
The feature remains disabled by default. The reason is that there are
two types of SSHDs—self hinted and host hinted. Both Seagate and
Western Digital make low-end self-hinted SSHDs that place data on the
embedded NAND or spindle based on their LBA hit rate determined by
their drive firmware. The higher-end (>16GB NAND) SSHDs provide hosts
with a mechanism to provide hints to the SSHDs. Also note that 25
consecutive power cycles with any hints in the SATA FISes will cause
the SSHD to turn off host hinting and switch back to self-hinting mode.
This patch issues a GET_HYBRID_LOG at start to prevent that from
happening. When the feature is enabled, by default, every IO is tagged
with the MAX-1 priority and SSHD simply implements the LRU policy.
Ionicing a specific App will emphasize or de-emphasize the hybrid hints
The current implementation toggles enable/disable when anything is
echoed to /sys/class/ata_device/devX.0/hybrid.
Signed-off-by: Kapil Karkra <kapil.karkra@intel.com>
Signed-off-by: Jason B. Akers <jason.b.akers@intel.com>
---
drivers/ata/Makefile | 2
drivers/ata/libata-core.c | 13 ++
drivers/ata/libata-eh.c | 11 ++
drivers/ata/libata-hybrid.c | 261 ++++++++++++++++++++++++++++++++++++++++
drivers/ata/libata-hybrid.h | 14 ++
drivers/ata/libata-scsi.c | 4 -
drivers/ata/libata-transport.c | 45 +++++++
drivers/ata/libata.h | 2
include/linux/ata.h | 1
include/linux/libata.h | 4 +
10 files changed, 351 insertions(+), 6 deletions(-)
create mode 100644 drivers/ata/libata-hybrid.c
create mode 100644 drivers/ata/libata-hybrid.h
diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index ae41107..0caf0a2 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -111,7 +111,7 @@ obj-$(CONFIG_ATA_GENERIC) += ata_generic.o
# Should be last libata driver
obj-$(CONFIG_PATA_LEGACY) += pata_legacy.o
-libata-y := libata-core.o libata-scsi.o libata-eh.o libata-transport.o
+libata-y := libata-core.o libata-scsi.o libata-eh.o libata-transport.o libata-hybrid.o
libata-$(CONFIG_ATA_SFF) += libata-sff.o
libata-$(CONFIG_SATA_PMP) += libata-pmp.o
libata-$(CONFIG_ATA_ACPI) += libata-acpi.o
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c5ba15a..582acfb 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -72,6 +72,7 @@
#include "libata.h"
#include "libata-transport.h"
+#include "libata-hybrid.h"
/* debounce timing parameters in msecs { interval, duration, timeout } */
const unsigned long sata_deb_timing_normal[] = { 5, 100, 2000 };
@@ -747,7 +748,7 @@ u64 ata_tf_read_block(struct ata_taskfile *tf, struct ata_device *dev)
*/
int ata_build_rw_tf(struct ata_taskfile *tf, struct ata_device *dev,
u64 block, u32 n_block, unsigned int tf_flags,
- unsigned int tag)
+ unsigned int tag, unsigned int ata_hybrid_hint)
{
tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
tf->flags |= tf_flags;
@@ -775,6 +776,7 @@ int ata_build_rw_tf(struct ata_taskfile *tf, struct ata_device *dev,
tf->lbah = (block >> 16) & 0xff;
tf->lbam = (block >> 8) & 0xff;
tf->lbal = block & 0xff;
+ tf->auxiliary |= (ata_hybrid_hint << 16);
tf->device = ATA_LBA;
if (tf->flags & ATA_TFLAG_FUA)
@@ -2389,6 +2391,12 @@ int ata_dev_configure(struct ata_device *dev)
}
}
+ if (ata_id_has_hybrid_cap(dev->id)) {
+ dev->hybrid_cap = true;
+ initialize_hybrid_drive(dev);
+ } else
+ dev->hybrid_cap = false;
+
dev->cdb_len = 16;
}
@@ -4494,7 +4502,8 @@ unsigned int ata_dev_set_feature(struct ata_device *dev, u8 enable, u8 feature)
unsigned int err_mask;
/* set up set-features taskfile */
- DPRINTK("set features - SATA features\n");
+ DPRINTK("set features port:%d enable:0x%X feature:0x%X\n",
+ dev->link->ap->print_id, enable, feature);
ata_tf_init(dev, &tf);
tf.command = ATA_CMD_SET_FEATURES;
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index dad83df..ecf3503 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -47,6 +47,7 @@
#include <linux/libata.h>
#include "libata.h"
+#include "libata-hybrid.h"
enum {
/* speed down verdicts */
@@ -3096,6 +3097,16 @@ static int ata_eh_revalidate_and_attach(struct ata_link *link,
if (ehc->i.flags & ATA_EHI_DID_RESET)
readid_flags |= ATA_READID_POSTRESET;
+ /* enable host hints or self-pinning depending on ehi flag */
+ if (ehc->i.flags & ATA_EHI_SET_HYBRID &&
+ dev->hybrid_cap && !dev->hybrid_en)
+ set_hybrid_enabled(dev, 1, 0);
+ else if (ehc->i.flags & ATA_EHI_SET_HYBRID &&
+ dev->hybrid_cap && dev->hybrid_en)
+ set_hybrid_enabled(dev, 0, 0);
+ pr_info("ehc iflags = 0x%X, hybrid_cap = %d, hybrid_en = %d\n",
+ ehc->i.flags, dev->hybrid_cap, dev->hybrid_en);
+
if ((action & ATA_EH_REVALIDATE) && ata_dev_enabled(dev)) {
WARN_ON(dev->class == ATA_DEV_PMP);
diff --git a/drivers/ata/libata-hybrid.c b/drivers/ata/libata-hybrid.c
new file mode 100644
index 0000000..86ee66c
--- /dev/null
+++ b/drivers/ata/libata-hybrid.c
@@ -0,0 +1,261 @@
+/*
+ * Copyright(c) 2013-2014 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+/*
+ * Libata Hybrid Information Feature from SATA standard 3.2.
+ *
+ * This file contains the methods and associated data that augment libata to
+ * support hybrid information feature as defined by the SATA standard revision
+ * 3.2. This feature enables the libata to hint the Solid State Hybrid Drives
+ * (SSHDs) that have a small embedded NAND and a large magnetic media. The hint
+ * conveys to the SSHD what data to place on the NAND portion and what to place
+ * on the spindle.An accurate hinting will realize the potential of SSHDs by
+ * giving SSD like performance and a hard disk like capacity at a $5 adder to
+ * the hard drive cost.
+ *
+ * At initialization, initialize_hybrid_drive is invoked that based on
+ * the availability of hybrid feature, loads a translation table for the
+ * solid state hybrid drive (SSHD)--either the default or via the
+ * update_firmware mechanism. This is done once per ata_device because this
+ * translation could be different based on the performance/power
+ * characteristics of the SSHD.
+ *
+ * Hybrid information feature can be enabled or disabled via sysfs. The methods
+ * in this module eventually perform the enable/disable at the ata_device level
+ *
+ * The only method that invoked in the IO path is get_ata_hybrid_hint. This
+ * method references a look-up table to plug in the hybrid hint
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/interrupt.h>
+#include <linux/completion.h>
+#include <linux/suspend.h>
+#include <linux/workqueue.h>
+#include <linux/scatterlist.h>
+#include <linux/io.h>
+#include <linux/ioprio.h>
+#include <linux/async.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_device.h>
+#include <linux/libata.h>
+#include <asm/byteorder.h>
+#include <linux/cdrom.h>
+#include <linux/ratelimit.h>
+#include <linux/pm_runtime.h>
+#include <linux/platform_device.h>
+#include <linux/firmware.h>
+
+#include "libata.h"
+#include "libata-transport.h"
+#include "libata-hybrid.h"
+
+static unsigned int max_priority;
+
+struct hybrid_info_translation {
+ unsigned char signature[5];
+ unsigned char dont_disturb;
+ unsigned char normal;
+ unsigned char will_need;
+ unsigned char dont_need;
+ unsigned char evict;
+};
+
+static struct hybrid_info_translation translation = {
+ {"sshd"}, 0x00, 0x21, 0x21, 0x20, 0x20};
+
+unsigned int get_ata_hybrid_hint (struct ata_queued_cmd *qc)
+{
+ unsigned int ioprio = qc->scsicmd->request->ioprio;
+ unsigned int ata_hybrid_hint = translation.normal;
+
+ if (ioprio_advice_valid(ioprio)) {
+ switch (IOPRIO_ADVICE(ioprio)) {
+ default:
+ break;
+ case IOPRIO_ADV_EVICT:
+ ata_hybrid_hint = translation.evict;
+ break;
+ case IOPRIO_ADV_DONTNEED:
+ ata_hybrid_hint = translation.dont_need;
+ break;
+ case IOPRIO_ADV_NORMAL:
+ ata_hybrid_hint = translation.normal;
+ break;
+ case IOPRIO_ADV_WILLNEED:
+ ata_hybrid_hint = translation.will_need;
+ break;
+ }
+ }
+
+ return ata_hybrid_hint;
+}
+
+unsigned int update_hybrid_info_translation_table(struct ata_device *dev)
+{
+ int i = 0;
+ const struct firmware *fw = NULL;
+ struct hybrid_info_translation *new_translation;
+ unsigned char *data;
+
+ i = request_firmware(&fw, HYBRID_INFO_TABLE_NAME, &dev->tdev);
+ if (i < 0) {
+ pr_debug("%s: request_firmware failed %x\n", __func__, i);
+ release_firmware(fw);
+ return 1;
+ }
+
+ new_translation = (struct hybrid_info_translation *)fw->data;
+
+ if (strncmp(new_translation->signature, "sshd", 4) != 0) {
+ pr_warn("%s: bad firmware signature %x\n", __func__, i);
+ release_firmware(fw);
+ return 1;
+ }
+
+ pr_info("old table={%x, %x, %x, %x, %x}\n",
+ translation.dont_disturb,
+ translation.normal,
+ translation.will_need,
+ translation.dont_need,
+ translation.evict);
+
+ data = (unsigned char *)&new_translation->dont_disturb;
+
+ for (i = 0; i < 5; i++)
+ data[i] = (data[i] & (0x20|max_priority));
+
+ pr_info("new table={%x, %x, %x, %x, %x}\n",
+ new_translation->dont_disturb,
+ new_translation->normal,
+ new_translation->will_need,
+ new_translation->dont_need,
+ new_translation->evict);
+
+ translation = *new_translation;
+ release_firmware(fw);
+
+ return 0;
+}
+
+unsigned int initialize_hybrid_drive(struct ata_device *dev)
+{
+ struct ata_port *ap = dev->link->ap;
+ u8 *inBuff = ap->sector_buf;
+ int k;
+ unsigned int err_mask;
+ int nvmSize;
+
+ err_mask = ata_read_log_page(dev,
+ 0x14,
+ 0,
+ inBuff,
+ 1);
+ if (err_mask)
+ ata_dev_dbg(dev,
+ "failed to get Hybrid Info Log, Emask 0x%x\n",
+ err_mask);
+ else {
+ pr_info("SATA hybrid information log:\n");
+ k = 0;
+ pr_info("Number of Hybrid Descriptors=%d\n",
+ inBuff[k]&0xF);
+ k = 2;
+
+ if (inBuff[k])
+ dev->hybrid_en = true;
+ else
+ dev->hybrid_en = false;
+
+ pr_info("Enabled=%x\n", inBuff[k++]);
+ pr_info("Hybrid Health =%x\n", inBuff[k++]);
+ pr_info("Dirty Low Threshold =%x\n", inBuff[k++]);
+ pr_info("Dirty High Threshold =%x\n", inBuff[k++]);
+ pr_info("Optimal Write Granularity=%x\n", inBuff[k++]);
+
+ max_priority = inBuff[k]&0xF;
+
+ pr_info("Maximum Priority Level=%d\n", inBuff[k++]&0xF);
+ k = 16;
+ nvmSize = (inBuff[k] | (inBuff[k+1]<<8) |
+ (inBuff[k+2]<<16) | (inBuff[k+3]<<24));
+ pr_info("NVM Size=%x(%d GigaBytes)\n",
+ nvmSize, nvmSize/(2*1024*1024));
+
+ for (k = 64; k < 512; k += 16) {
+ if (inBuff[k]) {
+ pr_info("Hybrid Priority=%d\n",
+ inBuff[k]);
+ pr_info("Consumed NVM Size Fraction=%x\n",
+ inBuff[k+1]);
+ pr_info("Consumed Map Res Fraction=%x\n",
+ inBuff[k+2]);
+ pr_info("Consumed Siz Drty Fraction=%x\n",
+ inBuff[k+3]);
+ pr_info("Consumed Map Res Drty Frtn=%x\n",
+ inBuff[k+4]);
+ }
+ }
+ }
+
+
+ /* Keep the feature disabled but update the translation table from the
+ * user space
+ */
+ return update_hybrid_info_translation_table(dev);
+}
+
+unsigned int hybrid_is_enabled(struct ata_device *dev)
+{
+ return dev->hybrid_en;
+}
+
+unsigned int set_hybrid_enabled(struct ata_device *dev, unsigned int value,
+ unsigned int count)
+{
+ if (value == 1) {
+ /*enable the hybrid information feature*/
+ if (!ata_dev_set_feature(dev, 0x10, 0xA))
+ dev->hybrid_en = true;
+ else
+ pr_warn("%s: Failed to set hybrid feature\n",
+ __func__);
+ dev->link->eh_info.flags &= ~ATA_EHI_SET_HYBRID;
+ return 0;
+ } else if (value == 0) {
+ /*disable the hybrid information feature*/
+ if (!ata_dev_set_feature(dev, 0x90, 0xA))
+ dev->hybrid_en = false;
+ else
+ pr_warn("%s: Failed to set hybrid feature\n",
+ __func__);
+
+ dev->link->eh_info.flags &= ~ATA_EHI_SET_HYBRID;
+ return 0;
+ }
+ return -EINVAL;
+}
diff --git a/drivers/ata/libata-hybrid.h b/drivers/ata/libata-hybrid.h
new file mode 100644
index 0000000..54cf2ae
--- /dev/null
+++ b/drivers/ata/libata-hybrid.h
@@ -0,0 +1,14 @@
+#ifndef _LIBATA_HYBRID_H
+#define _LIBATA_HYBRID_H
+
+#define HYBRID_INFO_TABLE_NAME "hybrid_information_table.bin"
+
+unsigned int get_ata_hybrid_hint(struct ata_queued_cmd *qc);
+unsigned int update_hybrid_info_translation_table(struct ata_device *dev);
+unsigned int initialize_hybrid_drive(struct ata_device *dev);
+unsigned int hybrid_is_enabled(struct ata_device *dev);
+unsigned int set_hybrid_enabled(struct ata_device *dev,
+ unsigned int value, unsigned int count);
+#define ata_id_has_hybrid_cap(id) ((id)[ATA_ID_FEATURE_SUPP] & (1 << 9))
+#define ata_id_has_hybrid_en(id) ((id)[ATA_ID_FEATURE_EN] & (1 << 9))
+#endif
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0586f66..07f68ee 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -53,6 +53,7 @@
#include "libata.h"
#include "libata-transport.h"
+#include "libata-hybrid.h"
#define ATA_SCSI_RBUF_SIZE 4096
@@ -1170,6 +1171,7 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
blk_queue_flush_queueable(q, false);
dev->sdev = sdev;
+
return 0;
}
@@ -1721,7 +1723,7 @@ static unsigned int ata_scsi_rw_xlat(struct ata_queued_cmd *qc)
qc->nbytes = n_block * scmd->device->sector_size;
rc = ata_build_rw_tf(&qc->tf, qc->dev, block, n_block, tf_flags,
- qc->tag);
+ qc->tag, get_ata_hybrid_hint(qc));
if (likely(rc == 0))
return 0;
diff --git a/drivers/ata/libata-transport.c b/drivers/ata/libata-transport.c
index e37413228..7d4c221 100644
--- a/drivers/ata/libata-transport.c
+++ b/drivers/ata/libata-transport.c
@@ -36,10 +36,11 @@
#include "libata.h"
#include "libata-transport.h"
+#include "libata-hybrid.h"
#define ATA_PORT_ATTRS 3
#define ATA_LINK_ATTRS 3
-#define ATA_DEV_ATTRS 9
+#define ATA_DEV_ATTRS 10
struct scsi_transport_template;
struct scsi_transport_template *ata_scsi_transport_template;
@@ -559,6 +560,47 @@ show_ata_dev_gscr(struct device *dev,
static DEVICE_ATTR(gscr, S_IRUGO, show_ata_dev_gscr, NULL);
+static ssize_t
+show_ata_dev_hybrid(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct ata_device *ata_dev = transport_class_to_dev(dev);
+
+ scnprintf(buf, PAGE_SIZE, "%d\n", hybrid_is_enabled(ata_dev));
+ return sizeof(unsigned int);
+}
+
+static ssize_t
+store_ata_dev_hybrid(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct ata_device *ata_dev = transport_class_to_dev(dev);
+ struct ata_port *ata_port = ata_dev->link->ap;
+ unsigned long flags;
+ unsigned int value;
+
+ if (kstrtouint(buf, 0, &value) < 0)
+ return -EINVAL;
+
+ spin_lock_irqsave(ata_port->lock, flags);
+ if (value == 0 || value == 1)
+ ata_dev->link->eh_info.flags |= ATA_EHI_SET_HYBRID;
+ else {
+ pr_warn("invalid hybrid value: use 1 or 0\n");
+ return -EINVAL;
+ }
+
+ pr_info("set dev->flags hybrid: 0x%X\n",
+ ata_dev->link->eh_context.i.flags);
+ ata_port_schedule_eh(ata_port);
+
+ spin_unlock_irqrestore(ata_port->lock, flags);
+ return count;
+}
+
+static DEVICE_ATTR(hybrid, S_IRUGO | S_IWUSR, show_ata_dev_hybrid,
+ store_ata_dev_hybrid);
+
static DECLARE_TRANSPORT_CLASS(ata_dev_class,
"ata_device", NULL, NULL, NULL);
@@ -732,6 +774,7 @@ struct scsi_transport_template *ata_attach_transport(void)
SETUP_DEV_ATTRIBUTE(ering);
SETUP_DEV_ATTRIBUTE(id);
SETUP_DEV_ATTRIBUTE(gscr);
+ SETUP_TEMPLATE(dev_attrs, hybrid, S_IRUGO | S_IWUGO, 1);
BUG_ON(count > ATA_DEV_ATTRS);
i->dev_attrs[count] = NULL;
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 5f4e0cc..3713bb5 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -66,7 +66,7 @@ extern u64 ata_tf_to_lba48(const struct ata_taskfile *tf);
extern struct ata_queued_cmd *ata_qc_new_init(struct ata_device *dev);
extern int ata_build_rw_tf(struct ata_taskfile *tf, struct ata_device *dev,
u64 block, u32 n_block, unsigned int tf_flags,
- unsigned int tag);
+ unsigned int tag, unsigned int ata_hybrid_hint);
extern u64 ata_tf_read_block(struct ata_taskfile *tf, struct ata_device *dev);
extern unsigned ata_exec_internal(struct ata_device *dev,
struct ata_taskfile *tf, const u8 *cdb,
diff --git a/include/linux/ata.h b/include/linux/ata.h
index f2f4d8d..0f9d8b3 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -80,6 +80,7 @@ enum {
ATA_ID_SATA_CAPABILITY = 76,
ATA_ID_SATA_CAPABILITY_2 = 77,
ATA_ID_FEATURE_SUPP = 78,
+ ATA_ID_FEATURE_EN = 79,
ATA_ID_MAJOR_VER = 80,
ATA_ID_COMMAND_SET_1 = 82,
ATA_ID_COMMAND_SET_2 = 83,
diff --git a/include/linux/libata.h b/include/linux/libata.h
index bd5fefe..1c0d1cf 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -371,6 +371,7 @@ enum {
ATA_EHI_PRINTINFO = (1 << 18), /* print configuration info */
ATA_EHI_SETMODE = (1 << 19), /* configure transfer mode */
ATA_EHI_POST_SETMODE = (1 << 20), /* revalidating after setmode */
+ ATA_EHI_SET_HYBRID = (1 << 21), /* changing hybrid mode */
ATA_EHI_DID_RESET = ATA_EHI_DID_SOFTRESET | ATA_EHI_DID_HARDRESET,
@@ -716,6 +717,9 @@ struct ata_device {
int spdn_cnt;
/* ering is CLEAR_END, read comment above CLEAR_END */
struct ata_ering ering;
+
+ bool hybrid_cap; /* device capable of hybrid hints */
+ bool hybrid_en; /* host hints enabled on device */
};
/* Fields between ATA_DEVICE_CLEAR_BEGIN and ATA_DEVICE_CLEAR_END are
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] block, ioprio: include caching advice via ionice
2014-10-29 18:23 ` [RFC PATCH 1/5] block, ioprio: include caching advice via ionice Jason B. Akers
@ 2014-10-29 19:02 ` Jeff Moyer
2014-10-29 21:07 ` Dan Williams
0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2014-10-29 19:02 UTC (permalink / raw)
To: Jason B. Akers
Cc: linux-ide, axboe, kapil.karkra, dan.j.williams, linux-kernel
"Jason B. Akers" <jason.b.akers@intel.com> writes:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Steal one unused bit from the priority class and two bits from the
> priority data, to implement a 3 bit cache-advice field. Similar to the
> page cache advice from fadvise() these hints are meant to be consumed
> by hybrid drives. Solid State Hyrbid-Drives, as defined by the SATA-IO
> Specification, implement up to a 4-bit cache priority that can be
> specified along with a FPDMA command.
ionice is about setting an I/O scheduling class for a *process*. So,
unless I've missed something, this does not seem like the right
interface for passing I/O hints.
You mention fadvise hints, which sounds like a good fit (and madvise
would be equally interesting), but I don't see where you've wired them
up in this patch set. Did I miss it?
Cheers,
Jeff
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
` (4 preceding siblings ...)
2014-10-29 18:24 ` [RFC PATCH 5/5] libata: Enabling Solid State Hybrid Drives (SSHDs) based on SATA 3.2 standard Jason B. Akers
@ 2014-10-29 20:14 ` Dave Chinner
2014-10-29 21:10 ` Jens Axboe
` (2 more replies)
2014-10-30 2:05 ` Martin K. Petersen
6 siblings, 3 replies; 27+ messages in thread
From: Dave Chinner @ 2014-10-29 20:14 UTC (permalink / raw)
To: Jason B. Akers
Cc: linux-ide, axboe, dan.j.williams, kapil.karkra, linux-kernel
On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
> The following series enables the use of Solid State hybrid drives
> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
>
> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
>
> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
> OPRIO_ADV_NONE,
> IOPRIO_ADV_EVICT, /* actively discard cached data */
> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
> IOPRIO_ADV_RESERVED1, /* reserved for future use */
> IOPRIO_ADV_RESERVED2,
> IOPRIO_ADV_RESERVED3,
> IOPRIO_ADV_WILLNEED, /* high temporal locality */
>
> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
>
> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
This looks to be the wrong way to implement per-IO priority
information.
How does a filesystem make use of this to make sure it's
metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
accessed metadata in flash. Conversely, journal writes need to
be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
consume flash space as they are never-read IOs...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] block, ioprio: include caching advice via ionice
2014-10-29 19:02 ` Jeff Moyer
@ 2014-10-29 21:07 ` Dan Williams
0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2014-10-29 21:07 UTC (permalink / raw)
To: Jeff Moyer
Cc: Jason B. Akers, IDE/ATA development list, axboe, Karkra, Kapil,
linux-kernel@vger.kernel.org
On Wed, Oct 29, 2014 at 12:02 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> "Jason B. Akers" <jason.b.akers@intel.com> writes:
>
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Steal one unused bit from the priority class and two bits from the
>> priority data, to implement a 3 bit cache-advice field. Similar to the
>> page cache advice from fadvise() these hints are meant to be consumed
>> by hybrid drives. Solid State Hyrbid-Drives, as defined by the SATA-IO
>> Specification, implement up to a 4-bit cache priority that can be
>> specified along with a FPDMA command.
>
> ionice is about setting an I/O scheduling class for a *process*. So,
> unless I've missed something, this does not seem like the right
> interface for passing I/O hints.
>
> You mention fadvise hints, which sounds like a good fit (and madvise
> would be equally interesting), but I don't see where you've wired them
> up in this patch set. Did I miss it?
It turns out we didn't need it. It's straightforward to add, but I
think "80%" of the benefit can be had by just having a per-thread
cache priority. It's more powerful to say "any page cache page this
thread touches, or any direct i/o this thread does, goes down the
stack at the given priority".
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
@ 2014-10-29 21:10 ` Jens Axboe
2014-10-29 22:09 ` Dave Chinner
2014-10-29 21:11 ` Dan Williams
2014-12-03 15:25 ` Pavel Machek
2 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2014-10-29 21:10 UTC (permalink / raw)
To: Dave Chinner, Jason B. Akers
Cc: linux-ide, dan.j.williams, kapil.karkra, linux-kernel
On 10/29/2014 02:14 PM, Dave Chinner wrote:
> On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
>> The following series enables the use of Solid State hybrid drives
>> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
>>
>> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
>>
>> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
>> OPRIO_ADV_NONE,
>> IOPRIO_ADV_EVICT, /* actively discard cached data */
>> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
>> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
>> IOPRIO_ADV_RESERVED1, /* reserved for future use */
>> IOPRIO_ADV_RESERVED2,
>> IOPRIO_ADV_RESERVED3,
>> IOPRIO_ADV_WILLNEED, /* high temporal locality */
>>
>> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
>>
>> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
>> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
>
> This looks to be the wrong way to implement per-IO priority
> information.
>
> How does a filesystem make use of this to make sure it's
> metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
> accessed metadata in flash. Conversely, journal writes need to
> be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
> consume flash space as they are never-read IOs...
Not disagreeing that loading more into the io priority fields is a
bit... icky. I see why it's done, though, it requires the least amount
of plumbing.
As for the fs accessing this, the io nice fields are readily exposed
through the ->bi_rw setting. So while the above example uses ionice to
set a task io priority (that a bio will then inherit), nothing prevents
you from passing it in directly from the kernel.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
2014-10-29 21:10 ` Jens Axboe
@ 2014-10-29 21:11 ` Dan Williams
2014-12-03 15:25 ` Pavel Machek
2 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2014-10-29 21:11 UTC (permalink / raw)
To: Dave Chinner
Cc: Jason B. Akers, IDE/ATA development list, axboe, Karkra, Kapil,
linux-kernel@vger.kernel.org
On Wed, Oct 29, 2014 at 1:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
>> The following series enables the use of Solid State hybrid drives
>> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
>>
>> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
>>
>> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
>> OPRIO_ADV_NONE,
>> IOPRIO_ADV_EVICT, /* actively discard cached data */
>> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
>> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
>> IOPRIO_ADV_RESERVED1, /* reserved for future use */
>> IOPRIO_ADV_RESERVED2,
>> IOPRIO_ADV_RESERVED3,
>> IOPRIO_ADV_WILLNEED, /* high temporal locality */
>>
>> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
>>
>> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
>> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
>
> This looks to be the wrong way to implement per-IO priority
> information.
>
> How does a filesystem make use of this to make sure it's
> metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
> accessed metadata in flash. Conversely, journal writes need to
> be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
> consume flash space as they are never-read IOs...
Internally this is still using the prio bits in bio->rw. So a
filesystem should be able to use bio_set_prio() if it wants to
override the default priority from userspace.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 21:10 ` Jens Axboe
@ 2014-10-29 22:09 ` Dave Chinner
2014-10-29 22:24 ` Dan Williams
2014-10-29 22:49 ` Jens Axboe
0 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2014-10-29 22:09 UTC (permalink / raw)
To: Jens Axboe
Cc: Jason B. Akers, linux-ide, dan.j.williams, kapil.karkra,
linux-kernel
On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
> On 10/29/2014 02:14 PM, Dave Chinner wrote:
> > On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
> >> The following series enables the use of Solid State hybrid drives
> >> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
> >>
> >> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
> >>
> >> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
> >> OPRIO_ADV_NONE,
> >> IOPRIO_ADV_EVICT, /* actively discard cached data */
> >> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
> >> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
> >> IOPRIO_ADV_RESERVED1, /* reserved for future use */
> >> IOPRIO_ADV_RESERVED2,
> >> IOPRIO_ADV_RESERVED3,
> >> IOPRIO_ADV_WILLNEED, /* high temporal locality */
> >>
> >> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
> >>
> >> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
> >> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
> >
> > This looks to be the wrong way to implement per-IO priority
> > information.
> >
> > How does a filesystem make use of this to make sure it's
> > metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
> > accessed metadata in flash. Conversely, journal writes need to
> > be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
> > consume flash space as they are never-read IOs...
>
> Not disagreeing that loading more into the io priority fields is a
> bit... icky. I see why it's done, though, it requires the least amount
> of plumbing.
Yeah, but we don't do things the easy way just because it's easy. We
do things the right way. ;)
> As for the fs accessing this, the io nice fields are readily exposed
> through the ->bi_rw setting. So while the above example uses ionice to
> set a task io priority (that a bio will then inherit), nothing prevents
> you from passing it in directly from the kernel.
Right, but now the filesystem needs to provide that on a per-inode
basis, not from the task structure as the task that is submitting
the bio is not necesarily the task doing the read/write syscall.
e.g. the write case above doesn't actually inherit the task priority
at the bio level at all because the IO is being dispatched by a
background flusher thread, not the ioniced task calling write(2).
IMO using ionice is a nice hack, but utimately it looks mostly useless
from a user and application perspective as cache residency is a
property of the data being read/written, not the task doing the IO.
e.g. a database will want it's indexes in flash and bulk
data in non-cached storage.
IOWs, to make effective use of this the task will need different
cache hints for each different type of data needs to do IO on, and
so overloading IO priorities just seems the wrong direction to be
starting from.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 22:09 ` Dave Chinner
@ 2014-10-29 22:24 ` Dan Williams
2014-10-30 7:21 ` Dave Chinner
2014-10-29 22:49 ` Jens Axboe
1 sibling, 1 reply; 27+ messages in thread
From: Dan Williams @ 2014-10-29 22:24 UTC (permalink / raw)
To: Dave Chinner
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
>> On 10/29/2014 02:14 PM, Dave Chinner wrote:
>> > On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
>> >> The following series enables the use of Solid State hybrid drives
>> >> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
>> >>
>> >> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
>> >>
>> >> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
>> >> OPRIO_ADV_NONE,
>> >> IOPRIO_ADV_EVICT, /* actively discard cached data */
>> >> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
>> >> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
>> >> IOPRIO_ADV_RESERVED1, /* reserved for future use */
>> >> IOPRIO_ADV_RESERVED2,
>> >> IOPRIO_ADV_RESERVED3,
>> >> IOPRIO_ADV_WILLNEED, /* high temporal locality */
>> >>
>> >> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
>> >>
>> >> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
>> >> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
>> >
>> > This looks to be the wrong way to implement per-IO priority
>> > information.
>> >
>> > How does a filesystem make use of this to make sure it's
>> > metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
>> > accessed metadata in flash. Conversely, journal writes need to
>> > be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
>> > consume flash space as they are never-read IOs...
>>
>> Not disagreeing that loading more into the io priority fields is a
>> bit... icky. I see why it's done, though, it requires the least amount
>> of plumbing.
>
> Yeah, but we don't do things the easy way just because it's easy. We
> do things the right way. ;)
...heh, I also don't think we add complication when the simple way
gets us most of the benefit*.
* says the low-level device driver guy ;-).
>> As for the fs accessing this, the io nice fields are readily exposed
>> through the ->bi_rw setting. So while the above example uses ionice to
>> set a task io priority (that a bio will then inherit), nothing prevents
>> you from passing it in directly from the kernel.
>
> Right, but now the filesystem needs to provide that on a per-inode
> basis, not from the task structure as the task that is submitting
> the bio is not necesarily the task doing the read/write syscall.
>
> e.g. the write case above doesn't actually inherit the task priority
> at the bio level at all because the IO is being dispatched by a
> background flusher thread, not the ioniced task calling write(2).
When the ioniced task calling write(2) inserts the page into the page
cache then the current priority is recorded in the struct page. The
background flusher likely runs at a lower / neutral caching priority
and the priority carried in the page will be the effective caching
priority applied to the bio.
> IMO using ionice is a nice hack, but utimately it looks mostly useless
> from a user and application perspective as cache residency is a
> property of the data being read/written, not the task doing the IO.
> e.g. a database will want it's indexes in flash and bulk
> data in non-cached storage.
Right, if those are doing direct-i/o then have a separate thread-id
for those write(2) calls. Otherwise if they are dirtying page cache
the struct page carries the hint.
> IOWs, to make effective use of this the task will need different
> cache hints for each different type of data needs to do IO on, and
> so overloading IO priorities just seems the wrong direction to be
> starting from.
There's also the fadvise() enabling that could be bolted on top of
this capability. But, before that step, is a thread-id per-caching
context too much to ask?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 22:09 ` Dave Chinner
2014-10-29 22:24 ` Dan Williams
@ 2014-10-29 22:49 ` Jens Axboe
1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2014-10-29 22:49 UTC (permalink / raw)
To: Dave Chinner
Cc: Jason B. Akers, linux-ide, dan.j.williams, kapil.karkra,
linux-kernel
On 10/29/2014 04:09 PM, Dave Chinner wrote:
> On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
>> On 10/29/2014 02:14 PM, Dave Chinner wrote:
>>> On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
>>>> The following series enables the use of Solid State hybrid drives
>>>> ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
>>>>
>>>> This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
>>>>
>>>> An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
>>>> OPRIO_ADV_NONE,
>>>> IOPRIO_ADV_EVICT, /* actively discard cached data */
>>>> IOPRIO_ADV_DONTNEED, /* caching this data has little value */
>>>> IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
>>>> IOPRIO_ADV_RESERVED1, /* reserved for future use */
>>>> IOPRIO_ADV_RESERVED2,
>>>> IOPRIO_ADV_RESERVED3,
>>>> IOPRIO_ADV_WILLNEED, /* high temporal locality */
>>>>
>>>> For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
>>>>
>>>> ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
>>>> ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
>>>
>>> This looks to be the wrong way to implement per-IO priority
>>> information.
>>>
>>> How does a filesystem make use of this to make sure it's
>>> metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
>>> accessed metadata in flash. Conversely, journal writes need to
>>> be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
>>> consume flash space as they are never-read IOs...
>>
>> Not disagreeing that loading more into the io priority fields is a
>> bit... icky. I see why it's done, though, it requires the least amount
>> of plumbing.
>
> Yeah, but we don't do things the easy way just because it's easy. We
> do things the right way. ;)
Still not disagreeing with you, merely stating that I can see why they
chose to do it this way. Still doesn't change the fact that it feels
like a hack, not a designed solution.
>> As for the fs accessing this, the io nice fields are readily exposed
>> through the ->bi_rw setting. So while the above example uses ionice to
>> set a task io priority (that a bio will then inherit), nothing prevents
>> you from passing it in directly from the kernel.
>
> Right, but now the filesystem needs to provide that on a per-inode
> basis, not from the task structure as the task that is submitting
> the bio is not necesarily the task doing the read/write syscall.
Whomever submits the bio would need to provide it, yes. And with the
disconnect for async writes, that becomes... interesting.
> e.g. the write case above doesn't actually inherit the task priority
> at the bio level at all because the IO is being dispatched by a
> background flusher thread, not the ioniced task calling write(2).
Oh yes, I realize that.
> IMO using ionice is a nice hack, but utimately it looks mostly useless
> from a user and application perspective as cache residency is a
> property of the data being read/written, not the task doing the IO.
> e.g. a database will want it's indexes in flash and bulk
> data in non-cached storage.
>
> IOWs, to make effective use of this the task will need different
> cache hints for each different type of data needs to do IO on, and
> so overloading IO priorities just seems the wrong direction to be
> starting from.
Agree.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
` (5 preceding siblings ...)
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
@ 2014-10-30 2:05 ` Martin K. Petersen
2014-10-30 2:35 ` Jens Axboe
6 siblings, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2014-10-30 2:05 UTC (permalink / raw)
To: Jason B. Akers
Cc: linux-ide, axboe, dan.j.williams, kapil.karkra, linux-kernel
>>>>> "Jason" == Jason B Akers <jason.b.akers@intel.com> writes:
Jason> The following series enables the use of Solid State hybrid drives
Jason> ATA standard 3.2 defines the hybrid information feature, which
Jason> provides a means for the host driver to provide hints to the
Jason> SSHDs to guide what to place on the SSD/NAND portion and what to
Jason> place on the magnetic media.
I have been ripping my hair out in this department for a while.
A colleague and I presented our findings at SNIA SDC a few weeks
ago. I'm trying to find out if there's an embargo on the slides or if I
can post them.
First of all I completely agree with Dave's comments about hooking into
fadvise()/madvise().
For my testing I also overloaded the existing priority fields but ended
up deciding that it would be better to have a separate field (and
cleaning up the priority high byte in bi_rw but that's part of a
different patch set).
My challenge with hints has been trying to bridge all the various
existing approaches with the new stuff that's coming down the pipe in
T10/T13 (LBMD hints) and NFS v4.2 ditto. That turned into a huge mapping
table as well as a few amendments to what's currently being worked on in
the standards bodies.
I didn't actively pursue the hybrid drive hints because I didn't think
there was much interest. But since there is we should combine our
efforts. From an application and kernel perspective we need to have one
type of hints that then get translated into whatever is suitable for
NFS, T10, T13 or SATA-IO SSHDs. It looks like the SSHD hints are
reasonably close to fadvise() which is great.
I'll see if I can get a link to the slides out tomorrow. Otherwise I'll
just redo them.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 2:05 ` Martin K. Petersen
@ 2014-10-30 2:35 ` Jens Axboe
2014-10-30 3:28 ` Martin K. Petersen
0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2014-10-30 2:35 UTC (permalink / raw)
To: Martin K. Petersen, Jason B. Akers
Cc: linux-ide, dan.j.williams, kapil.karkra, linux-kernel
On 2014-10-29 20:05, Martin K. Petersen wrote:
>>>>>> "Jason" == Jason B Akers <jason.b.akers@intel.com> writes:
>
> Jason> The following series enables the use of Solid State hybrid drives
> Jason> ATA standard 3.2 defines the hybrid information feature, which
> Jason> provides a means for the host driver to provide hints to the
> Jason> SSHDs to guide what to place on the SSD/NAND portion and what to
> Jason> place on the magnetic media.
>
> I have been ripping my hair out in this department for a while.
>
> A colleague and I presented our findings at SNIA SDC a few weeks
> ago. I'm trying to find out if there's an embargo on the slides or if I
> can post them.
>
> First of all I completely agree with Dave's comments about hooking into
> fadvise()/madvise().
The problem with xadvise() is that it handles only one part of this - it
handles the case of tying some sort of IO related priority information
to an inode. It does not handle the case of different parts of the file,
at least not without adding specific extra tracking for this on the
kernel side.
I think we've needed a proper API for passing in appropriate hints on a
per-io basis for a LONG time.
> My challenge with hints has been trying to bridge all the various
> existing approaches with the new stuff that's coming down the pipe in
> T10/T13 (LBMD hints) and NFS v4.2 ditto. That turned into a huge mapping
> table as well as a few amendments to what's currently being worked on in
> the standards bodies.
That is the big challenge. We've tried (and failed) in the past to
define a set of hints that make sense. It'd be a shame to add something
that's specific to a given transport/technology. That said, this set of
hints do seem pretty basic and would not necessarily be a bad place to
start. But they are still very specific to this use case. And who knows
what will happen on the device side. I might assume that WILLNEED is the
same as HOT, and that DONTNEED is the same as cold. And then
applications get upset when vendor X and Y treat them somewhat
differently, because that's how it fit into their architecture.
This is the primary reason that hints never happened previously.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 2:35 ` Jens Axboe
@ 2014-10-30 3:28 ` Martin K. Petersen
2014-10-30 4:19 ` Dan Williams
2014-10-30 14:53 ` Jens Axboe
0 siblings, 2 replies; 27+ messages in thread
From: Martin K. Petersen @ 2014-10-30 3:28 UTC (permalink / raw)
To: Jens Axboe
Cc: Martin K. Petersen, Jason B. Akers, linux-ide, dan.j.williams,
kapil.karkra, linux-kernel
>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
Jens> The problem with xadvise() is that it handles only one part of
Jens> this - it handles the case of tying some sort of IO related
Jens> priority information to an inode. It does not handle the case of
Jens> different parts of the file, at least not without adding specific
Jens> extra tracking for this on the kernel side.
Are there actually people asking for sub-file granularity? I didn't get
any requests for that in the survey I did this summer.
I talked to several application people about what they really needed and
wanted. That turned into a huge twisted mess of a table with ponies of
various sizes.
I condensed all those needs and desires into something like this:
+-----------------+------------+----------+------------+
| I/O Class | Command | Desired | Predicted |
| | Completion | Future | Future |
| | Urgency | Access | Access |
| | | Latency | Frequency |
+-----------------+------------+----------+------------+
| Transaction | High | Low | High |
+-----------------+------------+----------+------------+
| Metadata | High | Low | Normal |
+-----------------+------------+----------+------------+
| Paging | High | Normal | Normal |
+-----------------+------------+----------+------------+
| Streaming | High | Normal | Low |
+-----------------+------------+----------+------------+
| Data | Normal | Normal | Normal |
+-----------------+------------+----------+------------+
| Background | Low | Normal* | Low |
+-----------------+------------+----------+------------+
Command completion urgency is really just the existing I/O priority.
Desired future access latency affects data placement in a tiered
device. Predicted future access frequency is essentially a caching hint.
The names and I/O classes themselves are not really important. It's just
a reduced version of all the things people asked for. Essentially:
Relative priority, data placement and caching.
I had also asked why people wanted to specify any hints. And that boiled
down to the I/O classes in the left column above. People wanted stuff on
a low latency storage tier because it was a transactional or metadata
type of I/O. Or to isolate production I/O from any side effects of a
background scrub or backup run.
Incidentally, the classes data, transaction and background covered
almost all the use cases that people had asked for. The metadata class
mostly came about from good results with REQ_META tagging in a previous
prototype. A few vendors wanted to be able to identify swap to prevent
platter spin-ups. Streaming was requested by a couple of video folks.
The notion of telling the storage *why* you're doing I/O instead of
telling it how to manage its cache and where to put stuff is closely
aligned with our internal experiences with I/O hints over the last
decade. But it's a bit of a departure from where things are going in the
standards bodies. In any case I thought it was interesting that pretty
much every use case that people came up with could be adequately
described by a handful of I/O classes.
The next step was trying to map these hints into what was available in
xadvise(), NFS 4.2 and the recent T10/T13 efforts. That wasn't trivial
and there really isn't a 1:1 mapping that works. So I went to T10 and
tried to nudge things in the same direction as NFS 4.2. Mainly because
that's closer to what we already have in xadvise().
Jens> I think we've needed a proper API for passing in appropriate hints
Jens> on a per-io basis for a LONG time.
Yup.
Jens> That is the big challenge. We've tried (and failed) in the past to
Jens> define a set of hints that make sense. It'd be a shame to add
Jens> something that's specific to a given transport/technology.
Absolutely!
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 3:28 ` Martin K. Petersen
@ 2014-10-30 4:19 ` Dan Williams
2014-10-30 14:17 ` Jens Axboe
2014-10-30 14:53 ` Jens Axboe
1 sibling, 1 reply; 27+ messages in thread
From: Dan Williams @ 2014-10-30 4:19 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org, linux-nvme
On Wed, Oct 29, 2014 at 8:28 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> The next step was trying to map these hints into what was available in
> xadvise(), NFS 4.2 and the recent T10/T13 efforts. That wasn't trivial
> and there really isn't a 1:1 mapping that works. So I went to T10 and
> tried to nudge things in the same direction as NFS 4.2. Mainly because
> that's closer to what we already have in xadvise().
In case you still have hair left to pull wrangling these multiple
specifications, Matthew reminds me that NVME also has cache advice at
the transport layer.
> Jens> I think we've needed a proper API for passing in appropriate hints
> Jens> on a per-io basis for a LONG time.
>
> Yup.
I understand the desire to have per-io / per-inode xadvise()-style
hints, but I don't see why not also include a per-pid capability?
Per-pid was not "icky" for flashcache [1]. It let's you flag
processes that should not pollute the cache, as well "cache warming"
processes pre-loading sub-ranges of files that is awkward to do with a
per-inode hint. Per-pid also allows hinting on behalf of other
otherwise cache-unaware processes.
> Jens> That is the big challenge. We've tried (and failed) in the past to
> Jens> define a set of hints that make sense. It'd be a shame to add
> Jens> something that's specific to a given transport/technology.
>
> Absolutely!
In this RFC we end up punting the ultimate kernel-to-transport hint
translation to userspace. The kernel has a default interpretation,
but it seems it will almost always be inadequate trying to account for
per-device-quirks and platform performance policies.
[1]: https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt#L139
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 22:24 ` Dan Williams
@ 2014-10-30 7:21 ` Dave Chinner
2014-10-30 14:15 ` Jens Axboe
2014-10-30 17:07 ` Dan Williams
0 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2014-10-30 7:21 UTC (permalink / raw)
To: Dan Williams
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
On Wed, Oct 29, 2014 at 03:24:11PM -0700, Dan Williams wrote:
> On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
> >> As for the fs accessing this, the io nice fields are readily exposed
> >> through the ->bi_rw setting. So while the above example uses ionice to
> >> set a task io priority (that a bio will then inherit), nothing prevents
> >> you from passing it in directly from the kernel.
> >
> > Right, but now the filesystem needs to provide that on a per-inode
> > basis, not from the task structure as the task that is submitting
> > the bio is not necesarily the task doing the read/write syscall.
> >
> > e.g. the write case above doesn't actually inherit the task priority
> > at the bio level at all because the IO is being dispatched by a
> > background flusher thread, not the ioniced task calling write(2).
>
> When the ioniced task calling write(2) inserts the page into the page
> cache then the current priority is recorded in the struct page. The
It does? Can you point me to where the page cache code does this,
because I've clearly missed something important go by in the past
few months...
> background flusher likely runs at a lower / neutral caching priority
> and the priority carried in the page will be the effective caching
> priority applied to the bio.
How? The writepage code that adds the pages to the bio doesn't look
at priorities at all. If we're supposed to be doing this, then it
isn't being done in XFS when we are building bios, and nobody has
told me we need to do it...
Hmmm - ok, so what happens if an IO is made up of pages from
different tasks with different priorities? what then? ;)
> > from a user and application perspective as cache residency is a
> > property of the data being read/written, not the task doing the IO.
> > e.g. a database will want it's indexes in flash and bulk
> > data in non-cached storage.
>
> Right, if those are doing direct-i/o then have a separate thread-id
> for those write(2) calls.
Which, again, is not how applications are designed or implemented.
If the current transaction needs to read/write index blocks, it does
it directly, not have to wait for some other dispatch thread to do
it for it....
> Otherwise if they are dirtying page cache
> the struct page carries the hint.
>
> > IOWs, to make effective use of this the task will need different
> > cache hints for each different type of data needs to do IO on, and
> > so overloading IO priorities just seems the wrong direction to be
> > starting from.
>
> There's also the fadvise() enabling that could be bolted on top of
> this capability. But, before that step, is a thread-id per-caching
> context too much to ask?
If we do it that way, we are stuck with it forever. So let's get our
ducks in line first before pulling the trigger...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 7:21 ` Dave Chinner
@ 2014-10-30 14:15 ` Jens Axboe
2014-10-30 17:07 ` Dan Williams
1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2014-10-30 14:15 UTC (permalink / raw)
To: Dave Chinner, Dan Williams
Cc: Jason B. Akers, IDE/ATA development list, Karkra, Kapil,
linux-kernel@vger.kernel.org
On 2014-10-30 01:21, Dave Chinner wrote:
> On Wed, Oct 29, 2014 at 03:24:11PM -0700, Dan Williams wrote:
>> On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
>>>> As for the fs accessing this, the io nice fields are readily exposed
>>>> through the ->bi_rw setting. So while the above example uses ionice to
>>>> set a task io priority (that a bio will then inherit), nothing prevents
>>>> you from passing it in directly from the kernel.
>>>
>>> Right, but now the filesystem needs to provide that on a per-inode
>>> basis, not from the task structure as the task that is submitting
>>> the bio is not necesarily the task doing the read/write syscall.
>>>
>>> e.g. the write case above doesn't actually inherit the task priority
>>> at the bio level at all because the IO is being dispatched by a
>>> background flusher thread, not the ioniced task calling write(2).
>>
>> When the ioniced task calling write(2) inserts the page into the page
>> cache then the current priority is recorded in the struct page. The
>
> It does? Can you point me to where the page cache code does this,
> because I've clearly missed something important go by in the past
> few months...
I was puzzled too, but then I realized that Dan is referring to patch
4/5 in the series...
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 4:19 ` Dan Williams
@ 2014-10-30 14:17 ` Jens Axboe
0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2014-10-30 14:17 UTC (permalink / raw)
To: Dan Williams, Martin K. Petersen
Cc: Jason B. Akers, IDE/ATA development list, Karkra, Kapil,
linux-kernel@vger.kernel.org, linux-nvme
On 2014-10-29 22:19, Dan Williams wrote:
> I understand the desire to have per-io / per-inode xadvise()-style
> hints, but I don't see why not also include a per-pid capability?
>
> Per-pid was not "icky" for flashcache [1]. It let's you flag
> processes that should not pollute the cache, as well "cache warming"
> processes pre-loading sub-ranges of files that is awkward to do with a
> per-inode hint. Per-pid also allows hinting on behalf of other
> otherwise cache-unaware processes.
per-pid is imho fine as well, as long as it's not the primary interface.
I quite like how the io priority works in this regard. If the task has a
priority set, we use that. If you pass in something else, that overrides
the task set one.
per-pid allows you to modify how we treat applications without modifying
the application itself. This is handy for eg streamed backup and
similar, which is most likely why flashcache has it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 3:28 ` Martin K. Petersen
2014-10-30 4:19 ` Dan Williams
@ 2014-10-30 14:53 ` Jens Axboe
2014-10-30 16:27 ` Dan Williams
1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2014-10-30 14:53 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Jason B. Akers, linux-ide, dan.j.williams, kapil.karkra,
linux-kernel
On 2014-10-29 21:28, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens> The problem with xadvise() is that it handles only one part of
> Jens> this - it handles the case of tying some sort of IO related
> Jens> priority information to an inode. It does not handle the case of
> Jens> different parts of the file, at least not without adding specific
> Jens> extra tracking for this on the kernel side.
>
> Are there actually people asking for sub-file granularity? I didn't get
> any requests for that in the survey I did this summer.
Yeah, consider the case of using a raw block device for storing a
database. That one is quite common. Or perhaps a setup with a single
log, with data being appended to it. Some of that data would be marked
as hot/willneed, some of it will be marked with cold/wontneed. This
means that we cannot rely on per-inode hinting.
> I talked to several application people about what they really needed and
> wanted. That turned into a huge twisted mess of a table with ponies of
> various sizes.
Who could have envisioned that :-)
> I condensed all those needs and desires into something like this:
>
> +-----------------+------------+----------+------------+
> | I/O Class | Command | Desired | Predicted |
> | | Completion | Future | Future |
> | | Urgency | Access | Access |
> | | | Latency | Frequency |
> +-----------------+------------+----------+------------+
> | Transaction | High | Low | High |
> +-----------------+------------+----------+------------+
> | Metadata | High | Low | Normal |
> +-----------------+------------+----------+------------+
> | Paging | High | Normal | Normal |
> +-----------------+------------+----------+------------+
> | Streaming | High | Normal | Low |
> +-----------------+------------+----------+------------+
> | Data | Normal | Normal | Normal |
> +-----------------+------------+----------+------------+
> | Background | Low | Normal* | Low |
> +-----------------+------------+----------+------------+
>
> Command completion urgency is really just the existing I/O priority.
> Desired future access latency affects data placement in a tiered
> device. Predicted future access frequency is essentially a caching hint.
>
> The names and I/O classes themselves are not really important. It's just
> a reduced version of all the things people asked for. Essentially:
> Relative priority, data placement and caching.
>
> I had also asked why people wanted to specify any hints. And that boiled
> down to the I/O classes in the left column above. People wanted stuff on
> a low latency storage tier because it was a transactional or metadata
> type of I/O. Or to isolate production I/O from any side effects of a
> background scrub or backup run.
>
> Incidentally, the classes data, transaction and background covered
> almost all the use cases that people had asked for. The metadata class
> mostly came about from good results with REQ_META tagging in a previous
> prototype. A few vendors wanted to be able to identify swap to prevent
> platter spin-ups. Streaming was requested by a couple of video folks.
>
> The notion of telling the storage *why* you're doing I/O instead of
> telling it how to manage its cache and where to put stuff is closely
> aligned with our internal experiences with I/O hints over the last
> decade. But it's a bit of a departure from where things are going in the
> standards bodies. In any case I thought it was interesting that pretty
> much every use case that people came up with could be adequately
> described by a handful of I/O classes.
Definitely agree on this, it's about notifying storage on what type of
IO this is, or why we are doing it. I'm just still worried that this
will then end up being unusable by applications, since they can't rely
on anything. Say one vendor treats WONTNEED in a much colder fashion
than others, the user/application will then complain about the access
latencies for the next IO to that location. "Yes it's cold, but I didn't
expect it to be THAT cold" and then come to the conclusion that they
can't feasibly use these hints as they don't do exactly what they want.
It'd be nice if we could augment this with a query interface of some
sort, that could give the application some idea of what happens for each
of the passed in hints. That would improve the situation from a "lets
set this hint and hope it does what we think it does" to a more
predictable and robust environment.
--
Jens Axboe
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 14:53 ` Jens Axboe
@ 2014-10-30 16:27 ` Dan Williams
0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2014-10-30 16:27 UTC (permalink / raw)
To: Jens Axboe
Cc: Martin K. Petersen, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
On Thu, Oct 30, 2014 at 7:53 AM, Jens Axboe <axboe@fb.com> wrote:
> On 2014-10-29 21:28, Martin K. Petersen wrote:
>> The notion of telling the storage *why* you're doing I/O instead of
>> telling it how to manage its cache and where to put stuff is closely
>> aligned with our internal experiences with I/O hints over the last
>> decade. But it's a bit of a departure from where things are going in the
>> standards bodies. In any case I thought it was interesting that pretty
>> much every use case that people came up with could be adequately
>> described by a handful of I/O classes.
>
>
> Definitely agree on this, it's about notifying storage on what type of IO
> this is, or why we are doing it. I'm just still worried that this will then
> end up being unusable by applications, since they can't rely on anything.
> Say one vendor treats WONTNEED in a much colder fashion than others, the
> user/application will then complain about the access latencies for the next
> IO to that location. "Yes it's cold, but I didn't expect it to be THAT cold"
> and then come to the conclusion that they can't feasibly use these hints as
> they don't do exactly what they want.
>
> It'd be nice if we could augment this with a query interface of some sort,
> that could give the application some idea of what happens for each of the
> passed in hints. That would improve the situation from a "lets set this hint
> and hope it does what we think it does" to a more predictable and robust
> environment.
>
I'm skeptical we (Linux kernel) can ever get this right. If an
application wants strict determinism in the meaning of hints it seems
it will need to qualify them against component-vendor /
platform-vendor provided transport-translation. For this RFC we had
consumer platforms in mind where "mostly better than baseline" is the
acceptance criteria vs "hard QOS".
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 7:21 ` Dave Chinner
2014-10-30 14:15 ` Jens Axboe
@ 2014-10-30 17:07 ` Dan Williams
2014-11-10 4:22 ` Dave Chinner
1 sibling, 1 reply; 27+ messages in thread
From: Dan Williams @ 2014-10-30 17:07 UTC (permalink / raw)
To: Dave Chinner
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
On Thu, Oct 30, 2014 at 12:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Oct 29, 2014 at 03:24:11PM -0700, Dan Williams wrote:
>> On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
>> >> As for the fs accessing this, the io nice fields are readily exposed
>> >> through the ->bi_rw setting. So while the above example uses ionice to
>> >> set a task io priority (that a bio will then inherit), nothing prevents
>> >> you from passing it in directly from the kernel.
>> >
>> > Right, but now the filesystem needs to provide that on a per-inode
>> > basis, not from the task structure as the task that is submitting
>> > the bio is not necesarily the task doing the read/write syscall.
>> >
>> > e.g. the write case above doesn't actually inherit the task priority
>> > at the bio level at all because the IO is being dispatched by a
>> > background flusher thread, not the ioniced task calling write(2).
>>
>> When the ioniced task calling write(2) inserts the page into the page
>> cache then the current priority is recorded in the struct page. The
>
> It does? Can you point me to where the page cache code does this,
> because I've clearly missed something important go by in the past
> few months...
Sorry, should have been more clear that this patch set added that
capability in patch-4. The idea is to claim some unused extended page
flags to stash priority bits. Yes, the PageSetAdvice() helper needs
to be fixed up to do the flags update atomically, and yes this
precludes hinting on 32-bit platforms. I also think that
bio_add_page() is the better place to read the per-page priority into
the bio. We felt ok deferring these items until after the initial
RFC.
>> background flusher likely runs at a lower / neutral caching priority
>> and the priority carried in the page will be the effective caching
>> priority applied to the bio.
>
> How? The writepage code that adds the pages to the bio doesn't look
> at priorities at all. If we're supposed to be doing this, then it
> isn't being done in XFS when we are building bios, and nobody has
> told me we need to do it...
>
> Hmmm - ok, so what happens if an IO is made up of pages from
> different tasks with different priorities? what then? ;)
ioprio_best(). If a low priority task happens to cross pages with a
high priority task, the effective priority is still "high".
>> > from a user and application perspective as cache residency is a
>> > property of the data being read/written, not the task doing the IO.
>> > e.g. a database will want it's indexes in flash and bulk
>> > data in non-cached storage.
>>
>> Right, if those are doing direct-i/o then have a separate thread-id
>> for those write(2) calls.
>
> Which, again, is not how applications are designed or implemented.
> If the current transaction needs to read/write index blocks, it does
> it directly, not have to wait for some other dispatch thread to do
> it for it....
Sure, xadvise() based hints have a role to play in addition to per-pid
based hints.
>> Otherwise if they are dirtying page cache
>> the struct page carries the hint.
>>
>> > IOWs, to make effective use of this the task will need different
>> > cache hints for each different type of data needs to do IO on, and
>> > so overloading IO priorities just seems the wrong direction to be
>> > starting from.
>>
>> There's also the fadvise() enabling that could be bolted on top of
>> this capability. But, before that step, is a thread-id per-caching
>> context too much to ask?
>
> If we do it that way, we are stuck with it forever. So let's get our
> ducks in line first before pulling the trigger...
Are you objecting to ionice as the interface or per-pid based hinting
in general?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-30 17:07 ` Dan Williams
@ 2014-11-10 4:22 ` Dave Chinner
2014-11-12 16:47 ` Dan Williams
0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2014-11-10 4:22 UTC (permalink / raw)
To: Dan Williams
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
[Been distrcted with other issues, so just getting back to this.]
On Thu, Oct 30, 2014 at 10:07:47AM -0700, Dan Williams wrote:
> On Thu, Oct 30, 2014 at 12:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Oct 29, 2014 at 03:24:11PM -0700, Dan Williams wrote:
> >> On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
> >> >> As for the fs accessing this, the io nice fields are readily exposed
> >> >> through the ->bi_rw setting. So while the above example uses ionice to
> >> >> set a task io priority (that a bio will then inherit), nothing prevents
> >> >> you from passing it in directly from the kernel.
> >> >
> >> > Right, but now the filesystem needs to provide that on a per-inode
> >> > basis, not from the task structure as the task that is submitting
> >> > the bio is not necesarily the task doing the read/write syscall.
> >> >
> >> > e.g. the write case above doesn't actually inherit the task priority
> >> > at the bio level at all because the IO is being dispatched by a
> >> > background flusher thread, not the ioniced task calling write(2).
> >>
> >> When the ioniced task calling write(2) inserts the page into the page
> >> cache then the current priority is recorded in the struct page. The
> >
> > It does? Can you point me to where the page cache code does this,
> > because I've clearly missed something important go by in the past
> > few months...
>
> Sorry, should have been more clear that this patch set added that
> capability in patch-4. The idea is to claim some unused extended page
> flags to stash priority bits. Yes, the PageSetAdvice() helper needs
> to be fixed up to do the flags update atomically, and yes this
> precludes hinting on 32-bit platforms. I also think that
> bio_add_page() is the better place to read the per-page priority into
> the bio. We felt ok deferring these items until after the initial
> RFC.
I think that using page flags for this is a 'orrible idea. Yeah,
it's a neat hack that you can use for proff of concept
demonstrations, but my biggest concern is that it isn't a scalable
channel for carrying IO priority information through the page cache.
e.g. it can't carry existing ionice priority scheduling information,
it can't carry blkcg IO control information, etc.
So, really, I think that this buffered write IO priority issue is
bigger than this patch series, and we need to solve it properly
rather than hack ugly special cases into core infrastructure
that are an evolutionary dead-end....
> >> > IOWs, to make effective use of this the task will need different
> >> > cache hints for each different type of data needs to do IO on, and
> >> > so overloading IO priorities just seems the wrong direction to be
> >> > starting from.
> >>
> >> There's also the fadvise() enabling that could be bolted on top of
> >> this capability. But, before that step, is a thread-id per-caching
> >> context too much to ask?
> >
> > If we do it that way, we are stuck with it forever. So let's get our
> > ducks in line first before pulling the trigger...
>
> Are you objecting to ionice as the interface or per-pid based hinting
> in general?
Neither. It's the implementation I don't like.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-11-10 4:22 ` Dave Chinner
@ 2014-11-12 16:47 ` Dan Williams
0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2014-11-12 16:47 UTC (permalink / raw)
To: Dave Chinner
Cc: Jens Axboe, Jason B. Akers, IDE/ATA development list,
Karkra, Kapil, linux-kernel@vger.kernel.org
On Sun, Nov 9, 2014 at 8:22 PM, Dave Chinner <david@fromorbit.com> wrote:
> [Been distrcted with other issues, so just getting back to this.]
>
> On Thu, Oct 30, 2014 at 10:07:47AM -0700, Dan Williams wrote:
>> On Thu, Oct 30, 2014 at 12:21 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Oct 29, 2014 at 03:24:11PM -0700, Dan Williams wrote:
>> >> On Wed, Oct 29, 2014 at 3:09 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Oct 29, 2014 at 03:10:51PM -0600, Jens Axboe wrote:
>> >> >> As for the fs accessing this, the io nice fields are readily exposed
>> >> >> through the ->bi_rw setting. So while the above example uses ionice to
>> >> >> set a task io priority (that a bio will then inherit), nothing prevents
>> >> >> you from passing it in directly from the kernel.
>> >> >
>> >> > Right, but now the filesystem needs to provide that on a per-inode
>> >> > basis, not from the task structure as the task that is submitting
>> >> > the bio is not necesarily the task doing the read/write syscall.
>> >> >
>> >> > e.g. the write case above doesn't actually inherit the task priority
>> >> > at the bio level at all because the IO is being dispatched by a
>> >> > background flusher thread, not the ioniced task calling write(2).
>> >>
>> >> When the ioniced task calling write(2) inserts the page into the page
>> >> cache then the current priority is recorded in the struct page. The
>> >
>> > It does? Can you point me to where the page cache code does this,
>> > because I've clearly missed something important go by in the past
>> > few months...
>>
>> Sorry, should have been more clear that this patch set added that
>> capability in patch-4. The idea is to claim some unused extended page
>> flags to stash priority bits. Yes, the PageSetAdvice() helper needs
>> to be fixed up to do the flags update atomically, and yes this
>> precludes hinting on 32-bit platforms. I also think that
>> bio_add_page() is the better place to read the per-page priority into
>> the bio. We felt ok deferring these items until after the initial
>> RFC.
>
> I think that using page flags for this is a 'orrible idea. Yeah,
> it's a neat hack that you can use for proff of concept
> demonstrations, but my biggest concern is that it isn't a scalable
> channel for carrying IO priority information through the page cache.
> e.g. it can't carry existing ionice priority scheduling information,
> it can't carry blkcg IO control information, etc.
>
> So, really, I think that this buffered write IO priority issue is
> bigger than this patch series, and we need to solve it properly
> rather than hack ugly special cases into core infrastructure
> that are an evolutionary dead-end....
>
>> >> > IOWs, to make effective use of this the task will need different
>> >> > cache hints for each different type of data needs to do IO on, and
>> >> > so overloading IO priorities just seems the wrong direction to be
>> >> > starting from.
>> >>
>> >> There's also the fadvise() enabling that could be bolted on top of
>> >> this capability. But, before that step, is a thread-id per-caching
>> >> context too much to ask?
>> >
>> > If we do it that way, we are stuck with it forever. So let's get our
>> > ducks in line first before pulling the trigger...
>>
>> Are you objecting to ionice as the interface or per-pid based hinting
>> in general?
>
> Neither. It's the implementation I don't like.
>
Fair enough. The page flags hack was indeed a hack to get an RFC out
the door instead of implementing a proper look-aside data structure
for remembering page cache io-priority. We'll iterate from here...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
2014-10-29 21:10 ` Jens Axboe
2014-10-29 21:11 ` Dan Williams
@ 2014-12-03 15:25 ` Pavel Machek
2 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2014-12-03 15:25 UTC (permalink / raw)
To: Dave Chinner
Cc: Jason B. Akers, linux-ide, axboe, dan.j.williams, kapil.karkra,
linux-kernel
On Thu 2014-10-30 07:14:17, Dave Chinner wrote:
> On Wed, Oct 29, 2014 at 11:23:38AM -0700, Jason B. Akers wrote:
> > The following series enables the use of Solid State hybrid drives
> > ATA standard 3.2 defines the hybrid information feature, which provides a means for the host driver to provide hints to the SSHDs to guide what to place on the SSD/NAND portion and what to place on the magnetic media.
> >
> > This implementation allows user space applications to provide the cache hints to the kernel using the existing ionice syscall.
> >
> > An application can pass a priority number coding up bits 11, 12, and 15 of the ionice command to form a 3 bit field that encodes the following priorities:
> > OPRIO_ADV_NONE,
> > IOPRIO_ADV_EVICT, /* actively discard cached data */
> > IOPRIO_ADV_DONTNEED, /* caching this data has little value */
> > IOPRIO_ADV_NORMAL, /* best-effort cache priority (default) */
> > IOPRIO_ADV_RESERVED1, /* reserved for future use */
> > IOPRIO_ADV_RESERVED2,
> > IOPRIO_ADV_RESERVED3,
> > IOPRIO_ADV_WILLNEED, /* high temporal locality */
> >
> > For example the following commands from the user space will make dd IOs to be generated with a hint of IOPRIO_ADV_DONTNEED assuming the SSHD is /dev/sdc.
> >
> > ionice -c2 -n4096 dd if=/dev/zero of=/dev/sdc bs=1M count=1024
> > ionice -c2 -n4096 dd if=/dev/sdc of=/dev/null bs=1M count=1024
>
> This looks to be the wrong way to implement per-IO priority
> information.
>
> How does a filesystem make use of this to make sure it's
> metadata ends up with IOPRIO_ADV_WILLNEED to store frequently
> accessed metadata in flash. Conversely, journal writes need to
> be issued with IOPRIO_ADV_DONTNEED so they don't unneceessarily
> consume flash space as they are never-read IOs...
Well, that makes sense, but we still want some kind of per-application
priority.
I'd like ~/.chromium directory cached in the SSD part, but I don't
neccessarily want /data/backup directory cached in the SSD...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2014-12-03 15:25 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-29 18:23 [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 1/5] block, ioprio: include caching advice via ionice Jason B. Akers
2014-10-29 19:02 ` Jeff Moyer
2014-10-29 21:07 ` Dan Williams
2014-10-29 18:23 ` [RFC PATCH 2/5] block: ioprio hint to low-level device drivers Jason B. Akers
2014-10-29 18:23 ` [RFC PATCH 3/5] block: untangle ioprio from BLK_CGROUP and BLK_DEV_THROTTLING Jason B. Akers
2014-10-29 18:24 ` [RFC PATCH 4/5] block, mm: Added the necessary plumbing to take ioprio hints down to block layer Jason B. Akers
2014-10-29 18:24 ` [RFC PATCH 5/5] libata: Enabling Solid State Hybrid Drives (SSHDs) based on SATA 3.2 standard Jason B. Akers
2014-10-29 20:14 ` [RFC PATCH 0/5] Enable use of Solid State Hybrid Drives Dave Chinner
2014-10-29 21:10 ` Jens Axboe
2014-10-29 22:09 ` Dave Chinner
2014-10-29 22:24 ` Dan Williams
2014-10-30 7:21 ` Dave Chinner
2014-10-30 14:15 ` Jens Axboe
2014-10-30 17:07 ` Dan Williams
2014-11-10 4:22 ` Dave Chinner
2014-11-12 16:47 ` Dan Williams
2014-10-29 22:49 ` Jens Axboe
2014-10-29 21:11 ` Dan Williams
2014-12-03 15:25 ` Pavel Machek
2014-10-30 2:05 ` Martin K. Petersen
2014-10-30 2:35 ` Jens Axboe
2014-10-30 3:28 ` Martin K. Petersen
2014-10-30 4:19 ` Dan Williams
2014-10-30 14:17 ` Jens Axboe
2014-10-30 14:53 ` Jens Axboe
2014-10-30 16:27 ` Dan Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).