* [PATCH V4 13/16] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
This patch is basically the counterpart, for NCQ-capable rotational
devices, of the previous patch. Exactly as the previous patch does on
flash-based devices and for any workload, this patch disables device
idling on rotational devices, but only for random I/O. In fact, only
with these queues disabling idling boosts the throughput on
NCQ-capable rotational devices. To not break service guarantees,
idling is disabled for NCQ-enabled rotational devices only when the
same symmetry conditions considered in the previous patches hold.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2081784..549f030 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6439,20 +6439,15 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
* The next variable takes into account the cases where idling
* boosts the throughput.
*
- * The value of the variable is computed considering that
- * idling is usually beneficial for the throughput if:
+ * The value of the variable is computed considering, first, that
+ * idling is virtually always beneficial for the throughput if:
* (a) the device is not NCQ-capable, or
* (b) regardless of the presence of NCQ, the device is rotational
- * and the request pattern for bfqq is I/O-bound (possible
- * throughput losses caused by granting idling to seeky queues
- * are mitigated by the fact that, in all scenarios where
- * boosting throughput is the best thing to do, i.e., in all
- * symmetric scenarios, only a minimal idle time is allowed to
- * seeky queues).
+ * and the request pattern for bfqq is I/O-bound and sequential.
*
* Secondly, and in contrast to the above item (b), idling an
* NCQ-capable flash-based device would not boost the
- * throughput even with intense I/O; rather it would lower
+ * throughput even with sequential I/O; rather it would lower
* the throughput in proportion to how fast the device
* is. Accordingly, the next variable is true if any of the
* above conditions (a) and (b) is true, and, in particular,
@@ -6460,7 +6455,8 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
* device.
*/
idling_boosts_thr = !bfqd->hw_tag ||
- (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
+ (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) &&
+ bfq_bfqq_idle_window(bfqq));
/*
* The value of the next variable,
--
2.10.0
^ permalink raw reply related
* [PATCH V4 12/16] block, bfq: boost the throughput on NCQ-capable flash-based devices
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.
As already highlighted in a previous patch, allowing the device to
prefetch and internally reorder requests trivially causes loss of
control on the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments on the function
bfq_bfqq_may_idle(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.
Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 154 ++++++++++++++++++++++++++++++++++++----------------
1 file changed, 106 insertions(+), 48 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b97801f..2081784 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6442,15 +6442,25 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
* The value of the variable is computed considering that
* idling is usually beneficial for the throughput if:
* (a) the device is not NCQ-capable, or
- * (b) regardless of the presence of NCQ, the request pattern
- * for bfqq is I/O-bound (possible throughput losses
- * caused by granting idling to seeky queues are mitigated
- * by the fact that, in all scenarios where boosting
- * throughput is the best thing to do, i.e., in all
- * symmetric scenarios, only a minimal idle time is
- * allowed to seeky queues).
+ * (b) regardless of the presence of NCQ, the device is rotational
+ * and the request pattern for bfqq is I/O-bound (possible
+ * throughput losses caused by granting idling to seeky queues
+ * are mitigated by the fact that, in all scenarios where
+ * boosting throughput is the best thing to do, i.e., in all
+ * symmetric scenarios, only a minimal idle time is allowed to
+ * seeky queues).
+ *
+ * Secondly, and in contrast to the above item (b), idling an
+ * NCQ-capable flash-based device would not boost the
+ * throughput even with intense I/O; rather it would lower
+ * the throughput in proportion to how fast the device
+ * is. Accordingly, the next variable is true if any of the
+ * above conditions (a) and (b) is true, and, in particular,
+ * happens to be false if bfqd is an NCQ-capable flash-based
+ * device.
*/
- idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
+ idling_boosts_thr = !bfqd->hw_tag ||
+ (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq));
/*
* The value of the next variable,
@@ -6491,14 +6501,16 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
bfqd->wr_busy_queues == 0;
/*
- * There is then a case where idling must be performed not for
- * throughput concerns, but to preserve service guarantees. To
- * introduce it, we can note that allowing the drive to
- * enqueue more than one request at a time, and hence
+ * There is then a case where idling must be performed not
+ * for throughput concerns, but to preserve service
+ * guarantees.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
* delegating de facto final scheduling decisions to the
- * drive's internal scheduler, causes loss of control on the
+ * drive's internal scheduler, entails loss of control on the
* actual request service order. In particular, the critical
- * situation is when requests from different processes happens
+ * situation is when requests from different processes happen
* to be present, at the same time, in the internal queue(s)
* of the drive. In such a situation, the drive, by deciding
* the service order of the internally-queued requests, does
@@ -6509,51 +6521,97 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
* the service distribution enforced by the drive's internal
* scheduler is likely to coincide with the desired
* device-throughput distribution only in a completely
- * symmetric scenario where: (i) each of these processes must
- * get the same throughput as the others; (ii) all these
- * processes have the same I/O pattern (either sequential or
- * random). In fact, in such a scenario, the drive will tend
- * to treat the requests of each of these processes in about
- * the same way as the requests of the others, and thus to
- * provide each of these processes with about the same
- * throughput (which is exactly the desired throughput
- * distribution). In contrast, in any asymmetric scenario,
- * device idling is certainly needed to guarantee that bfqq
- * receives its assigned fraction of the device throughput
- * (see [1] for details).
+ * symmetric scenario where:
+ * (i) each of these processes must get the same throughput as
+ * the others;
+ * (ii) all these processes have the same I/O pattern
+ (either sequential or random).
+ * In fact, in such a scenario, the drive will tend to treat
+ * the requests of each of these processes in about the same
+ * way as the requests of the others, and thus to provide
+ * each of these processes with about the same throughput
+ * (which is exactly the desired throughput distribution). In
+ * contrast, in any asymmetric scenario, device idling is
+ * certainly needed to guarantee that bfqq receives its
+ * assigned fraction of the device throughput (see [1] for
+ * details).
+ *
+ * We address this issue by controlling, actually, only the
+ * symmetry sub-condition (i), i.e., provided that
+ * sub-condition (i) holds, idling is not performed,
+ * regardless of whether sub-condition (ii) holds. In other
+ * words, only if sub-condition (i) holds, then idling is
+ * allowed, and the device tends to be prevented from queueing
+ * many requests, possibly of several processes. The reason
+ * for not controlling also sub-condition (ii) is that we
+ * exploit preemption to preserve guarantees in case of
+ * symmetric scenarios, even if (ii) does not hold, as
+ * explained in the next two paragraphs.
+ *
+ * Even if a queue, say Q, is expired when it remains idle, Q
+ * can still preempt the new in-service queue if the next
+ * request of Q arrives soon (see the comments on
+ * bfq_bfqq_update_budg_for_activation). If all queues and
+ * groups have the same weight, this form of preemption,
+ * combined with the hole-recovery heuristic described in the
+ * comments on function bfq_bfqq_update_budg_for_activation,
+ * are enough to preserve a correct bandwidth distribution in
+ * the mid term, even without idling. In fact, even if not
+ * idling allows the internal queues of the device to contain
+ * many requests, and thus to reorder requests, we can rather
+ * safely assume that the internal scheduler still preserves a
+ * minimum of mid-term fairness. The motivation for using
+ * preemption instead of idling is that, by not idling,
+ * service guarantees are preserved without minimally
+ * sacrificing throughput. In other words, both a high
+ * throughput and its desired distribution are obtained.
+ *
+ * More precisely, this preemption-based, idleless approach
+ * provides fairness in terms of IOPS, and not sectors per
+ * second. This can be seen with a simple example. Suppose
+ * that there are two queues with the same weight, but that
+ * the first queue receives requests of 8 sectors, while the
+ * second queue receives requests of 1024 sectors. In
+ * addition, suppose that each of the two queues contains at
+ * most one request at a time, which implies that each queue
+ * always remains idle after it is served. Finally, after
+ * remaining idle, each queue receives very quickly a new
+ * request. It follows that the two queues are served
+ * alternatively, preempting each other if needed. This
+ * implies that, although both queues have the same weight,
+ * the queue with large requests receives a service that is
+ * 1024/8 times as high as the service received by the other
+ * queue.
*
- * As for sub-condition (i), actually we check only whether
- * bfqq is being weight-raised. In fact, if bfqq is not being
- * weight-raised, we have that:
- * - if the process associated with bfqq is not I/O-bound, then
- * it is not either latency- or throughput-critical; therefore
- * idling is not needed for bfqq;
- * - if the process asociated with bfqq is I/O-bound, then
- * idling is already granted with bfqq (see the comments on
- * idling_boosts_thr).
+ * On the other hand, device idling is performed, and thus
+ * pure sector-domain guarantees are provided, for the
+ * following queues, which are likely to need stronger
+ * throughput guarantees: weight-raised queues, and queues
+ * with a higher weight than other queues. When such queues
+ * are active, sub-condition (i) is false, which triggers
+ * device idling.
*
- * We do not check sub-condition (ii) at all, i.e., the next
- * variable is true if and only if bfqq is being
- * weight-raised. We do not need to control sub-condition (ii)
- * for the following reason:
- * - if bfqq is being weight-raised, then idling is already
- * guaranteed to bfqq by sub-condition (i);
- * - if bfqq is not being weight-raised, then idling is
- * already guaranteed to bfqq (only) if it matters, i.e., if
- * bfqq is associated to a currently I/O-bound process (see
- * the above comment on sub-condition (i)).
+ * According to the above considerations, the next variable is
+ * true (only) if sub-condition (i) holds. To compute the
+ * value of this variable, we not only use the return value of
+ * the function bfq_symmetric_scenario(), but also check
+ * whether bfqq is being weight-raised, because
+ * bfq_symmetric_scenario() does not take into account also
+ * weight-raised queues (see comments on
+ * bfq_weights_tree_add()).
*
* As a side note, it is worth considering that the above
* device-idling countermeasures may however fail in the
* following unlucky scenario: if idling is (correctly)
- * disabled in a time period during which the symmetry
- * sub-condition holds, and hence the device is allowed to
+ * disabled in a time period during which all symmetry
+ * sub-conditions hold, and hence the device is allowed to
* enqueue many requests, but at some later point in time some
* sub-condition stops to hold, then it may become impossible
* to let requests be served in the desired order until all
* the requests already queued in the device have been served.
*/
- asymmetric_scenario = bfqq->wr_coeff > 1;
+ asymmetric_scenario = bfqq->wr_coeff > 1 ||
+ !bfq_symmetric_scenario(bfqd);
/*
* We have now all the components we need to compute the return
--
2.10.0
^ permalink raw reply related
* [PATCH V4 11/16] block, bfq: reduce idling only in symmetric scenarios
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Riccardo Pizzetti,
Samuele Zecchini, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
From: Arianna Avanzini <avanzini.arianna@gmail.com>
A seeky queue (i..e, a queue containing random requests) is assigned a
very small device-idling slice, for throughput issues. Unfortunately,
given the process associated with a seeky queue, this behavior causes
the following problem: if the process, say P, performs sync I/O and
has a higher weight than some other processes doing I/O and associated
with non-seeky queues, then BFQ may fail to guarantee to P its
reserved share of the throughput. The reason is that idling is key
for providing service guarantees to processes doing sync I/O [1].
This commit addresses this issue by allowing the device-idling slice
to be reduced for a seeky queue only if the scenario happens to be
symmetric, i.e., if all the queues are to receive the same share of
the throughput.
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Riccardo Pizzetti <riccardo.pizzetti@gmail.com>
Signed-off-by: Samuele Zecchini <samuele.zecchini92@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
block/bfq-iosched.c | 287 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 280 insertions(+), 7 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e7388a..b97801f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -183,6 +183,20 @@ struct bfq_sched_data {
};
/**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ * with a given weight.
+ */
+struct bfq_weight_counter {
+ unsigned int weight; /* weight of the entities this counter refers to */
+ unsigned int num_active; /* nr of active entities with this weight */
+ /*
+ * Weights tree member (see bfq_data's @queue_weights_tree and
+ * @group_weights_tree)
+ */
+ struct rb_node weights_node;
+};
+
+/**
* struct bfq_entity - schedulable entity.
*
* A bfq_entity is used to represent either a bfq_queue (leaf node in the
@@ -212,6 +226,8 @@ struct bfq_sched_data {
struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
+ /* pointer to the weight counter associated with this entity */
+ struct bfq_weight_counter *weight_counter;
/*
* Flag, true if the entity is on a tree (either the active or
@@ -456,6 +472,25 @@ struct bfq_data {
struct bfq_group *root_group;
/*
+ * rbtree of weight counters of @bfq_queues, sorted by
+ * weight. Used to keep track of whether all @bfq_queues have
+ * the same weight. The tree contains one counter for each
+ * distinct weight associated to some active and not
+ * weight-raised @bfq_queue (see the comments to the functions
+ * bfq_weights_tree_[add|remove] for further details).
+ */
+ struct rb_root queue_weights_tree;
+ /*
+ * rbtree of non-queue @bfq_entity weight counters, sorted by
+ * weight. Used to keep track of whether all @bfq_groups have
+ * the same weight. The tree contains one counter for each
+ * distinct weight associated to some active @bfq_group (see
+ * the comments to the functions bfq_weights_tree_[add|remove]
+ * for further details).
+ */
+ struct rb_root group_weights_tree;
+
+ /*
* Number of bfq_queues containing requests (including the
* queue in service, even if it is idling).
*/
@@ -791,6 +826,11 @@ struct bfq_group_data {
* to avoid too many special cases during group creation/
* migration.
* @stats: stats for this bfqg.
+ * @active_entities: number of active entities belonging to the group;
+ * unused for the root group. Used to know whether there
+ * are groups with more than one active @bfq_entity
+ * (see the comments to the function
+ * bfq_bfqq_may_idle()).
* @rq_pos_tree: rbtree sorted by next_request position, used when
* determining if two or more queues have interleaving
* requests (see bfq_find_close_cooperator()).
@@ -818,6 +858,8 @@ struct bfq_group {
struct bfq_entity *my_entity;
+ int active_entities;
+
struct rb_root rq_pos_tree;
struct bfqg_stats stats;
@@ -1254,12 +1296,27 @@ static bool bfq_update_parent_budget(struct bfq_entity *next_in_service)
* a candidate for next service (i.e, a candidate entity to serve
* after the in-service entity is expired). The function then returns
* true.
+ *
+ * In contrast, the entity could stil be a candidate for next service
+ * if it is not a queue, and has more than one child. In fact, even if
+ * one of its children is about to be set in service, other children
+ * may still be the next to serve. As a consequence, a non-queue
+ * entity is not a candidate for next-service only if it has only one
+ * child. And only if this condition holds, then the function returns
+ * true for a non-queue entity.
*/
static bool bfq_no_longer_next_in_service(struct bfq_entity *entity)
{
+ struct bfq_group *bfqg;
+
if (bfq_entity_to_bfqq(entity))
return true;
+ bfqg = container_of(entity, struct bfq_group, entity);
+
+ if (bfqg->active_entities == 1)
+ return true;
+
return false;
}
@@ -1498,6 +1555,15 @@ static void bfq_update_active_tree(struct rb_node *node)
goto up;
}
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+ struct bfq_entity *entity,
+ struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+ struct bfq_entity *entity,
+ struct rb_root *root);
+
+
/**
* bfq_active_insert - insert an entity in the active tree of its
* group/device.
@@ -1536,6 +1602,13 @@ static void bfq_active_insert(struct bfq_service_tree *st,
#endif
if (bfqq)
list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ else /* bfq_group */
+ bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+
+ if (bfqg != bfqd->root_group)
+ bfqg->active_entities++;
+#endif
}
/**
@@ -1631,6 +1704,14 @@ static void bfq_active_extract(struct bfq_service_tree *st,
#endif
if (bfqq)
list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ else /* bfq_group */
+ bfq_weights_tree_remove(bfqd, entity,
+ &bfqd->group_weights_tree);
+
+ if (bfqg != bfqd->root_group)
+ bfqg->active_entities--;
+#endif
}
/**
@@ -1731,6 +1812,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
unsigned int prev_weight, new_weight;
struct bfq_data *bfqd = NULL;
+ struct rb_root *root;
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_sched_data *sd;
struct bfq_group *bfqg;
@@ -1780,7 +1862,26 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
prev_weight = entity->weight;
new_weight = entity->orig_weight *
(bfqq ? bfqq->wr_coeff : 1);
+ /*
+ * If the weight of the entity changes, remove the entity
+ * from its old weight counter (if there is a counter
+ * associated with the entity), and add it to the counter
+ * associated with its new weight.
+ */
+ if (prev_weight != new_weight) {
+ root = bfqq ? &bfqd->queue_weights_tree :
+ &bfqd->group_weights_tree;
+ bfq_weights_tree_remove(bfqd, entity, root);
+ }
entity->weight = new_weight;
+ /*
+ * Add the entity to its weights tree only if it is
+ * not associated with a weight-raised queue.
+ */
+ if (prev_weight != new_weight &&
+ (bfqq ? bfqq->wr_coeff == 1 : 1))
+ /* If we get here, root has been initialized. */
+ bfq_weights_tree_add(bfqd, entity, root);
new_st->wsum += entity->weight;
@@ -2606,6 +2707,10 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqd->busy_queues--;
+ if (!bfqq->dispatched)
+ bfq_weights_tree_remove(bfqd, &bfqq->entity,
+ &bfqd->queue_weights_tree);
+
if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues--;
@@ -2626,6 +2731,11 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfq_mark_bfqq_busy(bfqq);
bfqd->busy_queues++;
+ if (!bfqq->dispatched)
+ if (bfqq->wr_coeff == 1)
+ bfq_weights_tree_add(bfqd, &bfqq->entity,
+ &bfqd->queue_weights_tree);
+
if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues++;
}
@@ -3028,6 +3138,7 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
* in bfq_init_queue()
*/
bfqg->bfqd = bfqd;
+ bfqg->active_entities = 0;
bfqg->rq_pos_tree = RB_ROOT;
}
@@ -3916,6 +4027,158 @@ static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
}
/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+ /*
+ * For weights to differ, at least one of the trees must contain
+ * at least two nodes.
+ */
+ return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+ (bfqd->queue_weights_tree.rb_node->rb_left ||
+ bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ ) ||
+ (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+ (bfqd->group_weights_tree.rb_node->rb_left ||
+ bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+ );
+}
+
+/*
+ * The following function returns true if every queue must receive the
+ * same share of the throughput (this condition is used when deciding
+ * whether idling may be disabled, see the comments in the function
+ * bfq_bfqq_may_idle()).
+ *
+ * Such a scenario occurs when:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ * weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ * number of children.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore this function evaluates, instead, the following stronger
+ * sub-conditions, for which it is much easier to maintain the needed
+ * state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, thus no state needs
+ * to be maintained in this case.
+ */
+static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+{
+ return !bfq_differentiated_weights(bfqd);
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+ struct bfq_entity *entity,
+ struct rb_root *root)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /*
+ * Do not insert if the entity is already associated with a
+ * counter, which happens if:
+ * 1) the entity is associated with a queue,
+ * 2) a request arrival has caused the queue to become both
+ * non-weight-raised, and hence change its weight, and
+ * backlogged; in this respect, each of the two events
+ * causes an invocation of this function,
+ * 3) this is the invocation of this function caused by the
+ * second event. This second invocation is actually useless,
+ * and we handle this fact by exiting immediately. More
+ * efficient or clearer solutions might possibly be adopted.
+ */
+ if (entity->weight_counter)
+ return;
+
+ while (*new) {
+ struct bfq_weight_counter *__counter = container_of(*new,
+ struct bfq_weight_counter,
+ weights_node);
+ parent = *new;
+
+ if (entity->weight == __counter->weight) {
+ entity->weight_counter = __counter;
+ goto inc_counter;
+ }
+ if (entity->weight < __counter->weight)
+ new = &((*new)->rb_left);
+ else
+ new = &((*new)->rb_right);
+ }
+
+ entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+ GFP_ATOMIC);
+
+ /*
+ * In the unlucky event of an allocation failure, we just
+ * exit. This will cause the weight of entity to not be
+ * considered in bfq_differentiated_weights, which, in its
+ * turn, causes the scenario to be deemed wrongly symmetric in
+ * case entity's weight would have been the only weight making
+ * the scenario asymmetric. On the bright side, no unbalance
+ * will however occur when entity becomes inactive again (the
+ * invocation of this function is triggered by an activation
+ * of entity). In fact, bfq_weights_tree_remove does nothing
+ * if !entity->weight_counter.
+ */
+ if (unlikely(!entity->weight_counter))
+ return;
+
+ entity->weight_counter->weight = entity->weight;
+ rb_link_node(&entity->weight_counter->weights_node, parent, new);
+ rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+ entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+ struct bfq_entity *entity,
+ struct rb_root *root)
+{
+ if (!entity->weight_counter)
+ return;
+
+ entity->weight_counter->num_active--;
+ if (entity->weight_counter->num_active > 0)
+ goto reset_entity_pointer;
+
+ rb_erase(&entity->weight_counter->weights_node, root);
+ kfree(entity->weight_counter);
+
+reset_entity_pointer:
+ entity->weight_counter = NULL;
+}
+
+/*
* Return expired entry, or NULL to just start from scratch in rbtree.
*/
static struct request *bfq_check_fifo(struct bfq_queue *bfqq,
@@ -5293,13 +5556,17 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
*/
sl = bfqd->bfq_slice_idle;
/*
- * Unless the queue is being weight-raised, grant only minimum
- * idle time if the queue is seeky. A long idling is preserved
- * for a weight-raised queue, because it is needed for
- * guaranteeing to the queue its reserved share of the
- * throughput.
- */
- if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1)
+ * Unless the queue is being weight-raised or the scenario is
+ * asymmetric, grant only minimum idle time if the queue
+ * is seeky. A long idling is preserved for a weight-raised
+ * queue, or, more in general, in an asymmetric scenario,
+ * because a long idling is needed for guaranteeing to a queue
+ * its reserved share of the throughput (in particular, it is
+ * needed if the queue has a higher weight than some other
+ * queue).
+ */
+ if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
+ bfq_symmetric_scenario(bfqd))
sl = min_t(u64, sl, BFQ_MIN_TT);
bfqd->last_idling_start = ktime_get();
@@ -7197,6 +7464,9 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
* mechanism).
*/
bfqq->budget_timeout = jiffies;
+
+ bfq_weights_tree_remove(bfqd, &bfqq->entity,
+ &bfqd->queue_weights_tree);
}
now_ns = ktime_get_ns();
@@ -7627,6 +7897,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
HRTIMER_MODE_REL);
bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+ bfqd->queue_weights_tree = RB_ROOT;
+ bfqd->group_weights_tree = RB_ROOT;
+
INIT_LIST_HEAD(&bfqd->active_list);
INIT_LIST_HEAD(&bfqd->idle_list);
--
2.10.0
^ permalink raw reply related
* [PATCH V4 10/16] block, bfq: add Early Queue Merge (EQM)
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Mauro Andreolini,
Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
From: Arianna Avanzini <avanzini.arianna@gmail.com>
A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.
Unfortunately, in the following frequent case, the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
would be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.
CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too. As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.
This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
block/bfq-iosched.c | 881 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 840 insertions(+), 41 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index deb1f21c..6e7388a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -281,11 +281,12 @@ struct bfq_ttime {
* struct bfq_queue - leaf schedulable entity.
*
* A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to
- * the cgroup, to be sure that it does not disappear while a bfqq
- * still references it (mostly to avoid races between request issuing
- * and task migration followed by cgroup destruction). All the fields
- * are protected by the queue lock of the containing bfqd.
+ * io_context or more, if it is async or shared between cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
*/
struct bfq_queue {
/* reference counter */
@@ -298,6 +299,16 @@ struct bfq_queue {
/* next ioprio and ioprio class if a change is in progress */
unsigned short new_ioprio, new_ioprio_class;
+ /*
+ * Shared bfq_queue if queue is cooperating with one or more
+ * other queues.
+ */
+ struct bfq_queue *new_bfqq;
+ /* request-position tree member (see bfq_group's @rq_pos_tree) */
+ struct rb_node pos_node;
+ /* request-position tree root (see bfq_group's @rq_pos_tree) */
+ struct rb_root *pos_root;
+
/* sorted list of pending requests */
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
@@ -347,6 +358,12 @@ struct bfq_queue {
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
+ /*
+ * Pointer to the bfq_io_cq owning the bfq_queue, set to %NULL
+ * if the queue is shared.
+ */
+ struct bfq_io_cq *bic;
+
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
@@ -375,10 +392,13 @@ struct bfq_queue {
* last transition from idle to backlogged.
*/
unsigned long service_from_backlogged;
+
/*
* Value of wr start time when switching to soft rt
*/
unsigned long wr_start_at_switch_to_srt;
+
+ unsigned long split_time; /* time of last split */
};
/**
@@ -394,6 +414,26 @@ struct bfq_io_cq {
#ifdef CONFIG_BFQ_GROUP_IOSCHED
uint64_t blkcg_serial_nr; /* the current blkcg serial */
#endif
+ /*
+ * Snapshot of the idle window before merging; taken to
+ * remember this value while the queue is merged, so as to be
+ * able to restore it in case of split.
+ */
+ bool saved_idle_window;
+ /*
+ * Same purpose as the previous two fields for the I/O bound
+ * classification of a queue.
+ */
+ bool saved_IO_bound;
+
+ /*
+ * Similar to previous fields: save wr information.
+ */
+ unsigned long saved_wr_coeff;
+ unsigned long saved_last_wr_start_finish;
+ unsigned long saved_wr_start_at_switch_to_srt;
+ unsigned int saved_wr_cur_max_time;
+ struct bfq_ttime saved_ttime;
};
enum bfq_device_speed {
@@ -584,6 +624,15 @@ struct bfq_data {
struct bfq_io_cq *bio_bic;
/* bfqq associated with the task issuing current bio for merging */
struct bfq_queue *bio_bfqq;
+
+ /*
+ * io context to put right after bfqd->lock is released. This
+ * filed is used to perform put_io_context, when needed, to
+ * after the scheduler lock has been released, and thus
+ * prevent an ioc->lock from being possibly taken while the
+ * scheduler lock is being held.
+ */
+ struct io_context *ioc_to_put;
};
enum bfqq_state_flags {
@@ -605,6 +654,8 @@ enum bfqq_state_flags {
* may need softrt-next-start
* update
*/
+ BFQQF_coop, /* bfqq is shared */
+ BFQQF_split_coop /* shared bfqq will be split */
};
#define BFQ_BFQQ_FNS(name) \
@@ -628,6 +679,8 @@ BFQ_BFQQ_FNS(fifo_expire);
BFQ_BFQQ_FNS(idle_window);
BFQ_BFQQ_FNS(sync);
BFQ_BFQQ_FNS(IO_bound);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
BFQ_BFQQ_FNS(softrt_update);
#undef BFQ_BFQQ_FNS
@@ -738,6 +791,9 @@ struct bfq_group_data {
* to avoid too many special cases during group creation/
* migration.
* @stats: stats for this bfqg.
+ * @rq_pos_tree: rbtree sorted by next_request position, used when
+ * determining if two or more queues have interleaving
+ * requests (see bfq_find_close_cooperator()).
*
* Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
* there is a set of bfq_groups, each one collecting the lower-level
@@ -762,6 +818,8 @@ struct bfq_group {
struct bfq_entity *my_entity;
+ struct rb_root rq_pos_tree;
+
struct bfqg_stats stats;
};
@@ -811,6 +869,27 @@ static struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
return bic->icq.q->elevator->elevator_data;
}
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
+static struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq)
+{
+ struct bfq_entity *group_entity = bfqq->entity.parent;
+
+ if (!group_entity)
+ group_entity = &bfqq->bfqd->root_group->entity;
+
+ return container_of(group_entity, struct bfq_group, entity);
+}
+
+#else
+
+static struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq)
+{
+ return bfqq->bfqd->root_group;
+}
+
+#endif
+
static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio);
static void bfq_put_queue(struct bfq_queue *bfqq);
static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
@@ -975,6 +1054,34 @@ static void bfq_schedule_dispatch(struct bfq_data *bfqd)
}
}
+/*
+ * Next two functions release bfqd->lock and put the io context
+ * pointed by bfqd->ioc_to_put. This delayed put is used to not risk
+ * to take an ioc->lock while the scheduler lock is being held.
+ */
+static void bfq_unlock_put_ioc(struct bfq_data *bfqd)
+{
+ struct io_context *ioc_to_put = bfqd->ioc_to_put;
+
+ bfqd->ioc_to_put = NULL;
+ spin_unlock_irq(&bfqd->lock);
+
+ if (ioc_to_put)
+ put_io_context(ioc_to_put);
+}
+
+static void bfq_unlock_put_ioc_restore(struct bfq_data *bfqd,
+ unsigned long flags)
+{
+ struct io_context *ioc_to_put = bfqd->ioc_to_put;
+
+ bfqd->ioc_to_put = NULL;
+ spin_unlock_irqrestore(&bfqd->lock, flags);
+
+ if (ioc_to_put)
+ put_io_context(ioc_to_put);
+}
+
/**
* bfq_gt - compare two timestamps.
* @a: first ts.
@@ -2425,7 +2532,14 @@ static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
struct bfq_entity *entity = in_serv_entity;
if (bfqd->in_service_bic) {
- put_io_context(bfqd->in_service_bic->icq.ioc);
+ /*
+ * Schedule the release of a reference to
+ * bfqd->in_service_bic->icq.ioc to right after the
+ * scheduler lock is released. This ioc is not
+ * released immediately, to not risk to possibly take
+ * an ioc->lock while holding the scheduler lock.
+ */
+ bfqd->ioc_to_put = bfqd->in_service_bic->icq.ioc;
bfqd->in_service_bic = NULL;
}
@@ -2914,6 +3028,7 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
* in bfq_init_queue()
*/
bfqg->bfqd = bfqd;
+ bfqg->rq_pos_tree = RB_ROOT;
}
static void bfq_pd_free(struct blkg_policy_data *pd)
@@ -2982,6 +3097,8 @@ static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
return bfqg;
}
+static void bfq_pos_tree_add_move(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq);
static void bfq_bfqq_expire(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
bool compensate,
@@ -3030,8 +3147,10 @@ static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
entity->sched_data = &bfqg->sched_data;
bfqg_get(bfqg);
- if (bfq_bfqq_busy(bfqq))
+ if (bfq_bfqq_busy(bfqq)) {
+ bfq_pos_tree_add_move(bfqd, bfqq);
bfq_activate_bfqq(bfqd, bfqq);
+ }
if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
bfq_schedule_dispatch(bfqd);
@@ -3071,8 +3190,7 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
bic_set_bfqq(bic, NULL, 0);
bfq_log_bfqq(bfqd, async_bfqq,
"bic_change_group: %p %d",
- async_bfqq,
- async_bfqq->ref);
+ async_bfqq, async_bfqq->ref);
bfq_put_queue(async_bfqq);
}
}
@@ -3214,7 +3332,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
__bfq_deactivate_entity(entity, false);
bfq_put_async_queues(bfqd, bfqg);
- spin_unlock_irqrestore(&bfqd->lock, flags);
+ bfq_unlock_put_ioc_restore(bfqd, flags);
/*
* @blkg is going offline and will be ignored by
* blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so
@@ -3731,6 +3849,72 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
}
}
+static struct bfq_queue *
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
+ sector_t sector, struct rb_node **ret_parent,
+ struct rb_node ***rb_link)
+{
+ struct rb_node **p, *parent;
+ struct bfq_queue *bfqq = NULL;
+
+ parent = NULL;
+ p = &root->rb_node;
+ while (*p) {
+ struct rb_node **n;
+
+ parent = *p;
+ bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+
+ /*
+ * Sort strictly based on sector. Smallest to the left,
+ * largest to the right.
+ */
+ if (sector > blk_rq_pos(bfqq->next_rq))
+ n = &(*p)->rb_right;
+ else if (sector < blk_rq_pos(bfqq->next_rq))
+ n = &(*p)->rb_left;
+ else
+ break;
+ p = n;
+ bfqq = NULL;
+ }
+
+ *ret_parent = parent;
+ if (rb_link)
+ *rb_link = p;
+
+ bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
+ (unsigned long long)sector,
+ bfqq ? bfqq->pid : 0);
+
+ return bfqq;
+}
+
+static void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ struct rb_node **p, *parent;
+ struct bfq_queue *__bfqq;
+
+ if (bfqq->pos_root) {
+ rb_erase(&bfqq->pos_node, bfqq->pos_root);
+ bfqq->pos_root = NULL;
+ }
+
+ if (bfq_class_idle(bfqq))
+ return;
+ if (!bfqq->next_rq)
+ return;
+
+ bfqq->pos_root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree;
+ __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
+ blk_rq_pos(bfqq->next_rq), &parent, &p);
+ if (!__bfqq) {
+ rb_link_node(&bfqq->pos_node, parent, p);
+ rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
+ } else
+ bfqq->pos_root = NULL;
+}
+
/*
* Return expired entry, or NULL to just start from scratch in rbtree.
*/
@@ -3837,6 +4021,43 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
}
}
+static void
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+ if (bic->saved_idle_window)
+ bfq_mark_bfqq_idle_window(bfqq);
+ else
+ bfq_clear_bfqq_idle_window(bfqq);
+
+ if (bic->saved_IO_bound)
+ bfq_mark_bfqq_IO_bound(bfqq);
+ else
+ bfq_clear_bfqq_IO_bound(bfqq);
+
+ bfqq->ttime = bic->saved_ttime;
+ bfqq->wr_coeff = bic->saved_wr_coeff;
+ bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
+ bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish;
+ bfqq->wr_cur_max_time = bic->saved_wr_cur_max_time;
+
+ if (bfqq->wr_coeff > 1 &&
+ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ bfqq->wr_cur_max_time)) {
+ bfq_log_bfqq(bfqq->bfqd, bfqq,
+ "resume state: switching off wr");
+
+ bfqq->wr_coeff = 1;
+ }
+
+ /* make sure weight will be updated, however we got here */
+ bfqq->entity.prio_changed = 1;
+}
+
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+ return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
+}
+
static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
{
struct bfq_entity *entity = &bfqq->entity;
@@ -4157,14 +4378,16 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
/*
* bfqq deserves to be weight-raised if:
* - it is sync,
- * - it has been idle for enough time or is soft real-time.
+ * - it has been idle for enough time or is soft real-time,
+ * - is linked to a bfq_io_cq (it is not shared in any sense).
*/
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
time_is_before_jiffies(bfqq->soft_rt_next_start);
*interactive = idle_for_long_time;
wr_or_deserves_wr = bfqd->low_latency &&
(bfqq->wr_coeff > 1 ||
- (bfq_bfqq_sync(bfqq) && (*interactive || soft_rt)));
+ (bfq_bfqq_sync(bfqq) &&
+ bfqq->bic && (*interactive || soft_rt)));
/*
* Using the last flag, update budget and check whether bfqq
@@ -4186,14 +4409,22 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
}
if (bfqd->low_latency) {
- bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
- old_wr_coeff,
- wr_or_deserves_wr,
- *interactive,
- soft_rt);
-
- if (old_wr_coeff != bfqq->wr_coeff)
- bfqq->entity.prio_changed = 1;
+ if (unlikely(time_is_after_jiffies(bfqq->split_time)))
+ /* wraparound */
+ bfqq->split_time =
+ jiffies - bfqd->bfq_wr_min_idle_time - 1;
+
+ if (time_is_before_jiffies(bfqq->split_time +
+ bfqd->bfq_wr_min_idle_time)) {
+ bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
+ old_wr_coeff,
+ wr_or_deserves_wr,
+ *interactive,
+ soft_rt);
+
+ if (old_wr_coeff != bfqq->wr_coeff)
+ bfqq->entity.prio_changed = 1;
+ }
}
bfqq->last_idle_bklogged = jiffies;
@@ -4240,6 +4471,12 @@ static void bfq_add_request(struct request *rq)
next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
bfqq->next_rq = next_rq;
+ /*
+ * Adjust priority tree position, if next_rq changes.
+ */
+ if (prev != bfqq->next_rq)
+ bfq_pos_tree_add_move(bfqd, bfqq);
+
if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
rq, &interactive);
@@ -4368,6 +4605,14 @@ static void bfq_remove_request(struct request_queue *q,
*/
bfqq->entity.budget = bfqq->entity.service = 0;
}
+
+ /*
+ * Remove queue from request-position tree as it is empty.
+ */
+ if (bfqq->pos_root) {
+ rb_erase(&bfqq->pos_node, bfqq->pos_root);
+ bfqq->pos_root = NULL;
+ }
}
if (rq->cmd_flags & REQ_META)
@@ -4445,11 +4690,14 @@ static void bfq_request_merged(struct request_queue *q, struct request *req,
bfqd->last_position);
bfqq->next_rq = next_rq;
/*
- * If next_rq changes, update the queue's budget to fit
- * the new request.
+ * If next_rq changes, update both the queue's budget to
+ * fit the new request and the queue's position in its
+ * rq_pos_tree.
*/
- if (prev != bfqq->next_rq)
+ if (prev != bfqq->next_rq) {
bfq_updated_next_req(bfqd, bfqq);
+ bfq_pos_tree_add_move(bfqd, bfqq);
+ }
}
}
@@ -4532,12 +4780,364 @@ static void bfq_end_wr(struct bfq_data *bfqd)
spin_unlock_irq(&bfqd->lock);
}
+static sector_t bfq_io_struct_pos(void *io_struct, bool request)
+{
+ if (request)
+ return blk_rq_pos(io_struct);
+ else
+ return ((struct bio *)io_struct)->bi_iter.bi_sector;
+}
+
+static int bfq_rq_close_to_sector(void *io_struct, bool request,
+ sector_t sector)
+{
+ return abs(bfq_io_struct_pos(io_struct, request) - sector) <=
+ BFQQ_CLOSE_THR;
+}
+
+static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ sector_t sector)
+{
+ struct rb_root *root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree;
+ struct rb_node *parent, *node;
+ struct bfq_queue *__bfqq;
+
+ if (RB_EMPTY_ROOT(root))
+ return NULL;
+
+ /*
+ * First, if we find a request starting at the end of the last
+ * request, choose it.
+ */
+ __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
+ if (__bfqq)
+ return __bfqq;
+
+ /*
+ * If the exact sector wasn't found, the parent of the NULL leaf
+ * will contain the closest sector (rq_pos_tree sorted by
+ * next_request position).
+ */
+ __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ if (blk_rq_pos(__bfqq->next_rq) < sector)
+ node = rb_next(&__bfqq->pos_node);
+ else
+ node = rb_prev(&__bfqq->pos_node);
+ if (!node)
+ return NULL;
+
+ __bfqq = rb_entry(node, struct bfq_queue, pos_node);
+ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ return NULL;
+}
+
+static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd,
+ struct bfq_queue *cur_bfqq,
+ sector_t sector)
+{
+ struct bfq_queue *bfqq;
+
+ /*
+ * We shall notice if some of the queues are cooperating,
+ * e.g., working closely on the same area of the device. In
+ * that case, we can group them together and: 1) don't waste
+ * time idling, and 2) serve the union of their requests in
+ * the best possible order for throughput.
+ */
+ bfqq = bfqq_find_close(bfqd, cur_bfqq, sector);
+ if (!bfqq || bfqq == cur_bfqq)
+ return NULL;
+
+ return bfqq;
+}
+
+static struct bfq_queue *
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+ int process_refs, new_process_refs;
+ struct bfq_queue *__bfqq;
+
+ /*
+ * If there are no process references on the new_bfqq, then it is
+ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+ * may have dropped their last reference (not just their last process
+ * reference).
+ */
+ if (!bfqq_process_refs(new_bfqq))
+ return NULL;
+
+ /* Avoid a circular list and skip interim queue merges. */
+ while ((__bfqq = new_bfqq->new_bfqq)) {
+ if (__bfqq == bfqq)
+ return NULL;
+ new_bfqq = __bfqq;
+ }
+
+ process_refs = bfqq_process_refs(bfqq);
+ new_process_refs = bfqq_process_refs(new_bfqq);
+ /*
+ * If the process for the bfqq has gone away, there is no
+ * sense in merging the queues.
+ */
+ if (process_refs == 0 || new_process_refs == 0)
+ return NULL;
+
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+ new_bfqq->pid);
+
+ /*
+ * Merging is just a redirection: the requests of the process
+ * owning one of the two queues are redirected to the other queue.
+ * The latter queue, in its turn, is set as shared if this is the
+ * first time that the requests of some process are redirected to
+ * it.
+ *
+ * We redirect bfqq to new_bfqq and not the opposite, because we
+ * are in the context of the process owning bfqq, hence we have
+ * the io_cq of this process. So we can immediately configure this
+ * io_cq to redirect the requests of the process to new_bfqq.
+ *
+ * NOTE, even if new_bfqq coincides with the in-service queue, the
+ * io_cq of new_bfqq is not available, because, if the in-service
+ * queue is shared, bfqd->in_service_bic may not point to the
+ * io_cq of the in-service queue.
+ * Redirecting the requests of the process owning bfqq to the
+ * currently in-service queue is in any case the best option, as
+ * we feed the in-service queue with new requests close to the
+ * last request served and, by doing so, hopefully increase the
+ * throughput.
+ */
+ bfqq->new_bfqq = new_bfqq;
+ new_bfqq->ref += process_refs;
+ return new_bfqq;
+}
+
+static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
+ struct bfq_queue *new_bfqq)
+{
+ if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
+ (bfqq->ioprio_class != new_bfqq->ioprio_class))
+ return false;
+
+ /*
+ * If either of the queues has already been detected as seeky,
+ * then merging it with the other queue is unlikely to lead to
+ * sequential I/O.
+ */
+ if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq))
+ return false;
+
+ /*
+ * Interleaved I/O is known to be done by (some) applications
+ * only for reads, so it does not make sense to merge async
+ * queues.
+ */
+ if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))
+ return false;
+
+ return true;
+}
+
+/*
+ * If this function returns true, then bfqq cannot be merged. The idea
+ * is that true cooperation happens very early after processes start
+ * to do I/O. Usually, late cooperations are just accidental false
+ * positives. In case bfqq is weight-raised, such false positives
+ * would evidently degrade latency guarantees for bfqq.
+ */
+static bool wr_from_too_long(struct bfq_queue *bfqq)
+{
+ return bfqq->wr_coeff > 1 &&
+ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ msecs_to_jiffies(100));
+}
+
+/*
+ * Attempt to schedule a merge of bfqq with the currently in-service
+ * queue or with a close queue among the scheduled queues. Return
+ * NULL if no merge was scheduled, a pointer to the shared bfq_queue
+ * structure otherwise.
+ *
+ * The OOM queue is not allowed to participate to cooperation: in fact, since
+ * the requests temporarily redirected to the OOM queue could be redirected
+ * again to dedicated queues at any time, the state needed to correctly
+ * handle merging with the OOM queue would be quite complex and expensive
+ * to maintain. Besides, in such a critical condition as an out of memory,
+ * the benefits of queue merging may be little relevant, or even negligible.
+ *
+ * Weight-raised queues can be merged only if their weight-raising
+ * period has just started. In fact cooperating processes are usually
+ * started together. Thus, with this filter we avoid false positives
+ * that would jeopardize low-latency guarantees.
+ *
+ * WARNING: queue merging may impair fairness among non-weight raised
+ * queues, for at least two reasons: 1) the original weight of a
+ * merged queue may change during the merged state, 2) even being the
+ * weight the same, a merged queue may be bloated with many more
+ * requests than the ones produced by its originally-associated
+ * process.
+ */
+static struct bfq_queue *
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ void *io_struct, bool request)
+{
+ struct bfq_queue *in_service_bfqq, *new_bfqq;
+
+ if (bfqq->new_bfqq)
+ return bfqq->new_bfqq;
+
+ if (!io_struct ||
+ wr_from_too_long(bfqq) ||
+ unlikely(bfqq == &bfqd->oom_bfqq))
+ return NULL;
+
+ /* If there is only one backlogged queue, don't search. */
+ if (bfqd->busy_queues == 1)
+ return NULL;
+
+ in_service_bfqq = bfqd->in_service_queue;
+
+ if (!in_service_bfqq || in_service_bfqq == bfqq ||
+ !bfqd->in_service_bic || wr_from_too_long(in_service_bfqq) ||
+ unlikely(in_service_bfqq == &bfqd->oom_bfqq))
+ goto check_scheduled;
+
+ if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+ bfqq->entity.parent == in_service_bfqq->entity.parent &&
+ bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
+ new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
+ if (new_bfqq)
+ return new_bfqq;
+ }
+ /*
+ * Check whether there is a cooperator among currently scheduled
+ * queues. The only thing we need is that the bio/request is not
+ * NULL, as we need it to establish whether a cooperator exists.
+ */
+check_scheduled:
+ new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
+ bfq_io_struct_pos(io_struct, request));
+
+ if (new_bfqq && !wr_from_too_long(new_bfqq) &&
+ likely(new_bfqq != &bfqd->oom_bfqq) &&
+ bfq_may_be_close_cooperator(bfqq, new_bfqq))
+ return bfq_setup_merge(bfqq, new_bfqq);
+
+ return NULL;
+}
+
+static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
+{
+ struct bfq_io_cq *bic = bfqq->bic;
+
+ /*
+ * If !bfqq->bic, the queue is already shared or its requests
+ * have already been redirected to a shared queue; both idle window
+ * and weight raising state have already been saved. Do nothing.
+ */
+ if (!bic)
+ return;
+
+ bic->saved_ttime = bfqq->ttime;
+ bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+ bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
+ bic->saved_wr_coeff = bfqq->wr_coeff;
+ bic->saved_wr_start_at_switch_to_srt = bfqq->wr_start_at_switch_to_srt;
+ bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
+ bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
+}
+
+static void bfq_get_bic_reference(struct bfq_queue *bfqq)
+{
+ /*
+ * If bfqq->bic has a non-NULL value, the bic to which it belongs
+ * is about to begin using a shared bfq_queue.
+ */
+ if (bfqq->bic)
+ atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
+}
+
+static void
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+ struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+ (unsigned long)new_bfqq->pid);
+ /* Save weight raising and idle window of the merged queues */
+ bfq_bfqq_save_state(bfqq);
+ bfq_bfqq_save_state(new_bfqq);
+ if (bfq_bfqq_IO_bound(bfqq))
+ bfq_mark_bfqq_IO_bound(new_bfqq);
+ bfq_clear_bfqq_IO_bound(bfqq);
+
+ /*
+ * If bfqq is weight-raised, then let new_bfqq inherit
+ * weight-raising. To reduce false positives, neglect the case
+ * where bfqq has just been created, but has not yet made it
+ * to be weight-raised (which may happen because EQM may merge
+ * bfqq even before bfq_add_request is executed for the first
+ * time for bfqq).
+ */
+ if (new_bfqq->wr_coeff == 1 && bfqq->wr_coeff > 1) {
+ new_bfqq->wr_coeff = bfqq->wr_coeff;
+ new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time;
+ new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish;
+ new_bfqq->wr_start_at_switch_to_srt =
+ bfqq->wr_start_at_switch_to_srt;
+ if (bfq_bfqq_busy(new_bfqq))
+ bfqd->wr_busy_queues++;
+ new_bfqq->entity.prio_changed = 1;
+ }
+
+ if (bfqq->wr_coeff > 1) { /* bfqq has given its wr to new_bfqq */
+ bfqq->wr_coeff = 1;
+ bfqq->entity.prio_changed = 1;
+ if (bfq_bfqq_busy(bfqq))
+ bfqd->wr_busy_queues--;
+ }
+
+ bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d",
+ bfqd->wr_busy_queues);
+
+ /*
+ * Grab a reference to the bic, to prevent it from being destroyed
+ * before being possibly touched by a bfq_split_bfqq().
+ */
+ bfq_get_bic_reference(bfqq);
+ bfq_get_bic_reference(new_bfqq);
+ /*
+ * Merge queues (that is, let bic redirect its requests to new_bfqq)
+ */
+ bic_set_bfqq(bic, new_bfqq, 1);
+ bfq_mark_bfqq_coop(new_bfqq);
+ /*
+ * new_bfqq now belongs to at least two bics (it is a shared queue):
+ * set new_bfqq->bic to NULL. bfqq either:
+ * - does not belong to any bic any more, and hence bfqq->bic must
+ * be set to NULL, or
+ * - is a queue whose owning bics have already been redirected to a
+ * different queue, hence the queue is destined to not belong to
+ * any bic soon and bfqq->bic is already NULL (therefore the next
+ * assignment causes no harm).
+ */
+ new_bfqq->bic = NULL;
+ bfqq->bic = NULL;
+ /* release process reference to bfqq */
+ bfq_put_queue(bfqq);
+}
+
static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
struct bio *bio)
{
struct bfq_data *bfqd = q->elevator->elevator_data;
bool is_sync = op_is_sync(bio->bi_opf);
- struct bfq_queue *bfqq = bfqd->bio_bfqq;
+ struct bfq_queue *bfqq = bfqd->bio_bfqq, *new_bfqq;
/*
* Disallow merge of a sync bio into an async request.
@@ -4552,6 +5152,37 @@ static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
if (!bfqq)
return false;
+ /*
+ * We take advantage of this function to perform an early merge
+ * of the queues of possible cooperating processes.
+ */
+ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
+ if (new_bfqq) {
+ /*
+ * bic still points to bfqq, then it has not yet been
+ * redirected to some other bfq_queue, and a queue
+ * merge beween bfqq and new_bfqq can be safely
+ * fulfillled, i.e., bic can be redirected to new_bfqq
+ * and bfqq can be put.
+ */
+ bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
+ new_bfqq);
+ /*
+ * If we get here, bio will be queued into new_queue,
+ * so use new_bfqq to decide whether bio and rq can be
+ * merged.
+ */
+ bfqq = new_bfqq;
+
+ /*
+ * Change also bqfd->bio_bfqq, as
+ * bfqd->bio_bic now points to new_bfqq, and
+ * this function may be invoked again (and then may
+ * use again bqfd->bio_bfqq).
+ */
+ bfqd->bio_bfqq = bfqq;
+ }
+
return bfqq == RQ_BFQQ(rq);
}
@@ -4959,6 +5590,15 @@ static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
+ /*
+ * If this bfqq is shared between multiple processes, check
+ * to make sure that those processes are still issuing I/Os
+ * within the mean seek distance. If not, it may be time to
+ * break the queues apart again.
+ */
+ if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
+ bfq_mark_bfqq_split_coop(bfqq);
+
if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
if (bfqq->dispatched == 0)
/*
@@ -4970,8 +5610,13 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfqq->budget_timeout = jiffies;
bfq_del_bfqq_busy(bfqd, bfqq, true);
- } else
+ } else {
bfq_requeue_bfqq(bfqd, bfqq);
+ /*
+ * Resort priority tree of potential close cooperators.
+ */
+ bfq_pos_tree_add_move(bfqd, bfqq);
+ }
/*
* All in-service entities must have been properly deactivated
@@ -5792,8 +6437,7 @@ static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
/*
* If too much time has elapsed from the beginning of
- * this weight-raising period, then end weight
- * raising.
+ * this weight-raising period, then end weight raising.
*/
if (time_is_before_jiffies(bfqq->last_wr_start_finish +
bfqq->wr_cur_max_time)) {
@@ -5969,8 +6613,9 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
struct request *rq;
spin_lock_irq(&bfqd->lock);
+
rq = __bfq_dispatch_request(hctx);
- spin_unlock_irq(&bfqd->lock);
+ bfq_unlock_put_ioc(bfqd);
return rq;
}
@@ -6004,6 +6649,25 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
#endif
}
+static void bfq_put_cooperator(struct bfq_queue *bfqq)
+{
+ struct bfq_queue *__bfqq, *next;
+
+ /*
+ * If this queue was scheduled to merge with another queue, be
+ * sure to drop the reference taken on that queue (and others in
+ * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
+ */
+ __bfqq = bfqq->new_bfqq;
+ while (__bfqq) {
+ if (__bfqq == bfqq)
+ break;
+ next = __bfqq->new_bfqq;
+ bfq_put_queue(__bfqq);
+ __bfqq = next;
+ }
+}
+
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
if (bfqq == bfqd->in_service_queue) {
@@ -6013,6 +6677,8 @@ static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);
+ bfq_put_cooperator(bfqq);
+
bfq_put_queue(bfqq); /* release process reference */
}
@@ -6028,9 +6694,20 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
unsigned long flags;
spin_lock_irqsave(&bfqd->lock, flags);
+ /*
+ * If the bic is using a shared queue, put the
+ * reference taken on the io_context when the bic
+ * started using a shared bfq_queue. This put cannot
+ * make ioc->ref_count reach 0, then no ioc->lock
+ * risks to be taken (leading to possible deadlock
+ * scenarios).
+ */
+ if (is_sync && bfq_bfqq_coop(bfqq))
+ put_io_context(bic->icq.ioc);
+
bfq_exit_bfqq(bfqd, bfqq);
bic_set_bfqq(bic, NULL, is_sync);
- spin_unlock_irq(&bfqd->lock);
+ bfq_unlock_put_ioc_restore(bfqd, flags);
}
}
@@ -6152,8 +6829,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqq->budget_timeout = bfq_smallest_from_now();
bfqq->wr_coeff = 1;
- bfqq->last_wr_start_finish = bfq_smallest_from_now();
+ bfqq->last_wr_start_finish = jiffies;
bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();
+ bfqq->split_time = bfq_smallest_from_now();
/*
* Set to the value for which bfqq will not be deemed as
@@ -6288,6 +6966,11 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
return;
+ /* Idle window just restored, statistics are meaningless. */
+ if (time_is_after_eq_jiffies(bfqq->split_time +
+ bfqd->bfq_wr_min_idle_time))
+ return;
+
enable_idle = bfq_bfqq_idle_window(bfqq);
if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
@@ -6383,7 +7066,38 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
{
- struct bfq_queue *bfqq = RQ_BFQQ(rq);
+ struct bfq_queue *bfqq = RQ_BFQQ(rq),
+ *new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
+
+ if (new_bfqq) {
+ if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
+ new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
+ /*
+ * Release the request's reference to the old bfqq
+ * and make sure one is taken to the shared queue.
+ */
+ new_bfqq->allocated++;
+ bfqq->allocated--;
+ new_bfqq->ref++;
+ /*
+ * If the bic associated with the process
+ * issuing this request still points to bfqq
+ * (and thus has not been already redirected
+ * to new_bfqq or even some other bfq_queue),
+ * then complete the merge and redirect it to
+ * new_bfqq.
+ */
+ if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
+ bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
+ bfqq, new_bfqq);
+ /*
+ * rq is about to be enqueued into new_bfqq,
+ * release rq reference on bfqq
+ */
+ bfq_put_queue(bfqq);
+ rq->elv.priv[1] = new_bfqq;
+ bfqq = new_bfqq;
+ }
bfq_add_request(rq);
@@ -6425,7 +7139,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
}
}
- spin_unlock_irq(&bfqd->lock);
+ bfq_unlock_put_ioc(bfqd);
}
static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
@@ -6576,7 +7290,7 @@ static void bfq_put_rq_private(struct request_queue *q, struct request *rq)
bfq_completed_request(bfqq, bfqd);
bfq_put_rq_priv_body(bfqq);
- spin_unlock_irqrestore(&bfqd->lock, flags);
+ bfq_unlock_put_ioc_restore(bfqd, flags);
} else {
/*
* Request rq may be still/already in the scheduler,
@@ -6600,6 +7314,55 @@ static void bfq_put_rq_private(struct request_queue *q, struct request *rq)
}
/*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to that bfqq.
+ */
+static struct bfq_queue *
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+{
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
+
+ if (bfqq_process_refs(bfqq) == 1) {
+ bfqq->pid = current->pid;
+ bfq_clear_bfqq_coop(bfqq);
+ bfq_clear_bfqq_split_coop(bfqq);
+ return bfqq;
+ }
+
+ bic_set_bfqq(bic, NULL, 1);
+
+ bfq_put_cooperator(bfqq);
+
+ bfq_put_queue(bfqq);
+ return NULL;
+}
+
+static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
+ struct bfq_io_cq *bic,
+ struct bio *bio,
+ bool split, bool is_sync,
+ bool *new_queue)
+{
+ struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync);
+
+ if (likely(bfqq && bfqq != &bfqd->oom_bfqq))
+ return bfqq;
+
+ if (new_queue)
+ *new_queue = true;
+
+ if (bfqq)
+ bfq_put_queue(bfqq);
+ bfqq = bfq_get_queue(bfqd, bio, is_sync, bic);
+
+ bic_set_bfqq(bic, bfqq, is_sync);
+ if (split && is_sync)
+ bfqq->split_time = jiffies;
+
+ return bfqq;
+}
+
+/*
* Allocate bfq data structures associated with this request.
*/
static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
@@ -6609,6 +7372,7 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
const int is_sync = rq_is_sync(rq);
struct bfq_queue *bfqq;
+ bool new_queue = false;
spin_lock_irq(&bfqd->lock);
@@ -6619,12 +7383,28 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
bfq_bic_update_cgroup(bic, bio);
- bfqq = bic_to_bfqq(bic, is_sync);
- if (!bfqq || bfqq == &bfqd->oom_bfqq) {
- if (bfqq)
- bfq_put_queue(bfqq);
- bfqq = bfq_get_queue(bfqd, bio, is_sync, bic);
- bic_set_bfqq(bic, bfqq, is_sync);
+ bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, false, is_sync,
+ &new_queue);
+
+ if (likely(!new_queue)) {
+ /* If the queue was seeky for too long, break it apart. */
+ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+ bfqq = bfq_split_bfqq(bic, bfqq);
+ /*
+ * A reference to bic->icq.ioc needs to be
+ * released after a queue split. Do not do it
+ * immediately, to not risk to possibly take
+ * an ioc->lock while holding the scheduler
+ * lock.
+ */
+ bfqd->ioc_to_put = bic->icq.ioc;
+
+ if (!bfqq)
+ bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
+ true, is_sync,
+ NULL);
+ }
}
bfqq->allocated++;
@@ -6635,7 +7415,25 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
rq->elv.priv[0] = bic;
rq->elv.priv[1] = bfqq;
- spin_unlock_irq(&bfqd->lock);
+ /*
+ * If a bfq_queue has only one process reference, it is owned
+ * by only this bic: we can then set bfqq->bic = bic. in
+ * addition, if the queue has also just been split, we have to
+ * resume its state.
+ */
+ if (likely(bfqq != &bfqd->oom_bfqq) && bfqq_process_refs(bfqq) == 1) {
+ bfqq->bic = bic;
+ if (bfqd->ioc_to_put) { /* if true, there has been a split */
+ /*
+ * The queue has just been split from a shared
+ * queue: restore the idle window and the
+ * possible weight raising period.
+ */
+ bfq_bfqq_resume_state(bfqq, bic);
+ }
+ }
+
+ bfq_unlock_put_ioc(bfqd);
return 0;
@@ -6680,7 +7478,7 @@ static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq)
bfq_bfqq_expire(bfqd, bfqq, true, reason);
schedule_dispatch:
- spin_unlock_irqrestore(&bfqd->lock, flags);
+ bfq_unlock_put_ioc_restore(bfqd, flags);
bfq_schedule_dispatch(bfqd);
}
@@ -6777,6 +7575,7 @@ static void bfq_init_root_group(struct bfq_group *root_group,
root_group->my_entity = NULL;
root_group->bfqd = bfqd;
#endif
+ root_group->rq_pos_tree = RB_ROOT;
for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
root_group->sched_data.bfq_class_idle_last_service = jiffies;
--
2.10.0
^ permalink raw reply related
* [PATCH V4 09/16] block, bfq: reduce latency during request-pool saturation
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment on the function
bfq_bfqq_may_idle(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one. Along the same line, if there are weight-raised queues,
then this patch halves the service rate of async (write) requests for
non-weight-raised queues.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 63 insertions(+), 3 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 574a5f6..deb1f21c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -420,6 +420,8 @@ struct bfq_data {
* queue in service, even if it is idling).
*/
int busy_queues;
+ /* number of weight-raised busy @bfq_queues */
+ int wr_busy_queues;
/* number of queued requests */
int queued;
/* number of requests dispatched and waiting for completion */
@@ -2490,6 +2492,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqd->busy_queues--;
+ if (bfqq->wr_coeff > 1)
+ bfqd->wr_busy_queues--;
+
bfqg_stats_update_dequeue(bfqq_group(bfqq));
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
@@ -2506,6 +2511,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfq_mark_bfqq_busy(bfqq);
bfqd->busy_queues++;
+
+ if (bfqq->wr_coeff > 1)
+ bfqd->wr_busy_queues++;
}
#ifdef CONFIG_BFQ_GROUP_IOSCHED
@@ -3779,7 +3787,16 @@ static unsigned long bfq_serv_to_charge(struct request *rq,
if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
return blk_rq_sectors(rq);
- return blk_rq_sectors(rq) * bfq_async_charge_factor;
+ /*
+ * If there are no weight-raised queues, then amplify service
+ * by just the async charge factor; otherwise amplify service
+ * by twice the async charge factor, to further reduce latency
+ * for weight-raised queues.
+ */
+ if (bfqq->bfqd->wr_busy_queues == 0)
+ return blk_rq_sectors(rq) * bfq_async_charge_factor;
+
+ return blk_rq_sectors(rq) * 2 * bfq_async_charge_factor;
}
/**
@@ -4234,6 +4251,7 @@ static void bfq_add_request(struct request *rq)
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ bfqd->wr_busy_queues++;
bfqq->entity.prio_changed = 1;
}
if (prev != bfqq->next_rq)
@@ -4474,6 +4492,8 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq,
/* Must be called with bfqq != NULL */
static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
{
+ if (bfq_bfqq_busy(bfqq))
+ bfqq->bfqd->wr_busy_queues--;
bfqq->wr_coeff = 1;
bfqq->wr_cur_max_time = 0;
bfqq->last_wr_start_finish = jiffies;
@@ -5497,7 +5517,8 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
{
struct bfq_data *bfqd = bfqq->bfqd;
- bool idling_boosts_thr, asymmetric_scenario;
+ bool idling_boosts_thr, idling_boosts_thr_without_issues,
+ asymmetric_scenario;
if (bfqd->strict_guarantees)
return true;
@@ -5520,6 +5541,44 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
/*
+ * The value of the next variable,
+ * idling_boosts_thr_without_issues, is equal to that of
+ * idling_boosts_thr, unless a special case holds. In this
+ * special case, described below, idling may cause problems to
+ * weight-raised queues.
+ *
+ * When the request pool is saturated (e.g., in the presence
+ * of write hogs), if the processes associated with
+ * non-weight-raised queues ask for requests at a lower rate,
+ * then processes associated with weight-raised queues have a
+ * higher probability to get a request from the pool
+ * immediately (or at least soon) when they need one. Thus
+ * they have a higher probability to actually get a fraction
+ * of the device throughput proportional to their high
+ * weight. This is especially true with NCQ-capable drives,
+ * which enqueue several requests in advance, and further
+ * reorder internally-queued requests.
+ *
+ * For this reason, we force to false the value of
+ * idling_boosts_thr_without_issues if there are weight-raised
+ * busy queues. In this case, and if bfqq is not weight-raised,
+ * this guarantees that the device is not idled for bfqq (if,
+ * instead, bfqq is weight-raised, then idling will be
+ * guaranteed by another variable, see below). Combined with
+ * the timestamping rules of BFQ (see [1] for details), this
+ * behavior causes bfqq, and hence any sync non-weight-raised
+ * queue, to get a lower number of requests served, and thus
+ * to ask for a lower number of requests from the request
+ * pool, before the busy weight-raised queues get served
+ * again. This often mitigates starvation problems in the
+ * presence of heavy write workloads and NCQ, thereby
+ * guaranteeing a higher application and system responsiveness
+ * in these hostile scenarios.
+ */
+ idling_boosts_thr_without_issues = idling_boosts_thr &&
+ bfqd->wr_busy_queues == 0;
+
+ /*
* There is then a case where idling must be performed not for
* throughput concerns, but to preserve service guarantees. To
* introduce it, we can note that allowing the drive to
@@ -5593,7 +5652,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
* is necessary to preserve service guarantees.
*/
return bfq_bfqq_sync(bfqq) &&
- (idling_boosts_thr || asymmetric_scenario);
+ (idling_boosts_thr_without_issues || asymmetric_scenario);
}
/*
@@ -6801,6 +6860,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
* high-definition compressed
* video.
*/
+ bfqd->wr_busy_queues = 0;
/*
* Begin by assuming, optimistically, that the device is a
--
2.10.0
^ permalink raw reply related
* [PATCH V4 08/16] block, bfq: preserve a low latency also with NCQ-capable drives
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.
Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.
This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.
[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7f94ad3..574a5f6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6233,7 +6233,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
bfqd->bfq_slice_idle == 0 ||
- (bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+ (bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+ bfqq->wr_coeff == 1))
enable_idle = 0;
else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) {
if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle &&
--
2.10.0
^ permalink raw reply related
* [PATCH V4 07/16] block, bfq: reduce I/O latency for soft real-time applications
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in the previous patch)
also the queues associated to applications deemed as soft real-time.
To be deemed as soft real-time, an application must meet two
requirements. First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.
As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following. First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement. Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.
To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* their requests as quickly as they can,
whereas soft real-time applications spend some time processing data
after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time, thereby giving to the application the opportunity to be
deemed as such, only when both the following two conditions happen to
hold: 1) the queue associated with the application has expired and is
empty, 2) there is no outstanding request of the application.
Suppose that both conditions hold at time, say, t_c and that the
application issues its next request at time, say, t_i. At time t_c the
heuristic computes the next time instant, called soft_rt_next_start in
the code, such that, only if t_i >= soft_rt_next_start, then both the
next conditions will hold when the application issues its next
request: 1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).
The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments on the code).
If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 342 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 323 insertions(+), 19 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1a32c83..7f94ad3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -119,6 +119,13 @@
#define BFQ_DEFAULT_GRP_IOPRIO 0
#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+/*
+ * Soft real-time applications are extremely more latency sensitive
+ * than interactive ones. Over-raise the weight of the former to
+ * privilege them against the latter.
+ */
+#define BFQ_SOFTRT_WEIGHT_FACTOR 100
+
struct bfq_entity;
/**
@@ -343,6 +350,14 @@ struct bfq_queue {
/* current maximum weight-raising time for this queue */
unsigned long wr_cur_max_time;
/*
+ * Minimum time instant such that, only if a new request is
+ * enqueued after this time instant in an idle @bfq_queue with
+ * no outstanding requests, then the task associated with the
+ * queue it is deemed as soft real-time (see the comments on
+ * the function bfq_bfqq_softrt_next_start())
+ */
+ unsigned long soft_rt_next_start;
+ /*
* Start time of the current weight-raising period if
* the @bfq-queue is being weight-raised, otherwise
* finish time of the last weight-raising period.
@@ -350,6 +365,20 @@ struct bfq_queue {
unsigned long last_wr_start_finish;
/* factor by which the weight of this queue is multiplied */
unsigned int wr_coeff;
+ /*
+ * Time of the last transition of the @bfq_queue from idle to
+ * backlogged.
+ */
+ unsigned long last_idle_bklogged;
+ /*
+ * Cumulative service received from the @bfq_queue since the
+ * last transition from idle to backlogged.
+ */
+ unsigned long service_from_backlogged;
+ /*
+ * Value of wr start time when switching to soft rt
+ */
+ unsigned long wr_start_at_switch_to_srt;
};
/**
@@ -512,6 +541,9 @@ struct bfq_data {
unsigned int bfq_wr_coeff;
/* maximum duration of a weight-raising period (jiffies) */
unsigned int bfq_wr_max_time;
+
+ /* Maximum weight-raising duration for soft real-time processes */
+ unsigned int bfq_wr_rt_max_time;
/*
* Minimum idle period after which weight-raising may be
* reactivated for a queue (in jiffies).
@@ -523,6 +555,9 @@ struct bfq_data {
* queue (in jiffies).
*/
unsigned long bfq_wr_min_inter_arr_async;
+
+ /* Max service-rate for a soft real-time queue, in sectors/sec */
+ unsigned int bfq_wr_max_softrt_rate;
/*
* Cached value of the product R*T, used for computing the
* maximum duration of weight raising automatically.
@@ -564,6 +599,10 @@ enum bfqq_state_flags {
* having consumed at most 2/10 of
* its budget
*/
+ BFQQF_softrt_update, /*
+ * may need softrt-next-start
+ * update
+ */
};
#define BFQ_BFQQ_FNS(name) \
@@ -587,6 +626,7 @@ BFQ_BFQQ_FNS(fifo_expire);
BFQ_BFQQ_FNS(idle_window);
BFQ_BFQQ_FNS(sync);
BFQ_BFQQ_FNS(IO_bound);
+BFQ_BFQQ_FNS(softrt_update);
#undef BFQ_BFQQ_FNS
/* Logging facilities. */
@@ -3992,13 +4032,21 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
unsigned int old_wr_coeff,
bool wr_or_deserves_wr,
- bool interactive)
+ bool interactive,
+ bool soft_rt)
{
if (old_wr_coeff == 1 && wr_or_deserves_wr) {
/* start a weight-raising period */
- bfqq->wr_coeff = bfqd->bfq_wr_coeff;
- /* update wr duration */
- bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ if (interactive) {
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ } else {
+ bfqq->wr_start_at_switch_to_srt = jiffies;
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff *
+ BFQ_SOFTRT_WEIGHT_FACTOR;
+ bfqq->wr_cur_max_time =
+ bfqd->bfq_wr_rt_max_time;
+ }
/*
* If needed, further reduce budget to make sure it is
@@ -4013,8 +4061,51 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
bfqq->entity.budget,
2 * bfq_min_budget(bfqd));
} else if (old_wr_coeff > 1) {
- /* update wr duration */
- bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ if (interactive) { /* update wr coeff and duration */
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ } else if (soft_rt) {
+ /*
+ * The application is now or still meeting the
+ * requirements for being deemed soft rt. We
+ * can then correctly and safely (re)charge
+ * the weight-raising duration for the
+ * application with the weight-raising
+ * duration for soft rt applications.
+ *
+ * In particular, doing this recharge now, i.e.,
+ * before the weight-raising period for the
+ * application finishes, reduces the probability
+ * of the following negative scenario:
+ * 1) the weight of a soft rt application is
+ * raised at startup (as for any newly
+ * created application),
+ * 2) since the application is not interactive,
+ * at a certain time weight-raising is
+ * stopped for the application,
+ * 3) at that time the application happens to
+ * still have pending requests, and hence
+ * is destined to not have a chance to be
+ * deemed soft rt before these requests are
+ * completed (see the comments to the
+ * function bfq_bfqq_softrt_next_start()
+ * for details on soft rt detection),
+ * 4) these pending requests experience a high
+ * latency because the application is not
+ * weight-raised while they are pending.
+ */
+ if (bfqq->wr_cur_max_time !=
+ bfqd->bfq_wr_rt_max_time) {
+ bfqq->wr_start_at_switch_to_srt =
+ bfqq->last_wr_start_finish;
+
+ bfqq->wr_cur_max_time =
+ bfqd->bfq_wr_rt_max_time;
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff *
+ BFQ_SOFTRT_WEIGHT_FACTOR;
+ }
+ bfqq->last_wr_start_finish = jiffies;
+ }
}
}
@@ -4033,7 +4124,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
struct request *rq,
bool *interactive)
{
- bool wr_or_deserves_wr, bfqq_wants_to_preempt,
+ bool soft_rt, wr_or_deserves_wr, bfqq_wants_to_preempt,
idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
/*
* See the comments on
@@ -4049,12 +4140,14 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
/*
* bfqq deserves to be weight-raised if:
* - it is sync,
- * - it has been idle for enough time.
+ * - it has been idle for enough time or is soft real-time.
*/
+ soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+ time_is_before_jiffies(bfqq->soft_rt_next_start);
*interactive = idle_for_long_time;
wr_or_deserves_wr = bfqd->low_latency &&
(bfqq->wr_coeff > 1 ||
- (bfq_bfqq_sync(bfqq) && *interactive));
+ (bfq_bfqq_sync(bfqq) && (*interactive || soft_rt)));
/*
* Using the last flag, update budget and check whether bfqq
@@ -4079,12 +4172,17 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
old_wr_coeff,
wr_or_deserves_wr,
- *interactive);
+ *interactive,
+ soft_rt);
if (old_wr_coeff != bfqq->wr_coeff)
bfqq->entity.prio_changed = 1;
}
+ bfqq->last_idle_bklogged = jiffies;
+ bfqq->service_from_backlogged = 0;
+ bfq_clear_bfqq_softrt_update(bfqq);
+
bfq_add_bfqq_busy(bfqd, bfqq);
/*
@@ -4098,7 +4196,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
* function bfq_bfqq_update_budg_for_activation).
*/
if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
- bfqd->in_service_queue->wr_coeff == 1 &&
+ bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff &&
next_queue_may_preempt(bfqd))
bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
false, BFQQE_PREEMPTED);
@@ -4161,6 +4259,12 @@ static void bfq_add_request(struct request *rq)
* period must start or restart (this case is considered
* separately because it is not detected by the above
* conditions, if bfqq is already weight-raised)
+ *
+ * last_wr_start_finish has to be updated also if bfqq is soft
+ * real-time, because the weight-raising period is constantly
+ * restarted on idle-to-busy transitions for these queues, but
+ * this is already done in bfq_bfqq_handle_idle_busy_switch if
+ * needed.
*/
if (bfqd->low_latency &&
(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
@@ -4372,6 +4476,7 @@ static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
{
bfqq->wr_coeff = 1;
bfqq->wr_cur_max_time = 0;
+ bfqq->last_wr_start_finish = jiffies;
/*
* Trigger a weight change on the next invocation of
* __bfq_entity_update_weight_prio.
@@ -4439,11 +4544,17 @@ static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
static void bfq_set_budget_timeout(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
+ unsigned int timeout_coeff;
+
+ if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+ timeout_coeff = 1;
+ else
+ timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
+
bfqd->last_budget_start = ktime_get();
bfqq->budget_timeout = jiffies +
- bfqd->bfq_timeout *
- (bfqq->entity.weight / bfqq->entity.orig_weight);
+ bfqd->bfq_timeout * timeout_coeff;
}
static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
@@ -4455,6 +4566,42 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;
+ if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&
+ bfqq->wr_coeff > 1 &&
+ bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+ time_is_before_jiffies(bfqq->budget_timeout)) {
+ /*
+ * For soft real-time queues, move the start
+ * of the weight-raising period forward by the
+ * time the queue has not received any
+ * service. Otherwise, a relatively long
+ * service delay is likely to cause the
+ * weight-raising period of the queue to end,
+ * because of the short duration of the
+ * weight-raising period of a soft real-time
+ * queue. It is worth noting that this move
+ * is not so dangerous for the other queues,
+ * because soft real-time queues are not
+ * greedy.
+ *
+ * To not add a further variable, we use the
+ * overloaded field budget_timeout to
+ * determine for how long the queue has not
+ * received service, i.e., how much time has
+ * elapsed since the queue expired. However,
+ * this is a little imprecise, because
+ * budget_timeout is set to jiffies if bfqq
+ * not only expires, but also remains with no
+ * request.
+ */
+ if (time_after(bfqq->budget_timeout,
+ bfqq->last_wr_start_finish))
+ bfqq->last_wr_start_finish +=
+ jiffies - bfqq->budget_timeout;
+ else
+ bfqq->last_wr_start_finish = jiffies;
+ }
+
bfq_set_budget_timeout(bfqd, bfqq);
bfq_log_bfqq(bfqd, bfqq,
"set_in_service_queue, cur-budget = %d",
@@ -5073,6 +5220,76 @@ static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
}
/*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ * HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ * for a while, then suddenly 'jump' by several units to recover the lost
+ * increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ return max(bfqq->last_idle_bklogged +
+ HZ * bfqq->service_from_backlogged /
+ bfqd->bfq_wr_max_softrt_rate,
+ jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
+}
+
+/*
+ * Return the farthest future time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_greatest_from_now(void)
+{
+ return jiffies + MAX_JIFFY_OFFSET;
+}
+
+/*
* Return the farthest past time instant according to jiffies
* macros.
*/
@@ -5123,6 +5340,17 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
/*
+ * Increase service_from_backlogged before next statement,
+ * because the possible next invocation of
+ * bfq_bfqq_charge_time would likely inflate
+ * entity->service. In contrast, service_from_backlogged must
+ * contain real service, to enable the soft real-time
+ * heuristic to correctly compute the bandwidth consumed by
+ * bfqq.
+ */
+ bfqq->service_from_backlogged += entity->service;
+
+ /*
* As above explained, charge slow (typically seeky) and
* timed-out queues with the time and not the service
* received, to favor sequential workloads.
@@ -5150,6 +5378,48 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
if (bfqd->low_latency && bfqq->wr_coeff == 1)
bfqq->last_wr_start_finish = jiffies;
+ if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+ RB_EMPTY_ROOT(&bfqq->sort_list)) {
+ /*
+ * If we get here, and there are no outstanding
+ * requests, then the request pattern is isochronous
+ * (see the comments on the function
+ * bfq_bfqq_softrt_next_start()). Thus we can compute
+ * soft_rt_next_start. If, instead, the queue still
+ * has outstanding requests, then we have to wait for
+ * the completion of all the outstanding requests to
+ * discover whether the request pattern is actually
+ * isochronous.
+ */
+ if (bfqq->dispatched == 0)
+ bfqq->soft_rt_next_start =
+ bfq_bfqq_softrt_next_start(bfqd, bfqq);
+ else {
+ /*
+ * The application is still waiting for the
+ * completion of one or more requests:
+ * prevent it from possibly being incorrectly
+ * deemed as soft real-time by setting its
+ * soft_rt_next_start to infinity. In fact,
+ * without this assignment, the application
+ * would be incorrectly deemed as soft
+ * real-time if:
+ * 1) it issued a new request before the
+ * completion of all its in-flight
+ * requests, and
+ * 2) at that time, its soft_rt_next_start
+ * happened to be in the past.
+ */
+ bfqq->soft_rt_next_start =
+ bfq_greatest_from_now();
+ /*
+ * Schedule an update of soft_rt_next_start to when
+ * the task may be discovered to be isochronous.
+ */
+ bfq_mark_bfqq_softrt_update(bfqq);
+ }
+ }
+
bfq_log_bfqq(bfqd, bfqq,
"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -5468,12 +5738,18 @@ static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
*/
if (time_is_before_jiffies(bfqq->last_wr_start_finish +
bfqq->wr_cur_max_time)) {
- bfqq->last_wr_start_finish = jiffies;
- bfq_log_bfqq(bfqd, bfqq,
- "wrais ending at %lu, rais_max_time %u",
- bfqq->last_wr_start_finish,
- jiffies_to_msecs(bfqq->wr_cur_max_time));
- bfq_bfqq_end_wr(bfqq);
+ if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time ||
+ time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
+ bfq_wr_duration(bfqd)))
+ bfq_bfqq_end_wr(bfqq);
+ else {
+ /* switch back to interactive wr */
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ bfqq->last_wr_start_finish =
+ bfqq->wr_start_at_switch_to_srt;
+ bfqq->entity.prio_changed = 1;
+ }
}
}
/* Update weight both if it must be raised and if it must be lowered */
@@ -5818,6 +6094,13 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqq->wr_coeff = 1;
bfqq->last_wr_start_finish = bfq_smallest_from_now();
+ bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();
+
+ /*
+ * Set to the value for which bfqq will not be deemed as
+ * soft rt when it becomes backlogged.
+ */
+ bfqq->soft_rt_next_start = bfq_greatest_from_now();
/* first request is almost certainly seeky */
bfqq->seek_history = 1;
@@ -6175,6 +6458,20 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
bfqd->last_completion = now_ns;
/*
+ * If we are waiting to discover whether the request pattern
+ * of the task associated with the queue is actually
+ * isochronous, and both requisites for this condition to hold
+ * are now satisfied, then compute soft_rt_next_start (see the
+ * comments on the function bfq_bfqq_softrt_next_start()). We
+ * schedule this delayed check when bfqq expires, if it still
+ * has in-flight requests.
+ */
+ if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+ RB_EMPTY_ROOT(&bfqq->sort_list))
+ bfqq->soft_rt_next_start =
+ bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+ /*
* If this is the in-service queue, check if it needs to be expired,
* or if we want to idle in case it has no pending requests.
*/
@@ -6493,9 +6790,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
* Trade-off between responsiveness and fairness.
*/
bfqd->bfq_wr_coeff = 30;
+ bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
bfqd->bfq_wr_max_time = 0;
bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+ bfqd->bfq_wr_max_softrt_rate = 7000; /*
+ * Approximate rate required
+ * to playback or record a
+ * high-definition compressed
+ * video.
+ */
/*
* Begin by assuming, optimistically, that the device is a
--
2.10.0
^ permalink raw reply related
* [PATCH V4 06/16] block, bfq: improve responsiveness
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:
1) The weight of the queue is raised.
2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).
For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.
Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.
According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
Documentation/block/bfq-iosched.txt | 9 +
block/bfq-iosched.c | 740 ++++++++++++++++++++++++++++++++----
2 files changed, 675 insertions(+), 74 deletions(-)
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 461b27f..1b87df6 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -375,6 +375,11 @@ default, low latency mode is enabled. If enabled, interactive and soft
real-time applications are privileged and experience a lower latency,
as explained in more detail in the description of how BFQ works.
+DO NOT enable this mode if you need full control on bandwidth
+distribution. In fact, if it is enabled, then BFQ automatically
+increases the bandwidth share of privileged applications, as the main
+means to guarantee a lower latency to them.
+
timeout_sync
------------
@@ -507,6 +512,10 @@ linear mapping between ioprio and weights, described at the beginning
of the tunable section, is still valid, but all weights higher than
IOPRIO_BE_NR*10 are mapped to ioprio 0.
+Recall that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
+
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index dce273b..1a32c83 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -339,6 +339,17 @@ struct bfq_queue {
/* pid of the process owning the queue, used for logging purposes */
pid_t pid;
+
+ /* current maximum weight-raising time for this queue */
+ unsigned long wr_cur_max_time;
+ /*
+ * Start time of the current weight-raising period if
+ * the @bfq-queue is being weight-raised, otherwise
+ * finish time of the last weight-raising period.
+ */
+ unsigned long last_wr_start_finish;
+ /* factor by which the weight of this queue is multiplied */
+ unsigned int wr_coeff;
};
/**
@@ -356,6 +367,11 @@ struct bfq_io_cq {
#endif
};
+enum bfq_device_speed {
+ BFQ_BFQD_FAST,
+ BFQ_BFQD_SLOW,
+};
+
/**
* struct bfq_data - per-device data structure.
*
@@ -487,6 +503,34 @@ struct bfq_data {
*/
bool strict_guarantees;
+ /* if set to true, low-latency heuristics are enabled */
+ bool low_latency;
+ /*
+ * Maximum factor by which the weight of a weight-raised queue
+ * is multiplied.
+ */
+ unsigned int bfq_wr_coeff;
+ /* maximum duration of a weight-raising period (jiffies) */
+ unsigned int bfq_wr_max_time;
+ /*
+ * Minimum idle period after which weight-raising may be
+ * reactivated for a queue (in jiffies).
+ */
+ unsigned int bfq_wr_min_idle_time;
+ /*
+ * Minimum period between request arrivals after which
+ * weight-raising may be reactivated for an already busy async
+ * queue (in jiffies).
+ */
+ unsigned long bfq_wr_min_inter_arr_async;
+ /*
+ * Cached value of the product R*T, used for computing the
+ * maximum duration of weight raising automatically.
+ */
+ u64 RT_prod;
+ /* device-speed class for the low-latency heuristic */
+ enum bfq_device_speed device_speed;
+
/* fallback dummy bfqq for extreme OOM conditions */
struct bfq_queue oom_bfqq;
@@ -515,7 +559,6 @@ enum bfqq_state_flags {
BFQQF_fifo_expire, /* FIFO checked in this slice */
BFQQF_idle_window, /* slice idling enabled */
BFQQF_sync, /* synchronous queue */
- BFQQF_budget_new, /* no completion with this budget */
BFQQF_IO_bound, /*
* bfqq has timed-out at least once
* having consumed at most 2/10 of
@@ -543,7 +586,6 @@ BFQ_BFQQ_FNS(non_blocking_wait_rq);
BFQ_BFQQ_FNS(fifo_expire);
BFQ_BFQQ_FNS(idle_window);
BFQ_BFQQ_FNS(sync);
-BFQ_BFQQ_FNS(budget_new);
BFQ_BFQQ_FNS(IO_bound);
#undef BFQ_BFQQ_FNS
@@ -637,7 +679,7 @@ struct bfq_group_data {
/* must be the first member */
struct blkcg_policy_data pd;
- unsigned short weight;
+ unsigned int weight;
};
/**
@@ -732,6 +774,8 @@ static void bfq_put_queue(struct bfq_queue *bfqq);
static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
struct bio *bio, bool is_sync,
struct bfq_io_cq *bic);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+ struct bfq_group *bfqg);
static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
@@ -787,6 +831,56 @@ static struct kmem_cache *bfq_pool;
/* Shift used for peak rate fixed precision calculations. */
#define BFQ_RATE_SHIFT 16
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ * SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and
+ * T_slow[0]/T_fast[0] are the reference values for a slow/fast
+ * rotational device, whereas R_slow[1]/R_fast[1] and
+ * T_slow[1]/T_fast[1] are the reference values for a slow/fast
+ * non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes. The reference
+ * rates are not the actual peak rates of the devices used as a
+ * reference, but slightly lower values. The reason for using these
+ * slightly lower values is that the peak-rate estimator tends to
+ * yield slightly lower values than the actual peak rate (it can yield
+ * the actual peak rate only if there is only one process doing I/O,
+ * and the process does sequential I/O).
+ *
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1000, 10700};
+static int R_fast[2] = {14000, 33000};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \
{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -1486,7 +1580,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
if (entity->prio_changed) {
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
- unsigned short prev_weight, new_weight;
+ unsigned int prev_weight, new_weight;
struct bfq_data *bfqd = NULL;
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_sched_data *sd;
@@ -1535,7 +1629,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
new_st = bfq_entity_service_tree(entity);
prev_weight = entity->weight;
- new_weight = entity->orig_weight;
+ new_weight = entity->orig_weight *
+ (bfqq ? bfqq->wr_coeff : 1);
entity->weight = new_weight;
new_st->wsum += entity->weight;
@@ -1630,6 +1725,8 @@ static void bfq_update_fin_time_enqueue(struct bfq_entity *entity,
struct bfq_service_tree *st,
bool backshifted)
{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
st = __bfq_entity_update_weight_prio(st, entity);
bfq_calc_finish(entity, entity->budget);
@@ -1659,10 +1756,19 @@ static void bfq_update_fin_time_enqueue(struct bfq_entity *entity,
* time. This may introduce a little unfairness among queues
* with backshifted timestamps, but it does not break
* worst-case fairness guarantees.
+ *
+ * As a special case, if bfqq is weight-raised, push up
+ * timestamps much less, to keep very low the probability that
+ * this push up causes the backshifted finish timestamps of
+ * weight-raised queues to become higher than the backshifted
+ * finish timestamps of non weight-raised queues.
*/
if (backshifted && bfq_gt(st->vtime, entity->finish)) {
unsigned long delta = st->vtime - entity->finish;
+ if (bfqq)
+ delta /= bfqq->wr_coeff;
+
entity->start += delta;
entity->finish += delta;
}
@@ -3070,6 +3176,18 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
bfqg_stats_xfer_dead(bfqg);
}
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+ struct blkcg_gq *blkg;
+
+ list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
+ struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+ bfq_end_wr_async_queues(bfqd, bfqg);
+ }
+ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
static int bfq_io_show_weight(struct seq_file *sf, void *v)
{
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
@@ -3433,6 +3551,11 @@ static void bfq_init_entity(struct bfq_entity *entity,
static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
struct blkcg *blkcg)
{
@@ -3613,7 +3736,7 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
static unsigned long bfq_serv_to_charge(struct request *rq,
struct bfq_queue *bfqq)
{
- if (bfq_bfqq_sync(bfqq))
+ if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
return blk_rq_sectors(rq);
return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -3700,12 +3823,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
* whether the in-service queue should be expired, by returning
* true. The purpose of expiring the in-service queue is to give bfqq
* the chance to possibly preempt the in-service queue, and the reason
- * for preempting the in-service queue is to achieve the following
- * goal: guarantee to bfqq its reserved bandwidth even if bfqq has
- * expired because it has remained idle.
+ * for preempting the in-service queue is to achieve one of the two
+ * goals below.
*
- * In particular, bfqq may have expired for one of the following two
- * reasons:
+ * 1. Guarantee to bfqq its reserved bandwidth even if bfqq has
+ * expired because it has remained idle. In particular, bfqq may have
+ * expired for one of the following two reasons:
*
* - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling
* and did not make it to issue a new request before its last
@@ -3769,10 +3892,36 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
* above-described special way, and signals that the in-service queue
* should be expired. Timestamp back-shifting is done later in
* __bfq_activate_entity.
+ *
+ * 2. Reduce latency. Even if timestamps are not backshifted to let
+ * the process associated with bfqq recover a service hole, bfqq may
+ * however happen to have, after being (re)activated, a lower finish
+ * timestamp than the in-service queue. That is, the next budget of
+ * bfqq may have to be completed before the one of the in-service
+ * queue. If this is the case, then preempting the in-service queue
+ * allows this goal to be achieved, apart from the unpreemptible,
+ * outstanding requests mentioned above.
+ *
+ * Unfortunately, regardless of which of the above two goals one wants
+ * to achieve, service trees need first to be updated to know whether
+ * the in-service queue must be preempted. To have service trees
+ * correctly updated, the in-service queue must be expired and
+ * rescheduled, and bfqq must be scheduled too. This is one of the
+ * most costly operations (in future versions, the scheduling
+ * mechanism may be re-designed in such a way to make it possible to
+ * know whether preemption is needed without needing to update service
+ * trees). In addition, queue preemptions almost always cause random
+ * I/O, and thus loss of throughput. Because of these facts, the next
+ * function adopts the following simple scheme to avoid both costly
+ * operations and too frequent preemptions: it requests the expiration
+ * of the in-service queue (unconditionally) only for queues that need
+ * to recover a hole, or that either are weight-raised or deserve to
+ * be weight-raised.
*/
static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
- bool arrived_in_time)
+ bool arrived_in_time,
+ bool wr_or_deserves_wr)
{
struct bfq_entity *entity = &bfqq->entity;
@@ -3807,14 +3956,85 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
entity->budget = max_t(unsigned long, bfqq->max_budget,
bfq_serv_to_charge(bfqq->next_rq, bfqq));
bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
- return false;
+ return wr_or_deserves_wr;
+}
+
+static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+ u64 dur;
+
+ if (bfqd->bfq_wr_max_time > 0)
+ return bfqd->bfq_wr_max_time;
+
+ dur = bfqd->RT_prod;
+ do_div(dur, bfqd->peak_rate);
+
+ /*
+ * Limit duration between 3 and 13 seconds. Tests show that
+ * higher values than 13 seconds often yield the opposite of
+ * the desired result, i.e., worsen responsiveness by letting
+ * non-interactive and non-soft-real-time applications
+ * preserve weight raising for a too long time interval.
+ *
+ * On the other end, lower values than 3 seconds make it
+ * difficult for most interactive tasks to complete their jobs
+ * before weight-raising finishes.
+ */
+ if (dur > msecs_to_jiffies(13000))
+ dur = msecs_to_jiffies(13000);
+ else if (dur < msecs_to_jiffies(3000))
+ dur = msecs_to_jiffies(3000);
+
+ return dur;
+}
+
+static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ unsigned int old_wr_coeff,
+ bool wr_or_deserves_wr,
+ bool interactive)
+{
+ if (old_wr_coeff == 1 && wr_or_deserves_wr) {
+ /* start a weight-raising period */
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ /* update wr duration */
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+ /*
+ * If needed, further reduce budget to make sure it is
+ * close to bfqq's backlog, so as to reduce the
+ * scheduling-error component due to a too large
+ * budget. Do not care about throughput consequences,
+ * but only about latency. Finally, do not assign a
+ * too small budget either, to avoid increasing
+ * latency by causing too frequent expirations.
+ */
+ bfqq->entity.budget = min_t(unsigned long,
+ bfqq->entity.budget,
+ 2 * bfq_min_budget(bfqd));
+ } else if (old_wr_coeff > 1) {
+ /* update wr duration */
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ }
+}
+
+static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ return bfqq->dispatched == 0 &&
+ time_is_before_jiffies(
+ bfqq->budget_timeout +
+ bfqd->bfq_wr_min_idle_time);
}
static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
- struct request *rq)
+ int old_wr_coeff,
+ struct request *rq,
+ bool *interactive)
{
- bool bfqq_wants_to_preempt,
+ bool wr_or_deserves_wr, bfqq_wants_to_preempt,
+ idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
/*
* See the comments on
* bfq_bfqq_update_budg_for_activation for
@@ -3827,12 +4047,23 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, rq->cmd_flags);
/*
- * Update budget and check whether bfqq may want to preempt
- * the in-service queue.
+ * bfqq deserves to be weight-raised if:
+ * - it is sync,
+ * - it has been idle for enough time.
+ */
+ *interactive = idle_for_long_time;
+ wr_or_deserves_wr = bfqd->low_latency &&
+ (bfqq->wr_coeff > 1 ||
+ (bfq_bfqq_sync(bfqq) && *interactive));
+
+ /*
+ * Using the last flag, update budget and check whether bfqq
+ * may want to preempt the in-service queue.
*/
bfqq_wants_to_preempt =
bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
- arrived_in_time);
+ arrived_in_time,
+ wr_or_deserves_wr);
if (!bfq_bfqq_IO_bound(bfqq)) {
if (arrived_in_time) {
@@ -3844,6 +4075,16 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
bfqq->requests_within_timer = 0;
}
+ if (bfqd->low_latency) {
+ bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
+ old_wr_coeff,
+ wr_or_deserves_wr,
+ *interactive);
+
+ if (old_wr_coeff != bfqq->wr_coeff)
+ bfqq->entity.prio_changed = 1;
+ }
+
bfq_add_bfqq_busy(bfqd, bfqq);
/*
@@ -3857,6 +4098,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
* function bfq_bfqq_update_budg_for_activation).
*/
if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
+ bfqd->in_service_queue->wr_coeff == 1 &&
next_queue_may_preempt(bfqd))
bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
false, BFQQE_PREEMPTED);
@@ -3867,6 +4109,8 @@ static void bfq_add_request(struct request *rq)
struct bfq_queue *bfqq = RQ_BFQQ(rq);
struct bfq_data *bfqd = bfqq->bfqd;
struct request *next_rq, *prev;
+ unsigned int old_wr_coeff = bfqq->wr_coeff;
+ bool interactive = false;
bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
bfqq->queued[rq_is_sync(rq)]++;
@@ -3882,9 +4126,45 @@ static void bfq_add_request(struct request *rq)
bfqq->next_rq = next_rq;
if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
- bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, rq);
- else if (prev != bfqq->next_rq)
- bfq_updated_next_req(bfqd, bfqq);
+ bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
+ rq, &interactive);
+ else {
+ if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+ time_is_before_jiffies(
+ bfqq->last_wr_start_finish +
+ bfqd->bfq_wr_min_inter_arr_async)) {
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+ bfqq->entity.prio_changed = 1;
+ }
+ if (prev != bfqq->next_rq)
+ bfq_updated_next_req(bfqd, bfqq);
+ }
+
+ /*
+ * Assign jiffies to last_wr_start_finish in the following
+ * cases:
+ *
+ * . if bfqq is not going to be weight-raised, because, for
+ * non weight-raised queues, last_wr_start_finish stores the
+ * arrival time of the last request; as of now, this piece
+ * of information is used only for deciding whether to
+ * weight-raise async queues
+ *
+ * . if bfqq is not weight-raised, because, if bfqq is now
+ * switching to weight-raised, then last_wr_start_finish
+ * stores the time when weight-raising starts
+ *
+ * . if bfqq is interactive, because, regardless of whether
+ * bfqq is currently weight-raised, the weight-raising
+ * period must start or restart (this case is considered
+ * separately because it is not detected by the above
+ * conditions, if bfqq is already weight-raised)
+ */
+ if (bfqd->low_latency &&
+ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
+ bfqq->last_wr_start_finish = jiffies;
}
static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -4087,6 +4367,46 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq,
bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags);
}
+/* Must be called with bfqq != NULL */
+static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+ bfqq->wr_coeff = 1;
+ bfqq->wr_cur_max_time = 0;
+ /*
+ * Trigger a weight change on the next invocation of
+ * __bfq_entity_update_weight_prio.
+ */
+ bfqq->entity.prio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+ struct bfq_group *bfqg)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ if (bfqg->async_bfqq[i][j])
+ bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+ if (bfqg->async_idle_bfqq)
+ bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+ struct bfq_queue *bfqq;
+
+ spin_lock_irq(&bfqd->lock);
+
+ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+ bfq_bfqq_end_wr(bfqq);
+ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+ bfq_bfqq_end_wr(bfqq);
+ bfq_end_wr_async(bfqd);
+
+ spin_unlock_irq(&bfqd->lock);
+}
+
static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
struct bio *bio)
{
@@ -4110,16 +4430,32 @@ static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
return bfqq == RQ_BFQQ(rq);
}
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the throughput.
+ * In practice, a time-slice service scheme is used with seeky
+ * processes.
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ bfqd->last_budget_start = ktime_get();
+
+ bfqq->budget_timeout = jiffies +
+ bfqd->bfq_timeout *
+ (bfqq->entity.weight / bfqq->entity.orig_weight);
+}
+
static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
if (bfqq) {
bfqg_stats_update_avg_queue_size(bfqq_group(bfqq));
- bfq_mark_bfqq_budget_new(bfqq);
bfq_clear_bfqq_fifo_expire(bfqq);
bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;
+ bfq_set_budget_timeout(bfqd, bfqq);
bfq_log_bfqq(bfqd, bfqq,
"set_in_service_queue, cur-budget = %d",
bfqq->entity.budget);
@@ -4159,9 +4495,13 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
*/
sl = bfqd->bfq_slice_idle;
/*
- * Grant only minimum idle time if the queue is seeky.
+ * Unless the queue is being weight-raised, grant only minimum
+ * idle time if the queue is seeky. A long idling is preserved
+ * for a weight-raised queue, because it is needed for
+ * guaranteeing to the queue its reserved share of the
+ * throughput.
*/
- if (BFQQ_SEEKY(bfqq))
+ if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1)
sl = min_t(u64, sl, BFQ_MIN_TT);
bfqd->last_idling_start = ktime_get();
@@ -4171,27 +4511,6 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
}
/*
- * Set the maximum time for the in-service queue to consume its
- * budget. This prevents seeky processes from lowering the disk
- * throughput (always guaranteed with a time slice scheme as in CFQ).
- */
-static void bfq_set_budget_timeout(struct bfq_data *bfqd)
-{
- struct bfq_queue *bfqq = bfqd->in_service_queue;
- unsigned int timeout_coeff = bfqq->entity.weight /
- bfqq->entity.orig_weight;
-
- bfqd->last_budget_start = ktime_get();
-
- bfq_clear_bfqq_budget_new(bfqq);
- bfqq->budget_timeout = jiffies +
- bfqd->bfq_timeout * timeout_coeff;
-
- bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
- jiffies_to_msecs(bfqd->bfq_timeout * timeout_coeff));
-}
-
-/*
* In autotuning mode, max_budget is dynamically recomputed as the
* amount of sectors transferred in timeout at the estimated peak
* rate. This enables BFQ to utilize a full timeslice with a full
@@ -4204,6 +4523,42 @@ static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
}
+/*
+ * Update parameters related to throughput and responsiveness, as a
+ * function of the estimated peak rate. See comments on
+ * bfq_calc_max_budget(), and on T_slow and T_fast arrays.
+ */
+static void update_thr_responsiveness_params(struct bfq_data *bfqd)
+{
+ int dev_type = blk_queue_nonrot(bfqd->queue);
+
+ if (bfqd->bfq_user_max_budget == 0)
+ bfqd->bfq_max_budget =
+ bfq_calc_max_budget(bfqd);
+
+ if (bfqd->device_speed == BFQ_BFQD_FAST &&
+ bfqd->peak_rate < device_speed_thresh[dev_type]) {
+ bfqd->device_speed = BFQ_BFQD_SLOW;
+ bfqd->RT_prod = R_slow[dev_type] *
+ T_slow[dev_type];
+ } else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+ bfqd->peak_rate > device_speed_thresh[dev_type]) {
+ bfqd->device_speed = BFQ_BFQD_FAST;
+ bfqd->RT_prod = R_fast[dev_type] *
+ T_fast[dev_type];
+ }
+
+ bfq_log(bfqd,
+"dev_type %s dev_speed_class = %s (%llu sects/sec), thresh %llu setcs/sec",
+ dev_type == 0 ? "ROT" : "NONROT",
+ bfqd->device_speed == BFQ_BFQD_FAST ? "FAST" : "SLOW",
+ bfqd->device_speed == BFQ_BFQD_FAST ?
+ (USEC_PER_SEC*(u64)R_fast[dev_type])>>BFQ_RATE_SHIFT :
+ (USEC_PER_SEC*(u64)R_slow[dev_type])>>BFQ_RATE_SHIFT,
+ (USEC_PER_SEC*(u64)device_speed_thresh[dev_type])>>
+ BFQ_RATE_SHIFT);
+}
+
static void bfq_reset_rate_computation(struct bfq_data *bfqd,
struct request *rq)
{
@@ -4315,9 +4670,7 @@ static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
rate /= divisor; /* smoothing constant alpha = 1/divisor */
bfqd->peak_rate += rate;
- if (bfqd->bfq_user_max_budget == 0)
- bfqd->bfq_max_budget =
- bfq_calc_max_budget(bfqd);
+ update_thr_responsiveness_params(bfqd);
reset_computation:
bfq_reset_rate_computation(bfqd, rq);
@@ -4439,9 +4792,18 @@ static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
- if (RB_EMPTY_ROOT(&bfqq->sort_list))
+ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+ if (bfqq->dispatched == 0)
+ /*
+ * Overloading budget_timeout field to store
+ * the time at which the queue remains with no
+ * backlog and no outstanding request; used by
+ * the weight-raising mechanism.
+ */
+ bfqq->budget_timeout = jiffies;
+
bfq_del_bfqq_busy(bfqd, bfqq, true);
- else
+ } else
bfq_requeue_bfqq(bfqd, bfqq);
/*
@@ -4468,9 +4830,18 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
struct request *next_rq;
int budget, min_budget;
- budget = bfqq->max_budget;
min_budget = bfq_min_budget(bfqd);
+ if (bfqq->wr_coeff == 1)
+ budget = bfqq->max_budget;
+ else /*
+ * Use a constant, low budget for weight-raised queues,
+ * to help achieve a low latency. Keep it slightly higher
+ * than the minimum possible budget, to cause a little
+ * bit fewer expirations.
+ */
+ budget = 2 * min_budget;
+
bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
@@ -4478,7 +4849,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
- if (bfq_bfqq_sync(bfqq)) {
+ if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) {
switch (reason) {
/*
* Caveat: in all the following cases we trade latency
@@ -4577,7 +4948,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
default:
return;
}
- } else {
+ } else if (!bfq_bfqq_sync(bfqq)) {
/*
* Async queues get always the maximum possible
* budget, as for them we do not care about latency
@@ -4766,15 +5137,19 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
* bandwidth, and not time, distribution with little unlucky
* or quasi-sequential processes.
*/
- if (slow ||
- (reason == BFQQE_BUDGET_TIMEOUT &&
- bfq_bfqq_budget_left(bfqq) >= entity->budget / 3))
+ if (bfqq->wr_coeff == 1 &&
+ (slow ||
+ (reason == BFQQE_BUDGET_TIMEOUT &&
+ bfq_bfqq_budget_left(bfqq) >= entity->budget / 3)))
bfq_bfqq_charge_time(bfqd, bfqq, delta);
if (reason == BFQQE_TOO_IDLE &&
entity->service <= 2 * entity->budget / 10)
bfq_clear_bfqq_IO_bound(bfqq);
+ if (bfqd->low_latency && bfqq->wr_coeff == 1)
+ bfqq->last_wr_start_finish = jiffies;
+
bfq_log_bfqq(bfqd, bfqq,
"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -4801,10 +5176,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
*/
static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
{
- if (bfq_bfqq_budget_new(bfqq) ||
- time_is_after_jiffies(bfqq->budget_timeout))
- return false;
- return true;
+ return time_is_before_eq_jiffies(bfqq->budget_timeout);
}
/*
@@ -4831,19 +5203,40 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
/*
* For a queue that becomes empty, device idling is allowed only if
- * this function returns true for the queue. And this function returns
- * true only if idling is beneficial for throughput.
+ * this function returns true for the queue. As a consequence, since
+ * device idling plays a critical role in both throughput boosting and
+ * service guarantees, the return value of this function plays a
+ * critical role in both these aspects as well.
+ *
+ * In a nutshell, this function returns true only if idling is
+ * beneficial for throughput or, even if detrimental for throughput,
+ * idling is however necessary to preserve service guarantees (low
+ * latency, desired throughput distribution, ...). In particular, on
+ * NCQ-capable devices, this function tries to return false, so as to
+ * help keep the drives' internal queues full, whenever this helps the
+ * device boost the throughput without causing any service-guarantee
+ * issue.
+ *
+ * In more detail, the return value of this function is obtained by,
+ * first, computing a number of boolean variables that take into
+ * account throughput and service-guarantee issues, and, then,
+ * combining these variables in a logical expression. Most of the
+ * issues taken into account are not trivial. We discuss these issues
+ * individually while introducing the variables.
*/
static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
{
struct bfq_data *bfqd = bfqq->bfqd;
- bool idling_boosts_thr;
+ bool idling_boosts_thr, asymmetric_scenario;
if (bfqd->strict_guarantees)
return true;
/*
- * The value of the next variable is computed considering that
+ * The next variable takes into account the cases where idling
+ * boosts the throughput.
+ *
+ * The value of the variable is computed considering that
* idling is usually beneficial for the throughput if:
* (a) the device is not NCQ-capable, or
* (b) regardless of the presence of NCQ, the request pattern
@@ -4857,13 +5250,80 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
/*
- * We have now the components we need to compute the return
- * value of the function, which is true only if both the
- * following conditions hold:
+ * There is then a case where idling must be performed not for
+ * throughput concerns, but to preserve service guarantees. To
+ * introduce it, we can note that allowing the drive to
+ * enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, causes loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happens
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired
+ * device-throughput distribution only in a completely
+ * symmetric scenario where: (i) each of these processes must
+ * get the same throughput as the others; (ii) all these
+ * processes have the same I/O pattern (either sequential or
+ * random). In fact, in such a scenario, the drive will tend
+ * to treat the requests of each of these processes in about
+ * the same way as the requests of the others, and thus to
+ * provide each of these processes with about the same
+ * throughput (which is exactly the desired throughput
+ * distribution). In contrast, in any asymmetric scenario,
+ * device idling is certainly needed to guarantee that bfqq
+ * receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * As for sub-condition (i), actually we check only whether
+ * bfqq is being weight-raised. In fact, if bfqq is not being
+ * weight-raised, we have that:
+ * - if the process associated with bfqq is not I/O-bound, then
+ * it is not either latency- or throughput-critical; therefore
+ * idling is not needed for bfqq;
+ * - if the process asociated with bfqq is I/O-bound, then
+ * idling is already granted with bfqq (see the comments on
+ * idling_boosts_thr).
+ *
+ * We do not check sub-condition (ii) at all, i.e., the next
+ * variable is true if and only if bfqq is being
+ * weight-raised. We do not need to control sub-condition (ii)
+ * for the following reason:
+ * - if bfqq is being weight-raised, then idling is already
+ * guaranteed to bfqq by sub-condition (i);
+ * - if bfqq is not being weight-raised, then idling is
+ * already guaranteed to bfqq (only) if it matters, i.e., if
+ * bfqq is associated to a currently I/O-bound process (see
+ * the above comment on sub-condition (i)).
+ *
+ * As a side note, it is worth considering that the above
+ * device-idling countermeasures may however fail in the
+ * following unlucky scenario: if idling is (correctly)
+ * disabled in a time period during which the symmetry
+ * sub-condition holds, and hence the device is allowed to
+ * enqueue many requests, but at some later point in time some
+ * sub-condition stops to hold, then it may become impossible
+ * to let requests be served in the desired order until all
+ * the requests already queued in the device have been served.
+ */
+ asymmetric_scenario = bfqq->wr_coeff > 1;
+
+ /*
+ * We have now all the components we need to compute the return
+ * value of the function, which is true only if both the following
+ * conditions hold:
* 1) bfqq is sync, because idling make sense only for sync queues;
- * 2) idling boosts the throughput.
+ * 2) idling either boosts the throughput (without issues), or
+ * is necessary to preserve service guarantees.
*/
- return bfq_bfqq_sync(bfqq) && idling_boosts_thr;
+ return bfq_bfqq_sync(bfqq) &&
+ (idling_boosts_thr || asymmetric_scenario);
}
/*
@@ -4986,6 +5446,43 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
return bfqq;
}
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+ bfq_log_bfqq(bfqd, bfqq,
+ "raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+ jiffies_to_msecs(bfqq->wr_cur_max_time),
+ bfqq->wr_coeff,
+ bfqq->entity.weight, bfqq->entity.orig_weight);
+
+ if (entity->prio_changed)
+ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
+
+ /*
+ * If too much time has elapsed from the beginning of
+ * this weight-raising period, then end weight
+ * raising.
+ */
+ if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+ bfqq->wr_cur_max_time)) {
+ bfqq->last_wr_start_finish = jiffies;
+ bfq_log_bfqq(bfqd, bfqq,
+ "wrais ending at %lu, rais_max_time %u",
+ bfqq->last_wr_start_finish,
+ jiffies_to_msecs(bfqq->wr_cur_max_time));
+ bfq_bfqq_end_wr(bfqq);
+ }
+ }
+ /* Update weight both if it must be raised and if it must be lowered */
+ if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+ __bfq_entity_update_weight_prio(
+ bfq_entity_service_tree(entity),
+ entity);
+}
+
/*
* Dispatch next request from bfqq.
*/
@@ -5001,6 +5498,19 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
bfq_dispatch_remove(bfqd->queue, rq);
+ /*
+ * If weight raising has to terminate for bfqq, then next
+ * function causes an immediate update of bfqq's weight,
+ * without waiting for next activation. As a consequence, on
+ * expiration, bfqq will be timestamped as if has never been
+ * weight-raised during this service slot, even if it has
+ * received part or even most of the service as a
+ * weight-raised queue. This inflates bfqq's timestamps, which
+ * is beneficial, as bfqq is then more willing to leave the
+ * device immediately to possible other weight-raised queues.
+ */
+ bfq_update_wr_data(bfqd, bfqq);
+
if (!bfqd->in_service_bic) {
atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
bfqd->in_service_bic = RQ_BIC(rq);
@@ -5306,6 +5816,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
bfqq->budget_timeout = bfq_smallest_from_now();
+ bfqq->wr_coeff = 1;
+ bfqq->last_wr_start_finish = bfq_smallest_from_now();
+
/* first request is almost certainly seeky */
bfqq->seek_history = 1;
}
@@ -5440,7 +5953,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
enable_idle = 0;
else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) {
- if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle)
+ if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+ bfqq->wr_coeff == 1)
enable_idle = 0;
else
enable_idle = 1;
@@ -5618,6 +6132,16 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
bfqd->rq_in_driver--;
bfqq->dispatched--;
+ if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
+ /*
+ * Set budget_timeout (which we overload to store the
+ * time at which the queue remains with no backlog and
+ * no outstanding request; used by the weight-raising
+ * mechanism).
+ */
+ bfqq->budget_timeout = jiffies;
+ }
+
now_ns = ktime_get_ns();
bfqq->ttime.last_end_request = now_ns;
@@ -5655,10 +6179,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
* or if we want to idle in case it has no pending requests.
*/
if (bfqd->in_service_queue == bfqq) {
- if (bfq_bfqq_budget_new(bfqq))
- bfq_set_budget_timeout(bfqd);
-
- if (bfq_bfqq_must_idle(bfqq)) {
+ if (bfqq->dispatched == 0 && bfq_bfqq_must_idle(bfqq)) {
bfq_arm_slice_timer(bfqd);
return;
} else if (bfq_may_expire_for_budg_timeout(bfqq))
@@ -5966,6 +6487,26 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
bfqd->bfq_requests_within_timer = 120;
+ bfqd->low_latency = true;
+
+ /*
+ * Trade-off between responsiveness and fairness.
+ */
+ bfqd->bfq_wr_coeff = 30;
+ bfqd->bfq_wr_max_time = 0;
+ bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+ bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+ /*
+ * Begin by assuming, optimistically, that the device is a
+ * high-speed one, and that its peak rate is equal to 2/3 of
+ * the highest reference rate.
+ */
+ bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+ T_fast[blk_queue_nonrot(bfqd->queue)];
+ bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)] * 2 / 3;
+ bfqd->device_speed = BFQ_BFQD_FAST;
+
spin_lock_init(&bfqd->lock);
/*
@@ -6047,6 +6588,7 @@ SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);
SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
#undef SHOW_FUNCTION
#define USEC_SHOW_FUNCTION(__FUNC, __VAR) \
@@ -6167,6 +6709,22 @@ static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
return ret;
}
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+ const char *page, size_t count)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ unsigned long uninitialized_var(__data);
+ int ret = bfq_var_store(&__data, (page), count);
+
+ if (__data > 1)
+ __data = 1;
+ if (__data == 0 && bfqd->low_latency != 0)
+ bfq_end_wr(bfqd);
+ bfqd->low_latency = __data;
+
+ return ret;
+}
+
#define BFQ_ATTR(name) \
__ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
@@ -6180,6 +6738,7 @@ static struct elv_fs_entry bfq_attrs[] = {
BFQ_ATTR(max_budget),
BFQ_ATTR(timeout_sync),
BFQ_ATTR(strict_guarantees),
+ BFQ_ATTR(low_latency),
__ATTR_NULL
};
@@ -6242,6 +6801,39 @@ static int __init bfq_init(void)
if (bfq_slab_setup())
goto err_pol_unreg;
+ /*
+ * Times to load large popular applications for the typical
+ * systems installed on the reference devices (see the
+ * comments before the definitions of the next two
+ * arrays). Actually, we use slightly slower values, as the
+ * estimated peak rate tends to be smaller than the actual
+ * peak rate. The reason for this last fact is that estimates
+ * are computed over much shorter time intervals than the long
+ * intervals typically used for benchmarking. Why? First, to
+ * adapt more quickly to variations. Second, because an I/O
+ * scheduler cannot rely on a peak-rate-evaluation workload to
+ * be run for a long time.
+ */
+ T_slow[0] = msecs_to_jiffies(3500); /* actually 4 sec */
+ T_slow[1] = msecs_to_jiffies(6000); /* actually 6.5 sec */
+ T_fast[0] = msecs_to_jiffies(7000); /* actually 8 sec */
+ T_fast[1] = msecs_to_jiffies(2500); /* actually 3 sec */
+
+ /*
+ * Thresholds that determine the switch between speed classes
+ * (see the comments before the definition of the array
+ * device_speed_thresh). These thresholds are biased towards
+ * transitions to the fast class. This is safer than the
+ * opposite bias. In fact, a wrong transition to the slow
+ * class results in short weight-raising periods, because the
+ * speed of the device then tends to be higher that the
+ * reference peak rate. On the opposite end, a wrong
+ * transition to the fast class tends to increase
+ * weight-raising periods, because of the opposite reason.
+ */
+ device_speed_thresh[0] = (4 * R_slow[0]) / 3;
+ device_speed_thresh[1] = (4 * R_slow[1]) / 3;
+
ret = elv_register(&iosched_bfq_mq);
if (ret)
goto err_pol_unreg;
--
2.10.0
^ permalink raw reply related
* [PATCH V4 05/16] block, bfq: add more fairness with writes and slow processes
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
This patch deals with two sources of unfairness, which can also cause
high latencies and throughput loss. The first source is related to
write requests. Write requests tend to starve read requests, basically
because, on one side, writes are slower than reads, whereas, on the
other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient. The value of the
coefficient is the result of our tuning with different devices.
The second source of unfairness has to do with slowness detection:
when the in-service queue is expired, BFQ also controls whether the
queue has been "too slow", i.e., has consumed its last-assigned budget
at such a low rate that it would have been impossible to consume all
of this budget within the maximum time slice T_max (Subsec. 3.5 in
[1]). In this case, the queue is always (over)charged the whole
budget, to reduce its utilization of the device. Both this overcharge
and the slowness-detection criterion may cause unfairness.
First, always charging a full budget to a slow queue is too coarse. It
is much more accurate, and this patch lets BFQ do so, to charge an
amount of service 'equivalent' to the amount of time during which the
queue has been in service. As explained in more detail in the comments
on the code, this enables BFQ to provide time fairness among slow
queues.
Secondly, because of ZBR, a queue may be deemed as slow when its
associated process is performing I/O on the slowest zones of a
disk. However, unless the process is truly too slow, not reducing the
disk utilization of the queue is more profitable in terms of disk
throughput than the opposite. A similar problem is caused by logical
block mapping on non-rotational devices. For this reason, this patch
lets a queue be charged time, and not budget, only if the queue has
consumed less than 2/3 of its assigned budget. As an additional,
important benefit, this tolerance allows BFQ to preserve enough
elasticity to still perform bandwidth, and not time, distribution with
little unlucky or quasi-sequential processes.
Finally, for the same reasons as above, this patch makes slowness
detection itself much less harsh: a queue is deemed slow only if it
has consumed its budget at less than half of the peak rate.
[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 120 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 85 insertions(+), 35 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 61d880b..dce273b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -753,6 +753,13 @@ static const int bfq_stats_min_budgets = 194;
/* Default maximum budget values, in sectors and number of requests. */
static const int bfq_default_max_budget = 16 * 1024;
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
/* Default timeout values, in jiffies, approximating CFQ defaults. */
static const int bfq_timeout = HZ / 8;
@@ -1571,22 +1578,52 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
}
/**
- * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * bfq_bfqq_charge_time - charge an amount of service equivalent to the length
+ * of the time interval during which bfqq has been in
+ * service.
+ * @bfqd: the device
* @bfqq: the queue that needs a service update.
+ * @time_ms: the amount of time during which the queue has received service
*
- * When it's not possible to be fair in the service domain, because
- * a queue is not consuming its budget fast enough (the meaning of
- * fast depends on the timeout parameter), we charge it a full
- * budget. In this way we should obtain a sort of time-domain
- * fairness among all the seeky/slow queues.
+ * If a queue does not consume its budget fast enough, then providing
+ * the queue with service fairness may impair throughput, more or less
+ * severely. For this reason, queues that consume their budget slowly
+ * are provided with time fairness instead of service fairness. This
+ * goal is achieved through the BFQ scheduling engine, even if such an
+ * engine works in the service, and not in the time domain. The trick
+ * is charging these queues with an inflated amount of service, equal
+ * to the amount of service that they would have received during their
+ * service slot if they had been fast, i.e., if their requests had
+ * been dispatched at a rate equal to the estimated peak rate.
+ *
+ * It is worth noting that time fairness can cause important
+ * distortions in terms of bandwidth distribution, on devices with
+ * internal queueing. The reason is that I/O requests dispatched
+ * during the service slot of a queue may be served after that service
+ * slot is finished, and may have a total processing time loosely
+ * correlated with the duration of the service slot. This is
+ * especially true for short service slots.
*/
-static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+static void bfq_bfqq_charge_time(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ unsigned long time_ms)
{
struct bfq_entity *entity = &bfqq->entity;
+ int tot_serv_to_charge = entity->service;
+ unsigned int timeout_ms = jiffies_to_msecs(bfq_timeout);
+
+ if (time_ms > 0 && time_ms < timeout_ms)
+ tot_serv_to_charge =
+ (bfqd->bfq_max_budget * time_ms) / timeout_ms;
- bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+ if (tot_serv_to_charge < entity->service)
+ tot_serv_to_charge = entity->service;
- bfq_bfqq_served(bfqq, entity->budget - entity->service);
+ /* Increase budget to avoid inconsistencies */
+ if (tot_serv_to_charge > entity->budget)
+ entity->budget = tot_serv_to_charge;
+
+ bfq_bfqq_served(bfqq,
+ max_t(int, 0, tot_serv_to_charge - entity->service));
}
static void bfq_update_fin_time_enqueue(struct bfq_entity *entity,
@@ -3572,10 +3609,14 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
}
+/* see the definition of bfq_async_charge_factor for details */
static unsigned long bfq_serv_to_charge(struct request *rq,
struct bfq_queue *bfqq)
{
- return blk_rq_sectors(rq);
+ if (bfq_bfqq_sync(bfqq))
+ return blk_rq_sectors(rq);
+
+ return blk_rq_sectors(rq) * bfq_async_charge_factor;
}
/**
@@ -4676,28 +4717,24 @@ static unsigned long bfq_smallest_from_now(void)
* @compensate: if true, compensate for the time spent idling.
* @reason: the reason causing the expiration.
*
+ * If the process associated with bfqq does slow I/O (e.g., because it
+ * issues random requests), we charge bfqq with the time it has been
+ * in service instead of the service it has received (see
+ * bfq_bfqq_charge_time for details on how this goal is achieved). As
+ * a consequence, bfqq will typically get higher timestamps upon
+ * reactivation, and hence it will be rescheduled as if it had
+ * received more service than what it has actually received. In the
+ * end, bfqq receives less service in proportion to how slowly its
+ * associated process consumes its budgets (and hence how seriously it
+ * tends to lower the throughput). In addition, this time-charging
+ * strategy guarantees time fairness among slow processes. In
+ * contrast, if the process associated with bfqq is not slow, we
+ * charge bfqq exactly with the service it has received.
*
- * If the process associated with the queue is slow (i.e., seeky), or
- * in case of budget timeout, or, finally, if it is async, we
- * artificially charge it an entire budget (independently of the
- * actual service it received). As a consequence, the queue will get
- * higher timestamps than the correct ones upon reactivation, and
- * hence it will be rescheduled as if it had received more service
- * than what it actually received. In the end, this class of processes
- * will receive less service in proportion to how slowly they consume
- * their budgets (and hence how seriously they tend to lower the
- * throughput).
- *
- * In contrast, when a queue expires because it has been idling for
- * too much or because it exhausted its budget, we do not touch the
- * amount of service it has received. Hence when the queue will be
- * reactivated and its timestamps updated, the latter will be in sync
- * with the actual service received by the queue until expiration.
- *
- * Charging a full budget to the first type of queues and the exact
- * service to the others has the effect of using the WF2Q+ policy to
- * schedule the former on a timeslice basis, without violating the
- * service domain guarantees of the latter.
+ * Charging time to the first type of queues and the exact service to
+ * the other has the effect of using the WF2Q+ policy to schedule the
+ * former on a timeslice basis, without violating service domain
+ * guarantees among the latter.
*/
static void bfq_bfqq_expire(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
@@ -4715,11 +4752,24 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
/*
- * As above explained, 'punish' slow (i.e., seeky), timed-out
- * and async queues, to favor sequential sync workloads.
+ * As above explained, charge slow (typically seeky) and
+ * timed-out queues with the time and not the service
+ * received, to favor sequential workloads.
+ *
+ * Processes doing I/O in the slower disk zones will tend to
+ * be slow(er) even if not seeky. Therefore, since the
+ * estimated peak rate is actually an average over the disk
+ * surface, these processes may timeout just for bad luck. To
+ * avoid punishing them, do not charge time to processes that
+ * succeeded in consuming at least 2/3 of their budget. This
+ * allows BFQ to preserve enough elasticity to still perform
+ * bandwidth, and not time, distribution with little unlucky
+ * or quasi-sequential processes.
*/
- if (slow || reason == BFQQE_BUDGET_TIMEOUT)
- bfq_bfqq_charge_full_budget(bfqq);
+ if (slow ||
+ (reason == BFQQE_BUDGET_TIMEOUT &&
+ bfq_bfqq_budget_left(bfqq) >= entity->budget / 3))
+ bfq_bfqq_charge_time(bfqd, bfqq, delta);
if (reason == BFQQE_TOO_IDLE &&
entity->service <= 2 * entity->budget / 10)
--
2.10.0
^ permalink raw reply related
* [PATCH V4 04/16] block, bfq: modify the peak-rate estimator
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual device
peak rate, the higher the probability that processes incur budget
timeouts unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.
Unfortunately, it is not trivial to estimate the peak rate correctly:
because of the presence of sw and hw queues between the scheduler and
the device components that finally serve I/O requests, it is hard to
say exactly when a given dispatched request is served inside the
device, and for how long. As a consequence, it is hard to know
precisely at what rate a given set of requests is actually served by
the device.
On the opposite end, the dispatch time of any request is trivially
available, and, from this piece of information, the "dispatch rate"
of requests can be immediately computed. So, the idea in the next
function is to use what is known, namely request dispatch times
(plus, when useful, request completion times), to estimate what is
unknown, namely in-device request service rate.
The main issue is that, because of the above facts, the rate at
which a certain set of requests is dispatched over a certain time
interval can vary greatly with respect to the rate at which the
same requests are then served. But, since the size of any
intermediate queue is limited, and the service scheme is lossless
(no request is silently dropped), the following obvious convergence
property holds: the number of requests dispatched MUST become
closer and closer to the number of requests completed as the
observation interval grows. This is the key property used in
this new version of the peak-rate estimator.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 497 +++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 372 insertions(+), 125 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1edac72..61d880b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -407,19 +407,37 @@ struct bfq_data {
/* on-disk position of the last served request */
sector_t last_position;
+ /* time of last request completion (ns) */
+ u64 last_completion;
+
+ /* time of first rq dispatch in current observation interval (ns) */
+ u64 first_dispatch;
+ /* time of last rq dispatch in current observation interval (ns) */
+ u64 last_dispatch;
+
/* beginning of the last budget */
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
- /* number of samples used to calculate @peak_rate */
+
+ /* number of samples in current observation interval */
int peak_rate_samples;
+ /* num of samples of seq dispatches in current observation interval */
+ u32 sequential_samples;
+ /* total num of sectors transferred in current observation interval */
+ u64 tot_sectors_dispatched;
+ /* max rq size seen during current observation interval (sectors) */
+ u32 last_rq_max_size;
+ /* time elapsed from first dispatch in current observ. interval (us) */
+ u64 delta_from_first;
/*
- * Peak read/write rate, observed during the service of a
- * budget [BFQ_RATE_SHIFT * sectors/usec]. The value is
- * left-shifted by BFQ_RATE_SHIFT to increase precision in
+ * Current estimate of the device peak rate, measured in
+ * [BFQ_RATE_SHIFT * sectors/usec]. The left-shift by
+ * BFQ_RATE_SHIFT is performed to increase precision in
* fixed-point calculations.
*/
- u64 peak_rate;
+ u32 peak_rate;
+
/* maximum budget allotted to a bfq_queue before rescheduling */
int bfq_max_budget;
@@ -740,7 +758,7 @@ static const int bfq_timeout = HZ / 8;
static struct kmem_cache *bfq_pool;
-/* Below this threshold (in ms), we consider thinktime immediate. */
+/* Below this threshold (in ns), we consider thinktime immediate. */
#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
/* hw_tag detection: parallel requests threshold and min samples needed. */
@@ -752,8 +770,12 @@ static struct kmem_cache *bfq_pool;
#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 32/8)
-/* Min samples used for peak rate estimation (for autotuning). */
-#define BFQ_PEAK_RATE_SAMPLES 32
+/* Min number of samples required to perform peak-rate update */
+#define BFQ_RATE_MIN_SAMPLES 32
+/* Min observation time interval required to perform a peak-rate update (ns) */
+#define BFQ_RATE_MIN_INTERVAL (300*NSEC_PER_MSEC)
+/* Target observation time interval for a peak-rate update (ns) */
+#define BFQ_RATE_REF_INTERVAL NSEC_PER_SEC
/* Shift used for peak rate fixed precision calculations. */
#define BFQ_RATE_SHIFT 16
@@ -3837,15 +3859,20 @@ static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
return NULL;
}
+static sector_t get_sdist(sector_t last_pos, struct request *rq)
+{
+ if (last_pos)
+ return abs(blk_rq_pos(rq) - last_pos);
+
+ return 0;
+}
+
#if 0 /* Still not clear if we can do without next two functions */
static void bfq_activate_request(struct request_queue *q, struct request *rq)
{
struct bfq_data *bfqd = q->elevator->elevator_data;
bfqd->rq_in_driver++;
- bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
- bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
- (unsigned long long)bfqd->last_position);
}
static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
@@ -4124,6 +4151,227 @@ static void bfq_set_budget_timeout(struct bfq_data *bfqd)
}
/*
+ * In autotuning mode, max_budget is dynamically recomputed as the
+ * amount of sectors transferred in timeout at the estimated peak
+ * rate. This enables BFQ to utilize a full timeslice with a full
+ * budget, even if the in-service queue is served at peak rate. And
+ * this maximises throughput with sequential workloads.
+ */
+static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
+{
+ return (u64)bfqd->peak_rate * USEC_PER_MSEC *
+ jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
+}
+
+static void bfq_reset_rate_computation(struct bfq_data *bfqd,
+ struct request *rq)
+{
+ if (rq != NULL) { /* new rq dispatch now, reset accordingly */
+ bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns();
+ bfqd->peak_rate_samples = 1;
+ bfqd->sequential_samples = 0;
+ bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
+ blk_rq_sectors(rq);
+ } else /* no new rq dispatched, just reset the number of samples */
+ bfqd->peak_rate_samples = 0; /* full re-init on next disp. */
+
+ bfq_log(bfqd,
+ "reset_rate_computation at end, sample %u/%u tot_sects %llu",
+ bfqd->peak_rate_samples, bfqd->sequential_samples,
+ bfqd->tot_sectors_dispatched);
+}
+
+static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
+{
+ u32 rate, weight, divisor;
+
+ /*
+ * For the convergence property to hold (see comments on
+ * bfq_update_peak_rate()) and for the assessment to be
+ * reliable, a minimum number of samples must be present, and
+ * a minimum amount of time must have elapsed. If not so, do
+ * not compute new rate. Just reset parameters, to get ready
+ * for a new evaluation attempt.
+ */
+ if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES ||
+ bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)
+ goto reset_computation;
+
+ /*
+ * If a new request completion has occurred after last
+ * dispatch, then, to approximate the rate at which requests
+ * have been served by the device, it is more precise to
+ * extend the observation interval to the last completion.
+ */
+ bfqd->delta_from_first =
+ max_t(u64, bfqd->delta_from_first,
+ bfqd->last_completion - bfqd->first_dispatch);
+
+ /*
+ * Rate computed in sects/usec, and not sects/nsec, for
+ * precision issues.
+ */
+ rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,
+ div_u64(bfqd->delta_from_first, NSEC_PER_USEC));
+
+ /*
+ * Peak rate not updated if:
+ * - the percentage of sequential dispatches is below 3/4 of the
+ * total, and rate is below the current estimated peak rate
+ * - rate is unreasonably high (> 20M sectors/sec)
+ */
+ if ((bfqd->sequential_samples < (3 * bfqd->peak_rate_samples)>>2 &&
+ rate <= bfqd->peak_rate) ||
+ rate > 20<<BFQ_RATE_SHIFT)
+ goto reset_computation;
+
+ /*
+ * We have to update the peak rate, at last! To this purpose,
+ * we use a low-pass filter. We compute the smoothing constant
+ * of the filter as a function of the 'weight' of the new
+ * measured rate.
+ *
+ * As can be seen in next formulas, we define this weight as a
+ * quantity proportional to how sequential the workload is,
+ * and to how long the observation time interval is.
+ *
+ * The weight runs from 0 to 8. The maximum value of the
+ * weight, 8, yields the minimum value for the smoothing
+ * constant. At this minimum value for the smoothing constant,
+ * the measured rate contributes for half of the next value of
+ * the estimated peak rate.
+ *
+ * So, the first step is to compute the weight as a function
+ * of how sequential the workload is. Note that the weight
+ * cannot reach 9, because bfqd->sequential_samples cannot
+ * become equal to bfqd->peak_rate_samples, which, in its
+ * turn, holds true because bfqd->sequential_samples is not
+ * incremented for the first sample.
+ */
+ weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples;
+
+ /*
+ * Second step: further refine the weight as a function of the
+ * duration of the observation interval.
+ */
+ weight = min_t(u32, 8,
+ div_u64(weight * bfqd->delta_from_first,
+ BFQ_RATE_REF_INTERVAL));
+
+ /*
+ * Divisor ranging from 10, for minimum weight, to 2, for
+ * maximum weight.
+ */
+ divisor = 10 - weight;
+
+ /*
+ * Finally, update peak rate:
+ *
+ * peak_rate = peak_rate * (divisor-1) / divisor + rate / divisor
+ */
+ bfqd->peak_rate *= divisor-1;
+ bfqd->peak_rate /= divisor;
+ rate /= divisor; /* smoothing constant alpha = 1/divisor */
+
+ bfqd->peak_rate += rate;
+ if (bfqd->bfq_user_max_budget == 0)
+ bfqd->bfq_max_budget =
+ bfq_calc_max_budget(bfqd);
+
+reset_computation:
+ bfq_reset_rate_computation(bfqd, rq);
+}
+
+/*
+ * Update the read/write peak rate (the main quantity used for
+ * auto-tuning, see update_thr_responsiveness_params()).
+ *
+ * It is not trivial to estimate the peak rate (correctly): because of
+ * the presence of sw and hw queues between the scheduler and the
+ * device components that finally serve I/O requests, it is hard to
+ * say exactly when a given dispatched request is served inside the
+ * device, and for how long. As a consequence, it is hard to know
+ * precisely at what rate a given set of requests is actually served
+ * by the device.
+ *
+ * On the opposite end, the dispatch time of any request is trivially
+ * available, and, from this piece of information, the "dispatch rate"
+ * of requests can be immediately computed. So, the idea in the next
+ * function is to use what is known, namely request dispatch times
+ * (plus, when useful, request completion times), to estimate what is
+ * unknown, namely in-device request service rate.
+ *
+ * The main issue is that, because of the above facts, the rate at
+ * which a certain set of requests is dispatched over a certain time
+ * interval can vary greatly with respect to the rate at which the
+ * same requests are then served. But, since the size of any
+ * intermediate queue is limited, and the service scheme is lossless
+ * (no request is silently dropped), the following obvious convergence
+ * property holds: the number of requests dispatched MUST become
+ * closer and closer to the number of requests completed as the
+ * observation interval grows. This is the key property used in
+ * the next function to estimate the peak service rate as a function
+ * of the observed dispatch rate. The function assumes to be invoked
+ * on every request dispatch.
+ */
+static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
+{
+ u64 now_ns = ktime_get_ns();
+
+ if (bfqd->peak_rate_samples == 0) { /* first dispatch */
+ bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
+ bfqd->peak_rate_samples);
+ bfq_reset_rate_computation(bfqd, rq);
+ goto update_last_values; /* will add one sample */
+ }
+
+ /*
+ * Device idle for very long: the observation interval lasting
+ * up to this dispatch cannot be a valid observation interval
+ * for computing a new peak rate (similarly to the late-
+ * completion event in bfq_completed_request()). Go to
+ * update_rate_and_reset to have the following three steps
+ * taken:
+ * - close the observation interval at the last (previous)
+ * request dispatch or completion
+ * - compute rate, if possible, for that observation interval
+ * - start a new observation interval with this dispatch
+ */
+ if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&
+ bfqd->rq_in_driver == 0)
+ goto update_rate_and_reset;
+
+ /* Update sampling information */
+ bfqd->peak_rate_samples++;
+
+ if ((bfqd->rq_in_driver > 0 ||
+ now_ns - bfqd->last_completion < BFQ_MIN_TT)
+ && get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR)
+ bfqd->sequential_samples++;
+
+ bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
+
+ /* Reset max observed rq size every 32 dispatches */
+ if (likely(bfqd->peak_rate_samples % 32))
+ bfqd->last_rq_max_size =
+ max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);
+ else
+ bfqd->last_rq_max_size = blk_rq_sectors(rq);
+
+ bfqd->delta_from_first = now_ns - bfqd->first_dispatch;
+
+ /* Target observation interval not yet reached, go on sampling */
+ if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)
+ goto update_last_values;
+
+update_rate_and_reset:
+ bfq_update_rate_reset(bfqd, rq);
+update_last_values:
+ bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+ bfqd->last_dispatch = now_ns;
+}
+
+/*
* Remove request from internal lists.
*/
static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
@@ -4143,6 +4391,7 @@ static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
* happens to be taken into account.
*/
bfqq->dispatched++;
+ bfq_update_peak_rate(q->elevator->elevator_data, rq);
bfq_remove_request(q, rq);
}
@@ -4323,110 +4572,92 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
bfqq->entity.budget);
}
-static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
-{
- unsigned long max_budget;
-
- /*
- * The max_budget calculated when autotuning is equal to the
- * amount of sectors transferred in timeout at the estimated
- * peak rate. To get this value, peak_rate is, first,
- * multiplied by 1000, because timeout is measured in ms,
- * while peak_rate is measured in sectors/usecs. Then the
- * result of this multiplication is right-shifted by
- * BFQ_RATE_SHIFT, because peak_rate is equal to the value of
- * the peak rate left-shifted by BFQ_RATE_SHIFT.
- */
- max_budget = (unsigned long)(peak_rate * 1000 *
- timeout >> BFQ_RATE_SHIFT);
-
- return max_budget;
-}
-
/*
- * In addition to updating the peak rate, checks whether the process
- * is "slow", and returns 1 if so. This slow flag is used, in addition
- * to the budget timeout, to reduce the amount of service provided to
- * seeky processes, and hence reduce their chances to lower the
- * throughput. See the code for more details.
+ * Return true if the process associated with bfqq is "slow". The slow
+ * flag is used, in addition to the budget timeout, to reduce the
+ * amount of service provided to seeky processes, and thus reduce
+ * their chances to lower the throughput. More details in the comments
+ * on the function bfq_bfqq_expire().
+ *
+ * An important observation is in order: as discussed in the comments
+ * on the function bfq_update_peak_rate(), with devices with internal
+ * queues, it is hard if ever possible to know when and for how long
+ * an I/O request is processed by the device (apart from the trivial
+ * I/O pattern where a new request is dispatched only after the
+ * previous one has been completed). This makes it hard to evaluate
+ * the real rate at which the I/O requests of each bfq_queue are
+ * served. In fact, for an I/O scheduler like BFQ, serving a
+ * bfq_queue means just dispatching its requests during its service
+ * slot (i.e., until the budget of the queue is exhausted, or the
+ * queue remains idle, or, finally, a timeout fires). But, during the
+ * service slot of a bfq_queue, around 100 ms at most, the device may
+ * be even still processing requests of bfq_queues served in previous
+ * service slots. On the opposite end, the requests of the in-service
+ * bfq_queue may be completed after the service slot of the queue
+ * finishes.
+ *
+ * Anyway, unless more sophisticated solutions are used
+ * (where possible), the sum of the sizes of the requests dispatched
+ * during the service slot of a bfq_queue is probably the only
+ * approximation available for the service received by the bfq_queue
+ * during its service slot. And this sum is the quantity used in this
+ * function to evaluate the I/O speed of a process.
*/
-static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
- bool compensate)
+static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool compensate, enum bfqq_expiration reason,
+ unsigned long *delta_ms)
{
- u64 bw, usecs, expected, timeout;
- ktime_t delta;
- int update = 0;
+ ktime_t delta_ktime;
+ u32 delta_usecs;
+ bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */
- if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+ if (!bfq_bfqq_sync(bfqq))
return false;
if (compensate)
- delta = bfqd->last_idling_start;
+ delta_ktime = bfqd->last_idling_start;
else
- delta = ktime_get();
- delta = ktime_sub(delta, bfqd->last_budget_start);
- usecs = ktime_to_us(delta);
+ delta_ktime = ktime_get();
+ delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
+ delta_usecs = ktime_to_us(delta_ktime);
/* don't use too short time intervals */
- if (usecs < 1000)
- return false;
-
- /*
- * Calculate the bandwidth for the last slice. We use a 64 bit
- * value to store the peak rate, in sectors per usec in fixed
- * point math. We do so to have enough precision in the estimate
- * and to avoid overflows.
- */
- bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
- do_div(bw, (unsigned long)usecs);
+ if (delta_usecs < 1000) {
+ if (blk_queue_nonrot(bfqd->queue))
+ /*
+ * give same worst-case guarantees as idling
+ * for seeky
+ */
+ *delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;
+ else /* charge at least one seek */
+ *delta_ms = bfq_slice_idle / NSEC_PER_MSEC;
+
+ return slow;
+ }
- timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+ *delta_ms = delta_usecs / USEC_PER_MSEC;
/*
- * Use only long (> 20ms) intervals to filter out spikes for
- * the peak rate estimation.
+ * Use only long (> 20ms) intervals to filter out excessive
+ * spikes in service rate estimation.
*/
- if (usecs > 20000) {
- if (bw > bfqd->peak_rate) {
- bfqd->peak_rate = bw;
- update = 1;
- bfq_log(bfqd, "new peak_rate=%llu", bw);
- }
-
- update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
-
- if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
- bfqd->peak_rate_samples++;
-
- if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
- update && bfqd->bfq_user_max_budget == 0) {
- bfqd->bfq_max_budget =
- bfq_calc_max_budget(bfqd->peak_rate,
- timeout);
- bfq_log(bfqd, "new max_budget=%d",
- bfqd->bfq_max_budget);
- }
+ if (delta_usecs > 20000) {
+ /*
+ * Caveat for rotational devices: processes doing I/O
+ * in the slower disk zones tend to be slow(er) even
+ * if not seeky. In this respect, the estimated peak
+ * rate is likely to be an average over the disk
+ * surface. Accordingly, to not be too harsh with
+ * unlucky processes, a process is deemed slow only if
+ * its rate has been lower than half of the estimated
+ * peak rate.
+ */
+ slow = bfqq->entity.service < bfqd->bfq_max_budget / 2;
}
- /*
- * A process is considered ``slow'' (i.e., seeky, so that we
- * cannot treat it fairly in the service domain, as it would
- * slow down too much the other processes) if, when a slice
- * ends for whatever reason, it has received service at a
- * rate that would not be high enough to complete the budget
- * before the budget timeout expiration.
- */
- expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+ bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);
- /*
- * Caveat: processes doing IO in the slower disk zones will
- * tend to be slow(er) even if not seeky. And the estimated
- * peak rate will actually be an average over the disk
- * surface. Hence, to not be too harsh with unlucky processes,
- * we keep a budget/3 margin of safety before declaring a
- * process slow.
- */
- return expected > (4 * bfqq->entity.budget) / 3;
+ return slow;
}
/*
@@ -4474,13 +4705,14 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
enum bfqq_expiration reason)
{
bool slow;
+ unsigned long delta = 0;
+ struct bfq_entity *entity = &bfqq->entity;
int ref;
/*
- * Update device peak rate for autotuning and check whether the
- * process is slow (see bfq_update_peak_rate).
+ * Check whether the process is slow (see bfq_bfqq_is_slow).
*/
- slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+ slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
/*
* As above explained, 'punish' slow (i.e., seeky), timed-out
@@ -4490,7 +4722,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
bfq_bfqq_charge_full_budget(bfqq);
if (reason == BFQQE_TOO_IDLE &&
- bfqq->entity.service <= 2 * bfqq->entity.budget / 10)
+ entity->service <= 2 * entity->budget / 10)
bfq_clear_bfqq_IO_bound(bfqq);
bfq_log_bfqq(bfqd, bfqq,
@@ -5130,17 +5362,9 @@ static void
bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct request *rq)
{
- sector_t sdist = 0;
-
- if (bfqq->last_request_pos) {
- if (bfqq->last_request_pos < blk_rq_pos(rq))
- sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
- else
- sdist = bfqq->last_request_pos - blk_rq_pos(rq);
- }
-
bfqq->seek_history <<= 1;
- bfqq->seek_history |= sdist > BFQQ_SEEK_THR &&
+ bfqq->seek_history |=
+ get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR &&
(!blk_queue_nonrot(bfqd->queue) ||
blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT);
}
@@ -5336,12 +5560,45 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
{
+ u64 now_ns;
+ u32 delta_us;
+
bfq_update_hw_tag(bfqd);
bfqd->rq_in_driver--;
bfqq->dispatched--;
- bfqq->ttime.last_end_request = ktime_get_ns();
+ now_ns = ktime_get_ns();
+
+ bfqq->ttime.last_end_request = now_ns;
+
+ /*
+ * Using us instead of ns, to get a reasonable precision in
+ * computing rate in next check.
+ */
+ delta_us = div_u64(now_ns - bfqd->last_completion, NSEC_PER_USEC);
+
+ /*
+ * If the request took rather long to complete, and, according
+ * to the maximum request size recorded, this completion latency
+ * implies that the request was certainly served at a very low
+ * rate (less than 1M sectors/sec), then the whole observation
+ * interval that lasts up to this time instant cannot be a
+ * valid time interval for computing a new peak rate. Invoke
+ * bfq_update_rate_reset to have the following three steps
+ * taken:
+ * - close the observation interval at the last (previous)
+ * request dispatch or completion
+ * - compute rate, if possible, for that observation interval
+ * - reset to zero samples, which will trigger a proper
+ * re-initialization of the observation interval on next
+ * dispatch
+ */
+ if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC &&
+ (bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us <
+ 1UL<<(BFQ_RATE_SHIFT - 10))
+ bfq_update_rate_reset(bfqd, NULL);
+ bfqd->last_completion = now_ns;
/*
* If this is the in-service queue, check if it needs to be expired,
@@ -5799,16 +6056,6 @@ USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
UINT_MAX);
#undef USEC_STORE_FUNCTION
-static unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
-{
- u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout);
-
- if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
- return bfq_calc_max_budget(bfqd->peak_rate, timeout);
- else
- return bfq_default_max_budget;
-}
-
static ssize_t bfq_max_budget_store(struct elevator_queue *e,
const char *page, size_t count)
{
@@ -5817,7 +6064,7 @@ static ssize_t bfq_max_budget_store(struct elevator_queue *e,
int ret = bfq_var_store(&__data, (page), count);
if (__data == 0)
- bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+ bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
else {
if (__data > INT_MAX)
__data = INT_MAX;
@@ -5847,7 +6094,7 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
bfqd->bfq_timeout = msecs_to_jiffies(__data);
if (bfqd->bfq_user_max_budget == 0)
- bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+ bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
return ret;
}
--
2.10.0
^ permalink raw reply related
* [PATCH V4 03/16] block, bfq: improve throughput boosting
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.
To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.
*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.
*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.
[1] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
block/bfq-iosched.c | 87 +++++++++++++++++++++++++----------------------------
1 file changed, 41 insertions(+), 46 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index af1740a..1edac72 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -752,9 +752,6 @@ static struct kmem_cache *bfq_pool;
#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 32/8)
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP 128
-
/* Min samples used for peak rate estimation (for autotuning). */
#define BFQ_PEAK_RATE_SAMPLES 32
@@ -4074,40 +4071,6 @@ static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
return bfqq;
}
-/*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues. This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
- struct bfq_queue *bfqq)
-{
- unsigned long budget;
-
- /*
- * When we need an estimate of the peak rate we need to avoid
- * to give budgets that are too short due to previous
- * measurements. So, in the first 10 assignments use a
- * ``safe'' budget value. For such first assignment the value
- * of bfqd->budgets_assigned happens to be lower than 194.
- * See __bfq_set_in_service_queue for the formula by which
- * this field is computed.
- */
- if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
- budget = bfq_default_max_budget;
- else
- budget = bfqd->bfq_max_budget;
-
- return budget - budget / 4;
-}
-
static void bfq_arm_slice_timer(struct bfq_data *bfqd)
{
struct bfq_queue *bfqq = bfqd->in_service_queue;
@@ -4232,13 +4195,47 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
* for throughput.
*/
case BFQQE_TOO_IDLE:
- if (budget > min_budget + BFQ_BUDGET_STEP)
- budget -= BFQ_BUDGET_STEP;
- else
- budget = min_budget;
+ /*
+ * This is the only case where we may reduce
+ * the budget: if there is no request of the
+ * process still waiting for completion, then
+ * we assume (tentatively) that the timer has
+ * expired because the batch of requests of
+ * the process could have been served with a
+ * smaller budget. Hence, betting that
+ * process will behave in the same way when it
+ * becomes backlogged again, we reduce its
+ * next budget. As long as we guess right,
+ * this budget cut reduces the latency
+ * experienced by the process.
+ *
+ * However, if there are still outstanding
+ * requests, then the process may have not yet
+ * issued its next request just because it is
+ * still waiting for the completion of some of
+ * the still outstanding ones. So in this
+ * subcase we do not reduce its budget, on the
+ * contrary we increase it to possibly boost
+ * the throughput, as discussed in the
+ * comments to the BUDGET_TIMEOUT case.
+ */
+ if (bfqq->dispatched > 0) /* still outstanding reqs */
+ budget = min(budget * 2, bfqd->bfq_max_budget);
+ else {
+ if (budget > 5 * min_budget)
+ budget -= 4 * min_budget;
+ else
+ budget = min_budget;
+ }
break;
case BFQQE_BUDGET_TIMEOUT:
- budget = bfq_default_budget(bfqd, bfqq);
+ /*
+ * We double the budget here because it gives
+ * the chance to boost the throughput if this
+ * is not a seeky process (and has bumped into
+ * this timeout because of, e.g., ZBR).
+ */
+ budget = min(budget * 2, bfqd->bfq_max_budget);
break;
case BFQQE_BUDGET_EXHAUSTED:
/*
@@ -4250,8 +4247,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
* definitely increase the budget of this good
* candidate to boost the disk throughput.
*/
- budget = min(budget + 8 * BFQ_BUDGET_STEP,
- bfqd->bfq_max_budget);
+ budget = min(budget * 4, bfqd->bfq_max_budget);
break;
case BFQQE_NO_MORE_REQUESTS:
/*
@@ -5025,9 +5021,8 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqq->pid = pid;
/* Tentative initial value to trade off between thr and lat */
- bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+ bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
bfqq->budget_timeout = bfq_smallest_from_now();
- bfqq->pid = pid;
/* first request is almost certainly seeky */
bfqq->seek_history = 1;
--
2.10.0
^ permalink raw reply related
* [PATCH V4 02/16] block, bfq: add full hierarchical scheduling and cgroups support
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
From: Arianna Avanzini <avanzini.arianna@gmail.com>
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
Documentation/block/bfq-iosched.txt | 17 +-
block/Kconfig.iosched | 10 +
block/bfq-iosched.c | 2568 ++++++++++++++++++++++++++++++-----
include/linux/blkdev.h | 2 +-
4 files changed, 2213 insertions(+), 384 deletions(-)
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index cbf85f6f..461b27f 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -253,9 +253,14 @@ of slice_idle are copied from CFQ too.
per-process ioprio and weight
-----------------------------
-Unless the cgroups interface is used, weights can be assigned to
-processes only indirectly, through I/O priorities, and according to
-the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
+Unless the cgroups interface is used (see "4. BFQ group scheduling"),
+weights can be assigned to processes only indirectly, through I/O
+priorities, and according to the relation:
+weight = (IOPRIO_BE_NR - ioprio) * 10.
+
+Beware that, if low-latency is set, then BFQ automatically raises the
+weight of the queues associated with interactive and soft real-time
+applications. Unset this tunable if you need/want to control weights.
slice_idle
----------
@@ -450,9 +455,9 @@ may be reactivated for an already busy async queue (in ms).
4. Group scheduling with BFQ
============================
-BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
-and io. In particular, BFQ supports weight-based proportional
-share.
+BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
+blkio and io. In particular, BFQ supports weight-based proportional
+share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
4-1 Service guarantees provided
-------------------------------
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 562e30e..a37cd03 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED
Enable group IO scheduling in CFQ.
choice
+
prompt "Default I/O scheduler"
default DEFAULT_CFQ
help
@@ -80,6 +81,15 @@ config IOSCHED_BFQ
real-time applications. Details in
Documentation/block/bfq-iosched.txt
+config BFQ_GROUP_IOSCHED
+ bool "BFQ hierarchical scheduling support"
+ depends on IOSCHED_BFQ && BLK_CGROUP
+ default n
+ ---help---
+
+ Enable hierarchical scheduling in BFQ, using the blkio
+ (cgroups-v1) or io (cgroups-v2) controller.
+
endmenu
endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c4e7d8d..af1740a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -90,6 +90,7 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/blkdev.h>
+#include <linux/cgroup.h>
#include <linux/elevator.h>
#include <linux/ktime.h>
#include <linux/rbtree.h>
@@ -114,7 +115,7 @@
#define BFQ_DEFAULT_QUEUE_IOPRIO 4
-#define BFQ_DEFAULT_GRP_WEIGHT 10
+#define BFQ_WEIGHT_LEGACY_DFL 100
#define BFQ_DEFAULT_GRP_IOPRIO 0
#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
@@ -149,10 +150,11 @@ struct bfq_service_tree {
* struct bfq_sched_data - multi-class scheduler.
*
* bfq_sched_data is the basic scheduler queue. It supports three
- * ioprio_classes, and can be used either as a toplevel queue or as
- * an intermediate queue on a hierarchical setup.
- * @next_in_service points to the active entity of the sched_data
- * service trees that will be scheduled next.
+ * ioprio_classes, and can be used either as a toplevel queue or as an
+ * intermediate queue on a hierarchical setup. @next_in_service
+ * points to the active entity of the sched_data service trees that
+ * will be scheduled next. It is used to reduce the number of steps
+ * needed for each hierarchical-schedule update.
*
* The supported ioprio_classes are the same as in CFQ, in descending
* priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
@@ -164,19 +166,23 @@ struct bfq_service_tree {
struct bfq_sched_data {
/* entity in service */
struct bfq_entity *in_service_entity;
- /* head-of-the-line entity in the scheduler */
+ /* head-of-line entity (see comments above) */
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+ /* last time CLASS_IDLE was served */
+ unsigned long bfq_class_idle_last_service;
+
};
/**
* struct bfq_entity - schedulable entity.
*
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy. Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
*
* Each entity stores independently its priority values; this would
* allow different weights on different devices, but this
@@ -187,23 +193,24 @@ struct bfq_sched_data {
* update to take place the effective and the requested priority
* values are synchronized.
*
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ. When dealing with ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed. All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ. When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed. All the fields are protected by the queue lock of the
+ * containing bfqd.
*/
struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
/*
- * flag, true if the entity is on a tree (either the active or
- * the idle one of its service_tree).
+ * Flag, true if the entity is on a tree (either the active or
+ * the idle one of its service_tree) or is in service.
*/
- int on_st;
+ bool on_st;
/* B-WF2Q+ start and finish timestamps [sectors/weight] */
u64 start, finish;
@@ -246,6 +253,8 @@ struct bfq_entity {
int prio_changed;
};
+struct bfq_group;
+
/**
* struct bfq_ttime - per process thinktime stats.
*/
@@ -265,7 +274,11 @@ struct bfq_ttime {
* struct bfq_queue - leaf schedulable entity.
*
* A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to
+ * the cgroup, to be sure that it does not disappear while a bfqq
+ * still references it (mostly to avoid races between request issuing
+ * and task migration followed by cgroup destruction). All the fields
+ * are protected by the queue lock of the containing bfqd.
*/
struct bfq_queue {
/* reference counter */
@@ -338,6 +351,9 @@ struct bfq_io_cq {
struct bfq_queue *bfqq[2];
/* per (request_queue, blkcg) ioprio */
int ioprio;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ uint64_t blkcg_serial_nr; /* the current blkcg serial */
+#endif
};
/**
@@ -351,8 +367,8 @@ struct bfq_data {
/* dispatch queue */
struct list_head dispatch;
- /* root @bfq_sched_data for the device */
- struct bfq_sched_data sched_data;
+ /* root bfq_group for the device */
+ struct bfq_group *root_group;
/*
* Number of bfq_queues containing requests (including the
@@ -423,8 +439,6 @@ struct bfq_data {
unsigned int bfq_back_max;
/* maximum idling time */
u32 bfq_slice_idle;
- /* last time CLASS_IDLE was served */
- u64 bfq_class_idle_last_service;
/* user-configured max budget value (0 for auto-tuning) */
int bfq_user_max_budget;
@@ -516,8 +530,35 @@ BFQ_BFQQ_FNS(IO_bound);
#undef BFQ_BFQQ_FNS
/* Logging facilities. */
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
- blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
+static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
+
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
+ char __pbuf[128]; \
+ \
+ blkg_path(bfqg_to_blkg(bfqq_group(bfqq)), __pbuf, sizeof(__pbuf)); \
+ blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid, \
+ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
+ __pbuf, ##args); \
+} while (0)
+
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do { \
+ char __pbuf[128]; \
+ \
+ blkg_path(bfqg_to_blkg(bfqg), __pbuf, sizeof(__pbuf)); \
+ blk_add_trace_msg((bfqd)->queue, "%s " fmt, __pbuf, ##args); \
+} while (0)
+
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+ blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid, \
+ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
+ ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
+
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
#define bfq_log(bfqd, fmt, args...) \
blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
@@ -534,15 +575,120 @@ enum bfqq_expiration {
BFQQE_PREEMPTED /* preemption in progress */
};
+struct bfqg_stats {
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ /* number of ios merged */
+ struct blkg_rwstat merged;
+ /* total time spent on device in ns, may not be accurate w/ queueing */
+ struct blkg_rwstat service_time;
+ /* total time spent waiting in scheduler queue in ns */
+ struct blkg_rwstat wait_time;
+ /* number of IOs queued up */
+ struct blkg_rwstat queued;
+ /* total disk time and nr sectors dispatched by this group */
+ struct blkg_stat time;
+ /* sum of number of ios queued across all samples */
+ struct blkg_stat avg_queue_size_sum;
+ /* count of samples taken for average */
+ struct blkg_stat avg_queue_size_samples;
+ /* how many times this group has been removed from service tree */
+ struct blkg_stat dequeue;
+ /* total time spent waiting for it to be assigned a timeslice. */
+ struct blkg_stat group_wait_time;
+ /* time spent idling for this blkcg_gq */
+ struct blkg_stat idle_time;
+ /* total time with empty current active q with other requests queued */
+ struct blkg_stat empty_time;
+ /* fields after this shouldn't be cleared on stat reset */
+ uint64_t start_group_wait_time;
+ uint64_t start_idle_time;
+ uint64_t start_empty_time;
+ uint16_t flags;
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
+};
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
+/*
+ * struct bfq_group_data - per-blkcg storage for the blkio subsystem.
+ *
+ * @ps: @blkcg_policy_storage that this structure inherits
+ * @weight: weight of the bfq_group
+ */
+struct bfq_group_data {
+ /* must be the first member */
+ struct blkcg_policy_data pd;
+
+ unsigned short weight;
+};
+
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ * both bfq_queues and bfq_groups).
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ * the group, one queue per ioprio value per ioprio_class,
+ * except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ * to avoid too many special cases during group creation/
+ * migration.
+ * @stats: stats for this bfqg.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ * o @bfqd is protected by the queue lock, RCU is used to access it
+ * from the readers.
+ * o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+ /* must be the first member */
+ struct blkg_policy_data pd;
+
+ struct bfq_entity entity;
+ struct bfq_sched_data sched_data;
+
+ void *bfqd;
+
+ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+ struct bfq_queue *async_idle_bfqq;
+
+ struct bfq_entity *my_entity;
+
+ struct bfqg_stats stats;
+};
+
+#else
+struct bfq_group {
+ struct bfq_sched_data sched_data;
+
+ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+ struct bfq_queue *async_idle_bfqq;
+
+ struct rb_root rq_pos_tree;
+};
+#endif
+
static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+static unsigned int bfq_class_idx(struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ return bfqq ? bfqq->ioprio_class - 1 :
+ BFQ_DEFAULT_GRP_CLASS - 1;
+}
+
static struct bfq_service_tree *
bfq_entity_service_tree(struct bfq_entity *entity)
{
struct bfq_sched_data *sched_data = entity->sched_data;
- struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
- unsigned int idx = bfqq ? bfqq->ioprio_class - 1 :
- BFQ_DEFAULT_GRP_CLASS - 1;
+ unsigned int idx = bfq_class_idx(entity);
return sched_data->service_tree + idx;
}
@@ -568,16 +714,9 @@ static void bfq_put_queue(struct bfq_queue *bfqq);
static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
struct bio *bio, bool is_sync,
struct bfq_io_cq *bic);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
/* Expiration time of sync (0) and async (1) requests, in ns. */
static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
@@ -663,30 +802,222 @@ static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
}
/*
- * Next two macros are just fake loops for the moment. They will
- * become true loops in the cgroups-enabled variant of the code. Such
- * a variant, in its turn, will be introduced by next commit.
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+ if (bfqd->queued != 0) {
+ bfq_log(bfqd, "schedule dispatch");
+ blk_mq_run_hw_queues(bfqd->queue, true);
+ }
+}
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static int bfq_gt(u64 a, u64 b)
+{
+ return (s64)(a - b) > 0;
+}
+
+static struct bfq_entity *bfq_root_active_entity(struct rb_root *tree)
+{
+ struct rb_node *node = tree->rb_node;
+
+ return rb_entry(node, struct bfq_entity, rb_node);
+}
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd);
+
+static bool bfq_update_parent_budget(struct bfq_entity *next_in_service);
+
+/**
+ * bfq_update_next_in_service - update sd->next_in_service
+ * @sd: sched_data for which to perform the update.
+ * @new_entity: if not NULL, pointer to the entity whose activation,
+ * requeueing or repositionig triggered the invocation of
+ * this function.
+ *
+ * This function is called to update sd->next_in_service, which, in
+ * its turn, may change as a consequence of the insertion or
+ * extraction of an entity into/from one of the active trees of
+ * sd. These insertions/extractions occur as a consequence of
+ * activations/deactivations of entities, with some activations being
+ * 'true' activations, and other activations being requeueings (i.e.,
+ * implementing the second, requeueing phase of the mechanism used to
+ * reposition an entity in its active tree; see comments on
+ * __bfq_activate_entity and __bfq_requeue_entity for details). In
+ * both the last two activation sub-cases, new_entity points to the
+ * just activated or requeued entity.
+ *
+ * Returns true if sd->next_in_service changes in such a way that
+ * entity->parent may become the next_in_service for its parent
+ * entity.
*/
+static bool bfq_update_next_in_service(struct bfq_sched_data *sd,
+ struct bfq_entity *new_entity)
+{
+ struct bfq_entity *next_in_service = sd->next_in_service;
+ bool parent_sched_may_change = false;
+
+ /*
+ * If this update is triggered by the activation, requeueing
+ * or repositiong of an entity that does not coincide with
+ * sd->next_in_service, then a full lookup in the active tree
+ * can be avoided. In fact, it is enough to check whether the
+ * just-modified entity has a higher priority than
+ * sd->next_in_service, or, even if it has the same priority
+ * as sd->next_in_service, is eligible and has a lower virtual
+ * finish time than sd->next_in_service. If this compound
+ * condition holds, then the new entity becomes the new
+ * next_in_service. Otherwise no change is needed.
+ */
+ if (new_entity && new_entity != sd->next_in_service) {
+ /*
+ * Flag used to decide whether to replace
+ * sd->next_in_service with new_entity. Tentatively
+ * set to true, and left as true if
+ * sd->next_in_service is NULL.
+ */
+ bool replace_next = true;
+
+ /*
+ * If there is already a next_in_service candidate
+ * entity, then compare class priorities or timestamps
+ * to decide whether to replace sd->service_tree with
+ * new_entity.
+ */
+ if (next_in_service) {
+ unsigned int new_entity_class_idx =
+ bfq_class_idx(new_entity);
+ struct bfq_service_tree *st =
+ sd->service_tree + new_entity_class_idx;
+
+ /*
+ * For efficiency, evaluate the most likely
+ * sub-condition first.
+ */
+ replace_next =
+ (new_entity_class_idx ==
+ bfq_class_idx(next_in_service)
+ &&
+ !bfq_gt(new_entity->start, st->vtime)
+ &&
+ bfq_gt(next_in_service->finish,
+ new_entity->finish))
+ ||
+ new_entity_class_idx <
+ bfq_class_idx(next_in_service);
+ }
+
+ if (replace_next)
+ next_in_service = new_entity;
+ } else /* invoked because of a deactivation: lookup needed */
+ next_in_service = bfq_lookup_next_entity(sd);
+
+ if (next_in_service) {
+ parent_sched_may_change = !sd->next_in_service ||
+ bfq_update_parent_budget(next_in_service);
+ }
+
+ sd->next_in_service = next_in_service;
+
+ if (!next_in_service)
+ return parent_sched_may_change;
+
+ return parent_sched_may_change;
+}
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+/* both next loops stop at one of the child entities of the root group */
#define for_each_entity(entity) \
- for (; entity ; entity = NULL)
+ for (; entity ; entity = entity->parent)
+/*
+ * For each iteration, compute parent in advance, so as to be safe if
+ * entity is deallocated during the iteration. Such a deallocation may
+ * happen as a consequence of a bfq_put_queue that frees the bfq_queue
+ * containing entity.
+ */
#define for_each_entity_safe(entity, parent) \
- for (parent = NULL; entity ; entity = parent)
+ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
-static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+/*
+ * Returns true if this budget changes may let next_in_service->parent
+ * become the next_in_service entity for its parent entity.
+ */
+static bool bfq_update_parent_budget(struct bfq_entity *next_in_service)
{
- return 0;
+ struct bfq_entity *bfqg_entity;
+ struct bfq_group *bfqg;
+ struct bfq_sched_data *group_sd;
+ bool ret = false;
+
+ group_sd = next_in_service->sched_data;
+
+ bfqg = container_of(group_sd, struct bfq_group, sched_data);
+ /*
+ * bfq_group's my_entity field is not NULL only if the group
+ * is not the root group. We must not touch the root entity
+ * as it must never become an in-service entity.
+ */
+ bfqg_entity = bfqg->my_entity;
+ if (bfqg_entity) {
+ if (bfqg_entity->budget > next_in_service->budget)
+ ret = true;
+ bfqg_entity->budget = next_in_service->budget;
+ }
+
+ return ret;
+}
+
+/*
+ * This function tells whether entity stops being a candidate for next
+ * service, according to the following logic.
+ *
+ * This function is invoked for an entity that is about to be set in
+ * service. If such an entity is a queue, then the entity is no longer
+ * a candidate for next service (i.e, a candidate entity to serve
+ * after the in-service entity is expired). The function then returns
+ * true.
+ */
+static bool bfq_no_longer_next_in_service(struct bfq_entity *entity)
+{
+ if (bfq_entity_to_bfqq(entity))
+ return true;
+
+ return false;
}
-static void bfq_check_next_in_service(struct bfq_sched_data *sd,
- struct bfq_entity *entity)
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+/*
+ * Next two macros are fake loops when cgroups support is not
+ * enabled. I fact, in such a case, there is only one level to go up
+ * (to reach the root group).
+ */
+#define for_each_entity(entity) \
+ for (; entity ; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+ for (parent = NULL; entity ; entity = parent)
+
+static bool bfq_update_parent_budget(struct bfq_entity *next_in_service)
{
+ return false;
}
-static void bfq_update_budget(struct bfq_entity *next_in_service)
+static bool bfq_no_longer_next_in_service(struct bfq_entity *entity)
{
+ return true;
}
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
+
/*
* Shift for timestamp calculations. This actually limits the maximum
* service allowed in one timestamp delta (small shift values increase it),
@@ -696,18 +1027,6 @@ static void bfq_update_budget(struct bfq_entity *next_in_service)
*/
#define WFQ_SERVICE_SHIFT 22
-/**
- * bfq_gt - compare two timestamps.
- * @a: first ts.
- * @b: second ts.
- *
- * Return @a > @b, dealing with wrapping correctly.
- */
-static int bfq_gt(u64 a, u64 b)
-{
- return (s64)(a - b) > 0;
-}
-
static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
{
struct bfq_queue *bfqq = NULL;
@@ -926,6 +1245,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
{
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ struct bfq_sched_data *sd = NULL;
+ struct bfq_group *bfqg = NULL;
+ struct bfq_data *bfqd = NULL;
+#endif
bfq_insert(&st->active, entity);
@@ -936,6 +1260,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
bfq_update_active_tree(node);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ sd = entity->sched_data;
+ bfqg = container_of(sd, struct bfq_group, sched_data);
+ bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
if (bfqq)
list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
}
@@ -1014,6 +1343,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
{
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
struct rb_node *node;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ struct bfq_sched_data *sd = NULL;
+ struct bfq_group *bfqg = NULL;
+ struct bfq_data *bfqd = NULL;
+#endif
node = bfq_find_deepest(&entity->rb_node);
bfq_extract(&st->active, entity);
@@ -1021,6 +1355,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
if (node)
bfq_update_active_tree(node);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ sd = entity->sched_data;
+ bfqg = container_of(sd, struct bfq_group, sched_data);
+ bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
if (bfqq)
list_del(&bfqq->bfqq_list);
}
@@ -1069,7 +1408,7 @@ static void bfq_forget_entity(struct bfq_service_tree *st,
{
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
- entity->on_st = 0;
+ entity->on_st = false;
st->wsum -= entity->weight;
if (bfqq && !is_in_service)
bfq_put_queue(bfqq);
@@ -1115,7 +1454,7 @@ static void bfq_forget_idle(struct bfq_service_tree *st)
static struct bfq_service_tree *
__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
- struct bfq_entity *entity)
+ struct bfq_entity *entity)
{
struct bfq_service_tree *new_st = old_st;
@@ -1123,9 +1462,20 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
unsigned short prev_weight, new_weight;
struct bfq_data *bfqd = NULL;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ struct bfq_sched_data *sd;
+ struct bfq_group *bfqg;
+#endif
if (bfqq)
bfqd = bfqq->bfqd;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ else {
+ sd = entity->my_sched_data;
+ bfqg = container_of(sd, struct bfq_group, sched_data);
+ bfqd = (struct bfq_data *)bfqg->bfqd;
+ }
+#endif
old_st->wsum -= entity->weight;
@@ -1171,6 +1521,9 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
return new_st;
}
+static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg);
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
+
/**
* bfq_bfqq_served - update the scheduler status after selection for
* service.
@@ -1194,6 +1547,7 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
st->vtime += bfq_delta(served, st->wsum);
bfq_forget_idle(st);
}
+ bfqg_stats_set_start_empty_time(bfqq_group(bfqq));
bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served);
}
@@ -1216,78 +1570,10 @@ static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
bfq_bfqq_served(bfqq, entity->budget - entity->service);
}
-/**
- * __bfq_activate_entity - activate an entity.
- * @entity: the entity being activated.
- * @non_blocking_wait_rq: true if this entity was waiting for a request
- *
- * Called whenever an entity is activated, i.e., it is not active and one
- * of its children receives a new request, or has to be reactivated due to
- * budget exhaustion. It uses the current budget of the entity (and the
- * service received if @entity is active) of the queue to calculate its
- * timestamps.
- */
-static void __bfq_activate_entity(struct bfq_entity *entity,
- bool non_blocking_wait_rq)
+static void bfq_update_fin_time_enqueue(struct bfq_entity *entity,
+ struct bfq_service_tree *st,
+ bool backshifted)
{
- struct bfq_sched_data *sd = entity->sched_data;
- struct bfq_service_tree *st = bfq_entity_service_tree(entity);
- bool backshifted = false;
-
- if (entity == sd->in_service_entity) {
- /*
- * If we are requeueing the current entity we have
- * to take care of not charging to it service it has
- * not received.
- */
- bfq_calc_finish(entity, entity->service);
- entity->start = entity->finish;
- sd->in_service_entity = NULL;
- } else if (entity->tree == &st->active) {
- /*
- * Requeueing an entity due to a change of some
- * next_in_service entity below it. We reuse the
- * old start time.
- */
- bfq_active_extract(st, entity);
- } else {
- unsigned long long min_vstart;
-
- /* See comments on bfq_fqq_update_budg_for_activation */
- if (non_blocking_wait_rq && bfq_gt(st->vtime, entity->finish)) {
- backshifted = true;
- min_vstart = entity->finish;
- } else
- min_vstart = st->vtime;
-
- if (entity->tree == &st->idle) {
- /*
- * Must be on the idle tree, bfq_idle_extract() will
- * check for that.
- */
- bfq_idle_extract(st, entity);
- entity->start = bfq_gt(min_vstart, entity->finish) ?
- min_vstart : entity->finish;
- } else {
- /*
- * The finish time of the entity may be invalid, and
- * it is in the past for sure, otherwise the queue
- * would have been on the idle tree.
- */
- entity->start = min_vstart;
- st->wsum += entity->weight;
- /*
- * entity is about to be inserted into a service tree,
- * and then set in service: get a reference to make
- * sure entity does not disappear until it is no
- * longer in service or scheduled for service.
- */
- bfq_get_entity(entity);
-
- entity->on_st = 1;
- }
- }
-
st = __bfq_entity_update_weight_prio(st, entity);
bfq_calc_finish(entity, entity->budget);
@@ -1329,27 +1615,185 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
}
/**
- * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * __bfq_activate_entity - handle activation of entity.
+ * @entity: the entity being activated.
+ * @non_blocking_wait_rq: true if entity was waiting for a request
+ *
+ * Called for a 'true' activation, i.e., if entity is not active and
+ * one of its children receives a new request.
+ *
+ * Basically, this function updates the timestamps of entity and
+ * inserts entity into its active tree, ater possible extracting it
+ * from its idle tree.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity,
+ bool non_blocking_wait_rq)
+{
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+ bool backshifted = false;
+ unsigned long long min_vstart;
+
+ /* See comments on bfq_fqq_update_budg_for_activation */
+ if (non_blocking_wait_rq && bfq_gt(st->vtime, entity->finish)) {
+ backshifted = true;
+ min_vstart = entity->finish;
+ } else
+ min_vstart = st->vtime;
+
+ if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq_idle_extract() will
+ * check for that.
+ */
+ bfq_idle_extract(st, entity);
+ entity->start = bfq_gt(min_vstart, entity->finish) ?
+ min_vstart : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = min_vstart;
+ st->wsum += entity->weight;
+ /*
+ * entity is about to be inserted into a service tree,
+ * and then set in service: get a reference to make
+ * sure entity does not disappear until it is no
+ * longer in service or scheduled for service.
+ */
+ bfq_get_entity(entity);
+
+ entity->on_st = true;
+ }
+
+ bfq_update_fin_time_enqueue(entity, st, backshifted);
+}
+
+/**
+ * __bfq_requeue_entity - handle requeueing or repositioning of an entity.
+ * @entity: the entity being requeued or repositioned.
+ *
+ * Requeueing is needed if this entity stops being served, which
+ * happens if a leaf descendant entity has expired. On the other hand,
+ * repositioning is needed if the next_inservice_entity for the child
+ * entity has changed. See the comments inside the function for
+ * details.
+ *
+ * Basically, this function: 1) removes entity from its active tree if
+ * present there, 2) updates the timestamps of entity and 3) inserts
+ * entity back into its active tree (in the new, right position for
+ * the new values of the timestamps).
+ */
+static void __bfq_requeue_entity(struct bfq_entity *entity)
+{
+ struct bfq_sched_data *sd = entity->sched_data;
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+ if (entity == sd->in_service_entity) {
+ /*
+ * We are requeueing the current in-service entity,
+ * which may have to be done for one of the following
+ * reasons:
+ * - entity represents the in-service queue, and the
+ * in-service queue is being requeued after an
+ * expiration;
+ * - entity represents a group, and its budget has
+ * changed because one of its child entities has
+ * just been either activated or requeued for some
+ * reason; the timestamps of the entity need then to
+ * be updated, and the entity needs to be enqueued
+ * or repositioned accordingly.
+ *
+ * In particular, before requeueing, the start time of
+ * the entity must be moved forward to account for the
+ * service that the entity has received while in
+ * service. This is done by the next instructions. The
+ * finish time will then be updated according to this
+ * new value of the start time, and to the budget of
+ * the entity.
+ */
+ bfq_calc_finish(entity, entity->service);
+ entity->start = entity->finish;
+ /*
+ * In addition, if the entity had more than one child
+ * when set in service, then was not extracted from
+ * the active tree. This implies that the position of
+ * the entity in the active tree may need to be
+ * changed now, because we have just updated the start
+ * time of the entity, and we will update its finish
+ * time in a moment (the requeueing is then, more
+ * precisely, a repositioning in this case). To
+ * implement this repositioning, we: 1) dequeue the
+ * entity here, 2) update the finish time and
+ * requeue the entity according to the new
+ * timestamps below.
+ */
+ if (entity->tree)
+ bfq_active_extract(st, entity);
+ } else { /* The entity is already active, and not in service */
+ /*
+ * In this case, this function gets called only if the
+ * next_in_service entity below this entity has
+ * changed, and this change has caused the budget of
+ * this entity to change, which, finally implies that
+ * the finish time of this entity must be
+ * updated. Such an update may cause the scheduling,
+ * i.e., the position in the active tree, of this
+ * entity to change. We handle this change by: 1)
+ * dequeueing the entity here, 2) updating the finish
+ * time and requeueing the entity according to the new
+ * timestamps below. This is the same approach as the
+ * non-extracted-entity sub-case above.
+ */
+ bfq_active_extract(st, entity);
+ }
+
+ bfq_update_fin_time_enqueue(entity, st, false);
+}
+
+static void __bfq_activate_requeue_entity(struct bfq_entity *entity,
+ struct bfq_sched_data *sd,
+ bool non_blocking_wait_rq)
+{
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+ if (sd->in_service_entity == entity || entity->tree == &st->active)
+ /*
+ * in service or already queued on the active tree,
+ * requeue or reposition
+ */
+ __bfq_requeue_entity(entity);
+ else
+ /*
+ * Not in service and not queued on its active tree:
+ * the activity is idle and this is a true activation.
+ */
+ __bfq_activate_entity(entity, non_blocking_wait_rq);
+}
+
+
+/**
+ * bfq_activate_entity - activate or requeue an entity representing a bfq_queue,
+ * and activate, requeue or reposition all ancestors
+ * for which such an update becomes necessary.
* @entity: the entity to activate.
* @non_blocking_wait_rq: true if this entity was waiting for a request
- *
- * Activate @entity and all the entities on the path from it to the root.
+ * @requeue: true if this is a requeue, which implies that bfqq is
+ * being expired; thus ALL its ancestors stop being served and must
+ * therefore be requeued
*/
-static void bfq_activate_entity(struct bfq_entity *entity,
- bool non_blocking_wait_rq)
+static void bfq_activate_requeue_entity(struct bfq_entity *entity,
+ bool non_blocking_wait_rq,
+ bool requeue)
{
struct bfq_sched_data *sd;
for_each_entity(entity) {
- __bfq_activate_entity(entity, non_blocking_wait_rq);
-
sd = entity->sched_data;
- if (!bfq_update_next_in_service(sd))
- /*
- * No need to propagate the activation to the
- * upper entities, as they will be updated when
- * the in-service entity is rescheduled.
- */
+ __bfq_activate_requeue_entity(entity, sd, non_blocking_wait_rq);
+
+ if (!bfq_update_next_in_service(sd, entity) && !requeue)
break;
}
}
@@ -1357,52 +1801,48 @@ static void bfq_activate_entity(struct bfq_entity *entity,
/**
* __bfq_deactivate_entity - deactivate an entity from its service tree.
* @entity: the entity to deactivate.
- * @requeue: if false, the entity will not be put into the idle tree.
+ * @ins_into_idle_tree: if false, the entity will not be put into the
+ * idle tree.
*
- * Deactivate an entity, independently from its previous state. If the
- * entity was not on a service tree just return, otherwise if it is on
- * any scheduler tree, extract it from that tree, and if necessary
- * and if the caller did not specify @requeue, put it on the idle tree.
- *
- * Return %1 if the caller should update the entity hierarchy, i.e.,
- * if the entity was in service or if it was the next_in_service for
- * its sched_data; return %0 otherwise.
+ * Deactivates an entity, independently from its previous state. Must
+ * be invoked only if entity is on a service tree. Extracts the entity
+ * from that tree, and if necessary and allowed, puts it on the idle
+ * tree.
*/
-static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+static bool __bfq_deactivate_entity(struct bfq_entity *entity,
+ bool ins_into_idle_tree)
{
struct bfq_sched_data *sd = entity->sched_data;
struct bfq_service_tree *st = bfq_entity_service_tree(entity);
int is_in_service = entity == sd->in_service_entity;
- int ret = 0;
- if (!entity->on_st)
- return 0;
+ if (!entity->on_st) /* entity never activated, or already inactive */
+ return false;
- if (is_in_service) {
+ if (is_in_service)
bfq_calc_finish(entity, entity->service);
- sd->in_service_entity = NULL;
- } else if (entity->tree == &st->active)
+
+ if (entity->tree == &st->active)
bfq_active_extract(st, entity);
- else if (entity->tree == &st->idle)
+ else if (!is_in_service && entity->tree == &st->idle)
bfq_idle_extract(st, entity);
- if (is_in_service || sd->next_in_service == entity)
- ret = bfq_update_next_in_service(sd);
-
- if (!requeue || !bfq_gt(entity->finish, st->vtime))
+ if (!ins_into_idle_tree || !bfq_gt(entity->finish, st->vtime))
bfq_forget_entity(st, entity, is_in_service);
else
bfq_idle_insert(st, entity);
- return ret;
+ return true;
}
/**
- * bfq_deactivate_entity - deactivate an entity.
+ * bfq_deactivate_entity - deactivate an entity representing a bfq_queue.
* @entity: the entity to deactivate.
- * @requeue: true if the entity can be put on the idle tree
+ * @ins_into_idle_tree: true if the entity can be put on the idle tree
*/
-static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+static void bfq_deactivate_entity(struct bfq_entity *entity,
+ bool ins_into_idle_tree,
+ bool expiration)
{
struct bfq_sched_data *sd;
struct bfq_entity *parent = NULL;
@@ -1410,63 +1850,102 @@ static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
for_each_entity_safe(entity, parent) {
sd = entity->sched_data;
- if (!__bfq_deactivate_entity(entity, requeue))
+ if (!__bfq_deactivate_entity(entity, ins_into_idle_tree)) {
/*
- * The parent entity is still backlogged, and
- * we don't need to update it as it is still
- * in service.
+ * entity is not in any tree any more, so
+ * this deactivation is a no-op, and there is
+ * nothing to change for upper-level entities
+ * (in case of expiration, this can never
+ * happen).
*/
- break;
+ return;
+ }
+
+ if (sd->next_in_service == entity)
+ /*
+ * entity was the next_in_service entity,
+ * then, since entity has just been
+ * deactivated, a new one must be found.
+ */
+ bfq_update_next_in_service(sd, NULL);
if (sd->next_in_service)
/*
- * The parent entity is still backlogged and
- * the budgets on the path towards the root
- * need to be updated.
+ * The parent entity is still backlogged,
+ * because next_in_service is not NULL. So, no
+ * further upwards deactivation must be
+ * performed. Yet, next_in_service has
+ * changed. Then the schedule does need to be
+ * updated upwards.
*/
- goto update;
+ break;
/*
- * If we get here, then the parent is no more backlogged and
- * we want to propagate the deactivation upwards.
+ * If we get here, then the parent is no more
+ * backlogged and we need to propagate the
+ * deactivation upwards. Thus let the loop go on.
*/
- requeue = 1;
- }
- return;
+ /*
+ * Also let parent be queued into the idle tree on
+ * deactivation, to preserve service guarantees, and
+ * assuming that who invoked this function does not
+ * need parent entities too to be removed completely.
+ */
+ ins_into_idle_tree = true;
+ }
-update:
+ /*
+ * If the deactivation loop is fully executed, then there are
+ * no more entities to touch and next loop is not executed at
+ * all. Otherwise, requeue remaining entities if they are
+ * about to stop receiving service, or reposition them if this
+ * is not the case.
+ */
entity = parent;
for_each_entity(entity) {
- __bfq_activate_entity(entity, false);
+ /*
+ * Invoke __bfq_requeue_entity on entity, even if
+ * already active, to requeue/reposition it in the
+ * active tree (because sd->next_in_service has
+ * changed)
+ */
+ __bfq_requeue_entity(entity);
sd = entity->sched_data;
- if (!bfq_update_next_in_service(sd))
+ if (!bfq_update_next_in_service(sd, entity) &&
+ !expiration)
+ /*
+ * next_in_service unchanged or not causing
+ * any change in entity->parent->sd, and no
+ * requeueing needed for expiration: stop
+ * here.
+ */
break;
}
}
/**
- * bfq_update_vtime - update vtime if necessary.
+ * bfq_calc_vtime_jump - compute the value to which the vtime should jump,
+ * if needed, to have at least one entity eligible.
* @st: the service tree to act upon.
*
- * If necessary update the service tree vtime to have at least one
- * eligible entity, skipping to its start time. Assumes that the
- * active tree of the device is not empty.
- *
- * NOTE: this hierarchical implementation updates vtimes quite often,
- * we may end up with reactivated processes getting timestamps after a
- * vtime skip done because we needed a ->first_active entity on some
- * intermediate node.
+ * Assumes that st is not empty.
*/
-static void bfq_update_vtime(struct bfq_service_tree *st)
+static u64 bfq_calc_vtime_jump(struct bfq_service_tree *st)
{
- struct bfq_entity *entry;
- struct rb_node *node = st->active.rb_node;
+ struct bfq_entity *root_entity = bfq_root_active_entity(&st->active);
+
+ if (bfq_gt(root_entity->min_start, st->vtime))
+ return root_entity->min_start;
+
+ return st->vtime;
+}
- entry = rb_entry(node, struct bfq_entity, rb_node);
- if (bfq_gt(entry->min_start, st->vtime)) {
- st->vtime = entry->min_start;
+static void bfq_update_vtime(struct bfq_service_tree *st, u64 new_value)
+{
+ if (new_value > st->vtime) {
+ st->vtime = new_value;
bfq_forget_idle(st);
}
}
@@ -1475,6 +1954,7 @@ static void bfq_update_vtime(struct bfq_service_tree *st)
* bfq_first_active_entity - find the eligible entity with
* the smallest finish time
* @st: the service tree to select from.
+ * @vtime: the system virtual to use as a reference for eligibility
*
* This function searches the first schedulable entity, starting from the
* root of the tree and going on the left every time on this side there is
@@ -1482,7 +1962,8 @@ static void bfq_update_vtime(struct bfq_service_tree *st)
* the right is followed only if a) the left subtree contains no eligible
* entities and b) no eligible entity has been found yet.
*/
-static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st,
+ u64 vtime)
{
struct bfq_entity *entry, *first = NULL;
struct rb_node *node = st->active.rb_node;
@@ -1490,13 +1971,13 @@ static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
while (node) {
entry = rb_entry(node, struct bfq_entity, rb_node);
left:
- if (!bfq_gt(entry->start, st->vtime))
+ if (!bfq_gt(entry->start, vtime))
first = entry;
if (node->rb_left) {
entry = rb_entry(node->rb_left,
struct bfq_entity, rb_node);
- if (!bfq_gt(entry->min_start, st->vtime)) {
+ if (!bfq_gt(entry->min_start, vtime)) {
node = node->rb_left;
goto left;
}
@@ -1506,204 +1987,1423 @@ static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
node = node->rb_right;
}
- return first;
+ return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * If there is no in-service entity for the sched_data st belongs to,
+ * then return the entity that will be set in service if:
+ * 1) the parent entity this st belongs to is set in service;
+ * 2) no entity belonging to such parent entity undergoes a state change
+ * that would influence the timestamps of the entity (e.g., becomes idle,
+ * becomes backlogged, changes its budget, ...).
+ *
+ * In this first case, update the virtual time in @st too (see the
+ * comments on this update inside the function).
+ *
+ * In constrast, if there is an in-service entity, then return the
+ * entity that would be set in service if not only the above
+ * conditions, but also the next one held true: the currently
+ * in-service entity, on expiration,
+ * 1) gets a finish time equal to the current one, or
+ * 2) is not eligible any more, or
+ * 3) is idle.
+ */
+static struct bfq_entity *
+__bfq_lookup_next_entity(struct bfq_service_tree *st, bool in_service)
+{
+ struct bfq_entity *entity;
+ u64 new_vtime;
+
+ if (RB_EMPTY_ROOT(&st->active))
+ return NULL;
+
+ /*
+ * Get the value of the system virtual time for which at
+ * least one entity is eligible.
+ */
+ new_vtime = bfq_calc_vtime_jump(st);
+
+ /*
+ * If there is no in-service entity for the sched_data this
+ * active tree belongs to, then push the system virtual time
+ * up to the value that guarantees that at least one entity is
+ * eligible. If, instead, there is an in-service entity, then
+ * do not make any such update, because there is already an
+ * eligible entity, namely the in-service one (even if the
+ * entity is not on st, because it was extracted when set in
+ * service).
+ */
+ if (!in_service)
+ bfq_update_vtime(st, new_vtime);
+
+ entity = bfq_first_active_entity(st, new_vtime);
+
+ return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ *
+ * This function is invoked when there has been a change in the trees
+ * for sd, and we need know what is the new next entity after this
+ * change.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd)
+{
+ struct bfq_service_tree *st = sd->service_tree;
+ struct bfq_service_tree *idle_class_st = st + (BFQ_IOPRIO_CLASSES - 1);
+ struct bfq_entity *entity = NULL;
+ int class_idx = 0;
+
+ /*
+ * Choose from idle class, if needed to guarantee a minimum
+ * bandwidth to this class (and if there is some active entity
+ * in idle class). This should also mitigate
+ * priority-inversion problems in case a low priority task is
+ * holding file system resources.
+ */
+ if (time_is_before_jiffies(sd->bfq_class_idle_last_service +
+ BFQ_CL_IDLE_TIMEOUT)) {
+ if (!RB_EMPTY_ROOT(&idle_class_st->active))
+ class_idx = BFQ_IOPRIO_CLASSES - 1;
+ /* About to be served if backlogged, or not yet backlogged */
+ sd->bfq_class_idle_last_service = jiffies;
+ }
+
+ /*
+ * Find the next entity to serve for the highest-priority
+ * class, unless the idle class needs to be served.
+ */
+ for (; class_idx < BFQ_IOPRIO_CLASSES; class_idx++) {
+ entity = __bfq_lookup_next_entity(st + class_idx,
+ sd->in_service_entity);
+
+ if (entity)
+ break;
+ }
+
+ if (!entity)
+ return NULL;
+
+ return entity;
+}
+
+static bool next_queue_may_preempt(struct bfq_data *bfqd)
+{
+ struct bfq_sched_data *sd = &bfqd->root_group->sched_data;
+
+ return sd->next_in_service != sd->in_service_entity;
+}
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+ struct bfq_entity *entity = NULL;
+ struct bfq_sched_data *sd;
+ struct bfq_queue *bfqq;
+
+ if (bfqd->busy_queues == 0)
+ return NULL;
+
+ /*
+ * Traverse the path from the root to the leaf entity to
+ * serve. Set in service all the entities visited along the
+ * way.
+ */
+ sd = &bfqd->root_group->sched_data;
+ for (; sd ; sd = entity->my_sched_data) {
+ /*
+ * WARNING. We are about to set the in-service entity
+ * to sd->next_in_service, i.e., to the (cached) value
+ * returned by bfq_lookup_next_entity(sd) the last
+ * time it was invoked, i.e., the last time when the
+ * service order in sd changed as a consequence of the
+ * activation or deactivation of an entity. In this
+ * respect, if we execute bfq_lookup_next_entity(sd)
+ * in this very moment, it may, although with low
+ * probability, yield a different entity than that
+ * pointed to by sd->next_in_service. This rare event
+ * happens in case there was no CLASS_IDLE entity to
+ * serve for sd when bfq_lookup_next_entity(sd) was
+ * invoked for the last time, while there is now one
+ * such entity.
+ *
+ * If the above event happens, then the scheduling of
+ * such entity in CLASS_IDLE is postponed until the
+ * service of the sd->next_in_service entity
+ * finishes. In fact, when the latter is expired,
+ * bfq_lookup_next_entity(sd) gets called again,
+ * exactly to update sd->next_in_service.
+ */
+
+ /* Make next_in_service entity become in_service_entity */
+ entity = sd->next_in_service;
+ sd->in_service_entity = entity;
+
+ /*
+ * Reset the accumulator of the amount of service that
+ * the entity is about to receive.
+ */
+ entity->service = 0;
+
+ /*
+ * If entity is no longer a candidate for next
+ * service, then we extract it from its active tree,
+ * for the following reason. To further boost the
+ * throughput in some special case, BFQ needs to know
+ * which is the next candidate entity to serve, while
+ * there is already an entity in service. In this
+ * respect, to make it easy to compute/update the next
+ * candidate entity to serve after the current
+ * candidate has been set in service, there is a case
+ * where it is necessary to extract the current
+ * candidate from its service tree. Such a case is
+ * when the entity just set in service cannot be also
+ * a candidate for next service. Details about when
+ * this conditions holds are reported in the comments
+ * on the function bfq_no_longer_next_in_service()
+ * invoked below.
+ */
+ if (bfq_no_longer_next_in_service(entity))
+ bfq_active_extract(bfq_entity_service_tree(entity),
+ entity);
+
+ /*
+ * For the same reason why we may have just extracted
+ * entity from its active tree, we may need to update
+ * next_in_service for the sched_data of entity too,
+ * regardless of whether entity has been extracted.
+ * In fact, even if entity has not been extracted, a
+ * descendant entity may get extracted. Such an event
+ * would cause a change in next_in_service for the
+ * level of the descendant entity, and thus possibly
+ * back to upper levels.
+ *
+ * We cannot perform the resulting needed update
+ * before the end of this loop, because, to know which
+ * is the correct next-to-serve candidate entity for
+ * each level, we need first to find the leaf entity
+ * to set in service. In fact, only after we know
+ * which is the next-to-serve leaf entity, we can
+ * discover whether the parent entity of the leaf
+ * entity becomes the next-to-serve, and so on.
+ */
+
+ }
+
+ bfqq = bfq_entity_to_bfqq(entity);
+
+ /*
+ * We can finally update all next-to-serve entities along the
+ * path from the leaf entity just set in service to the root.
+ */
+ for_each_entity(entity) {
+ struct bfq_sched_data *sd = entity->sched_data;
+
+ if (!bfq_update_next_in_service(sd, NULL))
+ break;
+ }
+
+ return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+ struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
+ struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity;
+ struct bfq_entity *entity = in_serv_entity;
+
+ if (bfqd->in_service_bic) {
+ put_io_context(bfqd->in_service_bic->icq.ioc);
+ bfqd->in_service_bic = NULL;
+ }
+
+ bfq_clear_bfqq_wait_request(in_serv_bfqq);
+ hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+ bfqd->in_service_queue = NULL;
+
+ /*
+ * When this function is called, all in-service entities have
+ * been properly deactivated or requeued, so we can safely
+ * execute the final step: reset in_service_entity along the
+ * path from entity to the root.
+ */
+ for_each_entity(entity)
+ entity->sched_data->in_service_entity = NULL;
+
+ /*
+ * in_serv_entity is no longer in service, so, if it is in no
+ * service tree either, then release the service reference to
+ * the queue it represents (taken with bfq_get_entity).
+ */
+ if (!in_serv_entity->on_st)
+ bfq_put_queue(in_serv_bfqq);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool ins_into_idle_tree, bool expiration)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_deactivate_entity(entity, ins_into_idle_tree, expiration);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_activate_requeue_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq),
+ false);
+ bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+static void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_activate_requeue_entity(entity, false,
+ bfqq == bfqd->in_service_queue);
+}
+
+static void bfqg_stats_update_dequeue(struct bfq_group *bfqg);
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree. As a special case, it can be invoked during an
+ * expiration.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool expiration)
+{
+ bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+ bfq_clear_bfqq_busy(bfqq);
+
+ bfqd->busy_queues--;
+
+ bfqg_stats_update_dequeue(bfqq_group(bfqq));
+
+ bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+ bfq_activate_bfqq(bfqd, bfqq);
+
+ bfq_mark_bfqq_busy(bfqq);
+ bfqd->busy_queues++;
+}
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
+/* bfqg stats flags */
+enum bfqg_stats_flags {
+ BFQG_stats_waiting = 0,
+ BFQG_stats_idling,
+ BFQG_stats_empty,
+};
+
+#define BFQG_FLAG_FNS(name) \
+static void bfqg_stats_mark_##name(struct bfqg_stats *stats) \
+{ \
+ stats->flags |= (1 << BFQG_stats_##name); \
+} \
+static void bfqg_stats_clear_##name(struct bfqg_stats *stats) \
+{ \
+ stats->flags &= ~(1 << BFQG_stats_##name); \
+} \
+static int bfqg_stats_##name(struct bfqg_stats *stats) \
+{ \
+ return (stats->flags & (1 << BFQG_stats_##name)) != 0; \
+} \
+
+BFQG_FLAG_FNS(waiting)
+BFQG_FLAG_FNS(idling)
+BFQG_FLAG_FNS(empty)
+#undef BFQG_FLAG_FNS
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
+{
+ unsigned long long now;
+
+ if (!bfqg_stats_waiting(stats))
+ return;
+
+ now = sched_clock();
+ if (time_after64(now, stats->start_group_wait_time))
+ blkg_stat_add(&stats->group_wait_time,
+ now - stats->start_group_wait_time);
+ bfqg_stats_clear_waiting(stats);
+}
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
+ struct bfq_group *curr_bfqg)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+
+ if (bfqg_stats_waiting(stats))
+ return;
+ if (bfqg == curr_bfqg)
+ return;
+ stats->start_group_wait_time = sched_clock();
+ bfqg_stats_mark_waiting(stats);
+}
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
+{
+ unsigned long long now;
+
+ if (!bfqg_stats_empty(stats))
+ return;
+
+ now = sched_clock();
+ if (time_after64(now, stats->start_empty_time))
+ blkg_stat_add(&stats->empty_time,
+ now - stats->start_empty_time);
+ bfqg_stats_clear_empty(stats);
+}
+
+static void bfqg_stats_update_dequeue(struct bfq_group *bfqg)
+{
+ blkg_stat_add(&bfqg->stats.dequeue, 1);
+}
+
+static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+
+ if (blkg_rwstat_total(&stats->queued))
+ return;
+
+ /*
+ * group is already marked empty. This can happen if bfqq got new
+ * request in parent group and moved to this group while being added
+ * to service tree. Just ignore the event and move on.
+ */
+ if (bfqg_stats_empty(stats))
+ return;
+
+ stats->start_empty_time = sched_clock();
+ bfqg_stats_mark_empty(stats);
+}
+
+static void bfqg_stats_update_idle_time(struct bfq_group *bfqg)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+
+ if (bfqg_stats_idling(stats)) {
+ unsigned long long now = sched_clock();
+
+ if (time_after64(now, stats->start_idle_time))
+ blkg_stat_add(&stats->idle_time,
+ now - stats->start_idle_time);
+ bfqg_stats_clear_idling(stats);
+ }
+}
+
+static void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+
+ stats->start_idle_time = sched_clock();
+ bfqg_stats_mark_idling(stats);
+}
+
+static void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+
+ blkg_stat_add(&stats->avg_queue_size_sum,
+ blkg_rwstat_total(&stats->queued));
+ blkg_stat_add(&stats->avg_queue_size_samples, 1);
+ bfqg_stats_update_group_wait_time(stats);
+}
+
+/*
+ * blk-cgroup policy-related handlers
+ * The following functions help in converting between blk-cgroup
+ * internal structures and BFQ-specific structures.
+ */
+
+static struct bfq_group *pd_to_bfqg(struct blkg_policy_data *pd)
+{
+ return pd ? container_of(pd, struct bfq_group, pd) : NULL;
+}
+
+static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg)
+{
+ return pd_to_blkg(&bfqg->pd);
+}
+
+static struct blkcg_policy blkcg_policy_bfq;
+
+static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg)
+{
+ return pd_to_bfqg(blkg_to_pd(blkg, &blkcg_policy_bfq));
+}
+
+/*
+ * bfq_group handlers
+ * The following functions help in navigating the bfq_group hierarchy
+ * by allowing to find the parent of a bfq_group or the bfq_group
+ * associated to a bfq_queue.
+ */
+
+static struct bfq_group *bfqg_parent(struct bfq_group *bfqg)
+{
+ struct blkcg_gq *pblkg = bfqg_to_blkg(bfqg)->parent;
+
+ return pblkg ? blkg_to_bfqg(pblkg) : NULL;
+}
+
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
+{
+ struct bfq_entity *group_entity = bfqq->entity.parent;
+
+ return group_entity ? container_of(group_entity, struct bfq_group,
+ entity) :
+ bfqq->bfqd->root_group;
+}
+
+/*
+ * The following two functions handle get and put of a bfq_group by
+ * wrapping the related blk-cgroup hooks.
+ */
+
+static void bfqg_get(struct bfq_group *bfqg)
+{
+ return blkg_get(bfqg_to_blkg(bfqg));
+}
+
+static void bfqg_put(struct bfq_group *bfqg)
+{
+ return blkg_put(bfqg_to_blkg(bfqg));
+}
+
+static void bfqg_stats_update_io_add(struct bfq_group *bfqg,
+ struct bfq_queue *bfqq,
+ unsigned int op)
+{
+ blkg_rwstat_add(&bfqg->stats.queued, op, 1);
+ bfqg_stats_end_empty_time(&bfqg->stats);
+ if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue))
+ bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq));
+}
+
+static void bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op)
+{
+ blkg_rwstat_add(&bfqg->stats.queued, op, -1);
+}
+
+static void bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op)
+{
+ blkg_rwstat_add(&bfqg->stats.merged, op, 1);
+}
+
+static void bfqg_stats_update_completion(struct bfq_group *bfqg,
+ uint64_t start_time, uint64_t io_start_time,
+ unsigned int op)
+{
+ struct bfqg_stats *stats = &bfqg->stats;
+ unsigned long long now = sched_clock();
+
+ if (time_after64(now, io_start_time))
+ blkg_rwstat_add(&stats->service_time, op,
+ now - io_start_time);
+ if (time_after64(io_start_time, start_time))
+ blkg_rwstat_add(&stats->wait_time, op,
+ io_start_time - start_time);
+}
+
+/* @stats = 0 */
+static void bfqg_stats_reset(struct bfqg_stats *stats)
+{
+ /* queued stats shouldn't be cleared */
+ blkg_rwstat_reset(&stats->merged);
+ blkg_rwstat_reset(&stats->service_time);
+ blkg_rwstat_reset(&stats->wait_time);
+ blkg_stat_reset(&stats->time);
+ blkg_stat_reset(&stats->avg_queue_size_sum);
+ blkg_stat_reset(&stats->avg_queue_size_samples);
+ blkg_stat_reset(&stats->dequeue);
+ blkg_stat_reset(&stats->group_wait_time);
+ blkg_stat_reset(&stats->idle_time);
+ blkg_stat_reset(&stats->empty_time);
+}
+
+/* @to += @from */
+static void bfqg_stats_add_aux(struct bfqg_stats *to, struct bfqg_stats *from)
+{
+ if (!to || !from)
+ return;
+
+ /* queued stats shouldn't be cleared */
+ blkg_rwstat_add_aux(&to->merged, &from->merged);
+ blkg_rwstat_add_aux(&to->service_time, &from->service_time);
+ blkg_rwstat_add_aux(&to->wait_time, &from->wait_time);
+ blkg_stat_add_aux(&from->time, &from->time);
+ blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
+ blkg_stat_add_aux(&to->avg_queue_size_samples,
+ &from->avg_queue_size_samples);
+ blkg_stat_add_aux(&to->dequeue, &from->dequeue);
+ blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time);
+ blkg_stat_add_aux(&to->idle_time, &from->idle_time);
+ blkg_stat_add_aux(&to->empty_time, &from->empty_time);
+}
+
+/*
+ * Transfer @bfqg's stats to its parent's aux counts so that the ancestors'
+ * recursive stats can still account for the amount used by this bfqg after
+ * it's gone.
+ */
+static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
+{
+ struct bfq_group *parent;
+
+ if (!bfqg) /* root_group */
+ return;
+
+ parent = bfqg_parent(bfqg);
+
+ lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock);
+
+ if (unlikely(!parent))
+ return;
+
+ bfqg_stats_add_aux(&parent->stats, &bfqg->stats);
+ bfqg_stats_reset(&bfqg->stats);
+}
+
+static void bfq_init_entity(struct bfq_entity *entity,
+ struct bfq_group *bfqg)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ entity->weight = entity->new_weight;
+ entity->orig_weight = entity->new_weight;
+ if (bfqq) {
+ bfqq->ioprio = bfqq->new_ioprio;
+ bfqq->ioprio_class = bfqq->new_ioprio_class;
+ bfqg_get(bfqg);
+ }
+ entity->parent = bfqg->my_entity; /* NULL for root group */
+ entity->sched_data = &bfqg->sched_data;
+}
+
+static void bfqg_stats_exit(struct bfqg_stats *stats)
+{
+ blkg_rwstat_exit(&stats->merged);
+ blkg_rwstat_exit(&stats->service_time);
+ blkg_rwstat_exit(&stats->wait_time);
+ blkg_rwstat_exit(&stats->queued);
+ blkg_stat_exit(&stats->time);
+ blkg_stat_exit(&stats->avg_queue_size_sum);
+ blkg_stat_exit(&stats->avg_queue_size_samples);
+ blkg_stat_exit(&stats->dequeue);
+ blkg_stat_exit(&stats->group_wait_time);
+ blkg_stat_exit(&stats->idle_time);
+ blkg_stat_exit(&stats->empty_time);
+}
+
+static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
+{
+ if (blkg_rwstat_init(&stats->merged, gfp) ||
+ blkg_rwstat_init(&stats->service_time, gfp) ||
+ blkg_rwstat_init(&stats->wait_time, gfp) ||
+ blkg_rwstat_init(&stats->queued, gfp) ||
+ blkg_stat_init(&stats->time, gfp) ||
+ blkg_stat_init(&stats->avg_queue_size_sum, gfp) ||
+ blkg_stat_init(&stats->avg_queue_size_samples, gfp) ||
+ blkg_stat_init(&stats->dequeue, gfp) ||
+ blkg_stat_init(&stats->group_wait_time, gfp) ||
+ blkg_stat_init(&stats->idle_time, gfp) ||
+ blkg_stat_init(&stats->empty_time, gfp)) {
+ bfqg_stats_exit(stats);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd)
+{
+ return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL;
+}
+
+static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
+{
+ return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
+}
+
+static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
+{
+ struct bfq_group_data *bgd;
+
+ bgd = kzalloc(sizeof(*bgd), gfp);
+ if (!bgd)
+ return NULL;
+ return &bgd->pd;
+}
+
+static void bfq_cpd_init(struct blkcg_policy_data *cpd)
+{
+ struct bfq_group_data *d = cpd_to_bfqgd(cpd);
+
+ d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
+ CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
+}
+
+static void bfq_cpd_free(struct blkcg_policy_data *cpd)
+{
+ kfree(cpd_to_bfqgd(cpd));
+}
+
+static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
+{
+ struct bfq_group *bfqg;
+
+ bfqg = kzalloc_node(sizeof(*bfqg), gfp, node);
+ if (!bfqg)
+ return NULL;
+
+ if (bfqg_stats_init(&bfqg->stats, gfp)) {
+ kfree(bfqg);
+ return NULL;
+ }
+
+ return &bfqg->pd;
+}
+
+static void bfq_pd_init(struct blkg_policy_data *pd)
+{
+ struct blkcg_gq *blkg = pd_to_blkg(pd);
+ struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+ struct bfq_data *bfqd = blkg->q->elevator->elevator_data;
+ struct bfq_entity *entity = &bfqg->entity;
+ struct bfq_group_data *d = blkcg_to_bfqgd(blkg->blkcg);
+
+ entity->orig_weight = entity->weight = entity->new_weight = d->weight;
+ entity->my_sched_data = &bfqg->sched_data;
+ bfqg->my_entity = entity; /*
+ * the root_group's will be set to NULL
+ * in bfq_init_queue()
+ */
+ bfqg->bfqd = bfqd;
+}
+
+static void bfq_pd_free(struct blkg_policy_data *pd)
+{
+ struct bfq_group *bfqg = pd_to_bfqg(pd);
+
+ bfqg_stats_exit(&bfqg->stats);
+ return kfree(bfqg);
+}
+
+static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
+{
+ struct bfq_group *bfqg = pd_to_bfqg(pd);
+
+ bfqg_stats_reset(&bfqg->stats);
+}
+
+static void bfq_group_set_parent(struct bfq_group *bfqg,
+ struct bfq_group *parent)
+{
+ struct bfq_entity *entity;
+
+ entity = &bfqg->entity;
+ entity->parent = parent->my_entity;
+ entity->sched_data = &parent->sched_data;
+}
+
+static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd,
+ struct blkcg *blkcg)
+{
+ struct blkcg_gq *blkg;
+
+ blkg = blkg_lookup(blkcg, bfqd->queue);
+ if (likely(blkg))
+ return blkg_to_bfqg(blkg);
+ return NULL;
+}
+
+static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
+ struct blkcg *blkcg)
+{
+ struct bfq_group *bfqg, *parent;
+ struct bfq_entity *entity;
+
+ bfqg = bfq_lookup_bfqg(bfqd, blkcg);
+
+ if (unlikely(!bfqg))
+ return NULL;
+
+ /*
+ * Update chain of bfq_groups as we might be handling a leaf group
+ * which, along with some of its relatives, has not been hooked yet
+ * to the private hierarchy of BFQ.
+ */
+ entity = &bfqg->entity;
+ for_each_entity(entity) {
+ bfqg = container_of(entity, struct bfq_group, entity);
+ if (bfqg != bfqd->root_group) {
+ parent = bfqg_parent(bfqg);
+ if (!parent)
+ parent = bfqd->root_group;
+ bfq_group_set_parent(bfqg, parent);
+ }
+ }
+
+ return bfqg;
+}
+
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ bool compensate,
+ enum bfqq_expiration reason);
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one. Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct bfq_group *bfqg)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ /* If bfqq is empty, then bfq_bfqq_expire also invokes
+ * bfq_del_bfqq_busy, thereby removing bfqq and its entity
+ * from data structures related to current group. Otherwise we
+ * need to remove bfqq explicitly with bfq_deactivate_bfqq, as
+ * we do below.
+ */
+ if (bfqq == bfqd->in_service_queue)
+ bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
+ false, BFQQE_PREEMPTED);
+
+ if (bfq_bfqq_busy(bfqq))
+ bfq_deactivate_bfqq(bfqd, bfqq, false, false);
+ else if (entity->on_st)
+ bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+ bfqg_put(bfqq_group(bfqq));
+
+ /*
+ * Here we use a reference to bfqg. We don't need a refcounter
+ * as the cgroup reference will not be dropped, so that its
+ * destroy() callback will not be invoked.
+ */
+ entity->parent = bfqg->my_entity;
+ entity->sched_data = &bfqg->sched_data;
+ bfqg_get(bfqg);
+
+ if (bfq_bfqq_busy(bfqq))
+ bfq_activate_bfqq(bfqd, bfqq);
+
+ if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
+ bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @blkcg: the blk-cgroup to move to.
+ *
+ * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+ struct bfq_io_cq *bic,
+ struct blkcg *blkcg)
+{
+ struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+ struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+ struct bfq_group *bfqg;
+ struct bfq_entity *entity;
+
+ bfqg = bfq_find_set_group(bfqd, blkcg);
+
+ if (unlikely(!bfqg))
+ bfqg = bfqd->root_group;
+
+ if (async_bfqq) {
+ entity = &async_bfqq->entity;
+
+ if (entity->sched_data != &bfqg->sched_data) {
+ bic_set_bfqq(bic, NULL, 0);
+ bfq_log_bfqq(bfqd, async_bfqq,
+ "bic_change_group: %p %d",
+ async_bfqq,
+ async_bfqq->ref);
+ bfq_put_queue(async_bfqq);
+ }
+ }
+
+ if (sync_bfqq) {
+ entity = &sync_bfqq->entity;
+ if (entity->sched_data != &bfqg->sched_data)
+ bfq_bfqq_move(bfqd, sync_bfqq, bfqg);
+ }
+
+ return bfqg;
+}
+
+static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
+{
+ struct bfq_data *bfqd = bic_to_bfqd(bic);
+ struct bfq_group *bfqg = NULL;
+ uint64_t serial_nr;
+
+ rcu_read_lock();
+ serial_nr = bio_blkcg(bio)->css.serial_nr;
+
+ /*
+ * Check whether blkcg has changed. The condition may trigger
+ * spuriously on a newly created cic but there's no harm.
+ */
+ if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
+ goto out;
+
+ bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
+ bic->blkcg_serial_nr = serial_nr;
+out:
+ rcu_read_unlock();
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+ struct bfq_entity *entity = st->first_idle;
+
+ for (; entity ; entity = st->first_idle)
+ __bfq_deactivate_entity(entity, false);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+ struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
}
/**
- * __bfq_lookup_next_entity - return the first eligible entity in @st.
- * @st: the service tree.
+ * bfq_reparent_active_entities - move to the root group all active
+ * entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
*
- * Update the virtual time in @st and return the first eligible entity
- * it contains.
+ * Needs queue_lock to be taken and reference to be valid over the call.
*/
-static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
- bool force)
+static void bfq_reparent_active_entities(struct bfq_data *bfqd,
+ struct bfq_group *bfqg,
+ struct bfq_service_tree *st)
{
- struct bfq_entity *entity, *new_next_in_service = NULL;
-
- if (RB_EMPTY_ROOT(&st->active))
- return NULL;
+ struct rb_root *active = &st->active;
+ struct bfq_entity *entity = NULL;
- bfq_update_vtime(st);
- entity = bfq_first_active_entity(st);
+ if (!RB_EMPTY_ROOT(&st->active))
+ entity = bfq_entity_of(rb_first(active));
- /*
- * If the chosen entity does not match with the sched_data's
- * next_in_service and we are forcedly serving the IDLE priority
- * class tree, bubble up budget update.
- */
- if (unlikely(force && entity != entity->sched_data->next_in_service)) {
- new_next_in_service = entity;
- for_each_entity(new_next_in_service)
- bfq_update_budget(new_next_in_service);
- }
+ for (; entity ; entity = bfq_entity_of(rb_first(active)))
+ bfq_reparent_leaf_entity(bfqd, entity);
- return entity;
+ if (bfqg->sched_data.in_service_entity)
+ bfq_reparent_leaf_entity(bfqd,
+ bfqg->sched_data.in_service_entity);
}
/**
- * bfq_lookup_next_entity - return the first eligible entity in @sd.
- * @sd: the sched_data.
- * @extract: if true the returned entity will be also extracted from @sd.
+ * bfq_pd_offline - deactivate the entity associated with @pd,
+ * and reparent its children entities.
+ * @pd: descriptor of the policy going offline.
*
- * NOTE: since we cache the next_in_service entity at each level of the
- * hierarchy, the complexity of the lookup can be decreased with
- * absolutely no effort just returning the cached next_in_service value;
- * we prefer to do full lookups to test the consistency of the data
- * structures.
+ * blkio already grabs the queue_lock for us, so no need to use
+ * RCU-based magic
*/
-static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
- int extract,
- struct bfq_data *bfqd)
+static void bfq_pd_offline(struct blkg_policy_data *pd)
{
- struct bfq_service_tree *st = sd->service_tree;
- struct bfq_entity *entity;
- int i = 0;
+ struct bfq_service_tree *st;
+ struct bfq_group *bfqg = pd_to_bfqg(pd);
+ struct bfq_data *bfqd = bfqg->bfqd;
+ struct bfq_entity *entity = bfqg->my_entity;
+ unsigned long flags;
+ int i;
+ if (!entity) /* root group */
+ return;
+
+ spin_lock_irqsave(&bfqd->lock, flags);
/*
- * Choose from idle class, if needed to guarantee a minimum
- * bandwidth to this class. This should also mitigate
- * priority-inversion problems in case a low priority task is
- * holding file system resources.
+ * Empty all service_trees belonging to this group before
+ * deactivating the group itself.
*/
- if (bfqd &&
- jiffies - bfqd->bfq_class_idle_last_service >
- BFQ_CL_IDLE_TIMEOUT) {
- entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
- true);
- if (entity) {
- i = BFQ_IOPRIO_CLASSES - 1;
- bfqd->bfq_class_idle_last_service = jiffies;
- sd->next_in_service = entity;
- }
- }
- for (; i < BFQ_IOPRIO_CLASSES; i++) {
- entity = __bfq_lookup_next_entity(st + i, false);
- if (entity) {
- if (extract) {
- bfq_check_next_in_service(sd, entity);
- bfq_active_extract(st + i, entity);
- sd->in_service_entity = entity;
- sd->next_in_service = NULL;
- }
- break;
- }
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+ st = bfqg->sched_data.service_tree + i;
+
+ /*
+ * The idle tree may still contain bfq_queues belonging
+ * to exited task because they never migrated to a different
+ * cgroup from the one being destroyed now. No one else
+ * can access them so it's safe to act without any lock.
+ */
+ bfq_flush_idle_tree(st);
+
+ /*
+ * It may happen that some queues are still active
+ * (busy) upon group destruction (if the corresponding
+ * processes have been forced to terminate). We move
+ * all the leaf entities corresponding to these queues
+ * to the root_group.
+ * Also, it may happen that the group has an entity
+ * in service, which is disconnected from the active
+ * tree: it must be moved, too.
+ * There is no need to put the sync queues, as the
+ * scheduler has taken no reference.
+ */
+ bfq_reparent_active_entities(bfqd, bfqg, st);
}
- return entity;
+ __bfq_deactivate_entity(entity, false);
+ bfq_put_async_queues(bfqd, bfqg);
+
+ spin_unlock_irqrestore(&bfqd->lock, flags);
+ /*
+ * @blkg is going offline and will be ignored by
+ * blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so
+ * that they don't get lost. If IOs complete after this point, the
+ * stats for them will be lost. Oh well...
+ */
+ bfqg_stats_xfer_dead(bfqg);
}
-static bool next_queue_may_preempt(struct bfq_data *bfqd)
+static int bfq_io_show_weight(struct seq_file *sf, void *v)
{
- struct bfq_sched_data *sd = &bfqd->sched_data;
+ struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+ struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+ unsigned int val = 0;
- return sd->next_in_service != sd->in_service_entity;
-}
+ if (bfqgd)
+ val = bfqgd->weight;
+ seq_printf(sf, "%u\n", val);
-/*
- * Get next queue for service.
- */
-static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+ return 0;
+}
+
+static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
+ struct cftype *cftype,
+ u64 val)
{
- struct bfq_entity *entity = NULL;
- struct bfq_sched_data *sd;
- struct bfq_queue *bfqq;
+ struct blkcg *blkcg = css_to_blkcg(css);
+ struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+ struct blkcg_gq *blkg;
+ int ret = -ERANGE;
- if (bfqd->busy_queues == 0)
- return NULL;
+ if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT)
+ return ret;
- sd = &bfqd->sched_data;
- for (; sd ; sd = entity->my_sched_data) {
- entity = bfq_lookup_next_entity(sd, 1, bfqd);
- entity->service = 0;
+ ret = 0;
+ spin_lock_irq(&blkcg->lock);
+ bfqgd->weight = (unsigned short)val;
+ hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
+ struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+ if (!bfqg)
+ continue;
+ /*
+ * Setting the prio_changed flag of the entity
+ * to 1 with new_weight == weight would re-set
+ * the value of the weight to its ioprio mapping.
+ * Set the flag only if necessary.
+ */
+ if ((unsigned short)val != bfqg->entity.new_weight) {
+ bfqg->entity.new_weight = (unsigned short)val;
+ /*
+ * Make sure that the above new value has been
+ * stored in bfqg->entity.new_weight before
+ * setting the prio_changed flag. In fact,
+ * this flag may be read asynchronously (in
+ * critical sections protected by a different
+ * lock than that held here), and finding this
+ * flag set may cause the execution of the code
+ * for updating parameters whose value may
+ * depend also on bfqg->entity.new_weight (in
+ * __bfq_entity_update_weight_prio).
+ * This barrier makes sure that the new value
+ * of bfqg->entity.new_weight is correctly
+ * seen in that code.
+ */
+ smp_wmb();
+ bfqg->entity.prio_changed = 1;
+ }
}
+ spin_unlock_irq(&blkcg->lock);
- bfqq = bfq_entity_to_bfqq(entity);
+ return ret;
+}
- return bfqq;
+static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ u64 weight;
+ /* First unsigned long found in the file is used */
+ int ret = kstrtoull(strim(buf), 0, &weight);
+
+ if (ret)
+ return ret;
+
+ return bfq_io_set_weight_legacy(of_css(of), NULL, weight);
}
-static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+static int bfqg_print_stat(struct seq_file *sf, void *v)
{
- struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
- struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity;
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat,
+ &blkcg_policy_bfq, seq_cft(sf)->private, false);
+ return 0;
+}
- if (bfqd->in_service_bic) {
- put_io_context(bfqd->in_service_bic->icq.ioc);
- bfqd->in_service_bic = NULL;
- }
+static int bfqg_print_rwstat(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat,
+ &blkcg_policy_bfq, seq_cft(sf)->private, true);
+ return 0;
+}
- bfq_clear_bfqq_wait_request(in_serv_bfqq);
- hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
- bfqd->in_service_queue = NULL;
+static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
+{
+ u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd),
+ &blkcg_policy_bfq, off);
+ return __blkg_prfill_u64(sf, pd, sum);
+}
- /*
- * in_serv_entity is no longer in service, so, if it is in no
- * service tree either, then release the service reference to
- * the queue it represents (taken with bfq_get_entity).
- */
- if (!in_serv_entity->on_st)
- bfq_put_queue(in_serv_bfqq);
+static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
+{
+ struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd),
+ &blkcg_policy_bfq,
+ off);
+ return __blkg_prfill_rwstat(sf, pd, &sum);
}
-static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
- int requeue)
+static int bfqg_print_stat_recursive(struct seq_file *sf, void *v)
{
- struct bfq_entity *entity = &bfqq->entity;
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ bfqg_prfill_stat_recursive, &blkcg_policy_bfq,
+ seq_cft(sf)->private, false);
+ return 0;
+}
- bfq_deactivate_entity(entity, requeue);
+static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ bfqg_prfill_rwstat_recursive, &blkcg_policy_bfq,
+ seq_cft(sf)->private, true);
+ return 0;
}
-static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static u64 bfqg_prfill_sectors(struct seq_file *sf, struct blkg_policy_data *pd,
+ int off)
{
- struct bfq_entity *entity = &bfqq->entity;
+ u64 sum = blkg_rwstat_total(&pd->blkg->stat_bytes);
- bfq_activate_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq));
- bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+ return __blkg_prfill_u64(sf, pd, sum >> 9);
}
-/*
- * Called when the bfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
- int requeue)
+static int bfqg_print_stat_sectors(struct seq_file *sf, void *v)
{
- bfq_log_bfqq(bfqd, bfqq, "del from busy");
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ bfqg_prfill_sectors, &blkcg_policy_bfq, 0, false);
+ return 0;
+}
- bfq_clear_bfqq_busy(bfqq);
+static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
+{
+ struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL,
+ offsetof(struct blkcg_gq, stat_bytes));
+ u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
+ atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
- bfqd->busy_queues--;
+ return __blkg_prfill_u64(sf, pd, sum >> 9);
+}
- bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ bfqg_prfill_sectors_recursive, &blkcg_policy_bfq, 0,
+ false);
+ return 0;
}
-/*
- * Called when an inactive queue receives a new request.
- */
-static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf,
+ struct blkg_policy_data *pd, int off)
{
- bfq_log_bfqq(bfqd, bfqq, "add to busy");
+ struct bfq_group *bfqg = pd_to_bfqg(pd);
+ u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples);
+ u64 v = 0;
- bfq_activate_bfqq(bfqd, bfqq);
+ if (samples) {
+ v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum);
+ v = div64_u64(v, samples);
+ }
+ __blkg_prfill_u64(sf, pd, v);
+ return 0;
+}
- bfq_mark_bfqq_busy(bfqq);
- bfqd->busy_queues++;
+/* print avg_queue_size */
+static int bfqg_print_avg_queue_size(struct seq_file *sf, void *v)
+{
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+ bfqg_prfill_avg_queue_size, &blkcg_policy_bfq,
+ 0, false);
+ return 0;
+}
+
+static struct bfq_group *
+bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
+{
+ int ret;
+
+ ret = blkcg_activate_policy(bfqd->queue, &blkcg_policy_bfq);
+ if (ret)
+ return NULL;
+
+ return blkg_to_bfqg(bfqd->queue->root_blkg);
}
-static void bfq_init_entity(struct bfq_entity *entity)
+static struct cftype bfq_blkcg_legacy_files[] = {
+ {
+ .name = "bfq.weight",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = bfq_io_show_weight,
+ .write_u64 = bfq_io_set_weight_legacy,
+ },
+
+ /* statistics, covers only the tasks in the bfqg */
+ {
+ .name = "bfq.time",
+ .private = offsetof(struct bfq_group, stats.time),
+ .seq_show = bfqg_print_stat,
+ },
+ {
+ .name = "bfq.sectors",
+ .seq_show = bfqg_print_stat_sectors,
+ },
+ {
+ .name = "bfq.io_service_bytes",
+ .private = (unsigned long)&blkcg_policy_bfq,
+ .seq_show = blkg_print_stat_bytes,
+ },
+ {
+ .name = "bfq.io_serviced",
+ .private = (unsigned long)&blkcg_policy_bfq,
+ .seq_show = blkg_print_stat_ios,
+ },
+ {
+ .name = "bfq.io_service_time",
+ .private = offsetof(struct bfq_group, stats.service_time),
+ .seq_show = bfqg_print_rwstat,
+ },
+ {
+ .name = "bfq.io_wait_time",
+ .private = offsetof(struct bfq_group, stats.wait_time),
+ .seq_show = bfqg_print_rwstat,
+ },
+ {
+ .name = "bfq.io_merged",
+ .private = offsetof(struct bfq_group, stats.merged),
+ .seq_show = bfqg_print_rwstat,
+ },
+ {
+ .name = "bfq.io_queued",
+ .private = offsetof(struct bfq_group, stats.queued),
+ .seq_show = bfqg_print_rwstat,
+ },
+
+ /* the same statictics which cover the bfqg and its descendants */
+ {
+ .name = "bfq.time_recursive",
+ .private = offsetof(struct bfq_group, stats.time),
+ .seq_show = bfqg_print_stat_recursive,
+ },
+ {
+ .name = "bfq.sectors_recursive",
+ .seq_show = bfqg_print_stat_sectors_recursive,
+ },
+ {
+ .name = "bfq.io_service_bytes_recursive",
+ .private = (unsigned long)&blkcg_policy_bfq,
+ .seq_show = blkg_print_stat_bytes_recursive,
+ },
+ {
+ .name = "bfq.io_serviced_recursive",
+ .private = (unsigned long)&blkcg_policy_bfq,
+ .seq_show = blkg_print_stat_ios_recursive,
+ },
+ {
+ .name = "bfq.io_service_time_recursive",
+ .private = offsetof(struct bfq_group, stats.service_time),
+ .seq_show = bfqg_print_rwstat_recursive,
+ },
+ {
+ .name = "bfq.io_wait_time_recursive",
+ .private = offsetof(struct bfq_group, stats.wait_time),
+ .seq_show = bfqg_print_rwstat_recursive,
+ },
+ {
+ .name = "bfq.io_merged_recursive",
+ .private = offsetof(struct bfq_group, stats.merged),
+ .seq_show = bfqg_print_rwstat_recursive,
+ },
+ {
+ .name = "bfq.io_queued_recursive",
+ .private = offsetof(struct bfq_group, stats.queued),
+ .seq_show = bfqg_print_rwstat_recursive,
+ },
+ {
+ .name = "bfq.avg_queue_size",
+ .seq_show = bfqg_print_avg_queue_size,
+ },
+ {
+ .name = "bfq.group_wait_time",
+ .private = offsetof(struct bfq_group, stats.group_wait_time),
+ .seq_show = bfqg_print_stat,
+ },
+ {
+ .name = "bfq.idle_time",
+ .private = offsetof(struct bfq_group, stats.idle_time),
+ .seq_show = bfqg_print_stat,
+ },
+ {
+ .name = "bfq.empty_time",
+ .private = offsetof(struct bfq_group, stats.empty_time),
+ .seq_show = bfqg_print_stat,
+ },
+ {
+ .name = "bfq.dequeue",
+ .private = offsetof(struct bfq_group, stats.dequeue),
+ .seq_show = bfqg_print_stat,
+ },
+ { } /* terminate */
+};
+
+static struct cftype bfq_blkg_files[] = {
+ {
+ .name = "bfq.weight",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = bfq_io_show_weight,
+ .write = bfq_io_set_weight,
+ },
+ {} /* terminate */
+};
+
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+
+static inline void bfqg_stats_update_io_add(struct bfq_group *bfqg,
+ struct bfq_queue *bfqq, unsigned int op) { }
+static inline void
+bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op) { }
+static inline void
+bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op) { }
+static inline void bfqg_stats_update_completion(struct bfq_group *bfqg,
+ uint64_t start_time, uint64_t io_start_time,
+ unsigned int op) { }
+static inline void
+bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
+ struct bfq_group *curr_bfqg) { }
+static inline void bfqg_stats_end_empty_time(struct bfqg_stats *stats) { }
+static inline void bfqg_stats_update_dequeue(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_update_idle_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { }
+
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct bfq_group *bfqg) {}
+
+static void bfq_init_entity(struct bfq_entity *entity,
+ struct bfq_group *bfqg)
{
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
entity->weight = entity->new_weight;
entity->orig_weight = entity->new_weight;
+ if (bfqq) {
+ bfqq->ioprio = bfqq->new_ioprio;
+ bfqq->ioprio_class = bfqq->new_ioprio_class;
+ }
+ entity->sched_data = &bfqg->sched_data;
+}
+
+static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
+
+static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
+ struct blkcg *blkcg)
+{
+ return bfqd->root_group;
+}
+
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
+{
+ return bfqq->bfqd->root_group;
+}
- bfqq->ioprio = bfqq->new_ioprio;
- bfqq->ioprio_class = bfqq->new_ioprio_class;
+static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd,
+ int node)
+{
+ struct bfq_group *bfqg;
+ int i;
+
+ bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+ if (!bfqg)
+ return NULL;
- entity->sched_data = &bfqq->bfqd->sched_data;
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+ return bfqg;
}
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
#define bfq_class_rt(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
@@ -1711,18 +3411,6 @@ static void bfq_init_entity(struct bfq_entity *entity)
#define bfq_sample_valid(samples) ((samples) > 80)
/*
- * Scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing.
- */
-static void bfq_schedule_dispatch(struct bfq_data *bfqd)
-{
- if (bfqd->queued != 0) {
- bfq_log(bfqd, "schedule dispatch");
- blk_mq_run_hw_queues(bfqd->queue, true);
- }
-}
-
-/*
* Lifted from AS - choose which of rq1 and rq2 that is best served now.
* We choose the request that is closesr to the head right now. Distance
* behind the head is penalized and only allowed to a certain extent.
@@ -1905,7 +3593,7 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
new_budget);
- bfq_activate_bfqq(bfqd, bfqq);
+ bfq_requeue_bfqq(bfqd, bfqq);
}
}
@@ -2076,6 +3764,8 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
bfqq->ttime.last_end_request +
bfqd->bfq_slice_idle * 3;
+ bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, rq->cmd_flags);
+
/*
* Update budget and check whether bfqq may want to preempt
* the in-service queue.
@@ -2195,7 +3885,7 @@ static void bfq_remove_request(struct request_queue *q,
bfqq->next_rq = NULL;
if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
- bfq_del_bfqq_busy(bfqd, bfqq, 1);
+ bfq_del_bfqq_busy(bfqd, bfqq, false);
/*
* bfqq emptied. In normal operation, when
* bfqq is empty, bfqq->entity.service and
@@ -2215,6 +3905,8 @@ static void bfq_remove_request(struct request_queue *q,
if (rq->cmd_flags & REQ_META)
bfqq->meta_pending--;
+
+ bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags);
}
static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
@@ -2300,7 +3992,7 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq,
struct bfq_queue *bfqq = RQ_BFQQ(rq), *next_bfqq = RQ_BFQQ(next);
if (!RB_EMPTY_NODE(&rq->rb_node))
- return;
+ goto end;
spin_lock_irq(&bfqq->bfqd->lock);
/*
@@ -2326,6 +4018,8 @@ static void bfq_requests_merged(struct request_queue *q, struct request *rq,
bfq_remove_request(q, next);
spin_unlock_irq(&bfqq->bfqd->lock);
+end:
+ bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags);
}
static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
@@ -2355,6 +4049,7 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
if (bfqq) {
+ bfqg_stats_update_avg_queue_size(bfqq_group(bfqq));
bfq_mark_bfqq_budget_new(bfqq);
bfq_clear_bfqq_fifo_expire(bfqq);
@@ -2441,6 +4136,7 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
bfqd->last_idling_start = ktime_get();
hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
HRTIMER_MODE_REL);
+ bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
}
/*
@@ -2490,12 +4186,17 @@ static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
- __bfq_bfqd_reset_in_service(bfqd);
-
if (RB_EMPTY_ROOT(&bfqq->sort_list))
- bfq_del_bfqq_busy(bfqd, bfqq, 1);
+ bfq_del_bfqq_busy(bfqd, bfqq, true);
else
- bfq_activate_bfqq(bfqd, bfqq);
+ bfq_requeue_bfqq(bfqd, bfqq);
+
+ /*
+ * All in-service entities must have been properly deactivated
+ * or requeued before executing the next function, which
+ * resets all in-service entites as no more in service.
+ */
+ __bfq_bfqd_reset_in_service(bfqd);
}
/**
@@ -2972,6 +4673,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
*/
bfq_clear_bfqq_wait_request(bfqq);
hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+ bfqg_stats_update_idle_time(bfqq_group(bfqq));
}
goto keep_queue;
}
@@ -3159,6 +4861,10 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
*/
static void bfq_put_queue(struct bfq_queue *bfqq)
{
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ struct bfq_group *bfqg = bfqq_group(bfqq);
+#endif
+
if (bfqq->bfqd)
bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d",
bfqq, bfqq->ref);
@@ -3167,7 +4873,12 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
if (bfqq->ref)
return;
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p freed", bfqq);
+
kmem_cache_free(bfq_pool, bfqq);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ bfqg_put(bfqg);
+#endif
}
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
@@ -3323,18 +5034,19 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
}
static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+ struct bfq_group *bfqg,
int ioprio_class, int ioprio)
{
switch (ioprio_class) {
case IOPRIO_CLASS_RT:
- return &async_bfqq[0][ioprio];
+ return &bfqg->async_bfqq[0][ioprio];
case IOPRIO_CLASS_NONE:
ioprio = IOPRIO_NORM;
/* fall through */
case IOPRIO_CLASS_BE:
- return &async_bfqq[1][ioprio];
+ return &bfqg->async_bfqq[1][ioprio];
case IOPRIO_CLASS_IDLE:
- return &async_idle_bfqq;
+ return &bfqg->async_idle_bfqq;
default:
return NULL;
}
@@ -3348,11 +5060,18 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
struct bfq_queue **async_bfqq = NULL;
struct bfq_queue *bfqq;
+ struct bfq_group *bfqg;
rcu_read_lock();
+ bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio));
+ if (!bfqg) {
+ bfqq = &bfqd->oom_bfqq;
+ goto out;
+ }
+
if (!is_sync) {
- async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class,
+ async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
ioprio);
bfqq = *async_bfqq;
if (bfqq)
@@ -3366,7 +5085,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
if (bfqq) {
bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
is_sync);
- bfq_init_entity(&bfqq->entity);
+ bfq_init_entity(&bfqq->entity, bfqg);
bfq_log_bfqq(bfqd, bfqq, "allocated");
} else {
bfqq = &bfqd->oom_bfqq;
@@ -3379,9 +5098,14 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
* prune it.
*/
if (async_bfqq) {
- bfqq->ref++;
- bfq_log_bfqq(bfqd, bfqq,
- "get_queue, bfqq not in async: %p, %d",
+ bfqq->ref++; /*
+ * Extra group reference, w.r.t. sync
+ * queue. This extra reference is removed
+ * only if bfqq->bfqg disappears, to
+ * guarantee that this queue is not freed
+ * until its group goes away.
+ */
+ bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
bfqq, bfqq->ref);
*async_bfqq = bfqq;
}
@@ -3516,6 +5240,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
*/
bfq_clear_bfqq_wait_request(bfqq);
hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+ bfqg_stats_update_idle_time(bfqq_group(bfqq));
/*
* The queue is not empty, because a new request just
@@ -3657,6 +5382,11 @@ static void bfq_put_rq_private(struct request_queue *q, struct request *rq)
struct bfq_queue *bfqq = RQ_BFQQ(rq);
struct bfq_data *bfqd = bfqq->bfqd;
+ if (rq->rq_flags & RQF_STARTED)
+ bfqg_stats_update_completion(bfqq_group(bfqq),
+ rq_start_time_ns(rq),
+ rq_io_start_time_ns(rq),
+ rq->cmd_flags);
if (likely(rq->rq_flags & RQF_STARTED)) {
unsigned long flags;
@@ -3707,6 +5437,8 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
if (!bic)
goto queue_fail;
+ bfq_bic_update_cgroup(bic, bio);
+
bfqq = bic_to_bfqq(bic, is_sync);
if (!bfqq || bfqq == &bfqd->oom_bfqq) {
if (bfqq)
@@ -3803,6 +5535,8 @@ static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
if (bfqq) {
+ bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
+
bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
bfqq, bfqq->ref);
bfq_put_queue(bfqq);
@@ -3811,18 +5545,20 @@ static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
}
/*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues. If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
*/
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
{
int i, j;
for (i = 0; i < 2; i++)
for (j = 0; j < IOPRIO_BE_NR; j++)
- __bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+ __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
- __bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+ __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
}
static void bfq_exit_queue(struct elevator_queue *e)
@@ -3834,20 +5570,42 @@ static void bfq_exit_queue(struct elevator_queue *e)
spin_lock_irq(&bfqd->lock);
list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
- bfq_deactivate_bfqq(bfqd, bfqq, false);
- bfq_put_async_queues(bfqd);
+ bfq_deactivate_bfqq(bfqd, bfqq, false, false);
spin_unlock_irq(&bfqd->lock);
hrtimer_cancel(&bfqd->idle_slice_timer);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ blkcg_deactivate_policy(bfqd->queue, &blkcg_policy_bfq);
+#else
+ spin_lock_irq(&bfqd->lock);
+ bfq_put_async_queues(bfqd, bfqd->root_group);
+ kfree(bfqd->root_group);
+ spin_unlock_irq(&bfqd->lock);
+#endif
+
kfree(bfqd);
}
+static void bfq_init_root_group(struct bfq_group *root_group,
+ struct bfq_data *bfqd)
+{
+ int i;
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ root_group->entity.parent = NULL;
+ root_group->my_entity = NULL;
+ root_group->bfqd = bfqd;
+#endif
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+ root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+ root_group->sched_data.bfq_class_idle_last_service = jiffies;
+}
+
static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
{
struct bfq_data *bfqd;
struct elevator_queue *eq;
- int i;
eq = elevator_alloc(q, e);
if (!eq)
@@ -3860,6 +5618,10 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
}
eq->elevator_data = bfqd;
+ spin_lock_irq(q->queue_lock);
+ q->elevator = eq;
+ spin_unlock_irq(q->queue_lock);
+
/*
* Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
* Grab a permanent reference to it, so that the normal code flow
@@ -3880,8 +5642,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
bfqd->queue = q;
- for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
- bfqd->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+ INIT_LIST_HEAD(&bfqd->dispatch);
hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_REL);
@@ -3899,17 +5660,40 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
bfqd->bfq_back_max = bfq_back_max;
bfqd->bfq_back_penalty = bfq_back_penalty;
bfqd->bfq_slice_idle = bfq_slice_idle;
- bfqd->bfq_class_idle_last_service = 0;
bfqd->bfq_timeout = bfq_timeout;
bfqd->bfq_requests_within_timer = 120;
spin_lock_init(&bfqd->lock);
- INIT_LIST_HEAD(&bfqd->dispatch);
- q->elevator = eq;
+ /*
+ * The invocation of the next bfq_create_group_hierarchy
+ * function is the head of a chain of function calls
+ * (bfq_create_group_hierarchy->blkcg_activate_policy->
+ * blk_mq_freeze_queue) that may lead to the invocation of the
+ * has_work hook function. For this reason,
+ * bfq_create_group_hierarchy is invoked only after all
+ * scheduler data has been initialized, apart from the fields
+ * that can be initialized only after invoking
+ * bfq_create_group_hierarchy. This, in particular, enables
+ * has_work to correctly return false. Of course, to avoid
+ * other inconsistencies, the blk-mq stack must then refrain
+ * from invoking further scheduler hooks before this init
+ * function is finished.
+ */
+ bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node);
+ if (!bfqd->root_group)
+ goto out_free;
+ bfq_init_root_group(bfqd->root_group, bfqd);
+ bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group);
+
return 0;
+
+out_free:
+ kfree(bfqd);
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
}
static void bfq_slab_kill(void)
@@ -4134,10 +5918,34 @@ static struct elevator_type iosched_bfq_mq = {
.elevator_owner = THIS_MODULE,
};
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static struct blkcg_policy blkcg_policy_bfq = {
+ .dfl_cftypes = bfq_blkg_files,
+ .legacy_cftypes = bfq_blkcg_legacy_files,
+
+ .cpd_alloc_fn = bfq_cpd_alloc,
+ .cpd_init_fn = bfq_cpd_init,
+ .cpd_bind_fn = bfq_cpd_init,
+ .cpd_free_fn = bfq_cpd_free,
+
+ .pd_alloc_fn = bfq_pd_alloc,
+ .pd_init_fn = bfq_pd_init,
+ .pd_offline_fn = bfq_pd_offline,
+ .pd_free_fn = bfq_pd_free,
+ .pd_reset_stats_fn = bfq_pd_reset_stats,
+};
+#endif
+
static int __init bfq_init(void)
{
int ret;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ ret = blkcg_policy_register(&blkcg_policy_bfq);
+ if (ret)
+ return ret;
+#endif
+
ret = -ENOMEM;
if (bfq_slab_setup())
goto err_pol_unreg;
@@ -4149,12 +5957,18 @@ static int __init bfq_init(void)
return 0;
err_pol_unreg:
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ blkcg_policy_unregister(&blkcg_policy_bfq);
+#endif
return ret;
}
static void __exit bfq_exit(void)
{
elv_unregister(&iosched_bfq_mq);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ blkcg_policy_unregister(&blkcg_policy_bfq);
+#endif
bfq_slab_kill();
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f33..a569f3a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -48,7 +48,7 @@ struct rq_wb;
* Maximum number of blkcg policies allowed to be registered concurrently.
* Defined here to simplify include dependency.
*/
-#define BLKCG_MAX_POLS 2
+#define BLKCG_MAX_POLS 3
typedef void (rq_end_io_fn)(struct request *, int);
--
2.10.0
^ permalink raw reply related
* [PATCH V4 01/16] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170412162322.11139-1-paolo.valente@linaro.org>
We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.
BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.
- Each process doing I/O on a device is associated with a weight and a
(bfq_)queue.
- BFQ grants exclusive access to the device, for a while, to one queue
(process) at a time, and implements this service model by
associating every queue with a budget, measured in number of
sectors.
- After a queue is granted access to the device, the budget of the
queue is decremented, on each request dispatch, by the size of the
request.
- The in-service queue is expired, i.e., its service is suspended,
only if one of the following events occurs: 1) the queue finishes
its budget, 2) the queue empties, 3) a "budget timeout" fires.
- The budget timeout prevents processes doing random I/O from
holding the device for too long and dramatically reducing
throughput.
- Actually, as in CFQ, a queue associated with a process issuing
sync requests may not be expired immediately when it empties. In
contrast, BFQ may idle the device for a short time interval,
giving the process the chance to go on being served if it issues
a new request in time. Device idling typically boosts the
throughput on rotational devices, if processes do synchronous
and sequential I/O. In addition, under BFQ, device idling is
also instrumental in guaranteeing the desired throughput
fraction to processes issuing sync requests (see [2] for
details).
- With respect to idling for service guarantees, if several
processes are competing for the device at the same time, but
all processes (and groups, after the following commit) have
the same weight, then BFQ guarantees the expected throughput
distribution without ever idling the device. Throughput is
thus as high as possible in this common scenario.
- Queues are scheduled according to a variant of WF2Q+, named
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
also ready for hierarchical scheduling. However, for a cleaner
logical breakdown, the code that enables and completes
hierarchical support is provided in the next commit, which focuses
exactly on this feature.
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
perfectly fair, and smooth service. In particular, B-WF2Q+
guarantees that each queue receives a fraction of the device
throughput proportional to its weight, even if the throughput
fluctuates, and regardless of: the device parameters, the current
workload and the budgets assigned to the queue.
- The last, budget-independence, property (although probably
counterintuitive in the first place) is definitely beneficial, for
the following reasons:
- First, with any proportional-share scheduler, the maximum
deviation with respect to an ideal service is proportional to
the maximum budget (slice) assigned to queues. As a consequence,
BFQ can keep this deviation tight not only because of the
accurate service of B-WF2Q+, but also because BFQ *does not*
need to assign a larger budget to a queue to let the queue
receive a higher fraction of the device throughput.
- Second, BFQ is free to choose, for every process (queue), the
budget that best fits the needs of the process, or best
leverages the I/O pattern of the process. In particular, BFQ
updates queue budgets with a simple feedback-loop algorithm that
allows a high throughput to be achieved, while still providing
tight latency guarantees to time-sensitive applications. When
the in-service queue expires, this algorithm computes the next
budget of the queue so as to:
- Let large budgets be eventually assigned to the queues
associated with I/O-bound applications performing sequential
I/O: in fact, the longer these applications are served once
got access to the device, the higher the throughput is.
- Let small budgets be eventually assigned to the queues
associated with time-sensitive applications (which typically
perform sporadic and short I/O), because, the smaller the
budget assigned to a queue waiting for service is, the sooner
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
- Weights can be assigned to processes only indirectly, through I/O
priorities, and according to the relation:
weight = 10 * (IOPRIO_BE_NR - ioprio).
The next patch provides, instead, a cgroups interface through which
weights can be assigned explicitly.
- If several processes are competing for the device at the same time,
but all processes and groups have the same weight, then BFQ
guarantees the expected throughput distribution without ever idling
the device. It uses preemption instead. Throughput is then much
higher in this common scenario.
- ioprio classes are served in strict priority order, i.e.,
lower-priority queues are not served as long as there are
higher-priority queues. Among queues in the same class, the
bandwidth is distributed in proportion to the weight of each
queue. A very thin extra bandwidth is however guaranteed to the Idle
class, to prevent it from starving.
- If the strict_guarantees parameter is set (default: unset), then BFQ
- always performs idling when the in-service queue becomes empty;
- forces the device to serve one I/O request at a time, by
dispatching a new request only if there is no outstanding
request.
In the presence of differentiated weights or I/O-request sizes,
both the above conditions are needed to guarantee that every
queue receives its allotted share of the bandwidth (see
Documentation/block/bfq-iosched.txt for more details). Setting
strict_guarantees may evidently affect throughput.
[1] https://lkml.org/lkml/2008/4/1/234
https://lkml.org/lkml/2008/11/11/148
[2] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/bfq-iosched.txt | 517 +++++
block/Kconfig.iosched | 11 +
block/Makefile | 1 +
block/bfq-iosched.c | 4166 +++++++++++++++++++++++++++++++++++
5 files changed, 4697 insertions(+)
create mode 100644 Documentation/block/bfq-iosched.txt
create mode 100644 block/bfq-iosched.c
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index e55103a..8d55b4b 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -1,5 +1,7 @@
00-INDEX
- This file
+bfq-iosched.txt
+ - BFQ IO scheduler and its tunables
biodoc.txt
- Notes on the Generic Block Layer Rewrite in Linux 2.5
biovecs.txt
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
new file mode 100644
index 0000000..cbf85f6f
--- /dev/null
+++ b/Documentation/block/bfq-iosched.txt
@@ -0,0 +1,517 @@
+BFQ (Budget Fair Queueing)
+==========================
+
+BFQ is a proportional-share I/O scheduler, with some extra
+low-latency capabilities. In addition to cgroups support (blkio or io
+controllers), BFQ's main features are:
+- BFQ guarantees a high system and application responsiveness, and a
+ low latency for time-sensitive applications, such as audio or video
+ players;
+- BFQ distributes bandwidth, and not just time, among processes or
+ groups (switching back to time distribution when needed to keep
+ throughput high).
+
+On average CPUs, the current version of BFQ can handle devices
+performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
+reference, 30-50 KIOPS correspond to very high bandwidths with
+sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
+to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
+multi-queue devices.
+
+The table of contents follow. Impatients can just jump to Section 3.
+
+CONTENTS
+
+1. When may BFQ be useful?
+ 1-1 Personal systems
+ 1-2 Server systems
+2. How does BFQ work?
+3. What are BFQ's tunable?
+4. BFQ group scheduling
+ 4-1 Service guarantees provided
+ 4-2 Interface
+
+1. When may BFQ be useful?
+==========================
+
+BFQ provides the following benefits on personal and server systems.
+
+1-1 Personal systems
+--------------------
+
+Low latency for interactive applications
+
+Regardless of the actual background workload, BFQ guarantees that, for
+interactive tasks, the storage device is virtually as responsive as if
+it was idle. For example, even if one or more of the following
+background workloads are being executed:
+- one or more large files are being read, written or copied,
+- a tree of source files is being compiled,
+- one or more virtual machines are performing I/O,
+- a software update is in progress,
+- indexing daemons are scanning filesystems and updating their
+ databases,
+starting an application or loading a file from within an application
+takes about the same time as if the storage device was idle. As a
+comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
+applications experience high latencies, or even become unresponsive
+until the background workload terminates (also on SSDs).
+
+Low latency for soft real-time applications
+
+Also soft real-time applications, such as audio and video
+players/streamers, enjoy a low latency and a low drop rate, regardless
+of the background I/O workload. As a consequence, these applications
+do not suffer from almost any glitch due to the background workload.
+
+Higher speed for code-development tasks
+
+If some additional workload happens to be executed in parallel, then
+BFQ executes the I/O-related components of typical code-development
+tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
+NOOP or DEADLINE.
+
+High throughput
+
+On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
+up to 150% higher throughput than DEADLINE and NOOP, with all the
+sequential workloads considered in our tests. With random workloads,
+and with all the workloads on flash-based devices, BFQ achieves,
+instead, about the same throughput as the other schedulers.
+
+Strong fairness, bandwidth and delay guarantees
+
+BFQ distributes the device throughput, and not just the device time,
+among I/O-bound applications in proportion their weights, with any
+workload and regardless of the device parameters. From these bandwidth
+guarantees, it is possible to compute tight per-I/O-request delay
+guarantees by a simple formula. If not configured for strict service
+guarantees, BFQ switches to time-based resource sharing (only) for
+applications that would otherwise cause a throughput loss.
+
+1-2 Server systems
+------------------
+
+Most benefits for server systems follow from the same service
+properties as above. In particular, regardless of whether additional,
+possibly heavy workloads are being served, BFQ guarantees:
+
+. audio and video-streaming with zero or very low jitter and drop
+ rate;
+
+. fast retrieval of WEB pages and embedded objects;
+
+. real-time recording of data in live-dumping applications (e.g.,
+ packet logging);
+
+. responsiveness in local and remote access to a server.
+
+
+2. How does BFQ work?
+=====================
+
+BFQ is a proportional-share I/O scheduler, whose general structure,
+plus a lot of code, are borrowed from CFQ.
+
+- Each process doing I/O on a device is associated with a weight and a
+ (bfq_)queue.
+
+- BFQ grants exclusive access to the device, for a while, to one queue
+ (process) at a time, and implements this service model by
+ associating every queue with a budget, measured in number of
+ sectors.
+
+ - After a queue is granted access to the device, the budget of the
+ queue is decremented, on each request dispatch, by the size of the
+ request.
+
+ - The in-service queue is expired, i.e., its service is suspended,
+ only if one of the following events occurs: 1) the queue finishes
+ its budget, 2) the queue empties, 3) a "budget timeout" fires.
+
+ - The budget timeout prevents processes doing random I/O from
+ holding the device for too long and dramatically reducing
+ throughput.
+
+ - Actually, as in CFQ, a queue associated with a process issuing
+ sync requests may not be expired immediately when it empties. In
+ contrast, BFQ may idle the device for a short time interval,
+ giving the process the chance to go on being served if it issues
+ a new request in time. Device idling typically boosts the
+ throughput on rotational devices, if processes do synchronous
+ and sequential I/O. In addition, under BFQ, device idling is
+ also instrumental in guaranteeing the desired throughput
+ fraction to processes issuing sync requests (see the description
+ of the slice_idle tunable in this document, or [1, 2], for more
+ details).
+
+ - With respect to idling for service guarantees, if several
+ processes are competing for the device at the same time, but
+ all processes (and groups, after the following commit) have
+ the same weight, then BFQ guarantees the expected throughput
+ distribution without ever idling the device. Throughput is
+ thus as high as possible in this common scenario.
+
+ - If low-latency mode is enabled (default configuration), BFQ
+ executes some special heuristics to detect interactive and soft
+ real-time applications (e.g., video or audio players/streamers),
+ and to reduce their latency. The most important action taken to
+ achieve this goal is to give to the queues associated with these
+ applications more than their fair share of the device
+ throughput. For brevity, we call just "weight-raising" the whole
+ sets of actions taken by BFQ to privilege these queues. In
+ particular, BFQ provides a milder form of weight-raising for
+ interactive applications, and a stronger form for soft real-time
+ applications.
+
+ - BFQ automatically deactivates idling for queues born in a burst of
+ queue creations. In fact, these queues are usually associated with
+ the processes of applications and services that benefit mostly
+ from a high throughput. Examples are systemd during boot, or git
+ grep.
+
+ - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
+ performing random I/O that becomes mostly sequential if
+ merged. Differently from CFQ, BFQ achieves this goal with a more
+ reactive mechanism, called Early Queue Merge (EQM). EQM is so
+ responsive in detecting interleaved I/O (cooperating processes),
+ that it enables BFQ to achieve a high throughput, by queue
+ merging, even for queues for which CFQ needs a different
+ mechanism, preemption, to get a high throughput. As such EQM is a
+ unified mechanism to achieve a high throughput with interleaved
+ I/O.
+
+ - Queues are scheduled according to a variant of WF2Q+, named
+ B-WF2Q+, and implemented using an augmented rb-tree to preserve an
+ O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
+ also ready for hierarchical scheduling. However, for a cleaner
+ logical breakdown, the code that enables and completes
+ hierarchical support is provided in the next commit, which focuses
+ exactly on this feature.
+
+ - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
+ perfectly fair, and smooth service. In particular, B-WF2Q+
+ guarantees that each queue receives a fraction of the device
+ throughput proportional to its weight, even if the throughput
+ fluctuates, and regardless of: the device parameters, the current
+ workload and the budgets assigned to the queue.
+
+ - The last, budget-independence, property (although probably
+ counterintuitive in the first place) is definitely beneficial, for
+ the following reasons:
+
+ - First, with any proportional-share scheduler, the maximum
+ deviation with respect to an ideal service is proportional to
+ the maximum budget (slice) assigned to queues. As a consequence,
+ BFQ can keep this deviation tight not only because of the
+ accurate service of B-WF2Q+, but also because BFQ *does not*
+ need to assign a larger budget to a queue to let the queue
+ receive a higher fraction of the device throughput.
+
+ - Second, BFQ is free to choose, for every process (queue), the
+ budget that best fits the needs of the process, or best
+ leverages the I/O pattern of the process. In particular, BFQ
+ updates queue budgets with a simple feedback-loop algorithm that
+ allows a high throughput to be achieved, while still providing
+ tight latency guarantees to time-sensitive applications. When
+ the in-service queue expires, this algorithm computes the next
+ budget of the queue so as to:
+
+ - Let large budgets be eventually assigned to the queues
+ associated with I/O-bound applications performing sequential
+ I/O: in fact, the longer these applications are served once
+ got access to the device, the higher the throughput is.
+
+ - Let small budgets be eventually assigned to the queues
+ associated with time-sensitive applications (which typically
+ perform sporadic and short I/O), because, the smaller the
+ budget assigned to a queue waiting for service is, the sooner
+ B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
+
+- If several processes are competing for the device at the same time,
+ but all processes and groups have the same weight, then BFQ
+ guarantees the expected throughput distribution without ever idling
+ the device. It uses preemption instead. Throughput is then much
+ higher in this common scenario.
+
+- ioprio classes are served in strict priority order, i.e.,
+ lower-priority queues are not served as long as there are
+ higher-priority queues. Among queues in the same class, the
+ bandwidth is distributed in proportion to the weight of each
+ queue. A very thin extra bandwidth is however guaranteed to
+ the Idle class, to prevent it from starving.
+
+
+3. What are BFQ's tunable?
+==========================
+
+The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
+fifo_expire_sync below are the same as in CFQ. Their description is
+just copied from that for CFQ. Some considerations in the description
+of slice_idle are copied from CFQ too.
+
+per-process ioprio and weight
+-----------------------------
+
+Unless the cgroups interface is used, weights can be assigned to
+processes only indirectly, through I/O priorities, and according to
+the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
+
+slice_idle
+----------
+
+This parameter specifies how long BFQ should idle for next I/O
+request, when certain sync BFQ queues become empty. By default
+slice_idle is a non-zero value. Idling has a double purpose: boosting
+throughput and making sure that the desired throughput distribution is
+respected (see the description of how BFQ works, and, if needed, the
+papers referred there).
+
+As for throughput, idling can be very helpful on highly seeky media
+like single spindle SATA/SAS disks where we can cut down on overall
+number of seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues and one
+should see an overall improved throughput on faster storage devices
+like multiple SATA/SAS disks in hardware RAID configuration.
+
+So depending on storage and workload, it might be useful to set
+slice_idle=0. In general for SATA/SAS disks and software RAID of
+SATA/SAS disks keeping slice_idle enabled should be useful. For any
+configurations where there are multiple spindles behind single LUN
+(Host based hardware RAID controller or for storage arrays), setting
+slice_idle=0 might end up in better throughput and acceptable
+latencies.
+
+Idling is however necessary to have service guarantees enforced in
+case of differentiated weights or differentiated I/O-request lengths.
+To see why, suppose that a given BFQ queue A must get several I/O
+requests served for each request served for another queue B. Idling
+ensures that, if A makes a new I/O request slightly after becoming
+empty, then no request of B is dispatched in the middle, and thus A
+does not lose the possibility to get more than one request dispatched
+before the next request of B is dispatched. Note that idling
+guarantees the desired differentiated treatment of queues only in
+terms of I/O-request dispatches. To guarantee that the actual service
+order then corresponds to the dispatch order, the strict_guarantees
+tunable must be set too.
+
+There is an important flipside for idling: apart from the above cases
+where it is beneficial also for throughput, idling can severely impact
+throughput. One important case is random workload. Because of this
+issue, BFQ tends to avoid idling as much as possible, when it is not
+beneficial also for throughput. As a consequence of this behavior, and
+of further issues described for the strict_guarantees tunable,
+short-term service guarantees may be occasionally violated. And, in
+some cases, these guarantees may be more important than guaranteeing
+maximum throughput. For example, in video playing/streaming, a very
+low drop rate may be more important than maximum throughput. In these
+cases, consider setting the strict_guarantees parameter.
+
+strict_guarantees
+-----------------
+
+If this parameter is set (default: unset), then BFQ
+
+- always performs idling when the in-service queue becomes empty;
+
+- forces the device to serve one I/O request at a time, by dispatching a
+ new request only if there is no outstanding request.
+
+In the presence of differentiated weights or I/O-request sizes, both
+the above conditions are needed to guarantee that every BFQ queue
+receives its allotted share of the bandwidth. The first condition is
+needed for the reasons explained in the description of the slice_idle
+tunable. The second condition is needed because all modern storage
+devices reorder internally-queued requests, which may trivially break
+the service guarantees enforced by the I/O scheduler.
+
+Setting strict_guarantees may evidently affect throughput.
+
+back_seek_max
+-------------
+
+This specifies, given in Kbytes, the maximum "distance" for backward seeking.
+The distance is the amount of space from the current head location to the
+sectors that are backward in terms of distance.
+
+This parameter allows the scheduler to anticipate requests in the "backward"
+direction and consider them as being the "next" if they are within this
+distance from the current head location.
+
+back_seek_penalty
+-----------------
+
+This parameter is used to compute the cost of backward seeking. If the
+backward distance of request is just 1/back_seek_penalty from a "front"
+request, then the seeking cost of two requests is considered equivalent.
+
+So scheduler will not bias toward one or the other request (otherwise scheduler
+will bias toward front request). Default value of back_seek_penalty is 2.
+
+fifo_expire_async
+-----------------
+
+This parameter is used to set the timeout of asynchronous requests. Default
+value of this is 248ms.
+
+fifo_expire_sync
+----------------
+
+This parameter is used to set the timeout of synchronous requests. Default
+value of this is 124ms. In case to favor synchronous requests over asynchronous
+one, this value should be decreased relative to fifo_expire_async.
+
+low_latency
+-----------
+
+This parameter is used to enable/disable BFQ's low latency mode. By
+default, low latency mode is enabled. If enabled, interactive and soft
+real-time applications are privileged and experience a lower latency,
+as explained in more detail in the description of how BFQ works.
+
+timeout_sync
+------------
+
+Maximum amount of device time that can be given to a task (queue) once
+it has been selected for service. On devices with costly seeks,
+increasing this time usually increases maximum throughput. On the
+opposite end, increasing this time coarsens the granularity of the
+short-term bandwidth and latency guarantees, especially if the
+following parameter is set to zero.
+
+max_budget
+----------
+
+Maximum amount of service, measured in sectors, that can be provided
+to a BFQ queue once it is set in service (of course within the limits
+of the above timeout). According to what said in the description of
+the algorithm, larger values increase the throughput in proportion to
+the percentage of sequential I/O requests issued. The price of larger
+values is that they coarsen the granularity of short-term bandwidth
+and latency guarantees.
+
+The default value is 0, which enables auto-tuning: BFQ sets max_budget
+to the maximum number of sectors that can be served during
+timeout_sync, according to the estimated peak rate.
+
+weights
+-------
+
+Read-only parameter, used to show the weights of the currently active
+BFQ queues.
+
+
+wr_ tunables
+------------
+
+BFQ exports a few parameters to control/tune the behavior of
+low-latency heuristics.
+
+wr_coeff
+
+Factor by which the weight of a weight-raised queue is multiplied. If
+the queue is deemed soft real-time, then the weight is further
+multiplied by an additional, constant factor.
+
+wr_max_time
+
+Maximum duration of a weight-raising period for an interactive task
+(ms). If set to zero (default value), then this value is computed
+automatically, as a function of the peak rate of the device. In any
+case, when the value of this parameter is read, it always reports the
+current duration, regardless of whether it has been set manually or
+computed automatically.
+
+wr_max_softrt_rate
+
+Maximum service rate below which a queue is deemed to be associated
+with a soft real-time application, and is then weight-raised
+accordingly (sectors/sec).
+
+wr_min_idle_time
+
+Minimum idle period after which interactive weight-raising may be
+reactivated for a queue (in ms).
+
+wr_rt_max_time
+
+Maximum weight-raising duration for soft real-time queues (in ms). The
+start time from which this duration is considered is automatically
+moved forward if the queue is detected to be still soft real-time
+before the current soft real-time weight-raising period finishes.
+
+wr_min_inter_arr_async
+
+Minimum period between I/O request arrivals after which weight-raising
+may be reactivated for an already busy async queue (in ms).
+
+
+4. Group scheduling with BFQ
+============================
+
+BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
+and io. In particular, BFQ supports weight-based proportional
+share.
+
+4-1 Service guarantees provided
+-------------------------------
+
+With BFQ, proportional share means true proportional share of the
+device bandwidth, according to group weights. For example, a group
+with weight 200 gets twice the bandwidth, and not just twice the time,
+of a group with weight 100.
+
+BFQ supports hierarchies (group trees) of any depth. Bandwidth is
+distributed among groups and processes in the expected way: for each
+group, the children of the group share the whole bandwidth of the
+group in proportion to their weights. In particular, this implies
+that, for each leaf group, every process of the group receives the
+same share of the whole group bandwidth, unless the ioprio of the
+process is modified.
+
+The resource-sharing guarantee for a group may partially or totally
+switch from bandwidth to time, if providing bandwidth guarantees to
+the group lowers the throughput too much. This switch occurs on a
+per-process basis: if a process of a leaf group causes throughput loss
+if served in such a way to receive its share of the bandwidth, then
+BFQ switches back to just time-based proportional share for that
+process.
+
+4-2 Interface
+-------------
+
+To get proportional sharing of bandwidth with BFQ for a given device,
+BFQ must of course be the active scheduler for that device.
+
+Within each group directory, the names of the files associated with
+BFQ-specific cgroup parameters and stats begin with the "bfq."
+prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
+BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
+parameter to set the weight of a group with BFQ is blkio.bfq.weight
+or io.bfq.weight.
+
+Parameters to set
+-----------------
+
+For each group, there is only the following parameter to set.
+
+weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
+group inside its parent. Available values: 1..10000 (default 100). The
+linear mapping between ioprio and weights, described at the beginning
+of the tunable section, is still valid, but all weights higher than
+IOPRIO_BE_NR*10 are mapped to ioprio 0.
+
+
+[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
+ Scheduler", Proceedings of the First Workshop on Mobile System
+ Technologies (MST-2015), May 2015.
+ http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
+
+[2] P. Valente and M. Andreolini, "Improving Application
+ Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
+ the 5th Annual International Systems and Storage Conference
+ (SYSTOR '12), June 2012.
+ Slightly extended version:
+ http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
+ results.pdf
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc868..562e30e 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,17 @@ config MQ_IOSCHED_DEADLINE
---help---
MQ version of the deadline IO scheduler.
+config IOSCHED_BFQ
+ tristate "BFQ I/O scheduler"
+ default n
+ ---help---
+ BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
+ of the device among all processes according to their weights,
+ regardless of the device parameters and with any workload. It
+ also guarantees a low latency to interactive and soft
+ real-time applications. Details in
+ Documentation/block/bfq-iosched.txt
+
endmenu
endif
diff --git a/block/Makefile b/block/Makefile
index 081bb68..91869f2 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
+obj-$(CONFIG_IOSCHED_BFQ) += bfq-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..c4e7d8d
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,4166 @@
+/*
+ * Budget Fair Queueing (BFQ) I/O scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ * Arianna Avanzini <avanzini@google.com>
+ *
+ * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * BFQ is a proportional-share I/O scheduler, with some extra
+ * low-latency capabilities. BFQ also supports full hierarchical
+ * scheduling through cgroups. Next paragraphs provide an introduction
+ * on BFQ inner workings. Details on BFQ benefits, usage and
+ * limitations can be found in Documentation/block/bfq-iosched.txt.
+ *
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based
+ * on the slice-by-slice service scheme of CFQ. But BFQ assigns
+ * budgets, measured in number of sectors, to processes instead of
+ * time slices. The device is not granted to the in-service process
+ * for a given time slice, but until it has exhausted its assigned
+ * budget. This change from the time to the service domain enables BFQ
+ * to distribute the device throughput among processes as desired,
+ * without any distortion due to throughput fluctuations, or to device
+ * internal queueing. BFQ uses an ad hoc internal scheduler, called
+ * B-WF2Q+, to schedule processes according to their budgets. More
+ * precisely, BFQ schedules queues associated with processes. Each
+ * process/queue is assigned a user-configurable weight, and B-WF2Q+
+ * guarantees that each queue receives a fraction of the throughput
+ * proportional to its weight. Thanks to the accurate policy of
+ * B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound
+ * processes issuing sequential requests (to boost the throughput),
+ * and yet guarantee a low latency to interactive and soft real-time
+ * applications.
+ *
+ * In particular, to provide these low-latency guarantees, BFQ
+ * explicitly privileges the I/O of two classes of time-sensitive
+ * applications: interactive and soft real-time. This feature enables
+ * BFQ to provide applications in these classes with a very low
+ * latency. Finally, BFQ also features additional heuristics for
+ * preserving both a low latency and a high throughput on NCQ-capable,
+ * rotational or flash-based devices, and to get the job done quickly
+ * for applications consisting in many I/O-bound processes.
+ *
+ * BFQ is described in [1], where also a reference to the initial, more
+ * theoretical paper on BFQ can be found. The interested reader can find
+ * in the latter paper full details on the main algorithm, as well as
+ * formulas of the guarantees and formal proofs of all the properties.
+ * With respect to the version of BFQ presented in these papers, this
+ * implementation adds a few more heuristics, such as the one that
+ * guarantees a low latency to soft real-time applications, and a
+ * hierarchical extension based on H-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, which is described in [2], together with
+ * H-WF2Q+, while the augmented tree used here to implement B-WF2Q+
+ * with O(log N) complexity derives from the one introduced with EEVDF
+ * in [3].
+ *
+ * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
+ * Scheduler", Proceedings of the First Workshop on Mobile System
+ * Technologies (MST-2015), May 2015.
+ * http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing
+ * Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,
+ * Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline
+ * First: A Flexible and Accurate Mechanism for Proportional Share
+ * Resource Allocation", technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/ktime.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include <linux/sbitmap.h>
+#include <linux/delay.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/blk-cgroup.h>
+
+#define BFQ_IOPRIO_CLASSES 3
+#define BFQ_CL_IDLE_TIMEOUT (HZ/5)
+
+#define BFQ_MIN_WEIGHT 1
+#define BFQ_MAX_WEIGHT 1000
+#define BFQ_WEIGHT_CONVERSION_COEFF 10
+
+#define BFQ_DEFAULT_QUEUE_IOPRIO 4
+
+#define BFQ_DEFAULT_GRP_WEIGHT 10
+#define BFQ_DEFAULT_GRP_IOPRIO 0
+#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree. All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+ /* tree for active entities (i.e., those backlogged) */
+ struct rb_root active;
+ /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
+ struct rb_root idle;
+
+ /* idle entity with minimum F_i */
+ struct bfq_entity *first_idle;
+ /* idle entity with maximum F_i */
+ struct bfq_entity *last_idle;
+
+ /* scheduler virtual time */
+ u64 vtime;
+ /* scheduler weight sum; active and idle entities contribute to it */
+ unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ *
+ * bfq_sched_data is the basic scheduler queue. It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+ /* entity in service */
+ struct bfq_entity *in_service_entity;
+ /* head-of-the-line entity in the scheduler */
+ struct bfq_entity *next_in_service;
+ /* array of service trees, one per ioprio_class */
+ struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now. Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @prio_changed flag. As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ. When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed. All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+ /* service_tree member */
+ struct rb_node rb_node;
+
+ /*
+ * flag, true if the entity is on a tree (either the active or
+ * the idle one of its service_tree).
+ */
+ int on_st;
+
+ /* B-WF2Q+ start and finish timestamps [sectors/weight] */
+ u64 start, finish;
+
+ /* tree the entity is enqueued into; %NULL if not on a tree */
+ struct rb_root *tree;
+
+ /*
+ * minimum start time of the (active) subtree rooted at this
+ * entity; used for O(log N) lookups into active trees
+ */
+ u64 min_start;
+
+ /* amount of service received during the last service slot */
+ int service;
+
+ /* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
+ int budget;
+
+ /* weight of the queue */
+ int weight;
+ /* next weight if a change is in progress */
+ int new_weight;
+
+ /* original weight, used to implement weight boosting */
+ int orig_weight;
+
+ /* parent entity, for hierarchical scheduling */
+ struct bfq_entity *parent;
+
+ /*
+ * For non-leaf nodes in the hierarchy, the associated
+ * scheduler queue, %NULL on leaf nodes.
+ */
+ struct bfq_sched_data *my_sched_data;
+ /* the scheduler queue this entity belongs to */
+ struct bfq_sched_data *sched_data;
+
+ /* flag, set to request a weight, ioprio or ioprio_class change */
+ int prio_changed;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ */
+struct bfq_ttime {
+ /* completion time of the last request */
+ u64 last_end_request;
+
+ /* total process thinktime */
+ u64 ttime_total;
+ /* number of thinktime samples */
+ unsigned long ttime_samples;
+ /* average process thinktime */
+ u64 ttime_mean;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+ /* reference counter */
+ int ref;
+ /* parent bfq_data */
+ struct bfq_data *bfqd;
+
+ /* current ioprio and ioprio class */
+ unsigned short ioprio, ioprio_class;
+ /* next ioprio and ioprio class if a change is in progress */
+ unsigned short new_ioprio, new_ioprio_class;
+
+ /* sorted list of pending requests */
+ struct rb_root sort_list;
+ /* if fifo isn't expired, next request to serve */
+ struct request *next_rq;
+ /* number of sync and async requests queued */
+ int queued[2];
+ /* number of requests currently allocated */
+ int allocated;
+ /* number of pending metadata requests */
+ int meta_pending;
+ /* fifo list of requests in sort_list */
+ struct list_head fifo;
+
+ /* entity representing this queue in the scheduler */
+ struct bfq_entity entity;
+
+ /* maximum budget allowed from the feedback mechanism */
+ int max_budget;
+ /* budget expiration (in jiffies) */
+ unsigned long budget_timeout;
+
+ /* number of requests on the dispatch list or inside driver */
+ int dispatched;
+
+ /* status flags */
+ unsigned long flags;
+
+ /* node for active/idle bfqq list inside parent bfqd */
+ struct list_head bfqq_list;
+
+ /* associated @bfq_ttime struct */
+ struct bfq_ttime ttime;
+
+ /* bit vector: a 1 for each seeky requests in history */
+ u32 seek_history;
+ /* position of the last request enqueued */
+ sector_t last_request_pos;
+
+ /* Number of consecutive pairs of request completion and
+ * arrival, such that the queue becomes idle after the
+ * completion, but the next request arrives within an idle
+ * time slice; used only if the queue's IO_bound flag has been
+ * cleared.
+ */
+ unsigned int requests_within_timer;
+
+ /* pid of the process owning the queue, used for logging purposes */
+ pid_t pid;
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ */
+struct bfq_io_cq {
+ /* associated io_cq structure */
+ struct io_cq icq; /* must be the first member */
+ /* array of two process queues, the sync and the async */
+ struct bfq_queue *bfqq[2];
+ /* per (request_queue, blkcg) ioprio */
+ int ioprio;
+};
+
+/**
+ * struct bfq_data - per-device data structure.
+ *
+ * All the fields are protected by @lock.
+ */
+struct bfq_data {
+ /* device request queue */
+ struct request_queue *queue;
+ /* dispatch queue */
+ struct list_head dispatch;
+
+ /* root @bfq_sched_data for the device */
+ struct bfq_sched_data sched_data;
+
+ /*
+ * Number of bfq_queues containing requests (including the
+ * queue in service, even if it is idling).
+ */
+ int busy_queues;
+ /* number of queued requests */
+ int queued;
+ /* number of requests dispatched and waiting for completion */
+ int rq_in_driver;
+
+ /*
+ * Maximum number of requests in driver in the last
+ * @hw_tag_samples completed requests.
+ */
+ int max_rq_in_driver;
+ /* number of samples used to calculate hw_tag */
+ int hw_tag_samples;
+ /* flag set to one if the driver is showing a queueing behavior */
+ int hw_tag;
+
+ /* number of budgets assigned */
+ int budgets_assigned;
+
+ /*
+ * Timer set when idling (waiting) for the next request from
+ * the queue in service.
+ */
+ struct hrtimer idle_slice_timer;
+
+ /* bfq_queue in service */
+ struct bfq_queue *in_service_queue;
+ /* bfq_io_cq (bic) associated with the @in_service_queue */
+ struct bfq_io_cq *in_service_bic;
+
+ /* on-disk position of the last served request */
+ sector_t last_position;
+
+ /* beginning of the last budget */
+ ktime_t last_budget_start;
+ /* beginning of the last idle slice */
+ ktime_t last_idling_start;
+ /* number of samples used to calculate @peak_rate */
+ int peak_rate_samples;
+ /*
+ * Peak read/write rate, observed during the service of a
+ * budget [BFQ_RATE_SHIFT * sectors/usec]. The value is
+ * left-shifted by BFQ_RATE_SHIFT to increase precision in
+ * fixed-point calculations.
+ */
+ u64 peak_rate;
+ /* maximum budget allotted to a bfq_queue before rescheduling */
+ int bfq_max_budget;
+
+ /* list of all the bfq_queues active on the device */
+ struct list_head active_list;
+ /* list of all the bfq_queues idle on the device */
+ struct list_head idle_list;
+
+ /*
+ * Timeout for async/sync requests; when it fires, requests
+ * are served in fifo order.
+ */
+ u64 bfq_fifo_expire[2];
+ /* weight of backward seeks wrt forward ones */
+ unsigned int bfq_back_penalty;
+ /* maximum allowed backward seek */
+ unsigned int bfq_back_max;
+ /* maximum idling time */
+ u32 bfq_slice_idle;
+ /* last time CLASS_IDLE was served */
+ u64 bfq_class_idle_last_service;
+
+ /* user-configured max budget value (0 for auto-tuning) */
+ int bfq_user_max_budget;
+ /*
+ * Timeout for bfq_queues to consume their budget; used to
+ * prevent seeky queues from imposing long latencies to
+ * sequential or quasi-sequential ones (this also implies that
+ * seeky queues cannot receive guarantees in the service
+ * domain; after a timeout they are charged for the time they
+ * have been in service, to preserve fairness among them, but
+ * without service-domain guarantees).
+ */
+ unsigned int bfq_timeout;
+
+ /*
+ * Number of consecutive requests that must be issued within
+ * the idle time slice to set again idling to a queue which
+ * was marked as non-I/O-bound (see the definition of the
+ * IO_bound flag for further details).
+ */
+ unsigned int bfq_requests_within_timer;
+
+ /*
+ * Force device idling whenever needed to provide accurate
+ * service guarantees, without caring about throughput
+ * issues. CAVEAT: this may even increase latencies, in case
+ * of useless idling for processes that did stop doing I/O.
+ */
+ bool strict_guarantees;
+
+ /* fallback dummy bfqq for extreme OOM conditions */
+ struct bfq_queue oom_bfqq;
+
+ spinlock_t lock;
+
+ /*
+ * bic associated with the task issuing current bio for
+ * merging. This and the next field are used as a support to
+ * be able to perform the bic lookup, needed by bio-merge
+ * functions, before the scheduler lock is taken, and thus
+ * avoid taking the request-queue lock while the scheduler
+ * lock is being held.
+ */
+ struct bfq_io_cq *bio_bic;
+ /* bfqq associated with the task issuing current bio for merging */
+ struct bfq_queue *bio_bfqq;
+};
+
+enum bfqq_state_flags {
+ BFQQF_busy = 0, /* has requests or is in service */
+ BFQQF_wait_request, /* waiting for a request */
+ BFQQF_non_blocking_wait_rq, /*
+ * waiting for a request
+ * without idling the device
+ */
+ BFQQF_fifo_expire, /* FIFO checked in this slice */
+ BFQQF_idle_window, /* slice idling enabled */
+ BFQQF_sync, /* synchronous queue */
+ BFQQF_budget_new, /* no completion with this budget */
+ BFQQF_IO_bound, /*
+ * bfqq has timed-out at least once
+ * having consumed at most 2/10 of
+ * its budget
+ */
+};
+
+#define BFQ_BFQQ_FNS(name) \
+static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
+{ \
+ __set_bit(BFQQF_##name, &(bfqq)->flags); \
+} \
+static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \
+{ \
+ __clear_bit(BFQQF_##name, &(bfqq)->flags); \
+} \
+static int bfq_bfqq_##name(const struct bfq_queue *bfqq) \
+{ \
+ return test_bit(BFQQF_##name, &(bfqq)->flags); \
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(non_blocking_wait_rq);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(IO_bound);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+ blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+ blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+ BFQQE_TOO_IDLE = 0, /*
+ * queue has been idling for
+ * too long
+ */
+ BFQQE_BUDGET_TIMEOUT, /* budget took too long to be used */
+ BFQQE_BUDGET_EXHAUSTED, /* budget consumed */
+ BFQQE_NO_MORE_REQUESTS, /* the queue has no more requests */
+ BFQQE_PREEMPTED /* preemption in progress */
+};
+
+static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+
+static struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+ struct bfq_sched_data *sched_data = entity->sched_data;
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ unsigned int idx = bfqq ? bfqq->ioprio_class - 1 :
+ BFQ_DEFAULT_GRP_CLASS - 1;
+
+ return sched_data->service_tree + idx;
+}
+
+static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
+{
+ return bic->bfqq[is_sync];
+}
+
+static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq,
+ bool is_sync)
+{
+ bic->bfqq[is_sync] = bfqq;
+}
+
+static struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+ return bic->icq.q->elevator->elevator_data;
+}
+
+static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+ struct bio *bio, bool is_sync,
+ struct bfq_io_cq *bic);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Expiration time of sync (0) and async (1) requests, in ns. */
+static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
+
+/* Maximum backwards seek (magic number lifted from CFQ), in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in ns. */
+static u64 bfq_slice_idle = NSEC_PER_SEC / 125;
+
+/* Minimum number of assigned budgets for which stats are safe to compute. */
+static const int bfq_stats_min_budgets = 194;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout = HZ / 8;
+
+static struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD 4
+#define BFQ_HW_QUEUE_SAMPLES 32
+
+#define BFQQ_SEEK_THR (sector_t)(8 * 100)
+#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
+#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
+#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 32/8)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP 128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES 32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT 16
+
+#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \
+ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq) ((rq)->elv.priv[1])
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+ /* bic->icq is the first member, %NULL will convert to %NULL */
+ return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ * @q: the request queue.
+ */
+static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+ struct io_context *ioc,
+ struct request_queue *q)
+{
+ if (ioc) {
+ unsigned long flags;
+ struct bfq_io_cq *icq;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ icq = icq_to_bic(ioc_lookup_icq(ioc, q));
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return icq;
+ }
+
+ return NULL;
+}
+
+/*
+ * Next two macros are just fake loops for the moment. They will
+ * become true loops in the cgroups-enabled variant of the code. Such
+ * a variant, in its turn, will be introduced by next commit.
+ */
+#define for_each_entity(entity) \
+ for (; entity ; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+ for (parent = NULL; entity ; entity = parent)
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+ return 0;
+}
+
+static void bfq_check_next_in_service(struct bfq_sched_data *sd,
+ struct bfq_entity *entity)
+{
+}
+
+static void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations. This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT 22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static int bfq_gt(u64 a, u64 b)
+{
+ return (s64)(a - b) > 0;
+}
+
+static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = NULL;
+
+ if (!entity->my_sched_data)
+ bfqq = container_of(entity, struct bfq_queue, entity);
+
+ return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static u64 bfq_delta(unsigned long service, unsigned long weight)
+{
+ u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+ do_div(d, weight);
+ return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static void bfq_calc_finish(struct bfq_entity *entity, unsigned long service)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ entity->finish = entity->start +
+ bfq_delta(service, entity->weight);
+
+ if (bfqq) {
+ bfq_log_bfqq(bfqq->bfqd, bfqq,
+ "calc_finish: serv %lu, w %d",
+ service, entity->weight);
+ bfq_log_bfqq(bfqq->bfqd, bfqq,
+ "calc_finish: start %llu, finish %llu, delta %llu",
+ entity->start, entity->finish,
+ bfq_delta(service, entity->weight));
+ }
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity. This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+ struct bfq_entity *entity = NULL;
+
+ if (node)
+ entity = rb_entry(node, struct bfq_entity, rb_node);
+
+ return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static void bfq_extract(struct rb_root *root, struct bfq_entity *entity)
+{
+ entity->tree = NULL;
+ rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+ struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ struct rb_node *next;
+
+ if (entity == st->first_idle) {
+ next = rb_next(&entity->rb_node);
+ st->first_idle = bfq_entity_of(next);
+ }
+
+ if (entity == st->last_idle) {
+ next = rb_prev(&entity->rb_node);
+ st->last_idle = bfq_entity_of(next);
+ }
+
+ bfq_extract(&st->idle, entity);
+
+ if (bfqq)
+ list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+ struct bfq_entity *entry;
+ struct rb_node **node = &root->rb_node;
+ struct rb_node *parent = NULL;
+
+ while (*node) {
+ parent = *node;
+ entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+ if (bfq_gt(entry->finish, entity->finish))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, root);
+
+ entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree. The function assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static void bfq_update_min(struct bfq_entity *entity, struct rb_node *node)
+{
+ struct bfq_entity *child;
+
+ if (node) {
+ child = rb_entry(node, struct bfq_entity, rb_node);
+ if (bfq_gt(entity->min_start, child->min_start))
+ entity->min_start = child->min_start;
+ }
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value. The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static void bfq_update_active_node(struct rb_node *node)
+{
+ struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+ entity->min_start = entity->start;
+ bfq_update_min(entity, node->rb_right);
+ bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update. This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root. The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+ struct rb_node *parent;
+
+up:
+ bfq_update_active_node(node);
+
+ parent = rb_parent(node);
+ if (!parent)
+ return;
+
+ if (node == parent->rb_left && parent->rb_right)
+ bfq_update_active_node(parent->rb_right);
+ else if (parent->rb_left)
+ bfq_update_active_node(parent->rb_left);
+
+ node = parent;
+ goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ * group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+ struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ struct rb_node *node = &entity->rb_node;
+
+ bfq_insert(&st->active, entity);
+
+ if (node->rb_left)
+ node = node->rb_left;
+ else if (node->rb_right)
+ node = node->rb_right;
+
+ bfq_update_active_tree(node);
+
+ if (bfqq)
+ list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+ return (IOPRIO_BE_NR - ioprio) * BFQ_WEIGHT_CONVERSION_COEFF;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as much as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
+ */
+static unsigned short bfq_weight_to_ioprio(int weight)
+{
+ return max_t(int, 0,
+ IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight);
+}
+
+static void bfq_get_entity(struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ if (bfqq) {
+ bfqq->ref++;
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+ bfqq, bfqq->ref);
+ }
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch. If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+ struct rb_node *deepest;
+
+ if (!node->rb_right && !node->rb_left)
+ deepest = rb_parent(node);
+ else if (!node->rb_right)
+ deepest = node->rb_left;
+ else if (!node->rb_left)
+ deepest = node->rb_right;
+ else {
+ deepest = rb_next(node);
+ if (deepest->rb_right)
+ deepest = deepest->rb_right;
+ else if (rb_parent(deepest) != node)
+ deepest = rb_parent(deepest);
+ }
+
+ return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+ struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ struct rb_node *node;
+
+ node = bfq_find_deepest(&entity->rb_node);
+ bfq_extract(&st->active, entity);
+
+ if (node)
+ bfq_update_active_tree(node);
+
+ if (bfqq)
+ list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+ struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ struct bfq_entity *first_idle = st->first_idle;
+ struct bfq_entity *last_idle = st->last_idle;
+
+ if (!first_idle || bfq_gt(first_idle->finish, entity->finish))
+ st->first_idle = entity;
+ if (!last_idle || bfq_gt(entity->finish, last_idle->finish))
+ st->last_idle = entity;
+
+ bfq_insert(&st->idle, entity);
+
+ if (bfqq)
+ list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - do not consider entity any longer for scheduling
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ * @is_in_service: true if entity is currently the in-service entity.
+ *
+ * Forget everything about @entity. In addition, if entity represents
+ * a queue, and the latter is not in service, then release the service
+ * reference to the queue (the one taken through bfq_get_entity). In
+ * fact, in this case, there is really no more service reference to
+ * the queue, as the latter is also outside any service tree. If,
+ * instead, the queue is in service, then __bfq_bfqd_reset_in_service
+ * will take care of putting the reference when the queue finally
+ * stops being served.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+ struct bfq_entity *entity,
+ bool is_in_service)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ entity->on_st = 0;
+ st->wsum -= entity->weight;
+ if (bfqq && !is_in_service)
+ bfq_put_queue(bfqq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+ struct bfq_entity *entity)
+{
+ bfq_idle_extract(st, entity);
+ bfq_forget_entity(st, entity,
+ entity == entity->sched_data->in_service_entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+ struct bfq_entity *first_idle = st->first_idle;
+ struct bfq_entity *last_idle = st->last_idle;
+
+ if (RB_EMPTY_ROOT(&st->active) && last_idle &&
+ !bfq_gt(last_idle->finish, st->vtime)) {
+ /*
+ * Forget the whole idle tree, increasing the vtime past
+ * the last finish time of idle entities.
+ */
+ st->vtime = last_idle->finish;
+ }
+
+ if (first_idle && !bfq_gt(first_idle->finish, st->vtime))
+ bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+ struct bfq_entity *entity)
+{
+ struct bfq_service_tree *new_st = old_st;
+
+ if (entity->prio_changed) {
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ unsigned short prev_weight, new_weight;
+ struct bfq_data *bfqd = NULL;
+
+ if (bfqq)
+ bfqd = bfqq->bfqd;
+
+ old_st->wsum -= entity->weight;
+
+ if (entity->new_weight != entity->orig_weight) {
+ if (entity->new_weight < BFQ_MIN_WEIGHT ||
+ entity->new_weight > BFQ_MAX_WEIGHT) {
+ pr_crit("update_weight_prio: new_weight %d\n",
+ entity->new_weight);
+ if (entity->new_weight < BFQ_MIN_WEIGHT)
+ entity->new_weight = BFQ_MIN_WEIGHT;
+ else
+ entity->new_weight = BFQ_MAX_WEIGHT;
+ }
+ entity->orig_weight = entity->new_weight;
+ if (bfqq)
+ bfqq->ioprio =
+ bfq_weight_to_ioprio(entity->orig_weight);
+ }
+
+ if (bfqq)
+ bfqq->ioprio_class = bfqq->new_ioprio_class;
+ entity->prio_changed = 0;
+
+ /*
+ * NOTE: here we may be changing the weight too early,
+ * this will cause unfairness. The correct approach
+ * would have required additional complexity to defer
+ * weight changes to the proper time instants (i.e.,
+ * when entity->finish <= old_st->vtime).
+ */
+ new_st = bfq_entity_service_tree(entity);
+
+ prev_weight = entity->weight;
+ new_weight = entity->orig_weight;
+ entity->weight = new_weight;
+
+ new_st->wsum += entity->weight;
+
+ if (new_st != old_st)
+ entity->start = new_st->vtime;
+ }
+
+ return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ * service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service. By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+ struct bfq_service_tree *st;
+
+ for_each_entity(entity) {
+ st = bfq_entity_service_tree(entity);
+
+ entity->service += served;
+
+ st->vtime += bfq_delta(served, st->wsum);
+ bfq_forget_idle(st);
+ }
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget. In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+ bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ * @non_blocking_wait_rq: true if this entity was waiting for a request
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion. It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity,
+ bool non_blocking_wait_rq)
+{
+ struct bfq_sched_data *sd = entity->sched_data;
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+ bool backshifted = false;
+
+ if (entity == sd->in_service_entity) {
+ /*
+ * If we are requeueing the current entity we have
+ * to take care of not charging to it service it has
+ * not received.
+ */
+ bfq_calc_finish(entity, entity->service);
+ entity->start = entity->finish;
+ sd->in_service_entity = NULL;
+ } else if (entity->tree == &st->active) {
+ /*
+ * Requeueing an entity due to a change of some
+ * next_in_service entity below it. We reuse the
+ * old start time.
+ */
+ bfq_active_extract(st, entity);
+ } else {
+ unsigned long long min_vstart;
+
+ /* See comments on bfq_fqq_update_budg_for_activation */
+ if (non_blocking_wait_rq && bfq_gt(st->vtime, entity->finish)) {
+ backshifted = true;
+ min_vstart = entity->finish;
+ } else
+ min_vstart = st->vtime;
+
+ if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq_idle_extract() will
+ * check for that.
+ */
+ bfq_idle_extract(st, entity);
+ entity->start = bfq_gt(min_vstart, entity->finish) ?
+ min_vstart : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = min_vstart;
+ st->wsum += entity->weight;
+ /*
+ * entity is about to be inserted into a service tree,
+ * and then set in service: get a reference to make
+ * sure entity does not disappear until it is no
+ * longer in service or scheduled for service.
+ */
+ bfq_get_entity(entity);
+
+ entity->on_st = 1;
+ }
+ }
+
+ st = __bfq_entity_update_weight_prio(st, entity);
+ bfq_calc_finish(entity, entity->budget);
+
+ /*
+ * If some queues enjoy backshifting for a while, then their
+ * (virtual) finish timestamps may happen to become lower and
+ * lower than the system virtual time. In particular, if
+ * these queues often happen to be idle for short time
+ * periods, and during such time periods other queues with
+ * higher timestamps happen to be busy, then the backshifted
+ * timestamps of the former queues can become much lower than
+ * the system virtual time. In fact, to serve the queues with
+ * higher timestamps while the ones with lower timestamps are
+ * idle, the system virtual time may be pushed-up to much
+ * higher values than the finish timestamps of the idle
+ * queues. As a consequence, the finish timestamps of all new
+ * or newly activated queues may end up being much larger than
+ * those of lucky queues with backshifted timestamps. The
+ * latter queues may then monopolize the device for a lot of
+ * time. This would simply break service guarantees.
+ *
+ * To reduce this problem, push up a little bit the
+ * backshifted timestamps of the queue associated with this
+ * entity (only a queue can happen to have the backshifted
+ * flag set): just enough to let the finish timestamp of the
+ * queue be equal to the current value of the system virtual
+ * time. This may introduce a little unfairness among queues
+ * with backshifted timestamps, but it does not break
+ * worst-case fairness guarantees.
+ */
+ if (backshifted && bfq_gt(st->vtime, entity->finish)) {
+ unsigned long delta = st->vtime - entity->finish;
+
+ entity->start += delta;
+ entity->finish += delta;
+ }
+
+ bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ * @non_blocking_wait_rq: true if this entity was waiting for a request
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity,
+ bool non_blocking_wait_rq)
+{
+ struct bfq_sched_data *sd;
+
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, non_blocking_wait_rq);
+
+ sd = entity->sched_data;
+ if (!bfq_update_next_in_service(sd))
+ /*
+ * No need to propagate the activation to the
+ * upper entities, as they will be updated when
+ * the in-service entity is rescheduled.
+ */
+ break;
+ }
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state. If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+ struct bfq_sched_data *sd = entity->sched_data;
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+ int is_in_service = entity == sd->in_service_entity;
+ int ret = 0;
+
+ if (!entity->on_st)
+ return 0;
+
+ if (is_in_service) {
+ bfq_calc_finish(entity, entity->service);
+ sd->in_service_entity = NULL;
+ } else if (entity->tree == &st->active)
+ bfq_active_extract(st, entity);
+ else if (entity->tree == &st->idle)
+ bfq_idle_extract(st, entity);
+
+ if (is_in_service || sd->next_in_service == entity)
+ ret = bfq_update_next_in_service(sd);
+
+ if (!requeue || !bfq_gt(entity->finish, st->vtime))
+ bfq_forget_entity(st, entity, is_in_service);
+ else
+ bfq_idle_insert(st, entity);
+
+ return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+ struct bfq_sched_data *sd;
+ struct bfq_entity *parent = NULL;
+
+ for_each_entity_safe(entity, parent) {
+ sd = entity->sched_data;
+
+ if (!__bfq_deactivate_entity(entity, requeue))
+ /*
+ * The parent entity is still backlogged, and
+ * we don't need to update it as it is still
+ * in service.
+ */
+ break;
+
+ if (sd->next_in_service)
+ /*
+ * The parent entity is still backlogged and
+ * the budgets on the path towards the root
+ * need to be updated.
+ */
+ goto update;
+
+ /*
+ * If we get here, then the parent is no more backlogged and
+ * we want to propagate the deactivation upwards.
+ */
+ requeue = 1;
+ }
+
+ return;
+
+update:
+ entity = parent;
+ for_each_entity(entity) {
+ __bfq_activate_entity(entity, false);
+
+ sd = entity->sched_data;
+ if (!bfq_update_next_in_service(sd))
+ break;
+ }
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time. Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+ struct bfq_entity *entry;
+ struct rb_node *node = st->active.rb_node;
+
+ entry = rb_entry(node, struct bfq_entity, rb_node);
+ if (bfq_gt(entry->min_start, st->vtime)) {
+ st->vtime = entry->min_start;
+ bfq_forget_idle(st);
+ }
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ * the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+ struct bfq_entity *entry, *first = NULL;
+ struct rb_node *node = st->active.rb_node;
+
+ while (node) {
+ entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+ if (!bfq_gt(entry->start, st->vtime))
+ first = entry;
+
+ if (node->rb_left) {
+ entry = rb_entry(node->rb_left,
+ struct bfq_entity, rb_node);
+ if (!bfq_gt(entry->min_start, st->vtime)) {
+ node = node->rb_left;
+ goto left;
+ }
+ }
+ if (first)
+ break;
+ node = node->rb_right;
+ }
+
+ return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+ bool force)
+{
+ struct bfq_entity *entity, *new_next_in_service = NULL;
+
+ if (RB_EMPTY_ROOT(&st->active))
+ return NULL;
+
+ bfq_update_vtime(st);
+ entity = bfq_first_active_entity(st);
+
+ /*
+ * If the chosen entity does not match with the sched_data's
+ * next_in_service and we are forcedly serving the IDLE priority
+ * class tree, bubble up budget update.
+ */
+ if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+ new_next_in_service = entity;
+ for_each_entity(new_next_in_service)
+ bfq_update_budget(new_next_in_service);
+ }
+
+ return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+ int extract,
+ struct bfq_data *bfqd)
+{
+ struct bfq_service_tree *st = sd->service_tree;
+ struct bfq_entity *entity;
+ int i = 0;
+
+ /*
+ * Choose from idle class, if needed to guarantee a minimum
+ * bandwidth to this class. This should also mitigate
+ * priority-inversion problems in case a low priority task is
+ * holding file system resources.
+ */
+ if (bfqd &&
+ jiffies - bfqd->bfq_class_idle_last_service >
+ BFQ_CL_IDLE_TIMEOUT) {
+ entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+ true);
+ if (entity) {
+ i = BFQ_IOPRIO_CLASSES - 1;
+ bfqd->bfq_class_idle_last_service = jiffies;
+ sd->next_in_service = entity;
+ }
+ }
+ for (; i < BFQ_IOPRIO_CLASSES; i++) {
+ entity = __bfq_lookup_next_entity(st + i, false);
+ if (entity) {
+ if (extract) {
+ bfq_check_next_in_service(sd, entity);
+ bfq_active_extract(st + i, entity);
+ sd->in_service_entity = entity;
+ sd->next_in_service = NULL;
+ }
+ break;
+ }
+ }
+
+ return entity;
+}
+
+static bool next_queue_may_preempt(struct bfq_data *bfqd)
+{
+ struct bfq_sched_data *sd = &bfqd->sched_data;
+
+ return sd->next_in_service != sd->in_service_entity;
+}
+
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+ struct bfq_entity *entity = NULL;
+ struct bfq_sched_data *sd;
+ struct bfq_queue *bfqq;
+
+ if (bfqd->busy_queues == 0)
+ return NULL;
+
+ sd = &bfqd->sched_data;
+ for (; sd ; sd = entity->my_sched_data) {
+ entity = bfq_lookup_next_entity(sd, 1, bfqd);
+ entity->service = 0;
+ }
+
+ bfqq = bfq_entity_to_bfqq(entity);
+
+ return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+ struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
+ struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity;
+
+ if (bfqd->in_service_bic) {
+ put_io_context(bfqd->in_service_bic->icq.ioc);
+ bfqd->in_service_bic = NULL;
+ }
+
+ bfq_clear_bfqq_wait_request(in_serv_bfqq);
+ hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+ bfqd->in_service_queue = NULL;
+
+ /*
+ * in_serv_entity is no longer in service, so, if it is in no
+ * service tree either, then release the service reference to
+ * the queue it represents (taken with bfq_get_entity).
+ */
+ if (!in_serv_entity->on_st)
+ bfq_put_queue(in_serv_bfqq);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ int requeue)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ bfq_activate_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq));
+ bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ int requeue)
+{
+ bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+ bfq_clear_bfqq_busy(bfqq);
+
+ bfqd->busy_queues--;
+
+ bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+ bfq_activate_bfqq(bfqd, bfqq);
+
+ bfq_mark_bfqq_busy(bfqq);
+ bfqd->busy_queues++;
+}
+
+static void bfq_init_entity(struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+ entity->weight = entity->new_weight;
+ entity->orig_weight = entity->new_weight;
+
+ bfqq->ioprio = bfqq->new_ioprio;
+ bfqq->ioprio_class = bfqq->new_ioprio_class;
+
+ entity->sched_data = &bfqq->bfqd->sched_data;
+}
+
+#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples) ((samples) > 80)
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+ if (bfqd->queued != 0) {
+ bfq_log(bfqd, "schedule dispatch");
+ blk_mq_run_hw_queues(bfqd->queue, true);
+ }
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now. Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+ struct request *rq1,
+ struct request *rq2,
+ sector_t last)
+{
+ sector_t s1, s2, d1 = 0, d2 = 0;
+ unsigned long back_max;
+#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */
+ unsigned int wrap = 0; /* bit mask: requests behind the disk head? */
+
+ if (!rq1 || rq1 == rq2)
+ return rq2;
+ if (!rq2)
+ return rq1;
+
+ if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+ return rq1;
+ else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+ return rq2;
+ if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+ return rq1;
+ else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+ return rq2;
+
+ s1 = blk_rq_pos(rq1);
+ s2 = blk_rq_pos(rq2);
+
+ /*
+ * By definition, 1KiB is 2 sectors.
+ */
+ back_max = bfqd->bfq_back_max * 2;
+
+ /*
+ * Strict one way elevator _except_ in the case where we allow
+ * short backward seeks which are biased as twice the cost of a
+ * similar forward seek.
+ */
+ if (s1 >= last)
+ d1 = s1 - last;
+ else if (s1 + back_max >= last)
+ d1 = (last - s1) * bfqd->bfq_back_penalty;
+ else
+ wrap |= BFQ_RQ1_WRAP;
+
+ if (s2 >= last)
+ d2 = s2 - last;
+ else if (s2 + back_max >= last)
+ d2 = (last - s2) * bfqd->bfq_back_penalty;
+ else
+ wrap |= BFQ_RQ2_WRAP;
+
+ /* Found required data */
+
+ /*
+ * By doing switch() on the bit mask "wrap" we avoid having to
+ * check two variables for all permutations: --> faster!
+ */
+ switch (wrap) {
+ case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+ if (d1 < d2)
+ return rq1;
+ else if (d2 < d1)
+ return rq2;
+
+ if (s1 >= s2)
+ return rq1;
+ else
+ return rq2;
+
+ case BFQ_RQ2_WRAP:
+ return rq1;
+ case BFQ_RQ1_WRAP:
+ return rq2;
+ case BFQ_RQ1_WRAP|BFQ_RQ2_WRAP: /* both rqs wrapped */
+ default:
+ /*
+ * Since both rqs are wrapped,
+ * start with the one that's further behind head
+ * (--> only *one* back seek required),
+ * since back seek takes more time than forward.
+ */
+ if (s1 <= s2)
+ return rq1;
+ else
+ return rq2;
+ }
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq,
+ struct request *last)
+{
+ struct request *rq;
+
+ if (bfq_bfqq_fifo_expire(bfqq))
+ return NULL;
+
+ bfq_mark_bfqq_fifo_expire(bfqq);
+
+ rq = rq_entry_fifo(bfqq->fifo.next);
+
+ if (rq == last || ktime_get_ns() < rq->fifo_time)
+ return NULL;
+
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);
+ return rq;
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ struct request *last)
+{
+ struct rb_node *rbnext = rb_next(&last->rb_node);
+ struct rb_node *rbprev = rb_prev(&last->rb_node);
+ struct request *next, *prev = NULL;
+
+ /* Follow expired path, else get first next available. */
+ next = bfq_check_fifo(bfqq, last);
+ if (next)
+ return next;
+
+ if (rbprev)
+ prev = rb_entry_rq(rbprev);
+
+ if (rbnext)
+ next = rb_entry_rq(rbnext);
+ else {
+ rbnext = rb_first(&bfqq->sort_list);
+ if (rbnext && rbnext != &last->rb_node)
+ next = rb_entry_rq(rbnext);
+ }
+
+ return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static unsigned long bfq_serv_to_charge(struct request *rq,
+ struct bfq_queue *bfqq)
+{
+ return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown). We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+ struct request *next_rq = bfqq->next_rq;
+ unsigned long new_budget;
+
+ if (!next_rq)
+ return;
+
+ if (bfqq == bfqd->in_service_queue)
+ /*
+ * In order not to break guarantees, budgets cannot be
+ * changed after an entity has been selected.
+ */
+ return;
+
+ new_budget = max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(next_rq, bfqq));
+ if (entity->budget != new_budget) {
+ entity->budget = new_budget;
+ bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+ new_budget);
+ bfq_activate_bfqq(bfqd, bfqq);
+ }
+}
+
+static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ return entity->budget - entity->service;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static int bfq_max_budget(struct bfq_data *bfqd)
+{
+ if (bfqd->budgets_assigned < bfq_stats_min_budgets)
+ return bfq_default_max_budget;
+ else
+ return bfqd->bfq_max_budget;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static int bfq_min_budget(struct bfq_data *bfqd)
+{
+ if (bfqd->budgets_assigned < bfq_stats_min_budgets)
+ return bfq_default_max_budget / 32;
+ else
+ return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ bool compensate,
+ enum bfqq_expiration reason);
+
+/*
+ * The next function, invoked after the input queue bfqq switches from
+ * idle to busy, updates the budget of bfqq. The function also tells
+ * whether the in-service queue should be expired, by returning
+ * true. The purpose of expiring the in-service queue is to give bfqq
+ * the chance to possibly preempt the in-service queue, and the reason
+ * for preempting the in-service queue is to achieve the following
+ * goal: guarantee to bfqq its reserved bandwidth even if bfqq has
+ * expired because it has remained idle.
+ *
+ * In particular, bfqq may have expired for one of the following two
+ * reasons:
+ *
+ * - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling
+ * and did not make it to issue a new request before its last
+ * request was served;
+ *
+ * - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue
+ * a new request before the expiration of the idling-time.
+ *
+ * Even if bfqq has expired for one of the above reasons, the process
+ * associated with the queue may be however issuing requests greedily,
+ * and thus be sensitive to the bandwidth it receives (bfqq may have
+ * remained idle for other reasons: CPU high load, bfqq not enjoying
+ * idling, I/O throttling somewhere in the path from the process to
+ * the I/O scheduler, ...). But if, after every expiration for one of
+ * the above two reasons, bfqq has to wait for the service of at least
+ * one full budget of another queue before being served again, then
+ * bfqq is likely to get a much lower bandwidth or resource time than
+ * its reserved ones. To address this issue, two countermeasures need
+ * to be taken.
+ *
+ * First, the budget and the timestamps of bfqq need to be updated in
+ * a special way on bfqq reactivation: they need to be updated as if
+ * bfqq did not remain idle and did not expire. In fact, if they are
+ * computed as if bfqq expired and remained idle until reactivation,
+ * then the process associated with bfqq is treated as if, instead of
+ * being greedy, it stopped issuing requests when bfqq remained idle,
+ * and restarts issuing requests only on this reactivation. In other
+ * words, the scheduler does not help the process recover the "service
+ * hole" between bfqq expiration and reactivation. As a consequence,
+ * the process receives a lower bandwidth than its reserved one. In
+ * contrast, to recover this hole, the budget must be updated as if
+ * bfqq was not expired at all before this reactivation, i.e., it must
+ * be set to the value of the remaining budget when bfqq was
+ * expired. Along the same line, timestamps need to be assigned the
+ * value they had the last time bfqq was selected for service, i.e.,
+ * before last expiration. Thus timestamps need to be back-shifted
+ * with respect to their normal computation (see [1] for more details
+ * on this tricky aspect).
+ *
+ * Secondly, to allow the process to recover the hole, the in-service
+ * queue must be expired too, to give bfqq the chance to preempt it
+ * immediately. In fact, if bfqq has to wait for a full budget of the
+ * in-service queue to be completed, then it may become impossible to
+ * let the process recover the hole, even if the back-shifted
+ * timestamps of bfqq are lower than those of the in-service queue. If
+ * this happens for most or all of the holes, then the process may not
+ * receive its reserved bandwidth. In this respect, it is worth noting
+ * that, being the service of outstanding requests unpreemptible, a
+ * little fraction of the holes may however be unrecoverable, thereby
+ * causing a little loss of bandwidth.
+ *
+ * The last important point is detecting whether bfqq does need this
+ * bandwidth recovery. In this respect, the next function deems the
+ * process associated with bfqq greedy, and thus allows it to recover
+ * the hole, if: 1) the process is waiting for the arrival of a new
+ * request (which implies that bfqq expired for one of the above two
+ * reasons), and 2) such a request has arrived soon. The first
+ * condition is controlled through the flag non_blocking_wait_rq,
+ * while the second through the flag arrived_in_time. If both
+ * conditions hold, then the function computes the budget in the
+ * above-described special way, and signals that the in-service queue
+ * should be expired. Timestamp back-shifting is done later in
+ * __bfq_activate_entity.
+ */
+static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ bool arrived_in_time)
+{
+ struct bfq_entity *entity = &bfqq->entity;
+
+ if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) {
+ /*
+ * We do not clear the flag non_blocking_wait_rq here, as
+ * the latter is used in bfq_activate_bfqq to signal
+ * that timestamps need to be back-shifted (and is
+ * cleared right after).
+ */
+
+ /*
+ * In next assignment we rely on that either
+ * entity->service or entity->budget are not updated
+ * on expiration if bfqq is empty (see
+ * __bfq_bfqq_recalc_budget). Thus both quantities
+ * remain unchanged after such an expiration, and the
+ * following statement therefore assigns to
+ * entity->budget the remaining budget on such an
+ * expiration. For clarity, entity->service is not
+ * updated on expiration in any case, and, in normal
+ * operation, is reset only when bfqq is selected for
+ * service (see bfq_get_next_queue).
+ */
+ entity->budget = min_t(unsigned long,
+ bfq_bfqq_budget_left(bfqq),
+ bfqq->max_budget);
+
+ return true;
+ }
+
+ entity->budget = max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(bfqq->next_rq, bfqq));
+ bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+ return false;
+}
+
+static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ struct request *rq)
+{
+ bool bfqq_wants_to_preempt,
+ /*
+ * See the comments on
+ * bfq_bfqq_update_budg_for_activation for
+ * details on the usage of the next variable.
+ */
+ arrived_in_time = ktime_get_ns() <=
+ bfqq->ttime.last_end_request +
+ bfqd->bfq_slice_idle * 3;
+
+ /*
+ * Update budget and check whether bfqq may want to preempt
+ * the in-service queue.
+ */
+ bfqq_wants_to_preempt =
+ bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
+ arrived_in_time);
+
+ if (!bfq_bfqq_IO_bound(bfqq)) {
+ if (arrived_in_time) {
+ bfqq->requests_within_timer++;
+ if (bfqq->requests_within_timer >=
+ bfqd->bfq_requests_within_timer)
+ bfq_mark_bfqq_IO_bound(bfqq);
+ } else
+ bfqq->requests_within_timer = 0;
+ }
+
+ bfq_add_bfqq_busy(bfqd, bfqq);
+
+ /*
+ * Expire in-service queue only if preemption may be needed
+ * for guarantees. In this respect, the function
+ * next_queue_may_preempt just checks a simple, necessary
+ * condition, and not a sufficient condition based on
+ * timestamps. In fact, for the latter condition to be
+ * evaluated, timestamps would need first to be updated, and
+ * this operation is quite costly (see the comments on the
+ * function bfq_bfqq_update_budg_for_activation).
+ */
+ if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
+ next_queue_may_preempt(bfqd))
+ bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
+ false, BFQQE_PREEMPTED);
+}
+
+static void bfq_add_request(struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+ struct bfq_data *bfqd = bfqq->bfqd;
+ struct request *next_rq, *prev;
+
+ bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+ bfqq->queued[rq_is_sync(rq)]++;
+ bfqd->queued++;
+
+ elv_rb_add(&bfqq->sort_list, rq);
+
+ /*
+ * Check if this request is a better next-serve candidate.
+ */
+ prev = bfqq->next_rq;
+ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+ bfqq->next_rq = next_rq;
+
+ if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
+ bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, rq);
+ else if (prev != bfqq->next_rq)
+ bfq_updated_next_req(bfqd, bfqq);
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+ struct bio *bio,
+ struct request_queue *q)
+{
+ struct bfq_queue *bfqq = bfqd->bio_bfqq;
+
+
+ if (bfqq)
+ return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+ return NULL;
+}
+
+#if 0 /* Still not clear if we can do without next two functions */
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+
+ bfqd->rq_in_driver++;
+ bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+ bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+ (unsigned long long)bfqd->last_position);
+}
+
+static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
+{
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+
+ bfqd->rq_in_driver--;
+}
+#endif
+
+static void bfq_remove_request(struct request_queue *q,
+ struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+ struct bfq_data *bfqd = bfqq->bfqd;
+ const int sync = rq_is_sync(rq);
+
+ if (bfqq->next_rq == rq) {
+ bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+ bfq_updated_next_req(bfqd, bfqq);
+ }
+
+ if (rq->queuelist.prev != &rq->queuelist)
+ list_del_init(&rq->queuelist);
+ bfqq->queued[sync]--;
+ bfqd->queued--;
+ elv_rb_del(&bfqq->sort_list, rq);
+
+ elv_rqhash_del(q, rq);
+ if (q->last_merge == rq)
+ q->last_merge = NULL;
+
+ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+ bfqq->next_rq = NULL;
+
+ if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
+ bfq_del_bfqq_busy(bfqd, bfqq, 1);
+ /*
+ * bfqq emptied. In normal operation, when
+ * bfqq is empty, bfqq->entity.service and
+ * bfqq->entity.budget must contain,
+ * respectively, the service received and the
+ * budget used last time bfqq emptied. These
+ * facts do not hold in this case, as at least
+ * this last removal occurred while bfqq is
+ * not in service. To avoid inconsistencies,
+ * reset both bfqq->entity.service and
+ * bfqq->entity.budget, if bfqq has still a
+ * process that may issue I/O requests to it.
+ */
+ bfqq->entity.budget = bfqq->entity.service = 0;
+ }
+ }
+
+ if (rq->cmd_flags & REQ_META)
+ bfqq->meta_pending--;
+}
+
+static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+ struct request_queue *q = hctx->queue;
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+ struct request *free = NULL;
+ /*
+ * bfq_bic_lookup grabs the queue_lock: invoke it now and
+ * store its return value for later use, to avoid nesting
+ * queue_lock inside the bfqd->lock. We assume that the bic
+ * returned by bfq_bic_lookup does not go away before
+ * bfqd->lock is taken.
+ */
+ struct bfq_io_cq *bic = bfq_bic_lookup(bfqd, current->io_context, q);
+ bool ret;
+
+ spin_lock_irq(&bfqd->lock);
+
+ if (bic)
+ bfqd->bio_bfqq = bic_to_bfqq(bic, op_is_sync(bio->bi_opf));
+ else
+ bfqd->bio_bfqq = NULL;
+ bfqd->bio_bic = bic;
+
+ ret = blk_mq_sched_try_merge(q, bio, &free);
+
+ if (free)
+ blk_mq_free_request(free);
+ spin_unlock_irq(&bfqd->lock);
+
+ return ret;
+}
+
+static int bfq_request_merge(struct request_queue *q, struct request **req,
+ struct bio *bio)
+{
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+ struct request *__rq;
+
+ __rq = bfq_find_rq_fmerge(bfqd, bio, q);
+ if (__rq && elv_bio_merge_ok(__rq, bio)) {
+ *req = __rq;
+ return ELEVATOR_FRONT_MERGE;
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_request_merged(struct request_queue *q, struct request *req,
+ enum elv_merge type)
+{
+ if (type == ELEVATOR_FRONT_MERGE &&
+ rb_prev(&req->rb_node) &&
+ blk_rq_pos(req) <
+ blk_rq_pos(container_of(rb_prev(&req->rb_node),
+ struct request, rb_node))) {
+ struct bfq_queue *bfqq = RQ_BFQQ(req);
+ struct bfq_data *bfqd = bfqq->bfqd;
+ struct request *prev, *next_rq;
+
+ /* Reposition request in its sort_list */
+ elv_rb_del(&bfqq->sort_list, req);
+ elv_rb_add(&bfqq->sort_list, req);
+
+ /* Choose next request to be served for bfqq */
+ prev = bfqq->next_rq;
+ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+ bfqd->last_position);
+ bfqq->next_rq = next_rq;
+ /*
+ * If next_rq changes, update the queue's budget to fit
+ * the new request.
+ */
+ if (prev != bfqq->next_rq)
+ bfq_updated_next_req(bfqd, bfqq);
+ }
+}
+
+static void bfq_requests_merged(struct request_queue *q, struct request *rq,
+ struct request *next)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq), *next_bfqq = RQ_BFQQ(next);
+
+ if (!RB_EMPTY_NODE(&rq->rb_node))
+ return;
+ spin_lock_irq(&bfqq->bfqd->lock);
+
+ /*
+ * If next and rq belong to the same bfq_queue and next is older
+ * than rq, then reposition rq in the fifo (by substituting next
+ * with rq). Otherwise, if next and rq belong to different
+ * bfq_queues, never reposition rq: in fact, we would have to
+ * reposition it with respect to next's position in its own fifo,
+ * which would most certainly be too expensive with respect to
+ * the benefits.
+ */
+ if (bfqq == next_bfqq &&
+ !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+ next->fifo_time < rq->fifo_time) {
+ list_del_init(&rq->queuelist);
+ list_replace_init(&next->queuelist, &rq->queuelist);
+ rq->fifo_time = next->fifo_time;
+ }
+
+ if (bfqq->next_rq == next)
+ bfqq->next_rq = rq;
+
+ bfq_remove_request(q, next);
+
+ spin_unlock_irq(&bfqq->bfqd->lock);
+}
+
+static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+ bool is_sync = op_is_sync(bio->bi_opf);
+ struct bfq_queue *bfqq = bfqd->bio_bfqq;
+
+ /*
+ * Disallow merge of a sync bio into an async request.
+ */
+ if (is_sync && !rq_is_sync(rq))
+ return false;
+
+ /*
+ * Lookup the bfqq that this bio will be queued with. Allow
+ * merge only if rq is queued there.
+ */
+ if (!bfqq)
+ return false;
+
+ return bfqq == RQ_BFQQ(rq);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ if (bfqq) {
+ bfq_mark_bfqq_budget_new(bfqq);
+ bfq_clear_bfqq_fifo_expire(bfqq);
+
+ bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;
+
+ bfq_log_bfqq(bfqd, bfqq,
+ "set_in_service_queue, cur-budget = %d",
+ bfqq->entity.budget);
+ }
+
+ bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+ struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+ __bfq_set_in_service_queue(bfqd, bfqq);
+ return bfqq;
+}
+
+/*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues. This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ unsigned long budget;
+
+ /*
+ * When we need an estimate of the peak rate we need to avoid
+ * to give budgets that are too short due to previous
+ * measurements. So, in the first 10 assignments use a
+ * ``safe'' budget value. For such first assignment the value
+ * of bfqd->budgets_assigned happens to be lower than 194.
+ * See __bfq_set_in_service_queue for the formula by which
+ * this field is computed.
+ */
+ if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+ budget = bfq_default_max_budget;
+ else
+ budget = bfqd->bfq_max_budget;
+
+ return budget - budget / 4;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+ struct bfq_queue *bfqq = bfqd->in_service_queue;
+ struct bfq_io_cq *bic;
+ u32 sl;
+
+ /* Processes have exited, don't wait. */
+ bic = bfqd->in_service_bic;
+ if (!bic || atomic_read(&bic->icq.ioc->active_ref) == 0)
+ return;
+
+ bfq_mark_bfqq_wait_request(bfqq);
+
+ /*
+ * We don't want to idle for seeks, but we do want to allow
+ * fair distribution of slice time for a process doing back-to-back
+ * seeks. So allow a little bit of time for him to submit a new rq.
+ */
+ sl = bfqd->bfq_slice_idle;
+ /*
+ * Grant only minimum idle time if the queue is seeky.
+ */
+ if (BFQQ_SEEKY(bfqq))
+ sl = min_t(u64, sl, BFQ_MIN_TT);
+
+ bfqd->last_idling_start = ktime_get();
+ hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
+ HRTIMER_MODE_REL);
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+ struct bfq_queue *bfqq = bfqd->in_service_queue;
+ unsigned int timeout_coeff = bfqq->entity.weight /
+ bfqq->entity.orig_weight;
+
+ bfqd->last_budget_start = ktime_get();
+
+ bfq_clear_bfqq_budget_new(bfqq);
+ bfqq->budget_timeout = jiffies +
+ bfqd->bfq_timeout * timeout_coeff;
+
+ bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+ jiffies_to_msecs(bfqd->bfq_timeout * timeout_coeff));
+}
+
+/*
+ * Remove request from internal lists.
+ */
+static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+ /*
+ * For consistency, the next instruction should have been
+ * executed after removing the request from the queue and
+ * dispatching it. We execute instead this instruction before
+ * bfq_remove_request() (and hence introduce a temporary
+ * inconsistency), for efficiency. In fact, should this
+ * dispatch occur for a non in-service bfqq, this anticipated
+ * increment prevents two counters related to bfqq->dispatched
+ * from risking to be, first, uselessly decremented, and then
+ * incremented again when the (new) value of bfqq->dispatched
+ * happens to be taken into account.
+ */
+ bfqq->dispatched++;
+
+ bfq_remove_request(q, rq);
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ __bfq_bfqd_reset_in_service(bfqd);
+
+ if (RB_EMPTY_ROOT(&bfqq->sort_list))
+ bfq_del_bfqq_busy(bfqd, bfqq, 1);
+ else
+ bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget at queue expiration.
+ * See the body for detailed comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ enum bfqq_expiration reason)
+{
+ struct request *next_rq;
+ int budget, min_budget;
+
+ budget = bfqq->max_budget;
+ min_budget = bfq_min_budget(bfqd);
+
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
+ bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
+ budget, bfq_min_budget(bfqd));
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+ bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+ if (bfq_bfqq_sync(bfqq)) {
+ switch (reason) {
+ /*
+ * Caveat: in all the following cases we trade latency
+ * for throughput.
+ */
+ case BFQQE_TOO_IDLE:
+ if (budget > min_budget + BFQ_BUDGET_STEP)
+ budget -= BFQ_BUDGET_STEP;
+ else
+ budget = min_budget;
+ break;
+ case BFQQE_BUDGET_TIMEOUT:
+ budget = bfq_default_budget(bfqd, bfqq);
+ break;
+ case BFQQE_BUDGET_EXHAUSTED:
+ /*
+ * The process still has backlog, and did not
+ * let either the budget timeout or the disk
+ * idling timeout expire. Hence it is not
+ * seeky, has a short thinktime and may be
+ * happy with a higher budget too. So
+ * definitely increase the budget of this good
+ * candidate to boost the disk throughput.
+ */
+ budget = min(budget + 8 * BFQ_BUDGET_STEP,
+ bfqd->bfq_max_budget);
+ break;
+ case BFQQE_NO_MORE_REQUESTS:
+ /*
+ * For queues that expire for this reason, it
+ * is particularly important to keep the
+ * budget close to the actual service they
+ * need. Doing so reduces the timestamp
+ * misalignment problem described in the
+ * comments in the body of
+ * __bfq_activate_entity. In fact, suppose
+ * that a queue systematically expires for
+ * BFQQE_NO_MORE_REQUESTS and presents a
+ * new request in time to enjoy timestamp
+ * back-shifting. The larger the budget of the
+ * queue is with respect to the service the
+ * queue actually requests in each service
+ * slot, the more times the queue can be
+ * reactivated with the same virtual finish
+ * time. It follows that, even if this finish
+ * time is pushed to the system virtual time
+ * to reduce the consequent timestamp
+ * misalignment, the queue unjustly enjoys for
+ * many re-activations a lower finish time
+ * than all newly activated queues.
+ *
+ * The service needed by bfqq is measured
+ * quite precisely by bfqq->entity.service.
+ * Since bfqq does not enjoy device idling,
+ * bfqq->entity.service is equal to the number
+ * of sectors that the process associated with
+ * bfqq requested to read/write before waiting
+ * for request completions, or blocking for
+ * other reasons.
+ */
+ budget = max_t(int, bfqq->entity.service, min_budget);
+ break;
+ default:
+ return;
+ }
+ } else {
+ /*
+ * Async queues get always the maximum possible
+ * budget, as for them we do not care about latency
+ * (in addition, their ability to dispatch is limited
+ * by the charging factor).
+ */
+ budget = bfqd->bfq_max_budget;
+ }
+
+ bfqq->max_budget = budget;
+
+ if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&
+ !bfqd->bfq_user_max_budget)
+ bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);
+
+ /*
+ * If there is still backlog, then assign a new budget, making
+ * sure that it is large enough for the next request. Since
+ * the finish time of bfqq must be kept in sync with the
+ * budget, be sure to call __bfq_bfqq_expire() *after* this
+ * update.
+ *
+ * If there is no backlog, then no need to update the budget;
+ * it will be updated on the arrival of a new request.
+ */
+ next_rq = bfqq->next_rq;
+ if (next_rq)
+ bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(next_rq, bfqq));
+
+ bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",
+ next_rq ? blk_rq_sectors(next_rq) : 0,
+ bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+ unsigned long max_budget;
+
+ /*
+ * The max_budget calculated when autotuning is equal to the
+ * amount of sectors transferred in timeout at the estimated
+ * peak rate. To get this value, peak_rate is, first,
+ * multiplied by 1000, because timeout is measured in ms,
+ * while peak_rate is measured in sectors/usecs. Then the
+ * result of this multiplication is right-shifted by
+ * BFQ_RATE_SHIFT, because peak_rate is equal to the value of
+ * the peak rate left-shifted by BFQ_RATE_SHIFT.
+ */
+ max_budget = (unsigned long)(peak_rate * 1000 *
+ timeout >> BFQ_RATE_SHIFT);
+
+ return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool compensate)
+{
+ u64 bw, usecs, expected, timeout;
+ ktime_t delta;
+ int update = 0;
+
+ if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+ return false;
+
+ if (compensate)
+ delta = bfqd->last_idling_start;
+ else
+ delta = ktime_get();
+ delta = ktime_sub(delta, bfqd->last_budget_start);
+ usecs = ktime_to_us(delta);
+
+ /* don't use too short time intervals */
+ if (usecs < 1000)
+ return false;
+
+ /*
+ * Calculate the bandwidth for the last slice. We use a 64 bit
+ * value to store the peak rate, in sectors per usec in fixed
+ * point math. We do so to have enough precision in the estimate
+ * and to avoid overflows.
+ */
+ bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+ do_div(bw, (unsigned long)usecs);
+
+ timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+
+ /*
+ * Use only long (> 20ms) intervals to filter out spikes for
+ * the peak rate estimation.
+ */
+ if (usecs > 20000) {
+ if (bw > bfqd->peak_rate) {
+ bfqd->peak_rate = bw;
+ update = 1;
+ bfq_log(bfqd, "new peak_rate=%llu", bw);
+ }
+
+ update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+ if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+ bfqd->peak_rate_samples++;
+
+ if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+ update && bfqd->bfq_user_max_budget == 0) {
+ bfqd->bfq_max_budget =
+ bfq_calc_max_budget(bfqd->peak_rate,
+ timeout);
+ bfq_log(bfqd, "new max_budget=%d",
+ bfqd->bfq_max_budget);
+ }
+ }
+
+ /*
+ * A process is considered ``slow'' (i.e., seeky, so that we
+ * cannot treat it fairly in the service domain, as it would
+ * slow down too much the other processes) if, when a slice
+ * ends for whatever reason, it has received service at a
+ * rate that would not be high enough to complete the budget
+ * before the budget timeout expiration.
+ */
+ expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+ /*
+ * Caveat: processes doing IO in the slower disk zones will
+ * tend to be slow(er) even if not seeky. And the estimated
+ * peak rate will actually be an average over the disk
+ * surface. Hence, to not be too harsh with unlucky processes,
+ * we keep a budget/3 margin of safety before declaring a
+ * process slow.
+ */
+ return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/*
+ * Return the farthest past time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_smallest_from_now(void)
+{
+ return jiffies - MAX_JIFFY_OFFSET;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated with the queue is slow (i.e., seeky), or
+ * in case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ bool compensate,
+ enum bfqq_expiration reason)
+{
+ bool slow;
+ int ref;
+
+ /*
+ * Update device peak rate for autotuning and check whether the
+ * process is slow (see bfq_update_peak_rate).
+ */
+ slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+ /*
+ * As above explained, 'punish' slow (i.e., seeky), timed-out
+ * and async queues, to favor sequential sync workloads.
+ */
+ if (slow || reason == BFQQE_BUDGET_TIMEOUT)
+ bfq_bfqq_charge_full_budget(bfqq);
+
+ if (reason == BFQQE_TOO_IDLE &&
+ bfqq->entity.service <= 2 * bfqq->entity.budget / 10)
+ bfq_clear_bfqq_IO_bound(bfqq);
+
+ bfq_log_bfqq(bfqd, bfqq,
+ "expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+ slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+ /*
+ * Increase, decrease or leave budget unchanged according to
+ * reason.
+ */
+ __bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+ ref = bfqq->ref;
+ __bfq_bfqq_expire(bfqd, bfqq);
+
+ /* mark bfqq as waiting a request only if a bic still points to it */
+ if (ref > 1 && !bfq_bfqq_busy(bfqq) &&
+ reason != BFQQE_BUDGET_TIMEOUT &&
+ reason != BFQQE_BUDGET_EXHAUSTED)
+ bfq_mark_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+ if (bfq_bfqq_budget_new(bfqq) ||
+ time_is_after_jiffies(bfqq->budget_timeout))
+ return false;
+ return true;
+}
+
+/*
+ * If we expire a queue that is actively waiting (i.e., with the
+ * device idled) for the arrival of a new request, then we may incur
+ * the timestamp misalignment problem described in the body of the
+ * function __bfq_activate_entity. Hence we return true only if this
+ * condition does not hold, or if the queue is slow enough to deserve
+ * only to be kicked off for preserving a high throughput.
+ */
+static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+ bfq_log_bfqq(bfqq->bfqd, bfqq,
+ "may_budget_timeout: wait_request %d left %d timeout %d",
+ bfq_bfqq_wait_request(bfqq),
+ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3,
+ bfq_bfqq_budget_timeout(bfqq));
+
+ return (!bfq_bfqq_wait_request(bfqq) ||
+ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)
+ &&
+ bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * For a queue that becomes empty, device idling is allowed only if
+ * this function returns true for the queue. And this function returns
+ * true only if idling is beneficial for throughput.
+ */
+static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+ bool idling_boosts_thr;
+
+ if (bfqd->strict_guarantees)
+ return true;
+
+ /*
+ * The value of the next variable is computed considering that
+ * idling is usually beneficial for the throughput if:
+ * (a) the device is not NCQ-capable, or
+ * (b) regardless of the presence of NCQ, the request pattern
+ * for bfqq is I/O-bound (possible throughput losses
+ * caused by granting idling to seeky queues are mitigated
+ * by the fact that, in all scenarios where boosting
+ * throughput is the best thing to do, i.e., in all
+ * symmetric scenarios, only a minimal idle time is
+ * allowed to seeky queues).
+ */
+ idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
+
+ /*
+ * We have now the components we need to compute the return
+ * value of the function, which is true only if both the
+ * following conditions hold:
+ * 1) bfqq is sync, because idling make sense only for sync queues;
+ * 2) idling boosts the throughput.
+ */
+ return bfq_bfqq_sync(bfqq) && idling_boosts_thr;
+}
+
+/*
+ * If the in-service queue is empty but the function bfq_bfqq_may_idle
+ * returns true, then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the device must be idled to wait for the possible arrival of a new
+ * request for the queue.
+ * See the comments on the function bfq_bfqq_may_idle for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_may_idle itself
+ * returns true.
+ */
+static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+
+ return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+ bfq_bfqq_may_idle(bfqq);
+}
+
+/*
+ * Select a queue for service. If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+ struct bfq_queue *bfqq;
+ struct request *next_rq;
+ enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT;
+
+ bfqq = bfqd->in_service_queue;
+ if (!bfqq)
+ goto new_queue;
+
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+ if (bfq_may_expire_for_budg_timeout(bfqq) &&
+ !bfq_bfqq_wait_request(bfqq) &&
+ !bfq_bfqq_must_idle(bfqq))
+ goto expire;
+
+check_queue:
+ /*
+ * This loop is rarely executed more than once. Even when it
+ * happens, it is much more convenient to re-execute this loop
+ * than to return NULL and trigger a new dispatch to get a
+ * request served.
+ */
+ next_rq = bfqq->next_rq;
+ /*
+ * If bfqq has requests queued and it has enough budget left to
+ * serve them, keep the queue, otherwise expire it.
+ */
+ if (next_rq) {
+ if (bfq_serv_to_charge(next_rq, bfqq) >
+ bfq_bfqq_budget_left(bfqq)) {
+ /*
+ * Expire the queue for budget exhaustion,
+ * which makes sure that the next budget is
+ * enough to serve the next request, even if
+ * it comes from the fifo expired path.
+ */
+ reason = BFQQE_BUDGET_EXHAUSTED;
+ goto expire;
+ } else {
+ /*
+ * The idle timer may be pending because we may
+ * not disable disk idling even when a new request
+ * arrives.
+ */
+ if (bfq_bfqq_wait_request(bfqq)) {
+ /*
+ * If we get here: 1) at least a new request
+ * has arrived but we have not disabled the
+ * timer because the request was too small,
+ * 2) then the block layer has unplugged
+ * the device, causing the dispatch to be
+ * invoked.
+ *
+ * Since the device is unplugged, now the
+ * requests are probably large enough to
+ * provide a reasonable throughput.
+ * So we disable idling.
+ */
+ bfq_clear_bfqq_wait_request(bfqq);
+ hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+ }
+ goto keep_queue;
+ }
+ }
+
+ /*
+ * No requests pending. However, if the in-service queue is idling
+ * for a new request, or has requests waiting for a completion and
+ * may idle after their completion, then keep it anyway.
+ */
+ if (bfq_bfqq_wait_request(bfqq) ||
+ (bfqq->dispatched != 0 && bfq_bfqq_may_idle(bfqq))) {
+ bfqq = NULL;
+ goto keep_queue;
+ }
+
+ reason = BFQQE_NO_MORE_REQUESTS;
+expire:
+ bfq_bfqq_expire(bfqd, bfqq, false, reason);
+new_queue:
+ bfqq = bfq_set_in_service_queue(bfqd);
+ if (bfqq) {
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: checking new queue");
+ goto check_queue;
+ }
+keep_queue:
+ if (bfqq)
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: returned this queue");
+ else
+ bfq_log(bfqd, "select_queue: no queue returned");
+
+ return bfqq;
+}
+
+/*
+ * Dispatch next request from bfqq.
+ */
+static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ struct request *rq = bfqq->next_rq;
+ unsigned long service_to_charge;
+
+ service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+ bfq_bfqq_served(bfqq, service_to_charge);
+
+ bfq_dispatch_remove(bfqd->queue, rq);
+
+ if (!bfqd->in_service_bic) {
+ atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+ bfqd->in_service_bic = RQ_BIC(rq);
+ }
+
+ /*
+ * Expire bfqq, pretending that its budget expired, if bfqq
+ * belongs to CLASS_IDLE and other queues are waiting for
+ * service.
+ */
+ if (bfqd->busy_queues > 1 && bfq_class_idle(bfqq))
+ goto expire;
+
+ return rq;
+
+expire:
+ bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
+ return rq;
+}
+
+static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
+
+ /*
+ * Avoiding lock: a race on bfqd->busy_queues should cause at
+ * most a call to dispatch for nothing
+ */
+ return !list_empty_careful(&bfqd->dispatch) ||
+ bfqd->busy_queues > 0;
+}
+
+static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
+ struct request *rq = NULL;
+ struct bfq_queue *bfqq = NULL;
+
+ if (!list_empty(&bfqd->dispatch)) {
+ rq = list_first_entry(&bfqd->dispatch, struct request,
+ queuelist);
+ list_del_init(&rq->queuelist);
+
+ bfqq = RQ_BFQQ(rq);
+
+ if (bfqq) {
+ /*
+ * Increment counters here, because this
+ * dispatch does not follow the standard
+ * dispatch flow (where counters are
+ * incremented)
+ */
+ bfqq->dispatched++;
+
+ goto inc_in_driver_start_rq;
+ }
+
+ /*
+ * We exploit the put_rq_private hook to decrement
+ * rq_in_driver, but put_rq_private will not be
+ * invoked on this request. So, to avoid unbalance,
+ * just start this request, without incrementing
+ * rq_in_driver. As a negative consequence,
+ * rq_in_driver is deceptively lower than it should be
+ * while this request is in service. This may cause
+ * bfq_schedule_dispatch to be invoked uselessly.
+ *
+ * As for implementing an exact solution, the
+ * put_request hook, if defined, is probably invoked
+ * also on this request. So, by exploiting this hook,
+ * we could 1) increment rq_in_driver here, and 2)
+ * decrement it in put_request. Such a solution would
+ * let the value of the counter be always accurate,
+ * but it would entail using an extra interface
+ * function. This cost seems higher than the benefit,
+ * being the frequency of non-elevator-private
+ * requests very low.
+ */
+ goto start_rq;
+ }
+
+ bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+
+ if (bfqd->busy_queues == 0)
+ goto exit;
+
+ /*
+ * Force device to serve one request at a time if
+ * strict_guarantees is true. Forcing this service scheme is
+ * currently the ONLY way to guarantee that the request
+ * service order enforced by the scheduler is respected by a
+ * queueing device. Otherwise the device is free even to make
+ * some unlucky request wait for as long as the device
+ * wishes.
+ *
+ * Of course, serving one request at at time may cause loss of
+ * throughput.
+ */
+ if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0)
+ goto exit;
+
+ bfqq = bfq_select_queue(bfqd);
+ if (!bfqq)
+ goto exit;
+
+ rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
+
+ if (rq) {
+inc_in_driver_start_rq:
+ bfqd->rq_in_driver++;
+start_rq:
+ rq->rq_flags |= RQF_STARTED;
+ }
+exit:
+ return rq;
+}
+
+static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
+ struct request *rq;
+
+ spin_lock_irq(&bfqd->lock);
+ rq = __bfq_dispatch_request(hctx);
+ spin_unlock_irq(&bfqd->lock);
+
+ return rq;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits. Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Scheduler lock must be held here. Recall not to use bfqq after calling
+ * this function on it.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+ if (bfqq->bfqd)
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d",
+ bfqq, bfqq->ref);
+
+ bfqq->ref--;
+ if (bfqq->ref)
+ return;
+
+ kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+ if (bfqq == bfqd->in_service_queue) {
+ __bfq_bfqq_expire(bfqd, bfqq);
+ bfq_schedule_dispatch(bfqd);
+ }
+
+ bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);
+
+ bfq_put_queue(bfqq); /* release process reference */
+}
+
+static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
+{
+ struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync);
+ struct bfq_data *bfqd;
+
+ if (bfqq)
+ bfqd = bfqq->bfqd; /* NULL if scheduler already exited */
+
+ if (bfqq && bfqd) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&bfqd->lock, flags);
+ bfq_exit_bfqq(bfqd, bfqq);
+ bic_set_bfqq(bic, NULL, is_sync);
+ spin_unlock_irq(&bfqd->lock);
+ }
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+ struct bfq_io_cq *bic = icq_to_bic(icq);
+
+ bfq_exit_icq_bfqq(bic, true);
+ bfq_exit_icq_bfqq(bic, false);
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void
+bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+ struct task_struct *tsk = current;
+ int ioprio_class;
+ struct bfq_data *bfqd = bfqq->bfqd;
+
+ if (!bfqd)
+ return;
+
+ ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+ switch (ioprio_class) {
+ default:
+ dev_err(bfqq->bfqd->queue->backing_dev_info->dev,
+ "bfq: bad prio class %d\n", ioprio_class);
+ case IOPRIO_CLASS_NONE:
+ /*
+ * No prio set, inherit CPU scheduling settings.
+ */
+ bfqq->new_ioprio = task_nice_ioprio(tsk);
+ bfqq->new_ioprio_class = task_nice_ioclass(tsk);
+ break;
+ case IOPRIO_CLASS_RT:
+ bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+ bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
+ break;
+ case IOPRIO_CLASS_BE:
+ bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+ bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
+ break;
+ case IOPRIO_CLASS_IDLE:
+ bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
+ bfqq->new_ioprio = 7;
+ bfq_clear_bfqq_idle_window(bfqq);
+ break;
+ }
+
+ if (bfqq->new_ioprio >= IOPRIO_BE_NR) {
+ pr_crit("bfq_set_next_ioprio_data: new_ioprio %d\n",
+ bfqq->new_ioprio);
+ bfqq->new_ioprio = IOPRIO_BE_NR;
+ }
+
+ bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio);
+ bfqq->entity.prio_changed = 1;
+}
+
+static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
+{
+ struct bfq_data *bfqd = bic_to_bfqd(bic);
+ struct bfq_queue *bfqq;
+ int ioprio = bic->icq.ioc->ioprio;
+
+ /*
+ * This condition may trigger on a newly created bic, be sure to
+ * drop the lock before returning.
+ */
+ if (unlikely(!bfqd) || likely(bic->ioprio == ioprio))
+ return;
+
+ bic->ioprio = ioprio;
+
+ bfqq = bic_to_bfqq(bic, false);
+ if (bfqq) {
+ /* release process reference on this queue */
+ bfq_put_queue(bfqq);
+ bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic);
+ bic_set_bfqq(bic, bfqq, false);
+ }
+
+ bfqq = bic_to_bfqq(bic, true);
+ if (bfqq)
+ bfq_set_next_ioprio_data(bfqq, bic);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct bfq_io_cq *bic, pid_t pid, int is_sync)
+{
+ RB_CLEAR_NODE(&bfqq->entity.rb_node);
+ INIT_LIST_HEAD(&bfqq->fifo);
+
+ bfqq->ref = 0;
+ bfqq->bfqd = bfqd;
+
+ if (bic)
+ bfq_set_next_ioprio_data(bfqq, bic);
+
+ if (is_sync) {
+ if (!bfq_class_idle(bfqq))
+ bfq_mark_bfqq_idle_window(bfqq);
+ bfq_mark_bfqq_sync(bfqq);
+ } else
+ bfq_clear_bfqq_sync(bfqq);
+
+ /* set end request to minus infinity from now */
+ bfqq->ttime.last_end_request = ktime_get_ns() + 1;
+
+ bfq_mark_bfqq_IO_bound(bfqq);
+
+ bfqq->pid = pid;
+
+ /* Tentative initial value to trade off between thr and lat */
+ bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+ bfqq->budget_timeout = bfq_smallest_from_now();
+ bfqq->pid = pid;
+
+ /* first request is almost certainly seeky */
+ bfqq->seek_history = 1;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+ int ioprio_class, int ioprio)
+{
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ return &async_bfqq[0][ioprio];
+ case IOPRIO_CLASS_NONE:
+ ioprio = IOPRIO_NORM;
+ /* fall through */
+ case IOPRIO_CLASS_BE:
+ return &async_bfqq[1][ioprio];
+ case IOPRIO_CLASS_IDLE:
+ return &async_idle_bfqq;
+ default:
+ return NULL;
+ }
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+ struct bio *bio, bool is_sync,
+ struct bfq_io_cq *bic)
+{
+ const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+ const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+ struct bfq_queue **async_bfqq = NULL;
+ struct bfq_queue *bfqq;
+
+ rcu_read_lock();
+
+ if (!is_sync) {
+ async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class,
+ ioprio);
+ bfqq = *async_bfqq;
+ if (bfqq)
+ goto out;
+ }
+
+ bfqq = kmem_cache_alloc_node(bfq_pool,
+ GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN,
+ bfqd->queue->node);
+
+ if (bfqq) {
+ bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
+ is_sync);
+ bfq_init_entity(&bfqq->entity);
+ bfq_log_bfqq(bfqd, bfqq, "allocated");
+ } else {
+ bfqq = &bfqd->oom_bfqq;
+ bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+ goto out;
+ }
+
+ /*
+ * Pin the queue now that it's allocated, scheduler exit will
+ * prune it.
+ */
+ if (async_bfqq) {
+ bfqq->ref++;
+ bfq_log_bfqq(bfqd, bfqq,
+ "get_queue, bfqq not in async: %p, %d",
+ bfqq, bfqq->ref);
+ *async_bfqq = bfqq;
+ }
+
+out:
+ bfqq->ref++; /* get a process reference to this queue */
+ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref);
+ rcu_read_unlock();
+ return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ struct bfq_ttime *ttime = &bfqq->ttime;
+ u64 elapsed = ktime_get_ns() - bfqq->ttime.last_end_request;
+
+ elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle);
+
+ ttime->ttime_samples = (7*bfqq->ttime.ttime_samples + 256) / 8;
+ ttime->ttime_total = div_u64(7*ttime->ttime_total + 256*elapsed, 8);
+ ttime->ttime_mean = div64_ul(ttime->ttime_total + 128,
+ ttime->ttime_samples);
+}
+
+static void
+bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct request *rq)
+{
+ sector_t sdist = 0;
+
+ if (bfqq->last_request_pos) {
+ if (bfqq->last_request_pos < blk_rq_pos(rq))
+ sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+ else
+ sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+ }
+
+ bfqq->seek_history <<= 1;
+ bfqq->seek_history |= sdist > BFQQ_SEEK_THR &&
+ (!blk_queue_nonrot(bfqd->queue) ||
+ blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq,
+ struct bfq_io_cq *bic)
+{
+ int enable_idle;
+
+ /* Don't idle for async or idle io prio class. */
+ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+ return;
+
+ enable_idle = bfq_bfqq_idle_window(bfqq);
+
+ if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+ bfqd->bfq_slice_idle == 0 ||
+ (bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+ enable_idle = 0;
+ else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) {
+ if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle)
+ enable_idle = 0;
+ else
+ enable_idle = 1;
+ }
+ bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+ enable_idle);
+
+ if (enable_idle)
+ bfq_mark_bfqq_idle_window(bfqq);
+ else
+ bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq. Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct request *rq)
+{
+ struct bfq_io_cq *bic = RQ_BIC(rq);
+
+ if (rq->cmd_flags & REQ_META)
+ bfqq->meta_pending++;
+
+ bfq_update_io_thinktime(bfqd, bfqq);
+ bfq_update_io_seektime(bfqd, bfqq, rq);
+ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+ !BFQQ_SEEKY(bfqq))
+ bfq_update_idle_window(bfqd, bfqq, bic);
+
+ bfq_log_bfqq(bfqd, bfqq,
+ "rq_enqueued: idle_window=%d (seeky %d)",
+ bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq));
+
+ bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+ if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+ bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+ blk_rq_sectors(rq) < 32;
+ bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+ /*
+ * There is just this request queued: if the request
+ * is small and the queue is not to be expired, then
+ * just exit.
+ *
+ * In this way, if the device is being idled to wait
+ * for a new request from the in-service queue, we
+ * avoid unplugging the device and committing the
+ * device to serve just a small request. On the
+ * contrary, we wait for the block layer to decide
+ * when to unplug the device: hopefully, new requests
+ * will be merged to this one quickly, then the device
+ * will be unplugged and larger requests will be
+ * dispatched.
+ */
+ if (small_req && !budget_timeout)
+ return;
+
+ /*
+ * A large enough request arrived, or the queue is to
+ * be expired: in both cases disk idling is to be
+ * stopped, so clear wait_request flag and reset
+ * timer.
+ */
+ bfq_clear_bfqq_wait_request(bfqq);
+ hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+
+ /*
+ * The queue is not empty, because a new request just
+ * arrived. Hence we can safely expire the queue, in
+ * case of budget timeout, without risking that the
+ * timestamps of the queue are not updated correctly.
+ * See [1] for more details.
+ */
+ if (budget_timeout)
+ bfq_bfqq_expire(bfqd, bfqq, false,
+ BFQQE_BUDGET_TIMEOUT);
+ }
+}
+
+static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+ bfq_add_request(rq);
+
+ rq->fifo_time = ktime_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+ bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head)
+{
+ struct request_queue *q = hctx->queue;
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+
+ spin_lock_irq(&bfqd->lock);
+ if (blk_mq_sched_try_insert_merge(q, rq)) {
+ spin_unlock_irq(&bfqd->lock);
+ return;
+ }
+
+ spin_unlock_irq(&bfqd->lock);
+
+ blk_mq_sched_request_inserted(rq);
+
+ spin_lock_irq(&bfqd->lock);
+ if (at_head || blk_rq_is_passthrough(rq)) {
+ if (at_head)
+ list_add(&rq->queuelist, &bfqd->dispatch);
+ else
+ list_add_tail(&rq->queuelist, &bfqd->dispatch);
+ } else {
+ __bfq_insert_request(bfqd, rq);
+
+ if (rq_mergeable(rq)) {
+ elv_rqhash_add(q, rq);
+ if (!q->last_merge)
+ q->last_merge = rq;
+ }
+ }
+
+ spin_unlock_irq(&bfqd->lock);
+}
+
+static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *list, bool at_head)
+{
+ while (!list_empty(list)) {
+ struct request *rq;
+
+ rq = list_first_entry(list, struct request, queuelist);
+ list_del_init(&rq->queuelist);
+ bfq_insert_request(hctx, rq, at_head);
+ }
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+ bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
+ bfqd->rq_in_driver);
+
+ if (bfqd->hw_tag == 1)
+ return;
+
+ /*
+ * This sample is valid if the number of outstanding requests
+ * is large enough to allow a queueing behavior. Note that the
+ * sum is not exact, as it's not taking into account deactivated
+ * requests.
+ */
+ if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+ return;
+
+ if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+ return;
+
+ bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+ bfqd->max_rq_in_driver = 0;
+ bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
+{
+ bfq_update_hw_tag(bfqd);
+
+ bfqd->rq_in_driver--;
+ bfqq->dispatched--;
+
+ bfqq->ttime.last_end_request = ktime_get_ns();
+
+ /*
+ * If this is the in-service queue, check if it needs to be expired,
+ * or if we want to idle in case it has no pending requests.
+ */
+ if (bfqd->in_service_queue == bfqq) {
+ if (bfq_bfqq_budget_new(bfqq))
+ bfq_set_budget_timeout(bfqd);
+
+ if (bfq_bfqq_must_idle(bfqq)) {
+ bfq_arm_slice_timer(bfqd);
+ return;
+ } else if (bfq_may_expire_for_budg_timeout(bfqq))
+ bfq_bfqq_expire(bfqd, bfqq, false,
+ BFQQE_BUDGET_TIMEOUT);
+ else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ (bfqq->dispatched == 0 ||
+ !bfq_bfqq_may_idle(bfqq)))
+ bfq_bfqq_expire(bfqd, bfqq, false,
+ BFQQE_NO_MORE_REQUESTS);
+ }
+}
+
+static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
+{
+ bfqq->allocated--;
+
+ bfq_put_queue(bfqq);
+}
+
+static void bfq_put_rq_private(struct request_queue *q, struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+ struct bfq_data *bfqd = bfqq->bfqd;
+
+
+ if (likely(rq->rq_flags & RQF_STARTED)) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&bfqd->lock, flags);
+
+ bfq_completed_request(bfqq, bfqd);
+ bfq_put_rq_priv_body(bfqq);
+
+ spin_unlock_irqrestore(&bfqd->lock, flags);
+ } else {
+ /*
+ * Request rq may be still/already in the scheduler,
+ * in which case we need to remove it. And we cannot
+ * defer such a check and removal, to avoid
+ * inconsistencies in the time interval from the end
+ * of this function to the start of the deferred work.
+ * This situation seems to occur only in process
+ * context, as a consequence of a merge. In the
+ * current version of the code, this implies that the
+ * lock is held.
+ */
+
+ if (!RB_EMPTY_NODE(&rq->rb_node))
+ bfq_remove_request(q, rq);
+ bfq_put_rq_priv_body(bfqq);
+ }
+
+ rq->elv.priv[0] = NULL;
+ rq->elv.priv[1] = NULL;
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+ struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+ const int is_sync = rq_is_sync(rq);
+ struct bfq_queue *bfqq;
+
+ spin_lock_irq(&bfqd->lock);
+
+ bfq_check_ioprio_change(bic, bio);
+
+ if (!bic)
+ goto queue_fail;
+
+ bfqq = bic_to_bfqq(bic, is_sync);
+ if (!bfqq || bfqq == &bfqd->oom_bfqq) {
+ if (bfqq)
+ bfq_put_queue(bfqq);
+ bfqq = bfq_get_queue(bfqd, bio, is_sync, bic);
+ bic_set_bfqq(bic, bfqq, is_sync);
+ }
+
+ bfqq->allocated++;
+ bfqq->ref++;
+ bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",
+ rq, bfqq, bfqq->ref);
+
+ rq->elv.priv[0] = bic;
+ rq->elv.priv[1] = bfqq;
+
+ spin_unlock_irq(&bfqd->lock);
+
+ return 0;
+
+queue_fail:
+ spin_unlock_irq(&bfqd->lock);
+
+ return 1;
+}
+
+static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+ enum bfqq_expiration reason;
+ unsigned long flags;
+
+ spin_lock_irqsave(&bfqd->lock, flags);
+ bfq_clear_bfqq_wait_request(bfqq);
+
+ if (bfqq != bfqd->in_service_queue) {
+ spin_unlock_irqrestore(&bfqd->lock, flags);
+ return;
+ }
+
+ if (bfq_bfqq_budget_timeout(bfqq))
+ /*
+ * Also here the queue can be safely expired
+ * for budget timeout without wasting
+ * guarantees
+ */
+ reason = BFQQE_BUDGET_TIMEOUT;
+ else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+ /*
+ * The queue may not be empty upon timer expiration,
+ * because we may not disable the timer when the
+ * first request of the in-service queue arrives
+ * during disk idling.
+ */
+ reason = BFQQE_TOO_IDLE;
+ else
+ goto schedule_dispatch;
+
+ bfq_bfqq_expire(bfqd, bfqq, true, reason);
+
+schedule_dispatch:
+ spin_unlock_irqrestore(&bfqd->lock, flags);
+ bfq_schedule_dispatch(bfqd);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer)
+{
+ struct bfq_data *bfqd = container_of(timer, struct bfq_data,
+ idle_slice_timer);
+ struct bfq_queue *bfqq = bfqd->in_service_queue;
+
+ /*
+ * Theoretical race here: the in-service queue can be NULL or
+ * different from the queue that was idling if a new request
+ * arrives for the current queue and there is a full dispatch
+ * cycle that changes the in-service queue. This can hardly
+ * happen, but in the worst case we just expire a queue too
+ * early.
+ */
+ if (bfqq)
+ bfq_idle_slice_timer_body(bfqq);
+
+ return HRTIMER_NORESTART;
+}
+
+static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+ struct bfq_queue **bfqq_ptr)
+{
+ struct bfq_queue *bfqq = *bfqq_ptr;
+
+ bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+ if (bfqq) {
+ bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+ bfqq, bfqq->ref);
+ bfq_put_queue(bfqq);
+ *bfqq_ptr = NULL;
+ }
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ __bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+ __bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ struct bfq_queue *bfqq, *n;
+
+ hrtimer_cancel(&bfqd->idle_slice_timer);
+
+ spin_lock_irq(&bfqd->lock);
+ list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+ bfq_deactivate_bfqq(bfqd, bfqq, false);
+ bfq_put_async_queues(bfqd);
+ spin_unlock_irq(&bfqd->lock);
+
+ hrtimer_cancel(&bfqd->idle_slice_timer);
+
+ kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+ struct bfq_data *bfqd;
+ struct elevator_queue *eq;
+ int i;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+ if (!bfqd) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+ eq->elevator_data = bfqd;
+
+ /*
+ * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+ * Grab a permanent reference to it, so that the normal code flow
+ * will not attempt to free it.
+ */
+ bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0);
+ bfqd->oom_bfqq.ref++;
+ bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO;
+ bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE;
+ bfqd->oom_bfqq.entity.new_weight =
+ bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio);
+ /*
+ * Trigger weight initialization, according to ioprio, at the
+ * oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio
+ * class won't be changed any more.
+ */
+ bfqd->oom_bfqq.entity.prio_changed = 1;
+
+ bfqd->queue = q;
+
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+ bfqd->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+ hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+
+ INIT_LIST_HEAD(&bfqd->active_list);
+ INIT_LIST_HEAD(&bfqd->idle_list);
+
+ bfqd->hw_tag = -1;
+
+ bfqd->bfq_max_budget = bfq_default_max_budget;
+
+ bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+ bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+ bfqd->bfq_back_max = bfq_back_max;
+ bfqd->bfq_back_penalty = bfq_back_penalty;
+ bfqd->bfq_slice_idle = bfq_slice_idle;
+ bfqd->bfq_class_idle_last_service = 0;
+ bfqd->bfq_timeout = bfq_timeout;
+
+ bfqd->bfq_requests_within_timer = 120;
+
+ spin_lock_init(&bfqd->lock);
+ INIT_LIST_HEAD(&bfqd->dispatch);
+
+ q->elevator = eq;
+
+ return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+ kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+ bfq_pool = KMEM_CACHE(bfq_queue, 0);
+ if (!bfq_pool)
+ return -ENOMEM;
+ return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+ return sprintf(page, "%u\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+ size_t count)
+{
+ unsigned long new_val;
+ int ret = kstrtoul(page, 10, &new_val);
+
+ if (ret == 0)
+ *var = new_val;
+
+ return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+static ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct bfq_data *bfqd = e->elevator_data; \
+ u64 __data = __VAR; \
+ if (__CONV == 1) \
+ __data = jiffies_to_msecs(__data); \
+ else if (__CONV == 2) \
+ __data = div_u64(__data, NSEC_PER_MSEC); \
+ return bfq_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 2);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 2);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
+SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
+#undef SHOW_FUNCTION
+
+#define USEC_SHOW_FUNCTION(__FUNC, __VAR) \
+static ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct bfq_data *bfqd = e->elevator_data; \
+ u64 __data = __VAR; \
+ __data = div_u64(__data, NSEC_PER_USEC); \
+ return bfq_var_show(__data, (page)); \
+}
+USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle);
+#undef USEC_SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+static ssize_t \
+__FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct bfq_data *bfqd = e->elevator_data; \
+ unsigned long uninitialized_var(__data); \
+ int ret = bfq_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV == 1) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else if (__CONV == 2) \
+ *(__PTR) = (u64)__data * NSEC_PER_MSEC; \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+ INT_MAX, 2);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+ INT_MAX, 2);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+ INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
+#undef STORE_FUNCTION
+
+#define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX) \
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{ \
+ struct bfq_data *bfqd = e->elevator_data; \
+ unsigned long uninitialized_var(__data); \
+ int ret = bfq_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ *(__PTR) = (u64)__data * NSEC_PER_USEC; \
+ return ret; \
+}
+USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
+ UINT_MAX);
+#undef USEC_STORE_FUNCTION
+
+static unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+ u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+
+ if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+ return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+ else
+ return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+ const char *page, size_t count)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ unsigned long uninitialized_var(__data);
+ int ret = bfq_var_store(&__data, (page), count);
+
+ if (__data == 0)
+ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+ else {
+ if (__data > INT_MAX)
+ __data = INT_MAX;
+ bfqd->bfq_max_budget = __data;
+ }
+
+ bfqd->bfq_user_max_budget = __data;
+
+ return ret;
+}
+
+/*
+ * Leaving this name to preserve name compatibility with cfq
+ * parameters, but this timeout is used for both sync and async.
+ */
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+ const char *page, size_t count)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ unsigned long uninitialized_var(__data);
+ int ret = bfq_var_store(&__data, (page), count);
+
+ if (__data < 1)
+ __data = 1;
+ else if (__data > INT_MAX)
+ __data = INT_MAX;
+
+ bfqd->bfq_timeout = msecs_to_jiffies(__data);
+ if (bfqd->bfq_user_max_budget == 0)
+ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+ return ret;
+}
+
+static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
+ const char *page, size_t count)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ unsigned long uninitialized_var(__data);
+ int ret = bfq_var_store(&__data, (page), count);
+
+ if (__data > 1)
+ __data = 1;
+ if (!bfqd->strict_guarantees && __data == 1
+ && bfqd->bfq_slice_idle < 8 * NSEC_PER_MSEC)
+ bfqd->bfq_slice_idle = 8 * NSEC_PER_MSEC;
+
+ bfqd->strict_guarantees = __data;
+
+ return ret;
+}
+
+#define BFQ_ATTR(name) \
+ __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+ BFQ_ATTR(fifo_expire_sync),
+ BFQ_ATTR(fifo_expire_async),
+ BFQ_ATTR(back_seek_max),
+ BFQ_ATTR(back_seek_penalty),
+ BFQ_ATTR(slice_idle),
+ BFQ_ATTR(slice_idle_us),
+ BFQ_ATTR(max_budget),
+ BFQ_ATTR(timeout_sync),
+ BFQ_ATTR(strict_guarantees),
+ __ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq_mq = {
+ .ops.mq = {
+ .get_rq_priv = bfq_get_rq_private,
+ .put_rq_priv = bfq_put_rq_private,
+ .exit_icq = bfq_exit_icq,
+ .insert_requests = bfq_insert_requests,
+ .dispatch_request = bfq_dispatch_request,
+ .next_request = elv_rb_latter_request,
+ .former_request = elv_rb_former_request,
+ .allow_merge = bfq_allow_bio_merge,
+ .bio_merge = bfq_bio_merge,
+ .request_merge = bfq_request_merge,
+ .requests_merged = bfq_requests_merged,
+ .request_merged = bfq_request_merged,
+ .has_work = bfq_has_work,
+ .init_sched = bfq_init_queue,
+ .exit_sched = bfq_exit_queue,
+ },
+
+ .uses_mq = true,
+ .icq_size = sizeof(struct bfq_io_cq),
+ .icq_align = __alignof__(struct bfq_io_cq),
+ .elevator_attrs = bfq_attrs,
+ .elevator_name = "bfq",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+ int ret;
+
+ ret = -ENOMEM;
+ if (bfq_slab_setup())
+ goto err_pol_unreg;
+
+ ret = elv_register(&iosched_bfq_mq);
+ if (ret)
+ goto err_pol_unreg;
+
+ return 0;
+
+err_pol_unreg:
+ return ret;
+}
+
+static void __exit bfq_exit(void)
+{
+ elv_unregister(&iosched_bfq_mq);
+ bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Paolo Valente");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MQ Budget Fair Queueing I/O Scheduler");
--
2.10.0
^ permalink raw reply related
* [PATCH V4 00/16] Introduce the BFQ I/O scheduler
From: Paolo Valente @ 2017-04-12 16:23 UTC (permalink / raw)
To: Jens Axboe, Tejun Heo
Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
ulf.hansson, linus.walleij, broonie, Paolo Valente
Hi,
new patch series, addressing (both) issues raised by Bart [1], and
with block/Makefile fixed as suggested by Bart [2].
Thanks,
Paolo
[1] https://lkml.org/lkml/2017/3/31/393
[2] https://lkml.org/lkml/2017/4/12/502
Arianna Avanzini (4):
block, bfq: add full hierarchical scheduling and cgroups support
block, bfq: add Early Queue Merge (EQM)
block, bfq: reduce idling only in symmetric scenarios
block, bfq: handle bursts of queue activations
Paolo Valente (12):
block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
block, bfq: improve throughput boosting
block, bfq: modify the peak-rate estimator
block, bfq: add more fairness with writes and slow processes
block, bfq: improve responsiveness
block, bfq: reduce I/O latency for soft real-time applications
block, bfq: preserve a low latency also with NCQ-capable drives
block, bfq: reduce latency during request-pool saturation
block, bfq: boost the throughput on NCQ-capable flash-based devices
block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
block, bfq: remove all get and put of I/O contexts
block, bfq: split bfq-iosched.c into multiple source files
Documentation/block/00-INDEX | 2 +
Documentation/block/bfq-iosched.txt | 531 ++++
block/Kconfig.iosched | 21 +
block/Makefile | 2 +
block/bfq-cgroup.c | 1139 ++++++++
block/bfq-iosched.c | 5047 +++++++++++++++++++++++++++++++++++
block/bfq-iosched.h | 942 +++++++
block/bfq-wf2q.c | 1616 +++++++++++
include/linux/blkdev.h | 2 +-
9 files changed, 9301 insertions(+), 1 deletion(-)
create mode 100644 Documentation/block/bfq-iosched.txt
create mode 100644 block/bfq-cgroup.c
create mode 100644 block/bfq-iosched.c
create mode 100644 block/bfq-iosched.h
create mode 100644 block/bfq-wf2q.c
--
2.10.0
^ permalink raw reply
* Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler
From: Paolo Valente @ 2017-04-12 16:08 UTC (permalink / raw)
To: Bart Van Assche
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
Arianna Avanzini, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <1492011029.2764.3.camel@sandisk.com>
> Il giorno 12 apr 2017, alle ore 17:30, Bart Van Assche =
<Bart.VanAssche@sandisk.com> ha scritto:
>=20
> On Wed, 2017-04-12 at 08:01 +0200, Paolo Valente wrote:
>> Where is my mistake?
>=20
> I think in the Makefile. How about the patch below? Please note that =
I'm no
> Kbuild expert.
>=20
Thank you very much for finding and fixing the bug. I was working
exactly on that, and had got to the same solution (which I guess is
the only correct one). I'll apply these changes and resubmit.
Thanks,
Paolo
> diff --git a/block/Makefile b/block/Makefile
> index 546066ee7fa6..b3711af6b637 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -20,7 +20,8 @@ obj-$(CONFIG_IOSCHED_NOOP) +=3D noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE) +=3D deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) +=3D cfq-iosched.o
> obj-$(CONFIG_MQ_IOSCHED_DEADLINE) +=3D mq-deadline.o
> -obj-$(CONFIG_IOSCHED_BFQ) +=3D bfq-iosched.o bfq-wf2q.o =
bfq-cgroup.o
> +bfq-y :=3D bfq-iosched.o bfq-wf2q.o =
bfq-cgroup.o
> +obj-$(CONFIG_IOSCHED_BFQ) +=3D bfq.o
> obj-$(CONFIG_BLOCK_COMPAT) +=3D compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER) +=3D cmdline-parser.o
>=20
^ permalink raw reply
* Re: [kbuild-all] [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files
From: Paolo Valente @ 2017-04-12 16:05 UTC (permalink / raw)
To: Ye Xiaolong
Cc: kbuild test robot, Jens Axboe, Ulf Hansson, Linus Walleij,
Linux-Kernal, linux-block, Fabio Checconi, Mark Brown, kbuild-all,
Arianna Avanzini, Tejun Heo
In-Reply-To: <0313304F-A998-4C5F-BA32-E34C6F85A8CF@linaro.org>
> Il giorno 12 apr 2017, alle ore 11:24, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 12 apr 2017, alle ore 10:39, Ye Xiaolong =
<xiaolong.ye@intel.com> ha scritto:
>>=20
>> On 04/11, Paolo Valente wrote:
>>>=20
>>>> Il giorno 02 apr 2017, alle ore 12:02, kbuild test robot =
<lkp@intel.com> ha scritto:
>>>>=20
>>>> Hi Paolo,
>>>>=20
>>>> [auto build test ERROR on block/for-next]
>>>> [also build test ERROR on v4.11-rc4 next-20170331]
>>>> [if your patch is applied to the wrong git tree, please drop us a =
note to help improve the system]
>>>>=20
>>>=20
>>> Hi,
>>> this seems to be a false positive. Build is correct with the tested
>>> tree and the .config.
>>>=20
>>=20
>> Hmm, this error is reproducible in 0day side, and you patches were =
applied on
>> top of 803e16d "Merge branch 'for-4.12/block' into for-next", is it =
the same as
>> yours?
>>=20
>=20
> I have downloaded the offending tree directly from the github page.
>=20
> Here are my steps in super detail.
>=20
> I followed the url: =
https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce=
-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
> and downloaded the tree ("Browse the repository at this point in
> history" link on the top commit, then "Download ZIP"), plus the
> .config.gz attached to the email.
>=20
> Then I built with no error.
>=20
> To try to help understand where the mistake is, the compilation of the
> files of course fails because each of the offending files does not
> contain the definition of the reported functions. But that definition
> is contained in one of the other files for the same module. I mean
> one of the files listed in the following rule in block/Makefile:
> obj-$(CONFIG_IOSCHED_BFQ) +=3D bfq-iosched.o bfq-wf2q.o =
bfq-cgroup.o
>=20
> Maybe I'm making some mistake in the Makefile, or I forgot to modify
> some other configuration file?
>=20
> Help! :)
>=20
Ok, fortunately I've reproduced it on a different PC. block/Makefile
was flawed, but, for unknown (to me) reasons, my system was perfectly
happy with the flaw.
Thanks,
Paolo
> Thanks,
> Paolo
>=20
>> Thanks,
>> Xiaolong
>>=20
>>> Thanks,
>>> Paolo
>>>=20
>>>> url: =
https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce=
-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
>>>> base: =
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git =
for-next
>>>> config: i386-allmodconfig (attached as .config)
>>>> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
>>>> reproduce:
>>>> # save the attached .config to linux build tree
>>>> make ARCH=3Di386=20
>>>>=20
>>>> All errors (new ones prefixed by >>):
>>>>=20
>>>>>> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] =
undefined!
>>>>>> ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] =
undefined!
>>>>>> ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] =
undefined!
>>>>>> ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] =
undefined!
>>>>>> ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
>>>>>> ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] =
undefined!
>>>>>> ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] =
undefined!
>>>>>> ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
>>>>>> ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!
>>>>=20
>>>> ---
>>>> 0-DAY kernel test infrastructure Open Source =
Technology Center
>>>> https://lists.01.org/pipermail/kbuild-all Intel =
Corporation
>>>> <.config.gz>
>>>=20
>>> _______________________________________________
>>> kbuild-all mailing list
>>> kbuild-all@lists.01.org
>>> https://lists.01.org/mailman/listinfo/kbuild-all
^ permalink raw reply
* [PATCH 4/4] scsi: remove the SCSI OSD library
From: Christoph Hellwig @ 2017-04-12 16:01 UTC (permalink / raw)
To: ooo, martin.petersen, trond.myklebust, axboe
Cc: osd-dev, linux-nfs, linux-scsi, linux-block, linux-kernel
In-Reply-To: <20170412160109.10598-1-hch@lst.de>
Now that all the users are gone the SCSI OSD library can be removed
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/scsi/osd.txt | 192 ----
drivers/scsi/Makefile | 1 -
drivers/scsi/osd/Kbuild | 20 -
drivers/scsi/osd/Kconfig | 49 -
drivers/scsi/osd/osd_debug.h | 30 -
drivers/scsi/osd/osd_initiator.c | 2076 --------------------------------------
drivers/scsi/osd/osd_uld.c | 595 -----------
include/scsi/osd_initiator.h | 515 ----------
include/scsi/osd_ore.h | 201 ----
9 files changed, 3679 deletions(-)
delete mode 100644 Documentation/scsi/osd.txt
delete mode 100644 drivers/scsi/osd/Kbuild
delete mode 100644 drivers/scsi/osd/Kconfig
delete mode 100644 drivers/scsi/osd/osd_debug.h
delete mode 100644 drivers/scsi/osd/osd_initiator.c
delete mode 100644 drivers/scsi/osd/osd_uld.c
delete mode 100644 include/scsi/osd_initiator.h
delete mode 100644 include/scsi/osd_ore.h
diff --git a/Documentation/scsi/osd.txt b/Documentation/scsi/osd.txt
deleted file mode 100644
index 2bc2ab06b0c0..000000000000
--- a/Documentation/scsi/osd.txt
+++ /dev/null
@@ -1,192 +0,0 @@
-The OSD Standard
-================
-OSD (Object-Based Storage Device) is a T10 SCSI command set that is designed
-to provide efficient operation of input/output logical units that manage the
-allocation, placement, and accessing of variable-size data-storage containers,
-called objects. Objects are intended to contain operating system and application
-constructs. Each object has associated attributes attached to it, which are
-integral part of the object and provide metadata about the object. The standard
-defines some common obligatory attributes, but user attributes can be added as
-needed.
-
-See: http://www.t10.org/ftp/t10/drafts/osd2/ for the latest draft for OSD 2
-or search the web for "OSD SCSI"
-
-OSD in the Linux Kernel
-=======================
-osd-initiator:
- The main component of OSD in Kernel is the osd-initiator library. Its main
-user is intended to be the pNFS-over-objects layout driver, which uses objects
-as its back-end data storage. Other clients are the other osd parts listed below.
-
-osd-uld:
- This is a SCSI ULD that registers for OSD type devices and provides a testing
-platform, both for the in-kernel initiator as well as connected targets. It
-currently has no useful user-mode API, though it could have if need be.
-
-osd target:
- There are no current plans for an OSD target implementation in kernel. For all
-needs, a user-mode target that is based on the scsi tgt target framework is
-available from Ohio Supercomputer Center (OSC) at:
-http://www.open-osd.org/bin/view/Main/OscOsdProject
-There are several other target implementations. See http://open-osd.org for more
-links.
-
-Files and Folders
-=================
-This is the complete list of files included in this work:
-include/scsi/
- osd_initiator.h Main API for the initiator library
- osd_types.h Common OSD types
- osd_sec.h Security Manager API
- osd_protocol.h Wire definitions of the OSD standard protocol
- osd_attributes.h Wire definitions of OSD attributes
-
-drivers/scsi/osd/
- osd_initiator.c OSD-Initiator library implementation
- osd_uld.c The OSD scsi ULD
- osd_ktest.{h,c} In-kernel test suite (called by osd_uld)
- osd_debug.h Some printk macros
- Makefile For both in-tree and out-of-tree compilation
- Kconfig Enables inclusion of the different pieces
- osd_test.c User-mode application to call the kernel tests
-
-The OSD-Initiator Library
-=========================
-osd_initiator is a low level implementation of an osd initiator encoder.
-But even though, it should be intuitive and easy to use. Perhaps over time an
-higher lever will form that automates some of the more common recipes.
-
-init/fini:
-- osd_dev_init() associates a scsi_device with an osd_dev structure
- and initializes some global pools. This should be done once per scsi_device
- (OSD LUN). The osd_dev structure is needed for calling osd_start_request().
-
-- osd_dev_fini() cleans up before a osd_dev/scsi_device destruction.
-
-OSD commands encoding, execution, and decoding of results:
-
-struct osd_request's is used to iteratively encode an OSD command and carry
-its state throughout execution. Each request goes through these stages:
-
-a. osd_start_request() allocates the request.
-
-b. Any of the osd_req_* methods is used to encode a request of the specified
- type.
-
-c. osd_req_add_{get,set}_attr_* may be called to add get/set attributes to the
- CDB. "List" or "Page" mode can be used exclusively. The attribute-list API
- can be called multiple times on the same request. However, only one
- attribute-page can be read, as mandated by the OSD standard.
-
-d. osd_finalize_request() computes offsets into the data-in and data-out buffers
- and signs the request using the provided capability key and integrity-
- check parameters.
-
-e. osd_execute_request() may be called to execute the request via the block
- layer and wait for its completion. The request can be executed
- asynchronously by calling the block layer API directly.
-
-f. After execution, osd_req_decode_sense() can be called to decode the request's
- sense information.
-
-g. osd_req_decode_get_attr() may be called to retrieve osd_add_get_attr_list()
- values.
-
-h. osd_end_request() must be called to deallocate the request and any resource
- associated with it. Note that osd_end_request cleans up the request at any
- stage and it must always be called after a successful osd_start_request().
-
-osd_request's structure:
-
-The OSD standard defines a complex structure of IO segments pointed to by
-members in the CDB. Up to 3 segments can be deployed in the IN-Buffer and up to
-4 in the OUT-Buffer. The ASCII illustration below depicts a secure-read with
-associated get+set of attributes-lists. Other combinations very on the same
-basic theme. From no-segments-used up to all-segments-used.
-
-|________OSD-CDB__________|
-| |
-|read_len (offset=0) -|---------\
-| | |
-|get_attrs_list_length | |
-|get_attrs_list_offset -|----\ |
-| | | |
-|retrieved_attrs_alloc_len| | |
-|retrieved_attrs_offset -|----|----|-\
-| | | | |
-|set_attrs_list_length | | | |
-|set_attrs_list_offset -|-\ | | |
-| | | | | |
-|in_data_integ_offset -|-|--|----|-|-\
-|out_data_integ_offset -|-|--|--\ | | |
-\_________________________/ | | | | | |
- | | | | | |
-|_______OUT-BUFFER________| | | | | | |
-| Set attr list |</ | | | | |
-| | | | | | |
-|-------------------------| | | | | |
-| Get attr descriptors |<---/ | | | |
-| | | | | |
-|-------------------------| | | | |
-| Out-data integrity |<------/ | | |
-| | | | |
-\_________________________/ | | |
- | | |
-|________IN-BUFFER________| | | |
-| In-Data read |<--------/ | |
-| | | |
-|-------------------------| | |
-| Get attr list |<----------/ |
-| | |
-|-------------------------| |
-| In-data integrity |<------------/
-| |
-\_________________________/
-
-A block device request can carry bidirectional payload by means of associating
-a bidi_read request with a main write-request. Each in/out request is described
-by a chain of BIOs associated with each request.
-The CDB is of a SCSI VARLEN CDB format, as described by OSD standard.
-The OSD standard also mandates alignment restrictions at start of each segment.
-
-In the code, in struct osd_request, there are two _osd_io_info structures to
-describe the IN/OUT buffers above, two BIOs for the data payload and up to five
-_osd_req_data_segment structures to hold the different segments allocation and
-information.
-
-Important: We have chosen to disregard the assumption that a BIO-chain (and
-the resulting sg-list) describes a linear memory buffer. Meaning only first and
-last scatter chain can be incomplete and all the middle chains are of PAGE_SIZE.
-For us, a scatter-gather-list, as its name implies and as used by the Networking
-layer, is to describe a vector of buffers that will be transferred to/from the
-wire. It works very well with current iSCSI transport. iSCSI is currently the
-only deployed OSD transport. In the future we anticipate SAS and FC attached OSD
-devices as well.
-
-The OSD Testing ULD
-===================
-TODO: More user-mode control on tests.
-
-Authors, Mailing list
-=====================
-Please communicate with us on any deployment of osd, whether using this code
-or not.
-
-Any problems, questions, bug reports, lonely OSD nights, please email:
- OSD Dev List <osd-dev@open-osd.org>
-
-More up-to-date information can be found on:
-http://open-osd.org
-
-Boaz Harrosh <ooo@electrozaur.com>
-
-References
-==========
-Weber, R., "SCSI Object-Based Storage Device Commands",
-T10/1355-D ANSI/INCITS 400-2004,
-http://www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf
-
-Weber, R., "SCSI Object-Based Storage Device Commands -2 (OSD-2)"
-T10/1729-D, Working Draft, rev. 3
-http://www.t10.org/ftp/t10/drafts/osd2/osd2r03.pdf
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index fc2855565a51..f5e75a3dac9e 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -152,7 +152,6 @@ obj-$(CONFIG_CHR_DEV_SG) += sg.o
obj-$(CONFIG_CHR_DEV_SCH) += ch.o
obj-$(CONFIG_SCSI_ENCLOSURE) += ses.o
-obj-$(CONFIG_SCSI_OSD_INITIATOR) += osd/
obj-$(CONFIG_SCSI_HISI_SAS) += hisi_sas/
# This goes last, so that "real" scsi devices probe earlier
diff --git a/drivers/scsi/osd/Kbuild b/drivers/scsi/osd/Kbuild
deleted file mode 100644
index 58cecd45b0f5..000000000000
--- a/drivers/scsi/osd/Kbuild
+++ /dev/null
@@ -1,20 +0,0 @@
-#
-# Kbuild for the OSD modules
-#
-# Copyright (C) 2008 Panasas Inc. All rights reserved.
-#
-# Authors:
-# Boaz Harrosh <ooo@electrozaur.com>
-# Benny Halevy <bhalevy@panasas.com>
-#
-# This program is free software; you can redistribute it and/or modify
-# it under the terms of the GNU General Public License version 2
-#
-
-# libosd.ko - osd-initiator library
-libosd-y := osd_initiator.o
-obj-$(CONFIG_SCSI_OSD_INITIATOR) += libosd.o
-
-# osd.ko - SCSI ULD and char-device
-osd-y := osd_uld.o
-obj-$(CONFIG_SCSI_OSD_ULD) += osd.o
diff --git a/drivers/scsi/osd/Kconfig b/drivers/scsi/osd/Kconfig
deleted file mode 100644
index 347cc5e33749..000000000000
--- a/drivers/scsi/osd/Kconfig
+++ /dev/null
@@ -1,49 +0,0 @@
-#
-# Kernel configuration file for the OSD scsi protocol
-#
-# Copyright (C) 2008 Panasas Inc. All rights reserved.
-#
-# Authors:
-# Boaz Harrosh <ooo@electrozaur.com>
-# Benny Halevy <bhalevy@panasas.com>
-#
-# This program is free software; you can redistribute it and/or modify
-# it under the terms of the GNU General Public version 2 License as
-# published by the Free Software Foundation
-#
-config SCSI_OSD_INITIATOR
- tristate "OSD-Initiator library"
- depends on SCSI
- help
- Enable the OSD-Initiator library (libosd.ko).
- NOTE: You must also select CRYPTO_SHA1 + CRYPTO_HMAC and their
- dependencies
-
-config SCSI_OSD_ULD
- tristate "OSD Upper Level driver"
- depends on SCSI_OSD_INITIATOR
- help
- Build a SCSI upper layer driver that exports /dev/osdX devices
- to user-mode for testing and controlling OSD devices. It is also
- needed by exofs, for mounting an OSD based file system.
-
-config SCSI_OSD_DPRINT_SENSE
- int "(0-2) When sense is returned, DEBUG print all sense descriptors"
- default 1
- depends on SCSI_OSD_INITIATOR
- help
- When a CHECK_CONDITION status is returned from a target, and a
- sense-buffer is retrieved, turning this on will dump a full
- sense-decoding message. Setting to 2 will also print recoverable
- errors that might be regularly returned for some filesystem
- operations.
-
-config SCSI_OSD_DEBUG
- bool "Compile All OSD modules with lots of DEBUG prints"
- default n
- depends on SCSI_OSD_INITIATOR
- help
- OSD Code is populated with lots of OSD_DEBUG(..) printouts to
- dmesg. Enable this if you found a bug and you want to help us
- track the problem (see also MAINTAINERS). Setting this will also
- force SCSI_OSD_DPRINT_SENSE=2.
diff --git a/drivers/scsi/osd/osd_debug.h b/drivers/scsi/osd/osd_debug.h
deleted file mode 100644
index 26341261bb5c..000000000000
--- a/drivers/scsi/osd/osd_debug.h
+++ /dev/null
@@ -1,30 +0,0 @@
-/*
- * osd_debug.h - Some kprintf macros
- *
- * Copyright (C) 2008 Panasas Inc. All rights reserved.
- *
- * Authors:
- * Boaz Harrosh <ooo@electrozaur.com>
- * Benny Halevy <bhalevy@panasas.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- *
- */
-#ifndef __OSD_DEBUG_H__
-#define __OSD_DEBUG_H__
-
-#define OSD_ERR(fmt, a...) printk(KERN_ERR "osd: " fmt, ##a)
-#define OSD_INFO(fmt, a...) printk(KERN_NOTICE "osd: " fmt, ##a)
-
-#ifdef CONFIG_SCSI_OSD_DEBUG
-#define OSD_DEBUG(fmt, a...) \
- printk(KERN_NOTICE "osd @%s:%d: " fmt, __func__, __LINE__, ##a)
-#else
-#define OSD_DEBUG(fmt, a...) do {} while (0)
-#endif
-
-/* u64 has problems with printk this will cast it to unsigned long long */
-#define _LLU(x) (unsigned long long)(x)
-
-#endif /* ndef __OSD_DEBUG_H__ */
diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c
deleted file mode 100644
index 9d0727b2bdec..000000000000
--- a/drivers/scsi/osd/osd_initiator.c
+++ /dev/null
@@ -1,2076 +0,0 @@
-/*
- * osd_initiator - Main body of the osd initiator library.
- *
- * Note: The file does not contain the advanced security functionality which
- * is only needed by the security_manager's initiators.
- *
- * Copyright (C) 2008 Panasas Inc. All rights reserved.
- *
- * Authors:
- * Boaz Harrosh <ooo@electrozaur.com>
- * Benny Halevy <bhalevy@panasas.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <linux/slab.h>
-#include <linux/module.h>
-
-#include <scsi/osd_initiator.h>
-#include <scsi/osd_sec.h>
-#include <scsi/osd_attributes.h>
-#include <scsi/osd_sense.h>
-
-#include <scsi/scsi_device.h>
-#include <scsi/scsi_request.h>
-
-#include "osd_debug.h"
-
-#ifndef __unused
-# define __unused __attribute__((unused))
-#endif
-
-enum { OSD_REQ_RETRIES = 1 };
-
-MODULE_AUTHOR("Boaz Harrosh <ooo@electrozaur.com>");
-MODULE_DESCRIPTION("open-osd initiator library libosd.ko");
-MODULE_LICENSE("GPL");
-
-static inline void build_test(void)
-{
- /* structures were not packed */
- BUILD_BUG_ON(sizeof(struct osd_capability) != OSD_CAP_LEN);
- BUILD_BUG_ON(sizeof(struct osdv2_cdb) != OSD_TOTAL_CDB_LEN);
- BUILD_BUG_ON(sizeof(struct osdv1_cdb) != OSDv1_TOTAL_CDB_LEN);
-}
-
-static const char *_osd_ver_desc(struct osd_request *or)
-{
- return osd_req_is_ver1(or) ? "OSD1" : "OSD2";
-}
-
-#define ATTR_DEF_RI(id, len) ATTR_DEF(OSD_APAGE_ROOT_INFORMATION, id, len)
-
-static int _osd_get_print_system_info(struct osd_dev *od,
- void *caps, struct osd_dev_info *odi)
-{
- struct osd_request *or;
- struct osd_attr get_attrs[] = {
- ATTR_DEF_RI(OSD_ATTR_RI_VENDOR_IDENTIFICATION, 8),
- ATTR_DEF_RI(OSD_ATTR_RI_PRODUCT_IDENTIFICATION, 16),
- ATTR_DEF_RI(OSD_ATTR_RI_PRODUCT_MODEL, 32),
- ATTR_DEF_RI(OSD_ATTR_RI_PRODUCT_REVISION_LEVEL, 4),
- ATTR_DEF_RI(OSD_ATTR_RI_PRODUCT_SERIAL_NUMBER, 64 /*variable*/),
- ATTR_DEF_RI(OSD_ATTR_RI_OSD_NAME, 64 /*variable*/),
- ATTR_DEF_RI(OSD_ATTR_RI_TOTAL_CAPACITY, 8),
- ATTR_DEF_RI(OSD_ATTR_RI_USED_CAPACITY, 8),
- ATTR_DEF_RI(OSD_ATTR_RI_NUMBER_OF_PARTITIONS, 8),
- ATTR_DEF_RI(OSD_ATTR_RI_CLOCK, 6),
- /* IBM-OSD-SIM Has a bug with this one put it last */
- ATTR_DEF_RI(OSD_ATTR_RI_OSD_SYSTEM_ID, 20),
- };
- void *iter = NULL, *pFirst;
- int nelem = ARRAY_SIZE(get_attrs), a = 0;
- int ret;
-
- or = osd_start_request(od, GFP_KERNEL);
- if (!or)
- return -ENOMEM;
-
- /* get attrs */
- osd_req_get_attributes(or, &osd_root_object);
- osd_req_add_get_attr_list(or, get_attrs, ARRAY_SIZE(get_attrs));
-
- ret = osd_finalize_request(or, 0, caps, NULL);
- if (ret)
- goto out;
-
- ret = osd_execute_request(or);
- if (ret) {
- OSD_ERR("Failed to detect %s => %d\n", _osd_ver_desc(or), ret);
- goto out;
- }
-
- osd_req_decode_get_attr_list(or, get_attrs, &nelem, &iter);
-
- OSD_INFO("Detected %s device\n",
- _osd_ver_desc(or));
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("VENDOR_IDENTIFICATION [%s]\n",
- (char *)pFirst);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("PRODUCT_IDENTIFICATION [%s]\n",
- (char *)pFirst);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("PRODUCT_MODEL [%s]\n",
- (char *)pFirst);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("PRODUCT_REVISION_LEVEL [%u]\n",
- pFirst ? get_unaligned_be32(pFirst) : ~0U);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("PRODUCT_SERIAL_NUMBER [%s]\n",
- (char *)pFirst);
-
- odi->osdname_len = get_attrs[a].len;
- /* Avoid NULL for memcmp optimization 0-length is good enough */
- odi->osdname = kzalloc(odi->osdname_len + 1, GFP_KERNEL);
- if (!odi->osdname) {
- ret = -ENOMEM;
- goto out;
- }
- if (odi->osdname_len)
- memcpy(odi->osdname, get_attrs[a].val_ptr, odi->osdname_len);
- OSD_INFO("OSD_NAME [%s]\n", odi->osdname);
- a++;
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("TOTAL_CAPACITY [0x%llx]\n",
- pFirst ? _LLU(get_unaligned_be64(pFirst)) : ~0ULL);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("USED_CAPACITY [0x%llx]\n",
- pFirst ? _LLU(get_unaligned_be64(pFirst)) : ~0ULL);
-
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("NUMBER_OF_PARTITIONS [%llu]\n",
- pFirst ? _LLU(get_unaligned_be64(pFirst)) : ~0ULL);
-
- if (a >= nelem)
- goto out;
-
- /* FIXME: Where are the time utilities */
- pFirst = get_attrs[a++].val_ptr;
- OSD_INFO("CLOCK [0x%6phN]\n", pFirst);
-
- if (a < nelem) { /* IBM-OSD-SIM bug, Might not have it */
- unsigned len = get_attrs[a].len;
- char sid_dump[32*4 + 2]; /* 2nibbles+space+ASCII */
-
- hex_dump_to_buffer(get_attrs[a].val_ptr, len, 32, 1,
- sid_dump, sizeof(sid_dump), true);
- OSD_INFO("OSD_SYSTEM_ID(%d)\n"
- " [%s]\n", len, sid_dump);
-
- if (unlikely(len > sizeof(odi->systemid))) {
- OSD_ERR("OSD Target error: OSD_SYSTEM_ID too long(%d). "
- "device identification might not work\n", len);
- len = sizeof(odi->systemid);
- }
- odi->systemid_len = len;
- memcpy(odi->systemid, get_attrs[a].val_ptr, len);
- a++;
- }
-out:
- osd_end_request(or);
- return ret;
-}
-
-int osd_auto_detect_ver(struct osd_dev *od,
- void *caps, struct osd_dev_info *odi)
-{
- int ret;
-
- /* Auto-detect the osd version */
- ret = _osd_get_print_system_info(od, caps, odi);
- if (ret) {
- osd_dev_set_ver(od, OSD_VER1);
- OSD_DEBUG("converting to OSD1\n");
- ret = _osd_get_print_system_info(od, caps, odi);
- }
-
- return ret;
-}
-EXPORT_SYMBOL(osd_auto_detect_ver);
-
-static unsigned _osd_req_cdb_len(struct osd_request *or)
-{
- return osd_req_is_ver1(or) ? OSDv1_TOTAL_CDB_LEN : OSD_TOTAL_CDB_LEN;
-}
-
-static unsigned _osd_req_alist_elem_size(struct osd_request *or, unsigned len)
-{
- return osd_req_is_ver1(or) ?
- osdv1_attr_list_elem_size(len) :
- osdv2_attr_list_elem_size(len);
-}
-
-static void _osd_req_alist_elem_encode(struct osd_request *or,
- void *attr_last, const struct osd_attr *oa)
-{
- if (osd_req_is_ver1(or)) {
- struct osdv1_attributes_list_element *attr = attr_last;
-
- attr->attr_page = cpu_to_be32(oa->attr_page);
- attr->attr_id = cpu_to_be32(oa->attr_id);
- attr->attr_bytes = cpu_to_be16(oa->len);
- memcpy(attr->attr_val, oa->val_ptr, oa->len);
- } else {
- struct osdv2_attributes_list_element *attr = attr_last;
-
- attr->attr_page = cpu_to_be32(oa->attr_page);
- attr->attr_id = cpu_to_be32(oa->attr_id);
- attr->attr_bytes = cpu_to_be16(oa->len);
- memcpy(attr->attr_val, oa->val_ptr, oa->len);
- }
-}
-
-static int _osd_req_alist_elem_decode(struct osd_request *or,
- void *cur_p, struct osd_attr *oa, unsigned max_bytes)
-{
- unsigned inc;
- if (osd_req_is_ver1(or)) {
- struct osdv1_attributes_list_element *attr = cur_p;
-
- if (max_bytes < sizeof(*attr))
- return -1;
-
- oa->len = be16_to_cpu(attr->attr_bytes);
- inc = _osd_req_alist_elem_size(or, oa->len);
- if (inc > max_bytes)
- return -1;
-
- oa->attr_page = be32_to_cpu(attr->attr_page);
- oa->attr_id = be32_to_cpu(attr->attr_id);
-
- /* OSD1: On empty attributes we return a pointer to 2 bytes
- * of zeros. This keeps similar behaviour with OSD2.
- * (See below)
- */
- oa->val_ptr = likely(oa->len) ? attr->attr_val :
- (u8 *)&attr->attr_bytes;
- } else {
- struct osdv2_attributes_list_element *attr = cur_p;
-
- if (max_bytes < sizeof(*attr))
- return -1;
-
- oa->len = be16_to_cpu(attr->attr_bytes);
- inc = _osd_req_alist_elem_size(or, oa->len);
- if (inc > max_bytes)
- return -1;
-
- oa->attr_page = be32_to_cpu(attr->attr_page);
- oa->attr_id = be32_to_cpu(attr->attr_id);
-
- /* OSD2: For convenience, on empty attributes, we return 8 bytes
- * of zeros here. This keeps the same behaviour with OSD2r04,
- * and is nice with null terminating ASCII fields.
- * oa->val_ptr == NULL marks the end-of-list, or error.
- */
- oa->val_ptr = likely(oa->len) ? attr->attr_val : attr->reserved;
- }
- return inc;
-}
-
-static unsigned _osd_req_alist_size(struct osd_request *or, void *list_head)
-{
- return osd_req_is_ver1(or) ?
- osdv1_list_size(list_head) :
- osdv2_list_size(list_head);
-}
-
-static unsigned _osd_req_sizeof_alist_header(struct osd_request *or)
-{
- return osd_req_is_ver1(or) ?
- sizeof(struct osdv1_attributes_list_header) :
- sizeof(struct osdv2_attributes_list_header);
-}
-
-static void _osd_req_set_alist_type(struct osd_request *or,
- void *list, int list_type)
-{
- if (osd_req_is_ver1(or)) {
- struct osdv1_attributes_list_header *attr_list = list;
-
- memset(attr_list, 0, sizeof(*attr_list));
- attr_list->type = list_type;
- } else {
- struct osdv2_attributes_list_header *attr_list = list;
-
- memset(attr_list, 0, sizeof(*attr_list));
- attr_list->type = list_type;
- }
-}
-
-static bool _osd_req_is_alist_type(struct osd_request *or,
- void *list, int list_type)
-{
- if (!list)
- return false;
-
- if (osd_req_is_ver1(or)) {
- struct osdv1_attributes_list_header *attr_list = list;
-
- return attr_list->type == list_type;
- } else {
- struct osdv2_attributes_list_header *attr_list = list;
-
- return attr_list->type == list_type;
- }
-}
-
-/* This is for List-objects not Attributes-Lists */
-static void _osd_req_encode_olist(struct osd_request *or,
- struct osd_obj_id_list *list)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
-
- if (osd_req_is_ver1(or)) {
- cdbh->v1.list_identifier = list->list_identifier;
- cdbh->v1.start_address = list->continuation_id;
- } else {
- cdbh->v2.list_identifier = list->list_identifier;
- cdbh->v2.start_address = list->continuation_id;
- }
-}
-
-static osd_cdb_offset osd_req_encode_offset(struct osd_request *or,
- u64 offset, unsigned *padding)
-{
- return __osd_encode_offset(offset, padding,
- osd_req_is_ver1(or) ?
- OSDv1_OFFSET_MIN_SHIFT : OSD_OFFSET_MIN_SHIFT,
- OSD_OFFSET_MAX_SHIFT);
-}
-
-static struct osd_security_parameters *
-_osd_req_sec_params(struct osd_request *or)
-{
- struct osd_cdb *ocdb = &or->cdb;
-
- if (osd_req_is_ver1(or))
- return (struct osd_security_parameters *)&ocdb->v1.sec_params;
- else
- return (struct osd_security_parameters *)&ocdb->v2.sec_params;
-}
-
-void osd_dev_init(struct osd_dev *osdd, struct scsi_device *scsi_device)
-{
- memset(osdd, 0, sizeof(*osdd));
- osdd->scsi_device = scsi_device;
- osdd->def_timeout = BLK_DEFAULT_SG_TIMEOUT;
-#ifdef OSD_VER1_SUPPORT
- osdd->version = OSD_VER2;
-#endif
- /* TODO: Allocate pools for osd_request attributes ... */
-}
-EXPORT_SYMBOL(osd_dev_init);
-
-void osd_dev_fini(struct osd_dev *osdd)
-{
- /* TODO: De-allocate pools */
-
- osdd->scsi_device = NULL;
-}
-EXPORT_SYMBOL(osd_dev_fini);
-
-static struct osd_request *_osd_request_alloc(gfp_t gfp)
-{
- struct osd_request *or;
-
- /* TODO: Use mempool with one saved request */
- or = kzalloc(sizeof(*or), gfp);
- return or;
-}
-
-static void _osd_request_free(struct osd_request *or)
-{
- kfree(or);
-}
-
-struct osd_request *osd_start_request(struct osd_dev *dev, gfp_t gfp)
-{
- struct osd_request *or;
-
- or = _osd_request_alloc(gfp);
- if (!or)
- return NULL;
-
- or->osd_dev = dev;
- or->alloc_flags = gfp;
- or->timeout = dev->def_timeout;
- or->retries = OSD_REQ_RETRIES;
-
- return or;
-}
-EXPORT_SYMBOL(osd_start_request);
-
-static void _osd_free_seg(struct osd_request *or __unused,
- struct _osd_req_data_segment *seg)
-{
- if (!seg->buff || !seg->alloc_size)
- return;
-
- kfree(seg->buff);
- seg->buff = NULL;
- seg->alloc_size = 0;
-}
-
-static void _put_request(struct request *rq)
-{
- /*
- * If osd_finalize_request() was called but the request was not
- * executed through the block layer, then we must release BIOs.
- * TODO: Keep error code in or->async_error. Need to audit all
- * code paths.
- */
- if (unlikely(rq->bio))
- blk_end_request(rq, -ENOMEM, blk_rq_bytes(rq));
- else
- blk_put_request(rq);
-}
-
-void osd_end_request(struct osd_request *or)
-{
- struct request *rq = or->request;
-
- if (rq) {
- if (rq->next_rq) {
- _put_request(rq->next_rq);
- rq->next_rq = NULL;
- }
-
- _put_request(rq);
- }
-
- _osd_free_seg(or, &or->get_attr);
- _osd_free_seg(or, &or->enc_get_attr);
- _osd_free_seg(or, &or->set_attr);
- _osd_free_seg(or, &or->cdb_cont);
-
- _osd_request_free(or);
-}
-EXPORT_SYMBOL(osd_end_request);
-
-static void _set_error_resid(struct osd_request *or, struct request *req,
- int error)
-{
- or->async_error = error;
- or->req_errors = req->errors ? : error;
- or->sense_len = scsi_req(req)->sense_len;
- if (or->sense_len)
- memcpy(or->sense, scsi_req(req)->sense, or->sense_len);
- if (or->out.req)
- or->out.residual = scsi_req(or->out.req)->resid_len;
- if (or->in.req)
- or->in.residual = scsi_req(or->in.req)->resid_len;
-}
-
-int osd_execute_request(struct osd_request *or)
-{
- int error = blk_execute_rq(or->request->q, NULL, or->request, 0);
-
- _set_error_resid(or, or->request, error);
- return error;
-}
-EXPORT_SYMBOL(osd_execute_request);
-
-static void osd_request_async_done(struct request *req, int error)
-{
- struct osd_request *or = req->end_io_data;
-
- _set_error_resid(or, req, error);
- if (req->next_rq) {
- __blk_put_request(req->q, req->next_rq);
- req->next_rq = NULL;
- }
-
- __blk_put_request(req->q, req);
- or->request = NULL;
- or->in.req = NULL;
- or->out.req = NULL;
-
- if (or->async_done)
- or->async_done(or, or->async_private);
- else
- osd_end_request(or);
-}
-
-int osd_execute_request_async(struct osd_request *or,
- osd_req_done_fn *done, void *private)
-{
- or->request->end_io_data = or;
- or->async_private = private;
- or->async_done = done;
-
- blk_execute_rq_nowait(or->request->q, NULL, or->request, 0,
- osd_request_async_done);
- return 0;
-}
-EXPORT_SYMBOL(osd_execute_request_async);
-
-u8 sg_out_pad_buffer[1 << OSDv1_OFFSET_MIN_SHIFT];
-u8 sg_in_pad_buffer[1 << OSDv1_OFFSET_MIN_SHIFT];
-
-static int _osd_realloc_seg(struct osd_request *or,
- struct _osd_req_data_segment *seg, unsigned max_bytes)
-{
- void *buff;
-
- if (seg->alloc_size >= max_bytes)
- return 0;
-
- buff = krealloc(seg->buff, max_bytes, or->alloc_flags);
- if (!buff) {
- OSD_ERR("Failed to Realloc %d-bytes was-%d\n", max_bytes,
- seg->alloc_size);
- return -ENOMEM;
- }
-
- memset(buff + seg->alloc_size, 0, max_bytes - seg->alloc_size);
- seg->buff = buff;
- seg->alloc_size = max_bytes;
- return 0;
-}
-
-static int _alloc_cdb_cont(struct osd_request *or, unsigned total_bytes)
-{
- OSD_DEBUG("total_bytes=%d\n", total_bytes);
- return _osd_realloc_seg(or, &or->cdb_cont, total_bytes);
-}
-
-static int _alloc_set_attr_list(struct osd_request *or,
- const struct osd_attr *oa, unsigned nelem, unsigned add_bytes)
-{
- unsigned total_bytes = add_bytes;
-
- for (; nelem; --nelem, ++oa)
- total_bytes += _osd_req_alist_elem_size(or, oa->len);
-
- OSD_DEBUG("total_bytes=%d\n", total_bytes);
- return _osd_realloc_seg(or, &or->set_attr, total_bytes);
-}
-
-static int _alloc_get_attr_desc(struct osd_request *or, unsigned max_bytes)
-{
- OSD_DEBUG("total_bytes=%d\n", max_bytes);
- return _osd_realloc_seg(or, &or->enc_get_attr, max_bytes);
-}
-
-static int _alloc_get_attr_list(struct osd_request *or)
-{
- OSD_DEBUG("total_bytes=%d\n", or->get_attr.total_bytes);
- return _osd_realloc_seg(or, &or->get_attr, or->get_attr.total_bytes);
-}
-
-/*
- * Common to all OSD commands
- */
-
-static void _osdv1_req_encode_common(struct osd_request *or,
- __be16 act, const struct osd_obj_id *obj, u64 offset, u64 len)
-{
- struct osdv1_cdb *ocdb = &or->cdb.v1;
-
- /*
- * For speed, the commands
- * OSD_ACT_PERFORM_SCSI_COMMAND , V1 0x8F7E, V2 0x8F7C
- * OSD_ACT_SCSI_TASK_MANAGEMENT , V1 0x8F7F, V2 0x8F7D
- * are not supported here. Should pass zero and set after the call
- */
- act &= cpu_to_be16(~0x0080); /* V1 action code */
-
- OSD_DEBUG("OSDv1 execute opcode 0x%x\n", be16_to_cpu(act));
-
- ocdb->h.varlen_cdb.opcode = VARIABLE_LENGTH_CMD;
- ocdb->h.varlen_cdb.additional_cdb_length = OSD_ADDITIONAL_CDB_LENGTH;
- ocdb->h.varlen_cdb.service_action = act;
-
- ocdb->h.partition = cpu_to_be64(obj->partition);
- ocdb->h.object = cpu_to_be64(obj->id);
- ocdb->h.v1.length = cpu_to_be64(len);
- ocdb->h.v1.start_address = cpu_to_be64(offset);
-}
-
-static void _osdv2_req_encode_common(struct osd_request *or,
- __be16 act, const struct osd_obj_id *obj, u64 offset, u64 len)
-{
- struct osdv2_cdb *ocdb = &or->cdb.v2;
-
- OSD_DEBUG("OSDv2 execute opcode 0x%x\n", be16_to_cpu(act));
-
- ocdb->h.varlen_cdb.opcode = VARIABLE_LENGTH_CMD;
- ocdb->h.varlen_cdb.additional_cdb_length = OSD_ADDITIONAL_CDB_LENGTH;
- ocdb->h.varlen_cdb.service_action = act;
-
- ocdb->h.partition = cpu_to_be64(obj->partition);
- ocdb->h.object = cpu_to_be64(obj->id);
- ocdb->h.v2.length = cpu_to_be64(len);
- ocdb->h.v2.start_address = cpu_to_be64(offset);
-}
-
-static void _osd_req_encode_common(struct osd_request *or,
- __be16 act, const struct osd_obj_id *obj, u64 offset, u64 len)
-{
- if (osd_req_is_ver1(or))
- _osdv1_req_encode_common(or, act, obj, offset, len);
- else
- _osdv2_req_encode_common(or, act, obj, offset, len);
-}
-
-/*
- * Device commands
- */
-/*TODO: void osd_req_set_master_seed_xchg(struct osd_request *, ...); */
-/*TODO: void osd_req_set_master_key(struct osd_request *, ...); */
-
-void osd_req_format(struct osd_request *or, u64 tot_capacity)
-{
- _osd_req_encode_common(or, OSD_ACT_FORMAT_OSD, &osd_root_object, 0,
- tot_capacity);
-}
-EXPORT_SYMBOL(osd_req_format);
-
-int osd_req_list_dev_partitions(struct osd_request *or,
- osd_id initial_id, struct osd_obj_id_list *list, unsigned nelem)
-{
- return osd_req_list_partition_objects(or, 0, initial_id, list, nelem);
-}
-EXPORT_SYMBOL(osd_req_list_dev_partitions);
-
-static void _osd_req_encode_flush(struct osd_request *or,
- enum osd_options_flush_scope_values op)
-{
- struct osd_cdb_head *ocdb = osd_cdb_head(&or->cdb);
-
- ocdb->command_specific_options = op;
-}
-
-void osd_req_flush_obsd(struct osd_request *or,
- enum osd_options_flush_scope_values op)
-{
- _osd_req_encode_common(or, OSD_ACT_FLUSH_OSD, &osd_root_object, 0, 0);
- _osd_req_encode_flush(or, op);
-}
-EXPORT_SYMBOL(osd_req_flush_obsd);
-
-/*TODO: void osd_req_perform_scsi_command(struct osd_request *,
- const u8 *cdb, ...); */
-/*TODO: void osd_req_task_management(struct osd_request *, ...); */
-
-/*
- * Partition commands
- */
-static void _osd_req_encode_partition(struct osd_request *or,
- __be16 act, osd_id partition)
-{
- struct osd_obj_id par = {
- .partition = partition,
- .id = 0,
- };
-
- _osd_req_encode_common(or, act, &par, 0, 0);
-}
-
-void osd_req_create_partition(struct osd_request *or, osd_id partition)
-{
- _osd_req_encode_partition(or, OSD_ACT_CREATE_PARTITION, partition);
-}
-EXPORT_SYMBOL(osd_req_create_partition);
-
-void osd_req_remove_partition(struct osd_request *or, osd_id partition)
-{
- _osd_req_encode_partition(or, OSD_ACT_REMOVE_PARTITION, partition);
-}
-EXPORT_SYMBOL(osd_req_remove_partition);
-
-/*TODO: void osd_req_set_partition_key(struct osd_request *,
- osd_id partition, u8 new_key_id[OSD_CRYPTO_KEYID_SIZE],
- u8 seed[OSD_CRYPTO_SEED_SIZE]); */
-
-static int _osd_req_list_objects(struct osd_request *or,
- __be16 action, const struct osd_obj_id *obj, osd_id initial_id,
- struct osd_obj_id_list *list, unsigned nelem)
-{
- struct request_queue *q = osd_request_queue(or->osd_dev);
- u64 len = nelem * sizeof(osd_id) + sizeof(*list);
- struct bio *bio;
-
- _osd_req_encode_common(or, action, obj, (u64)initial_id, len);
-
- if (list->list_identifier)
- _osd_req_encode_olist(or, list);
-
- WARN_ON(or->in.bio);
- bio = bio_map_kern(q, list, len, or->alloc_flags);
- if (IS_ERR(bio)) {
- OSD_ERR("!!! Failed to allocate list_objects BIO\n");
- return PTR_ERR(bio);
- }
-
- bio_set_op_attrs(bio, REQ_OP_READ, 0);
- or->in.bio = bio;
- or->in.total_bytes = bio->bi_iter.bi_size;
- return 0;
-}
-
-int osd_req_list_partition_collections(struct osd_request *or,
- osd_id partition, osd_id initial_id, struct osd_obj_id_list *list,
- unsigned nelem)
-{
- struct osd_obj_id par = {
- .partition = partition,
- .id = 0,
- };
-
- return osd_req_list_collection_objects(or, &par, initial_id, list,
- nelem);
-}
-EXPORT_SYMBOL(osd_req_list_partition_collections);
-
-int osd_req_list_partition_objects(struct osd_request *or,
- osd_id partition, osd_id initial_id, struct osd_obj_id_list *list,
- unsigned nelem)
-{
- struct osd_obj_id par = {
- .partition = partition,
- .id = 0,
- };
-
- return _osd_req_list_objects(or, OSD_ACT_LIST, &par, initial_id, list,
- nelem);
-}
-EXPORT_SYMBOL(osd_req_list_partition_objects);
-
-void osd_req_flush_partition(struct osd_request *or,
- osd_id partition, enum osd_options_flush_scope_values op)
-{
- _osd_req_encode_partition(or, OSD_ACT_FLUSH_PARTITION, partition);
- _osd_req_encode_flush(or, op);
-}
-EXPORT_SYMBOL(osd_req_flush_partition);
-
-/*
- * Collection commands
- */
-/*TODO: void osd_req_create_collection(struct osd_request *,
- const struct osd_obj_id *); */
-/*TODO: void osd_req_remove_collection(struct osd_request *,
- const struct osd_obj_id *); */
-
-int osd_req_list_collection_objects(struct osd_request *or,
- const struct osd_obj_id *obj, osd_id initial_id,
- struct osd_obj_id_list *list, unsigned nelem)
-{
- return _osd_req_list_objects(or, OSD_ACT_LIST_COLLECTION, obj,
- initial_id, list, nelem);
-}
-EXPORT_SYMBOL(osd_req_list_collection_objects);
-
-/*TODO: void query(struct osd_request *, ...); V2 */
-
-void osd_req_flush_collection(struct osd_request *or,
- const struct osd_obj_id *obj, enum osd_options_flush_scope_values op)
-{
- _osd_req_encode_common(or, OSD_ACT_FLUSH_PARTITION, obj, 0, 0);
- _osd_req_encode_flush(or, op);
-}
-EXPORT_SYMBOL(osd_req_flush_collection);
-
-/*TODO: void get_member_attrs(struct osd_request *, ...); V2 */
-/*TODO: void set_member_attrs(struct osd_request *, ...); V2 */
-
-/*
- * Object commands
- */
-void osd_req_create_object(struct osd_request *or, struct osd_obj_id *obj)
-{
- _osd_req_encode_common(or, OSD_ACT_CREATE, obj, 0, 0);
-}
-EXPORT_SYMBOL(osd_req_create_object);
-
-void osd_req_remove_object(struct osd_request *or, struct osd_obj_id *obj)
-{
- _osd_req_encode_common(or, OSD_ACT_REMOVE, obj, 0, 0);
-}
-EXPORT_SYMBOL(osd_req_remove_object);
-
-
-/*TODO: void osd_req_create_multi(struct osd_request *or,
- struct osd_obj_id *first, struct osd_obj_id_list *list, unsigned nelem);
-*/
-
-void osd_req_write(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset,
- struct bio *bio, u64 len)
-{
- _osd_req_encode_common(or, OSD_ACT_WRITE, obj, offset, len);
- WARN_ON(or->out.bio || or->out.total_bytes);
- WARN_ON(!op_is_write(bio_op(bio)));
- or->out.bio = bio;
- or->out.total_bytes = len;
-}
-EXPORT_SYMBOL(osd_req_write);
-
-int osd_req_write_kern(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
-{
- struct request_queue *req_q = osd_request_queue(or->osd_dev);
- struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
-
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
- osd_req_write(or, obj, offset, bio, len);
- return 0;
-}
-EXPORT_SYMBOL(osd_req_write_kern);
-
-/*TODO: void osd_req_append(struct osd_request *,
- const struct osd_obj_id *, struct bio *data_out); */
-/*TODO: void osd_req_create_write(struct osd_request *,
- const struct osd_obj_id *, struct bio *data_out, u64 offset); */
-/*TODO: void osd_req_clear(struct osd_request *,
- const struct osd_obj_id *, u64 offset, u64 len); */
-/*TODO: void osd_req_punch(struct osd_request *,
- const struct osd_obj_id *, u64 offset, u64 len); V2 */
-
-void osd_req_flush_object(struct osd_request *or,
- const struct osd_obj_id *obj, enum osd_options_flush_scope_values op,
- /*V2*/ u64 offset, /*V2*/ u64 len)
-{
- if (unlikely(osd_req_is_ver1(or) && (offset || len))) {
- OSD_DEBUG("OSD Ver1 flush on specific range ignored\n");
- offset = 0;
- len = 0;
- }
-
- _osd_req_encode_common(or, OSD_ACT_FLUSH, obj, offset, len);
- _osd_req_encode_flush(or, op);
-}
-EXPORT_SYMBOL(osd_req_flush_object);
-
-void osd_req_read(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset,
- struct bio *bio, u64 len)
-{
- _osd_req_encode_common(or, OSD_ACT_READ, obj, offset, len);
- WARN_ON(or->in.bio || or->in.total_bytes);
- WARN_ON(op_is_write(bio_op(bio)));
- or->in.bio = bio;
- or->in.total_bytes = len;
-}
-EXPORT_SYMBOL(osd_req_read);
-
-int osd_req_read_kern(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, void* buff, u64 len)
-{
- struct request_queue *req_q = osd_request_queue(or->osd_dev);
- struct bio *bio = bio_map_kern(req_q, buff, len, GFP_KERNEL);
-
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- osd_req_read(or, obj, offset, bio, len);
- return 0;
-}
-EXPORT_SYMBOL(osd_req_read_kern);
-
-static int _add_sg_continuation_descriptor(struct osd_request *or,
- const struct osd_sg_entry *sglist, unsigned numentries, u64 *len)
-{
- struct osd_sg_continuation_descriptor *oscd;
- u32 oscd_size;
- unsigned i;
- int ret;
-
- oscd_size = sizeof(*oscd) + numentries * sizeof(oscd->entries[0]);
-
- if (!or->cdb_cont.total_bytes) {
- /* First time, jump over the header, we will write to:
- * cdb_cont.buff + cdb_cont.total_bytes
- */
- or->cdb_cont.total_bytes =
- sizeof(struct osd_continuation_segment_header);
- }
-
- ret = _alloc_cdb_cont(or, or->cdb_cont.total_bytes + oscd_size);
- if (unlikely(ret))
- return ret;
-
- oscd = or->cdb_cont.buff + or->cdb_cont.total_bytes;
- oscd->hdr.type = cpu_to_be16(SCATTER_GATHER_LIST);
- oscd->hdr.pad_length = 0;
- oscd->hdr.length = cpu_to_be32(oscd_size - sizeof(*oscd));
-
- *len = 0;
- /* copy the sg entries and convert to network byte order */
- for (i = 0; i < numentries; i++) {
- oscd->entries[i].offset = cpu_to_be64(sglist[i].offset);
- oscd->entries[i].len = cpu_to_be64(sglist[i].len);
- *len += sglist[i].len;
- }
-
- or->cdb_cont.total_bytes += oscd_size;
- OSD_DEBUG("total_bytes=%d oscd_size=%d numentries=%d\n",
- or->cdb_cont.total_bytes, oscd_size, numentries);
- return 0;
-}
-
-static int _osd_req_finalize_cdb_cont(struct osd_request *or, const u8 *cap_key)
-{
- struct request_queue *req_q = osd_request_queue(or->osd_dev);
- struct bio *bio;
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
- struct osd_continuation_segment_header *cont_seg_hdr;
-
- if (!or->cdb_cont.total_bytes)
- return 0;
-
- cont_seg_hdr = or->cdb_cont.buff;
- cont_seg_hdr->format = CDB_CONTINUATION_FORMAT_V2;
- cont_seg_hdr->service_action = cdbh->varlen_cdb.service_action;
-
- /* create a bio for continuation segment */
- bio = bio_map_kern(req_q, or->cdb_cont.buff, or->cdb_cont.total_bytes,
- GFP_KERNEL);
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-
- /* integrity check the continuation before the bio is linked
- * with the other data segments since the continuation
- * integrity is separate from the other data segments.
- */
- osd_sec_sign_data(cont_seg_hdr->integrity_check, bio, cap_key);
-
- cdbh->v2.cdb_continuation_length = cpu_to_be32(or->cdb_cont.total_bytes);
-
- /* we can't use _req_append_segment, because we need to link in the
- * continuation bio to the head of the bio list - the
- * continuation segment (if it exists) is always the first segment in
- * the out data buffer.
- */
- bio->bi_next = or->out.bio;
- or->out.bio = bio;
- or->out.total_bytes += or->cdb_cont.total_bytes;
-
- return 0;
-}
-
-/* osd_req_write_sg: Takes a @bio that points to the data out buffer and an
- * @sglist that has the scatter gather entries. Scatter-gather enables a write
- * of multiple none-contiguous areas of an object, in a single call. The extents
- * may overlap and/or be in any order. The only constrain is that:
- * total_bytes(sglist) >= total_bytes(bio)
- */
-int osd_req_write_sg(struct osd_request *or,
- const struct osd_obj_id *obj, struct bio *bio,
- const struct osd_sg_entry *sglist, unsigned numentries)
-{
- u64 len;
- int ret = _add_sg_continuation_descriptor(or, sglist, numentries, &len);
-
- if (ret)
- return ret;
- osd_req_write(or, obj, 0, bio, len);
-
- return 0;
-}
-EXPORT_SYMBOL(osd_req_write_sg);
-
-/* osd_req_read_sg: Read multiple extents of an object into @bio
- * See osd_req_write_sg
- */
-int osd_req_read_sg(struct osd_request *or,
- const struct osd_obj_id *obj, struct bio *bio,
- const struct osd_sg_entry *sglist, unsigned numentries)
-{
- u64 len;
- u64 off;
- int ret;
-
- if (numentries > 1) {
- off = 0;
- ret = _add_sg_continuation_descriptor(or, sglist, numentries,
- &len);
- if (ret)
- return ret;
- } else {
- /* Optimize the case of single segment, read_sg is a
- * bidi operation.
- */
- len = sglist->len;
- off = sglist->offset;
- }
- osd_req_read(or, obj, off, bio, len);
-
- return 0;
-}
-EXPORT_SYMBOL(osd_req_read_sg);
-
-/* SG-list write/read Kern API
- *
- * osd_req_{write,read}_sg_kern takes an array of @buff pointers and an array
- * of sg_entries. @numentries indicates how many pointers and sg_entries there
- * are. By requiring an array of buff pointers. This allows a caller to do a
- * single write/read and scatter into multiple buffers.
- * NOTE: Each buffer + len should not cross a page boundary.
- */
-static struct bio *_create_sg_bios(struct osd_request *or,
- void **buff, const struct osd_sg_entry *sglist, unsigned numentries)
-{
- struct request_queue *q = osd_request_queue(or->osd_dev);
- struct bio *bio;
- unsigned i;
-
- bio = bio_kmalloc(GFP_KERNEL, numentries);
- if (unlikely(!bio)) {
- OSD_DEBUG("Failed to allocate BIO size=%u\n", numentries);
- return ERR_PTR(-ENOMEM);
- }
-
- for (i = 0; i < numentries; i++) {
- unsigned offset = offset_in_page(buff[i]);
- struct page *page = virt_to_page(buff[i]);
- unsigned len = sglist[i].len;
- unsigned added_len;
-
- BUG_ON(offset + len > PAGE_SIZE);
- added_len = bio_add_pc_page(q, bio, page, len, offset);
- if (unlikely(len != added_len)) {
- OSD_DEBUG("bio_add_pc_page len(%d) != added_len(%d)\n",
- len, added_len);
- bio_put(bio);
- return ERR_PTR(-ENOMEM);
- }
- }
-
- return bio;
-}
-
-int osd_req_write_sg_kern(struct osd_request *or,
- const struct osd_obj_id *obj, void **buff,
- const struct osd_sg_entry *sglist, unsigned numentries)
-{
- struct bio *bio = _create_sg_bios(or, buff, sglist, numentries);
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
- osd_req_write_sg(or, obj, bio, sglist, numentries);
-
- return 0;
-}
-EXPORT_SYMBOL(osd_req_write_sg_kern);
-
-int osd_req_read_sg_kern(struct osd_request *or,
- const struct osd_obj_id *obj, void **buff,
- const struct osd_sg_entry *sglist, unsigned numentries)
-{
- struct bio *bio = _create_sg_bios(or, buff, sglist, numentries);
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- osd_req_read_sg(or, obj, bio, sglist, numentries);
-
- return 0;
-}
-EXPORT_SYMBOL(osd_req_read_sg_kern);
-
-
-
-void osd_req_get_attributes(struct osd_request *or,
- const struct osd_obj_id *obj)
-{
- _osd_req_encode_common(or, OSD_ACT_GET_ATTRIBUTES, obj, 0, 0);
-}
-EXPORT_SYMBOL(osd_req_get_attributes);
-
-void osd_req_set_attributes(struct osd_request *or,
- const struct osd_obj_id *obj)
-{
- _osd_req_encode_common(or, OSD_ACT_SET_ATTRIBUTES, obj, 0, 0);
-}
-EXPORT_SYMBOL(osd_req_set_attributes);
-
-/*
- * Attributes List-mode
- */
-
-int osd_req_add_set_attr_list(struct osd_request *or,
- const struct osd_attr *oa, unsigned nelem)
-{
- unsigned total_bytes = or->set_attr.total_bytes;
- void *attr_last;
- int ret;
-
- if (or->attributes_mode &&
- or->attributes_mode != OSD_CDB_GET_SET_ATTR_LISTS) {
- WARN_ON(1);
- return -EINVAL;
- }
- or->attributes_mode = OSD_CDB_GET_SET_ATTR_LISTS;
-
- if (!total_bytes) { /* first-time: allocate and put list header */
- total_bytes = _osd_req_sizeof_alist_header(or);
- ret = _alloc_set_attr_list(or, oa, nelem, total_bytes);
- if (ret)
- return ret;
- _osd_req_set_alist_type(or, or->set_attr.buff,
- OSD_ATTR_LIST_SET_RETRIEVE);
- }
- attr_last = or->set_attr.buff + total_bytes;
-
- for (; nelem; --nelem) {
- unsigned elem_size = _osd_req_alist_elem_size(or, oa->len);
-
- total_bytes += elem_size;
- if (unlikely(or->set_attr.alloc_size < total_bytes)) {
- or->set_attr.total_bytes = total_bytes - elem_size;
- ret = _alloc_set_attr_list(or, oa, nelem, total_bytes);
- if (ret)
- return ret;
- attr_last =
- or->set_attr.buff + or->set_attr.total_bytes;
- }
-
- _osd_req_alist_elem_encode(or, attr_last, oa);
-
- attr_last += elem_size;
- ++oa;
- }
-
- or->set_attr.total_bytes = total_bytes;
- return 0;
-}
-EXPORT_SYMBOL(osd_req_add_set_attr_list);
-
-static int _req_append_segment(struct osd_request *or,
- unsigned padding, struct _osd_req_data_segment *seg,
- struct _osd_req_data_segment *last_seg, struct _osd_io_info *io)
-{
- void *pad_buff;
- int ret;
-
- if (padding) {
- /* check if we can just add it to last buffer */
- if (last_seg &&
- (padding <= last_seg->alloc_size - last_seg->total_bytes))
- pad_buff = last_seg->buff + last_seg->total_bytes;
- else
- pad_buff = io->pad_buff;
-
- ret = blk_rq_map_kern(io->req->q, io->req, pad_buff, padding,
- or->alloc_flags);
- if (ret)
- return ret;
- io->total_bytes += padding;
- }
-
- ret = blk_rq_map_kern(io->req->q, io->req, seg->buff, seg->total_bytes,
- or->alloc_flags);
- if (ret)
- return ret;
-
- io->total_bytes += seg->total_bytes;
- OSD_DEBUG("padding=%d buff=%p total_bytes=%d\n", padding, seg->buff,
- seg->total_bytes);
- return 0;
-}
-
-static int _osd_req_finalize_set_attr_list(struct osd_request *or)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
- unsigned padding;
- int ret;
-
- if (!or->set_attr.total_bytes) {
- cdbh->attrs_list.set_attr_offset = OSD_OFFSET_UNUSED;
- return 0;
- }
-
- cdbh->attrs_list.set_attr_bytes = cpu_to_be32(or->set_attr.total_bytes);
- cdbh->attrs_list.set_attr_offset =
- osd_req_encode_offset(or, or->out.total_bytes, &padding);
-
- ret = _req_append_segment(or, padding, &or->set_attr,
- or->out.last_seg, &or->out);
- if (ret)
- return ret;
-
- or->out.last_seg = &or->set_attr;
- return 0;
-}
-
-int osd_req_add_get_attr_list(struct osd_request *or,
- const struct osd_attr *oa, unsigned nelem)
-{
- unsigned total_bytes = or->enc_get_attr.total_bytes;
- void *attr_last;
- int ret;
-
- if (or->attributes_mode &&
- or->attributes_mode != OSD_CDB_GET_SET_ATTR_LISTS) {
- WARN_ON(1);
- return -EINVAL;
- }
- or->attributes_mode = OSD_CDB_GET_SET_ATTR_LISTS;
-
- /* first time calc data-in list header size */
- if (!or->get_attr.total_bytes)
- or->get_attr.total_bytes = _osd_req_sizeof_alist_header(or);
-
- /* calc data-out info */
- if (!total_bytes) { /* first-time: allocate and put list header */
- unsigned max_bytes;
-
- total_bytes = _osd_req_sizeof_alist_header(or);
- max_bytes = total_bytes +
- nelem * sizeof(struct osd_attributes_list_attrid);
- ret = _alloc_get_attr_desc(or, max_bytes);
- if (ret)
- return ret;
-
- _osd_req_set_alist_type(or, or->enc_get_attr.buff,
- OSD_ATTR_LIST_GET);
- }
- attr_last = or->enc_get_attr.buff + total_bytes;
-
- for (; nelem; --nelem) {
- struct osd_attributes_list_attrid *attrid;
- const unsigned cur_size = sizeof(*attrid);
-
- total_bytes += cur_size;
- if (unlikely(or->enc_get_attr.alloc_size < total_bytes)) {
- or->enc_get_attr.total_bytes = total_bytes - cur_size;
- ret = _alloc_get_attr_desc(or,
- total_bytes + nelem * sizeof(*attrid));
- if (ret)
- return ret;
- attr_last = or->enc_get_attr.buff +
- or->enc_get_attr.total_bytes;
- }
-
- attrid = attr_last;
- attrid->attr_page = cpu_to_be32(oa->attr_page);
- attrid->attr_id = cpu_to_be32(oa->attr_id);
-
- attr_last += cur_size;
-
- /* calc data-in size */
- or->get_attr.total_bytes +=
- _osd_req_alist_elem_size(or, oa->len);
- ++oa;
- }
-
- or->enc_get_attr.total_bytes = total_bytes;
-
- OSD_DEBUG(
- "get_attr.total_bytes=%u(%u) enc_get_attr.total_bytes=%u(%zu)\n",
- or->get_attr.total_bytes,
- or->get_attr.total_bytes - _osd_req_sizeof_alist_header(or),
- or->enc_get_attr.total_bytes,
- (or->enc_get_attr.total_bytes - _osd_req_sizeof_alist_header(or))
- / sizeof(struct osd_attributes_list_attrid));
-
- return 0;
-}
-EXPORT_SYMBOL(osd_req_add_get_attr_list);
-
-static int _osd_req_finalize_get_attr_list(struct osd_request *or)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
- unsigned out_padding;
- unsigned in_padding;
- int ret;
-
- if (!or->enc_get_attr.total_bytes) {
- cdbh->attrs_list.get_attr_desc_offset = OSD_OFFSET_UNUSED;
- cdbh->attrs_list.get_attr_offset = OSD_OFFSET_UNUSED;
- return 0;
- }
-
- ret = _alloc_get_attr_list(or);
- if (ret)
- return ret;
-
- /* The out-going buffer info update */
- OSD_DEBUG("out-going\n");
- cdbh->attrs_list.get_attr_desc_bytes =
- cpu_to_be32(or->enc_get_attr.total_bytes);
-
- cdbh->attrs_list.get_attr_desc_offset =
- osd_req_encode_offset(or, or->out.total_bytes, &out_padding);
-
- ret = _req_append_segment(or, out_padding, &or->enc_get_attr,
- or->out.last_seg, &or->out);
- if (ret)
- return ret;
- or->out.last_seg = &or->enc_get_attr;
-
- /* The incoming buffer info update */
- OSD_DEBUG("in-coming\n");
- cdbh->attrs_list.get_attr_alloc_length =
- cpu_to_be32(or->get_attr.total_bytes);
-
- cdbh->attrs_list.get_attr_offset =
- osd_req_encode_offset(or, or->in.total_bytes, &in_padding);
-
- ret = _req_append_segment(or, in_padding, &or->get_attr, NULL,
- &or->in);
- if (ret)
- return ret;
- or->in.last_seg = &or->get_attr;
-
- return 0;
-}
-
-int osd_req_decode_get_attr_list(struct osd_request *or,
- struct osd_attr *oa, int *nelem, void **iterator)
-{
- unsigned cur_bytes, returned_bytes;
- int n;
- const unsigned sizeof_attr_list = _osd_req_sizeof_alist_header(or);
- void *cur_p;
-
- if (!_osd_req_is_alist_type(or, or->get_attr.buff,
- OSD_ATTR_LIST_SET_RETRIEVE)) {
- oa->attr_page = 0;
- oa->attr_id = 0;
- oa->val_ptr = NULL;
- oa->len = 0;
- *iterator = NULL;
- return 0;
- }
-
- if (*iterator) {
- BUG_ON((*iterator < or->get_attr.buff) ||
- (or->get_attr.buff + or->get_attr.alloc_size < *iterator));
- cur_p = *iterator;
- cur_bytes = (*iterator - or->get_attr.buff) - sizeof_attr_list;
- returned_bytes = or->get_attr.total_bytes;
- } else { /* first time decode the list header */
- cur_bytes = sizeof_attr_list;
- returned_bytes = _osd_req_alist_size(or, or->get_attr.buff) +
- sizeof_attr_list;
-
- cur_p = or->get_attr.buff + sizeof_attr_list;
-
- if (returned_bytes > or->get_attr.alloc_size) {
- OSD_DEBUG("target report: space was not big enough! "
- "Allocate=%u Needed=%u\n",
- or->get_attr.alloc_size,
- returned_bytes + sizeof_attr_list);
-
- returned_bytes =
- or->get_attr.alloc_size - sizeof_attr_list;
- }
- or->get_attr.total_bytes = returned_bytes;
- }
-
- for (n = 0; (n < *nelem) && (cur_bytes < returned_bytes); ++n) {
- int inc = _osd_req_alist_elem_decode(or, cur_p, oa,
- returned_bytes - cur_bytes);
-
- if (inc < 0) {
- OSD_ERR("BAD FOOD from target. list not valid!"
- "c=%d r=%d n=%d\n",
- cur_bytes, returned_bytes, n);
- oa->val_ptr = NULL;
- cur_bytes = returned_bytes; /* break the caller loop */
- break;
- }
-
- cur_bytes += inc;
- cur_p += inc;
- ++oa;
- }
-
- *iterator = (returned_bytes - cur_bytes) ? cur_p : NULL;
- *nelem = n;
- return returned_bytes - cur_bytes;
-}
-EXPORT_SYMBOL(osd_req_decode_get_attr_list);
-
-/*
- * Attributes Page-mode
- */
-
-int osd_req_add_get_attr_page(struct osd_request *or,
- u32 page_id, void *attar_page, unsigned max_page_len,
- const struct osd_attr *set_one_attr)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
-
- if (or->attributes_mode &&
- or->attributes_mode != OSD_CDB_GET_ATTR_PAGE_SET_ONE) {
- WARN_ON(1);
- return -EINVAL;
- }
- or->attributes_mode = OSD_CDB_GET_ATTR_PAGE_SET_ONE;
-
- or->get_attr.buff = attar_page;
- or->get_attr.total_bytes = max_page_len;
-
- cdbh->attrs_page.get_attr_page = cpu_to_be32(page_id);
- cdbh->attrs_page.get_attr_alloc_length = cpu_to_be32(max_page_len);
-
- if (!set_one_attr || !set_one_attr->attr_page)
- return 0; /* The set is optional */
-
- or->set_attr.buff = set_one_attr->val_ptr;
- or->set_attr.total_bytes = set_one_attr->len;
-
- cdbh->attrs_page.set_attr_page = cpu_to_be32(set_one_attr->attr_page);
- cdbh->attrs_page.set_attr_id = cpu_to_be32(set_one_attr->attr_id);
- cdbh->attrs_page.set_attr_length = cpu_to_be32(set_one_attr->len);
- return 0;
-}
-EXPORT_SYMBOL(osd_req_add_get_attr_page);
-
-static int _osd_req_finalize_attr_page(struct osd_request *or)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
- unsigned in_padding, out_padding;
- int ret;
-
- /* returned page */
- cdbh->attrs_page.get_attr_offset =
- osd_req_encode_offset(or, or->in.total_bytes, &in_padding);
-
- ret = _req_append_segment(or, in_padding, &or->get_attr, NULL,
- &or->in);
- if (ret)
- return ret;
-
- if (or->set_attr.total_bytes == 0)
- return 0;
-
- /* set one value */
- cdbh->attrs_page.set_attr_offset =
- osd_req_encode_offset(or, or->out.total_bytes, &out_padding);
-
- ret = _req_append_segment(or, out_padding, &or->set_attr, NULL,
- &or->out);
- return ret;
-}
-
-static inline void osd_sec_parms_set_out_offset(bool is_v1,
- struct osd_security_parameters *sec_parms, osd_cdb_offset offset)
-{
- if (is_v1)
- sec_parms->v1.data_out_integrity_check_offset = offset;
- else
- sec_parms->v2.data_out_integrity_check_offset = offset;
-}
-
-static inline void osd_sec_parms_set_in_offset(bool is_v1,
- struct osd_security_parameters *sec_parms, osd_cdb_offset offset)
-{
- if (is_v1)
- sec_parms->v1.data_in_integrity_check_offset = offset;
- else
- sec_parms->v2.data_in_integrity_check_offset = offset;
-}
-
-static int _osd_req_finalize_data_integrity(struct osd_request *or,
- bool has_in, bool has_out, struct bio *out_data_bio, u64 out_data_bytes,
- const u8 *cap_key)
-{
- struct osd_security_parameters *sec_parms = _osd_req_sec_params(or);
- int ret;
-
- if (!osd_is_sec_alldata(sec_parms))
- return 0;
-
- if (has_out) {
- struct _osd_req_data_segment seg = {
- .buff = &or->out_data_integ,
- .total_bytes = sizeof(or->out_data_integ),
- };
- unsigned pad;
-
- or->out_data_integ.data_bytes = cpu_to_be64(out_data_bytes);
- or->out_data_integ.set_attributes_bytes = cpu_to_be64(
- or->set_attr.total_bytes);
- or->out_data_integ.get_attributes_bytes = cpu_to_be64(
- or->enc_get_attr.total_bytes);
-
- osd_sec_parms_set_out_offset(osd_req_is_ver1(or), sec_parms,
- osd_req_encode_offset(or, or->out.total_bytes, &pad));
-
- ret = _req_append_segment(or, pad, &seg, or->out.last_seg,
- &or->out);
- if (ret)
- return ret;
- or->out.last_seg = NULL;
-
- /* they are now all chained to request sign them all together */
- osd_sec_sign_data(&or->out_data_integ, out_data_bio,
- cap_key);
- }
-
- if (has_in) {
- struct _osd_req_data_segment seg = {
- .buff = &or->in_data_integ,
- .total_bytes = sizeof(or->in_data_integ),
- };
- unsigned pad;
-
- osd_sec_parms_set_in_offset(osd_req_is_ver1(or), sec_parms,
- osd_req_encode_offset(or, or->in.total_bytes, &pad));
-
- ret = _req_append_segment(or, pad, &seg, or->in.last_seg,
- &or->in);
- if (ret)
- return ret;
-
- or->in.last_seg = NULL;
- }
-
- return 0;
-}
-
-/*
- * osd_finalize_request and helpers
- */
-static struct request *_make_request(struct request_queue *q, bool has_write,
- struct _osd_io_info *oii, gfp_t flags)
-{
- struct request *req;
- struct bio *bio = oii->bio;
- int ret;
-
- req = blk_get_request(q, has_write ? REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN,
- flags);
- if (IS_ERR(req))
- return req;
- scsi_req_init(req);
-
- for_each_bio(bio) {
- struct bio *bounce_bio = bio;
-
- blk_queue_bounce(req->q, &bounce_bio);
- ret = blk_rq_append_bio(req, bounce_bio);
- if (ret)
- return ERR_PTR(ret);
- }
-
- return req;
-}
-
-static int _init_blk_request(struct osd_request *or,
- bool has_in, bool has_out)
-{
- gfp_t flags = or->alloc_flags;
- struct scsi_device *scsi_device = or->osd_dev->scsi_device;
- struct request_queue *q = scsi_device->request_queue;
- struct request *req;
- int ret;
-
- req = _make_request(q, has_out, has_out ? &or->out : &or->in, flags);
- if (IS_ERR(req)) {
- ret = PTR_ERR(req);
- goto out;
- }
-
- or->request = req;
- req->rq_flags |= RQF_QUIET;
-
- req->timeout = or->timeout;
- scsi_req(req)->retries = or->retries;
-
- if (has_out) {
- or->out.req = req;
- if (has_in) {
- /* allocate bidi request */
- req = _make_request(q, false, &or->in, flags);
- if (IS_ERR(req)) {
- OSD_DEBUG("blk_get_request for bidi failed\n");
- ret = PTR_ERR(req);
- goto out;
- }
- scsi_req_init(req);
- or->in.req = or->request->next_rq = req;
- }
- } else if (has_in)
- or->in.req = req;
-
- ret = 0;
-out:
- OSD_DEBUG("or=%p has_in=%d has_out=%d => %d, %p\n",
- or, has_in, has_out, ret, or->request);
- return ret;
-}
-
-int osd_finalize_request(struct osd_request *or,
- u8 options, const void *cap, const u8 *cap_key)
-{
- struct osd_cdb_head *cdbh = osd_cdb_head(&or->cdb);
- bool has_in, has_out;
- /* Save for data_integrity without the cdb_continuation */
- struct bio *out_data_bio = or->out.bio;
- u64 out_data_bytes = or->out.total_bytes;
- int ret;
-
- if (options & OSD_REQ_FUA)
- cdbh->options |= OSD_CDB_FUA;
-
- if (options & OSD_REQ_DPO)
- cdbh->options |= OSD_CDB_DPO;
-
- if (options & OSD_REQ_BYPASS_TIMESTAMPS)
- cdbh->timestamp_control = OSD_CDB_BYPASS_TIMESTAMPS;
-
- osd_set_caps(&or->cdb, cap);
-
- has_in = or->in.bio || or->get_attr.total_bytes;
- has_out = or->out.bio || or->cdb_cont.total_bytes ||
- or->set_attr.total_bytes || or->enc_get_attr.total_bytes;
-
- ret = _osd_req_finalize_cdb_cont(or, cap_key);
- if (ret) {
- OSD_DEBUG("_osd_req_finalize_cdb_cont failed\n");
- return ret;
- }
- ret = _init_blk_request(or, has_in, has_out);
- if (ret) {
- OSD_DEBUG("_init_blk_request failed\n");
- return ret;
- }
-
- or->out.pad_buff = sg_out_pad_buffer;
- or->in.pad_buff = sg_in_pad_buffer;
-
- if (!or->attributes_mode)
- or->attributes_mode = OSD_CDB_GET_SET_ATTR_LISTS;
- cdbh->command_specific_options |= or->attributes_mode;
- if (or->attributes_mode == OSD_CDB_GET_ATTR_PAGE_SET_ONE) {
- ret = _osd_req_finalize_attr_page(or);
- if (ret) {
- OSD_DEBUG("_osd_req_finalize_attr_page failed\n");
- return ret;
- }
- } else {
- /* TODO: I think that for the GET_ATTR command these 2 should
- * be reversed to keep them in execution order (for embedded
- * targets with low memory footprint)
- */
- ret = _osd_req_finalize_set_attr_list(or);
- if (ret) {
- OSD_DEBUG("_osd_req_finalize_set_attr_list failed\n");
- return ret;
- }
-
- ret = _osd_req_finalize_get_attr_list(or);
- if (ret) {
- OSD_DEBUG("_osd_req_finalize_get_attr_list failed\n");
- return ret;
- }
- }
-
- ret = _osd_req_finalize_data_integrity(or, has_in, has_out,
- out_data_bio, out_data_bytes,
- cap_key);
- if (ret)
- return ret;
-
- osd_sec_sign_cdb(&or->cdb, cap_key);
-
- scsi_req(or->request)->cmd = or->cdb.buff;
- scsi_req(or->request)->cmd_len = _osd_req_cdb_len(or);
-
- return 0;
-}
-EXPORT_SYMBOL(osd_finalize_request);
-
-static bool _is_osd_security_code(int code)
-{
- return (code == osd_security_audit_value_frozen) ||
- (code == osd_security_working_key_frozen) ||
- (code == osd_nonce_not_unique) ||
- (code == osd_nonce_timestamp_out_of_range) ||
- (code == osd_invalid_dataout_buffer_integrity_check_value);
-}
-
-#define OSD_SENSE_PRINT1(fmt, a...) \
- do { \
- if (__cur_sense_need_output) \
- OSD_ERR(fmt, ##a); \
- } while (0)
-
-#define OSD_SENSE_PRINT2(fmt, a...) OSD_SENSE_PRINT1(" " fmt, ##a)
-
-int osd_req_decode_sense_full(struct osd_request *or,
- struct osd_sense_info *osi, bool silent,
- struct osd_obj_id *bad_obj_list __unused, int max_obj __unused,
- struct osd_attr *bad_attr_list, int max_attr)
-{
- int sense_len, original_sense_len;
- struct osd_sense_info local_osi;
- struct scsi_sense_descriptor_based *ssdb;
- void *cur_descriptor;
-#if (CONFIG_SCSI_OSD_DPRINT_SENSE == 0)
- const bool __cur_sense_need_output = false;
-#else
- bool __cur_sense_need_output = !silent;
-#endif
- int ret;
-
- if (likely(!or->req_errors))
- return 0;
-
- osi = osi ? : &local_osi;
- memset(osi, 0, sizeof(*osi));
-
- ssdb = (typeof(ssdb))or->sense;
- sense_len = or->sense_len;
- if ((sense_len < (int)sizeof(*ssdb) || !ssdb->sense_key)) {
- OSD_ERR("Block-layer returned error(0x%x) but "
- "sense_len(%u) || key(%d) is empty\n",
- or->req_errors, sense_len, ssdb->sense_key);
- goto analyze;
- }
-
- if ((ssdb->response_code != 0x72) && (ssdb->response_code != 0x73)) {
- OSD_ERR("Unrecognized scsi sense: rcode=%x length=%d\n",
- ssdb->response_code, sense_len);
- goto analyze;
- }
-
- osi->key = ssdb->sense_key;
- osi->additional_code = be16_to_cpu(ssdb->additional_sense_code);
- original_sense_len = ssdb->additional_sense_length + 8;
-
-#if (CONFIG_SCSI_OSD_DPRINT_SENSE == 1)
- if (__cur_sense_need_output)
- __cur_sense_need_output = (osi->key > scsi_sk_recovered_error);
-#endif
- OSD_SENSE_PRINT1("Main Sense information key=0x%x length(%d, %d) "
- "additional_code=0x%x async_error=%d errors=0x%x\n",
- osi->key, original_sense_len, sense_len,
- osi->additional_code, or->async_error,
- or->req_errors);
-
- if (original_sense_len < sense_len)
- sense_len = original_sense_len;
-
- cur_descriptor = ssdb->ssd;
- sense_len -= sizeof(*ssdb);
- while (sense_len > 0) {
- struct scsi_sense_descriptor *ssd = cur_descriptor;
- int cur_len = ssd->additional_length + 2;
-
- sense_len -= cur_len;
-
- if (sense_len < 0)
- break; /* sense was truncated */
-
- switch (ssd->descriptor_type) {
- case scsi_sense_information:
- case scsi_sense_command_specific_information:
- {
- struct scsi_sense_command_specific_data_descriptor
- *sscd = cur_descriptor;
-
- osi->command_info =
- get_unaligned_be64(&sscd->information) ;
- OSD_SENSE_PRINT2(
- "command_specific_information 0x%llx \n",
- _LLU(osi->command_info));
- break;
- }
- case scsi_sense_key_specific:
- {
- struct scsi_sense_key_specific_data_descriptor
- *ssks = cur_descriptor;
-
- osi->sense_info = get_unaligned_be16(&ssks->value);
- OSD_SENSE_PRINT2(
- "sense_key_specific_information %u"
- "sksv_cd_bpv_bp (0x%x)\n",
- osi->sense_info, ssks->sksv_cd_bpv_bp);
- break;
- }
- case osd_sense_object_identification:
- { /*FIXME: Keep first not last, Store in array*/
- struct osd_sense_identification_data_descriptor
- *osidd = cur_descriptor;
-
- osi->not_initiated_command_functions =
- le32_to_cpu(osidd->not_initiated_functions);
- osi->completed_command_functions =
- le32_to_cpu(osidd->completed_functions);
- osi->obj.partition = be64_to_cpu(osidd->partition_id);
- osi->obj.id = be64_to_cpu(osidd->object_id);
- OSD_SENSE_PRINT2(
- "object_identification pid=0x%llx oid=0x%llx\n",
- _LLU(osi->obj.partition), _LLU(osi->obj.id));
- OSD_SENSE_PRINT2(
- "not_initiated_bits(%x) "
- "completed_command_bits(%x)\n",
- osi->not_initiated_command_functions,
- osi->completed_command_functions);
- break;
- }
- case osd_sense_response_integrity_check:
- {
- struct osd_sense_response_integrity_check_descriptor
- *osricd = cur_descriptor;
- const unsigned len =
- sizeof(osricd->integrity_check_value);
- char key_dump[len*4 + 2]; /* 2nibbles+space+ASCII */
-
- hex_dump_to_buffer(osricd->integrity_check_value, len,
- 32, 1, key_dump, sizeof(key_dump), true);
- OSD_SENSE_PRINT2("response_integrity [%s]\n", key_dump);
- }
- case osd_sense_attribute_identification:
- {
- struct osd_sense_attributes_data_descriptor
- *osadd = cur_descriptor;
- unsigned len = min(cur_len, sense_len);
- struct osd_sense_attr *pattr = osadd->sense_attrs;
-
- while (len >= sizeof(*pattr)) {
- u32 attr_page = be32_to_cpu(pattr->attr_page);
- u32 attr_id = be32_to_cpu(pattr->attr_id);
-
- if (!osi->attr.attr_page) {
- osi->attr.attr_page = attr_page;
- osi->attr.attr_id = attr_id;
- }
-
- if (bad_attr_list && max_attr) {
- bad_attr_list->attr_page = attr_page;
- bad_attr_list->attr_id = attr_id;
- bad_attr_list++;
- max_attr--;
- }
-
- len -= sizeof(*pattr);
- OSD_SENSE_PRINT2(
- "osd_sense_attribute_identification"
- "attr_page=0x%x attr_id=0x%x\n",
- attr_page, attr_id);
- }
- }
- /*These are not legal for OSD*/
- case scsi_sense_field_replaceable_unit:
- OSD_SENSE_PRINT2("scsi_sense_field_replaceable_unit\n");
- break;
- case scsi_sense_stream_commands:
- OSD_SENSE_PRINT2("scsi_sense_stream_commands\n");
- break;
- case scsi_sense_block_commands:
- OSD_SENSE_PRINT2("scsi_sense_block_commands\n");
- break;
- case scsi_sense_ata_return:
- OSD_SENSE_PRINT2("scsi_sense_ata_return\n");
- break;
- default:
- if (ssd->descriptor_type <= scsi_sense_Reserved_last)
- OSD_SENSE_PRINT2(
- "scsi_sense Reserved descriptor (0x%x)",
- ssd->descriptor_type);
- else
- OSD_SENSE_PRINT2(
- "scsi_sense Vendor descriptor (0x%x)",
- ssd->descriptor_type);
- }
-
- cur_descriptor += cur_len;
- }
-
-analyze:
- if (!osi->key) {
- /* scsi sense is Empty, the request was never issued to target
- * linux return code might tell us what happened.
- */
- if (or->async_error == -ENOMEM)
- osi->osd_err_pri = OSD_ERR_PRI_RESOURCE;
- else
- osi->osd_err_pri = OSD_ERR_PRI_UNREACHABLE;
- ret = or->async_error;
- } else if (osi->key <= scsi_sk_recovered_error) {
- osi->osd_err_pri = 0;
- ret = 0;
- } else if (osi->additional_code == scsi_invalid_field_in_cdb) {
- if (osi->cdb_field_offset == OSD_CFO_STARTING_BYTE) {
- osi->osd_err_pri = OSD_ERR_PRI_CLEAR_PAGES;
- ret = -EFAULT; /* caller should recover from this */
- } else if (osi->cdb_field_offset == OSD_CFO_OBJECT_ID) {
- osi->osd_err_pri = OSD_ERR_PRI_NOT_FOUND;
- ret = -ENOENT;
- } else if (osi->cdb_field_offset == OSD_CFO_PERMISSIONS) {
- osi->osd_err_pri = OSD_ERR_PRI_NO_ACCESS;
- ret = -EACCES;
- } else {
- osi->osd_err_pri = OSD_ERR_PRI_BAD_CRED;
- ret = -EINVAL;
- }
- } else if (osi->additional_code == osd_quota_error) {
- osi->osd_err_pri = OSD_ERR_PRI_NO_SPACE;
- ret = -ENOSPC;
- } else if (_is_osd_security_code(osi->additional_code)) {
- osi->osd_err_pri = OSD_ERR_PRI_BAD_CRED;
- ret = -EINVAL;
- } else {
- osi->osd_err_pri = OSD_ERR_PRI_EIO;
- ret = -EIO;
- }
-
- if (!or->out.residual)
- or->out.residual = or->out.total_bytes;
- if (!or->in.residual)
- or->in.residual = or->in.total_bytes;
-
- return ret;
-}
-EXPORT_SYMBOL(osd_req_decode_sense_full);
-
-/*
- * Implementation of osd_sec.h API
- * TODO: Move to a separate osd_sec.c file at a later stage.
- */
-
-enum { OSD_SEC_CAP_V1_ALL_CAPS =
- OSD_SEC_CAP_APPEND | OSD_SEC_CAP_OBJ_MGMT | OSD_SEC_CAP_REMOVE |
- OSD_SEC_CAP_CREATE | OSD_SEC_CAP_SET_ATTR | OSD_SEC_CAP_GET_ATTR |
- OSD_SEC_CAP_WRITE | OSD_SEC_CAP_READ | OSD_SEC_CAP_POL_SEC |
- OSD_SEC_CAP_GLOBAL | OSD_SEC_CAP_DEV_MGMT
-};
-
-enum { OSD_SEC_CAP_V2_ALL_CAPS =
- OSD_SEC_CAP_V1_ALL_CAPS | OSD_SEC_CAP_QUERY | OSD_SEC_CAP_M_OBJECT
-};
-
-void osd_sec_init_nosec_doall_caps(void *caps,
- const struct osd_obj_id *obj, bool is_collection, const bool is_v1)
-{
- struct osd_capability *cap = caps;
- u8 type;
- u8 descriptor_type;
-
- if (likely(obj->id)) {
- if (unlikely(is_collection)) {
- type = OSD_SEC_OBJ_COLLECTION;
- descriptor_type = is_v1 ? OSD_SEC_OBJ_DESC_OBJ :
- OSD_SEC_OBJ_DESC_COL;
- } else {
- type = OSD_SEC_OBJ_USER;
- descriptor_type = OSD_SEC_OBJ_DESC_OBJ;
- }
- WARN_ON(!obj->partition);
- } else {
- type = obj->partition ? OSD_SEC_OBJ_PARTITION :
- OSD_SEC_OBJ_ROOT;
- descriptor_type = OSD_SEC_OBJ_DESC_PAR;
- }
-
- memset(cap, 0, sizeof(*cap));
-
- cap->h.format = OSD_SEC_CAP_FORMAT_VER1;
- cap->h.integrity_algorithm__key_version = 0; /* MAKE_BYTE(0, 0); */
- cap->h.security_method = OSD_SEC_NOSEC;
-/* cap->expiration_time;
- cap->AUDIT[30-10];
- cap->discriminator[42-30];
- cap->object_created_time; */
- cap->h.object_type = type;
- osd_sec_set_caps(&cap->h, OSD_SEC_CAP_V1_ALL_CAPS);
- cap->h.object_descriptor_type = descriptor_type;
- cap->od.obj_desc.policy_access_tag = 0;
- cap->od.obj_desc.allowed_partition_id = cpu_to_be64(obj->partition);
- cap->od.obj_desc.allowed_object_id = cpu_to_be64(obj->id);
-}
-EXPORT_SYMBOL(osd_sec_init_nosec_doall_caps);
-
-/* FIXME: Extract version from caps pointer.
- * Also Pete's target only supports caps from OSDv1 for now
- */
-void osd_set_caps(struct osd_cdb *cdb, const void *caps)
-{
- /* NOTE: They start at same address */
- memcpy(&cdb->v1.caps, caps, OSDv1_CAP_LEN);
-}
-
-bool osd_is_sec_alldata(struct osd_security_parameters *sec_parms __unused)
-{
- return false;
-}
-
-void osd_sec_sign_cdb(struct osd_cdb *ocdb __unused, const u8 *cap_key __unused)
-{
-}
-
-void osd_sec_sign_data(void *data_integ __unused,
- struct bio *bio __unused, const u8 *cap_key __unused)
-{
-}
-
-/*
- * Declared in osd_protocol.h
- * 4.12.5 Data-In and Data-Out buffer offsets
- * byte offset = mantissa * (2^(exponent+8))
- * Returns the smallest allowed encoded offset that contains given @offset
- * The actual encoded offset returned is @offset + *@padding.
- */
-osd_cdb_offset __osd_encode_offset(
- u64 offset, unsigned *padding, int min_shift, int max_shift)
-{
- u64 try_offset = -1, mod, align;
- osd_cdb_offset be32_offset;
- int shift;
-
- *padding = 0;
- if (!offset)
- return 0;
-
- for (shift = min_shift; shift < max_shift; ++shift) {
- try_offset = offset >> shift;
- if (try_offset < (1 << OSD_OFFSET_MAX_BITS))
- break;
- }
-
- BUG_ON(shift == max_shift);
-
- align = 1 << shift;
- mod = offset & (align - 1);
- if (mod) {
- *padding = align - mod;
- try_offset += 1;
- }
-
- try_offset |= ((shift - 8) & 0xf) << 28;
- be32_offset = cpu_to_be32((u32)try_offset);
-
- OSD_DEBUG("offset=%llu mantissa=%llu exp=%d encoded=%x pad=%d\n",
- _LLU(offset), _LLU(try_offset & 0x0FFFFFFF), shift,
- be32_offset, *padding);
- return be32_offset;
-}
diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
deleted file mode 100644
index e0ce5d2fd14d..000000000000
--- a/drivers/scsi/osd/osd_uld.c
+++ /dev/null
@@ -1,595 +0,0 @@
-/*
- * osd_uld.c - OSD Upper Layer Driver
- *
- * A Linux driver module that registers as a SCSI ULD and probes
- * for OSD type SCSI devices.
- * It's main function is to export osd devices to in-kernel users like
- * osdfs and pNFS-objects-LD. It also provides one ioctl for running
- * in Kernel tests.
- *
- * Copyright (C) 2008 Panasas Inc. All rights reserved.
- *
- * Authors:
- * Boaz Harrosh <ooo@electrozaur.com>
- * Benny Halevy <bhalevy@panasas.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <linux/namei.h>
-#include <linux/cdev.h>
-#include <linux/fs.h>
-#include <linux/module.h>
-#include <linux/device.h>
-#include <linux/idr.h>
-#include <linux/major.h>
-#include <linux/file.h>
-#include <linux/slab.h>
-
-#include <scsi/scsi.h>
-#include <scsi/scsi_driver.h>
-#include <scsi/scsi_device.h>
-#include <scsi/scsi_ioctl.h>
-
-#include <scsi/osd_initiator.h>
-#include <scsi/osd_sec.h>
-
-#include "osd_debug.h"
-
-#ifndef TYPE_OSD
-# define TYPE_OSD 0x11
-#endif
-
-#ifndef SCSI_OSD_MAJOR
-# define SCSI_OSD_MAJOR 260
-#endif
-#define SCSI_OSD_MAX_MINOR MINORMASK
-
-static const char osd_name[] = "osd";
-static const char *osd_version_string = "open-osd 0.2.1";
-
-MODULE_AUTHOR("Boaz Harrosh <ooo@electrozaur.com>");
-MODULE_DESCRIPTION("open-osd Upper-Layer-Driver osd.ko");
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_CHARDEV_MAJOR(SCSI_OSD_MAJOR);
-MODULE_ALIAS_SCSI_DEVICE(TYPE_OSD);
-
-struct osd_uld_device {
- int minor;
- struct device class_dev;
- struct cdev cdev;
- struct osd_dev od;
- struct osd_dev_info odi;
- struct gendisk *disk;
-};
-
-struct osd_dev_handle {
- struct osd_dev od;
- struct file *file;
- struct osd_uld_device *oud;
-} ;
-
-static DEFINE_IDA(osd_minor_ida);
-
-/*
- * scsi sysfs attribute operations
- */
-static ssize_t osdname_show(struct device *dev, struct device_attribute *attr,
- char *buf)
-{
- struct osd_uld_device *ould = container_of(dev, struct osd_uld_device,
- class_dev);
- return sprintf(buf, "%s\n", ould->odi.osdname);
-}
-static DEVICE_ATTR_RO(osdname);
-
-static ssize_t systemid_show(struct device *dev, struct device_attribute *attr,
- char *buf)
-{
- struct osd_uld_device *ould = container_of(dev, struct osd_uld_device,
- class_dev);
-
- memcpy(buf, ould->odi.systemid, ould->odi.systemid_len);
- return ould->odi.systemid_len;
-}
-static DEVICE_ATTR_RO(systemid);
-
-static struct attribute *osd_uld_attrs[] = {
- &dev_attr_osdname.attr,
- &dev_attr_systemid.attr,
- NULL,
-};
-ATTRIBUTE_GROUPS(osd_uld);
-
-static struct class osd_uld_class = {
- .owner = THIS_MODULE,
- .name = "scsi_osd",
- .dev_groups = osd_uld_groups,
-};
-
-/*
- * Char Device operations
- */
-
-static int osd_uld_open(struct inode *inode, struct file *file)
-{
- struct osd_uld_device *oud = container_of(inode->i_cdev,
- struct osd_uld_device, cdev);
-
- get_device(&oud->class_dev);
- /* cache osd_uld_device on file handle */
- file->private_data = oud;
- OSD_DEBUG("osd_uld_open %p\n", oud);
- return 0;
-}
-
-static int osd_uld_release(struct inode *inode, struct file *file)
-{
- struct osd_uld_device *oud = file->private_data;
-
- OSD_DEBUG("osd_uld_release %p\n", file->private_data);
- file->private_data = NULL;
- put_device(&oud->class_dev);
- return 0;
-}
-
-/* FIXME: Only one vector for now */
-unsigned g_test_ioctl;
-do_test_fn *g_do_test;
-
-int osduld_register_test(unsigned ioctl, do_test_fn *do_test)
-{
- if (g_test_ioctl)
- return -EINVAL;
-
- g_test_ioctl = ioctl;
- g_do_test = do_test;
- return 0;
-}
-EXPORT_SYMBOL(osduld_register_test);
-
-void osduld_unregister_test(unsigned ioctl)
-{
- if (ioctl == g_test_ioctl) {
- g_test_ioctl = 0;
- g_do_test = NULL;
- }
-}
-EXPORT_SYMBOL(osduld_unregister_test);
-
-static do_test_fn *_find_ioctl(unsigned cmd)
-{
- if (g_test_ioctl == cmd)
- return g_do_test;
- else
- return NULL;
-}
-
-static long osd_uld_ioctl(struct file *file, unsigned int cmd,
- unsigned long arg)
-{
- struct osd_uld_device *oud = file->private_data;
- int ret;
- do_test_fn *do_test;
-
- do_test = _find_ioctl(cmd);
- if (do_test)
- ret = do_test(&oud->od, cmd, arg);
- else {
- OSD_ERR("Unknown ioctl %d: osd_uld_device=%p\n", cmd, oud);
- ret = -ENOIOCTLCMD;
- }
- return ret;
-}
-
-static const struct file_operations osd_fops = {
- .owner = THIS_MODULE,
- .open = osd_uld_open,
- .release = osd_uld_release,
- .unlocked_ioctl = osd_uld_ioctl,
- .llseek = noop_llseek,
-};
-
-struct osd_dev *osduld_path_lookup(const char *name)
-{
- struct osd_uld_device *oud;
- struct osd_dev_handle *odh;
- struct file *file;
- int error;
-
- if (!name || !*name) {
- OSD_ERR("Mount with !path || !*path\n");
- return ERR_PTR(-EINVAL);
- }
-
- odh = kzalloc(sizeof(*odh), GFP_KERNEL);
- if (unlikely(!odh))
- return ERR_PTR(-ENOMEM);
-
- file = filp_open(name, O_RDWR, 0);
- if (IS_ERR(file)) {
- error = PTR_ERR(file);
- goto free_od;
- }
-
- if (file->f_op != &osd_fops){
- error = -EINVAL;
- goto close_file;
- }
-
- oud = file->private_data;
-
- odh->od = oud->od;
- odh->file = file;
- odh->oud = oud;
-
- return &odh->od;
-
-close_file:
- fput(file);
-free_od:
- kfree(odh);
- return ERR_PTR(error);
-}
-EXPORT_SYMBOL(osduld_path_lookup);
-
-static inline bool _the_same_or_null(const u8 *a1, unsigned a1_len,
- const u8 *a2, unsigned a2_len)
-{
- if (!a2_len) /* User string is Empty means don't care */
- return true;
-
- if (a1_len != a2_len)
- return false;
-
- return 0 == memcmp(a1, a2, a1_len);
-}
-
-static int _match_odi(struct device *dev, const void *find_data)
-{
- struct osd_uld_device *oud = container_of(dev, struct osd_uld_device,
- class_dev);
- const struct osd_dev_info *odi = find_data;
-
- if (_the_same_or_null(oud->odi.systemid, oud->odi.systemid_len,
- odi->systemid, odi->systemid_len) &&
- _the_same_or_null(oud->odi.osdname, oud->odi.osdname_len,
- odi->osdname, odi->osdname_len)) {
- OSD_DEBUG("found device sysid_len=%d osdname=%d\n",
- odi->systemid_len, odi->osdname_len);
- return 1;
- } else {
- return 0;
- }
-}
-
-/* osduld_info_lookup - Loop through all devices, return the requested osd_dev.
- *
- * if @odi->systemid_len and/or @odi->osdname_len are zero, they act as a don't
- * care. .e.g if they're both zero /dev/osd0 is returned.
- */
-struct osd_dev *osduld_info_lookup(const struct osd_dev_info *odi)
-{
- struct device *dev = class_find_device(&osd_uld_class, NULL, odi, _match_odi);
- if (likely(dev)) {
- struct osd_dev_handle *odh = kzalloc(sizeof(*odh), GFP_KERNEL);
- struct osd_uld_device *oud = container_of(dev,
- struct osd_uld_device, class_dev);
-
- if (unlikely(!odh)) {
- put_device(dev);
- return ERR_PTR(-ENOMEM);
- }
-
- odh->od = oud->od;
- odh->oud = oud;
-
- return &odh->od;
- }
-
- return ERR_PTR(-ENODEV);
-}
-EXPORT_SYMBOL(osduld_info_lookup);
-
-void osduld_put_device(struct osd_dev *od)
-{
- if (od && !IS_ERR(od)) {
- struct osd_dev_handle *odh =
- container_of(od, struct osd_dev_handle, od);
- struct osd_uld_device *oud = odh->oud;
-
- BUG_ON(od->scsi_device != oud->od.scsi_device);
-
- /* If scsi has released the device (logout), and exofs has last
- * reference on oud it will be freed by above osd_uld_release
- * within fput below. But this will oops in cdev_release which
- * is called after the fops->release. A get_/put_ pair makes
- * sure we have a cdev for the duration of fput
- */
- if (odh->file) {
- get_device(&oud->class_dev);
- fput(odh->file);
- }
- put_device(&oud->class_dev);
- kfree(odh);
- }
-}
-EXPORT_SYMBOL(osduld_put_device);
-
-const struct osd_dev_info *osduld_device_info(struct osd_dev *od)
-{
- struct osd_dev_handle *odh =
- container_of(od, struct osd_dev_handle, od);
- return &odh->oud->odi;
-}
-EXPORT_SYMBOL(osduld_device_info);
-
-bool osduld_device_same(struct osd_dev *od, const struct osd_dev_info *odi)
-{
- struct osd_dev_handle *odh =
- container_of(od, struct osd_dev_handle, od);
- struct osd_uld_device *oud = odh->oud;
-
- return (oud->odi.systemid_len == odi->systemid_len) &&
- _the_same_or_null(oud->odi.systemid, oud->odi.systemid_len,
- odi->systemid, odi->systemid_len) &&
- (oud->odi.osdname_len == odi->osdname_len) &&
- _the_same_or_null(oud->odi.osdname, oud->odi.osdname_len,
- odi->osdname, odi->osdname_len);
-}
-EXPORT_SYMBOL(osduld_device_same);
-
-/*
- * Scsi Device operations
- */
-
-static int __detect_osd(struct osd_uld_device *oud)
-{
- struct scsi_device *scsi_device = oud->od.scsi_device;
- struct scsi_sense_hdr sense_hdr;
- char caps[OSD_CAP_LEN];
- int error;
-
- /* sending a test_unit_ready as first command seems to be needed
- * by some targets
- */
- OSD_DEBUG("start scsi_test_unit_ready %p %p %p\n",
- oud, scsi_device, scsi_device->request_queue);
- error = scsi_test_unit_ready(scsi_device, 10*HZ, 5, &sense_hdr);
- if (error)
- OSD_ERR("warning: scsi_test_unit_ready failed\n");
-
- osd_sec_init_nosec_doall_caps(caps, &osd_root_object, false, true);
- if (osd_auto_detect_ver(&oud->od, caps, &oud->odi))
- return -ENODEV;
-
- return 0;
-}
-
-static void __remove(struct device *dev)
-{
- struct osd_uld_device *oud = container_of(dev, struct osd_uld_device,
- class_dev);
- struct scsi_device *scsi_device = oud->od.scsi_device;
-
- kfree(oud->odi.osdname);
-
- if (oud->cdev.owner)
- cdev_del(&oud->cdev);
-
- osd_dev_fini(&oud->od);
- scsi_device_put(scsi_device);
-
- OSD_INFO("osd_remove %s\n",
- oud->disk ? oud->disk->disk_name : NULL);
-
- if (oud->disk)
- put_disk(oud->disk);
- ida_remove(&osd_minor_ida, oud->minor);
-
- kfree(oud);
-}
-
-static int osd_probe(struct device *dev)
-{
- struct scsi_device *scsi_device = to_scsi_device(dev);
- struct gendisk *disk;
- struct osd_uld_device *oud;
- int minor;
- int error;
-
- if (scsi_device->type != TYPE_OSD)
- return -ENODEV;
-
- do {
- if (!ida_pre_get(&osd_minor_ida, GFP_KERNEL))
- return -ENODEV;
-
- error = ida_get_new(&osd_minor_ida, &minor);
- } while (error == -EAGAIN);
-
- if (error)
- return error;
- if (minor >= SCSI_OSD_MAX_MINOR) {
- error = -EBUSY;
- goto err_retract_minor;
- }
-
- error = -ENOMEM;
- oud = kzalloc(sizeof(*oud), GFP_KERNEL);
- if (NULL == oud)
- goto err_retract_minor;
-
- dev_set_drvdata(dev, oud);
- oud->minor = minor;
-
- /* allocate a disk and set it up */
- /* FIXME: do we need this since sg has already done that */
- disk = alloc_disk(1);
- if (!disk) {
- OSD_ERR("alloc_disk failed\n");
- goto err_free_osd;
- }
- disk->major = SCSI_OSD_MAJOR;
- disk->first_minor = oud->minor;
- sprintf(disk->disk_name, "osd%d", oud->minor);
- oud->disk = disk;
-
- /* hold one more reference to the scsi_device that will get released
- * in __release, in case a logout is happening while fs is mounted
- */
- scsi_device_get(scsi_device);
- osd_dev_init(&oud->od, scsi_device);
-
- /* Detect the OSD Version */
- error = __detect_osd(oud);
- if (error) {
- OSD_ERR("osd detection failed, non-compatible OSD device\n");
- goto err_put_disk;
- }
-
- /* init the char-device for communication with user-mode */
- cdev_init(&oud->cdev, &osd_fops);
- oud->cdev.owner = THIS_MODULE;
- error = cdev_add(&oud->cdev,
- MKDEV(SCSI_OSD_MAJOR, oud->minor), 1);
- if (error) {
- OSD_ERR("cdev_add failed\n");
- goto err_put_disk;
- }
-
- /* class device member */
- oud->class_dev.devt = oud->cdev.dev;
- oud->class_dev.class = &osd_uld_class;
- oud->class_dev.parent = dev;
- oud->class_dev.release = __remove;
- error = dev_set_name(&oud->class_dev, "%s", disk->disk_name);
- if (error) {
- OSD_ERR("dev_set_name failed => %d\n", error);
- goto err_put_cdev;
- }
-
- error = device_register(&oud->class_dev);
- if (error) {
- OSD_ERR("device_register failed => %d\n", error);
- goto err_put_cdev;
- }
-
- get_device(&oud->class_dev);
-
- OSD_INFO("osd_probe %s\n", disk->disk_name);
- return 0;
-
-err_put_cdev:
- cdev_del(&oud->cdev);
-err_put_disk:
- scsi_device_put(scsi_device);
- put_disk(disk);
-err_free_osd:
- dev_set_drvdata(dev, NULL);
- kfree(oud);
-err_retract_minor:
- ida_remove(&osd_minor_ida, minor);
- return error;
-}
-
-static int osd_remove(struct device *dev)
-{
- struct scsi_device *scsi_device = to_scsi_device(dev);
- struct osd_uld_device *oud = dev_get_drvdata(dev);
-
- if (!oud || (oud->od.scsi_device != scsi_device)) {
- OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
- dev, oud, oud ? oud->od.scsi_device : NULL,
- scsi_device);
- }
-
- device_unregister(&oud->class_dev);
-
- put_device(&oud->class_dev);
- return 0;
-}
-
-/*
- * Global driver and scsi registration
- */
-
-static struct scsi_driver osd_driver = {
- .gendrv = {
- .name = osd_name,
- .owner = THIS_MODULE,
- .probe = osd_probe,
- .remove = osd_remove,
- }
-};
-
-static int __init osd_uld_init(void)
-{
- int err;
-
- err = class_register(&osd_uld_class);
- if (err) {
- OSD_ERR("Unable to register sysfs class => %d\n", err);
- return err;
- }
-
- err = register_chrdev_region(MKDEV(SCSI_OSD_MAJOR, 0),
- SCSI_OSD_MAX_MINOR, osd_name);
- if (err) {
- OSD_ERR("Unable to register major %d for osd ULD => %d\n",
- SCSI_OSD_MAJOR, err);
- goto err_out;
- }
-
- err = scsi_register_driver(&osd_driver.gendrv);
- if (err) {
- OSD_ERR("scsi_register_driver failed => %d\n", err);
- goto err_out_chrdev;
- }
-
- OSD_INFO("LOADED %s\n", osd_version_string);
- return 0;
-
-err_out_chrdev:
- unregister_chrdev_region(MKDEV(SCSI_OSD_MAJOR, 0), SCSI_OSD_MAX_MINOR);
-err_out:
- class_unregister(&osd_uld_class);
- return err;
-}
-
-static void __exit osd_uld_exit(void)
-{
- scsi_unregister_driver(&osd_driver.gendrv);
- unregister_chrdev_region(MKDEV(SCSI_OSD_MAJOR, 0), SCSI_OSD_MAX_MINOR);
- class_unregister(&osd_uld_class);
- OSD_INFO("UNLOADED %s\n", osd_version_string);
-}
-
-module_init(osd_uld_init);
-module_exit(osd_uld_exit);
diff --git a/include/scsi/osd_initiator.h b/include/scsi/osd_initiator.h
deleted file mode 100644
index a09cca829082..000000000000
--- a/include/scsi/osd_initiator.h
+++ /dev/null
@@ -1,515 +0,0 @@
-/*
- * osd_initiator.h - OSD initiator API definition
- *
- * Copyright (C) 2008 Panasas Inc. All rights reserved.
- *
- * Authors:
- * Boaz Harrosh <ooo@electrozaur.com>
- * Benny Halevy <bhalevy@panasas.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- *
- */
-#ifndef __OSD_INITIATOR_H__
-#define __OSD_INITIATOR_H__
-
-#include <scsi/osd_protocol.h>
-#include <scsi/osd_types.h>
-
-#include <linux/blkdev.h>
-#include <scsi/scsi_device.h>
-
-/* Note: "NI" in comments below means "Not Implemented yet" */
-
-/* Configure of code:
- * #undef if you *don't* want OSD v1 support in runtime.
- * If #defined the initiator will dynamically configure to encode OSD v1
- * CDB's if the target is detected to be OSD v1 only.
- * OSD v2 only commands, options, and attributes will be ignored if target
- * is v1 only.
- * If #defined will result in bigger/slower code (OK Slower maybe not)
- * Q: Should this be CONFIG_SCSI_OSD_VER1_SUPPORT and set from Kconfig?
- */
-#define OSD_VER1_SUPPORT y
-
-enum osd_std_version {
- OSD_VER_NONE = 0,
- OSD_VER1 = 1,
- OSD_VER2 = 2,
-};
-
-/*
- * Object-based Storage Device.
- * This object represents an OSD device.
- * It is not a full linux device in any way. It is only
- * a place to hang resources associated with a Linux
- * request Q and some default properties.
- */
-struct osd_dev {
- struct scsi_device *scsi_device;
- unsigned def_timeout;
-
-#ifdef OSD_VER1_SUPPORT
- enum osd_std_version version;
-#endif
-};
-
-/* Unique Identification of an OSD device */
-struct osd_dev_info {
- unsigned systemid_len;
- u8 systemid[OSD_SYSTEMID_LEN];
- unsigned osdname_len;
- u8 *osdname;
-};
-
-/* Retrieve/return osd_dev(s) for use by Kernel clients
- * Use IS_ERR/ERR_PTR on returned "osd_dev *".
- */
-struct osd_dev *osduld_path_lookup(const char *dev_name);
-struct osd_dev *osduld_info_lookup(const struct osd_dev_info *odi);
-void osduld_put_device(struct osd_dev *od);
-
-const struct osd_dev_info *osduld_device_info(struct osd_dev *od);
-bool osduld_device_same(struct osd_dev *od, const struct osd_dev_info *odi);
-
-/* Add/remove test ioctls from external modules */
-typedef int (do_test_fn)(struct osd_dev *od, unsigned cmd, unsigned long arg);
-int osduld_register_test(unsigned ioctl, do_test_fn *do_test);
-void osduld_unregister_test(unsigned ioctl);
-
-/* These are called by uld at probe time */
-void osd_dev_init(struct osd_dev *od, struct scsi_device *scsi_device);
-void osd_dev_fini(struct osd_dev *od);
-
-/**
- * osd_auto_detect_ver - Detect the OSD version, return Unique Identification
- *
- * @od: OSD target lun handle
- * @caps: Capabilities authorizing OSD root read attributes access
- * @odi: Retrieved information uniquely identifying the osd target lun
- * Note: odi->osdname must be kfreed by caller.
- *
- * Auto detects the OSD version of the OSD target and sets the @od
- * accordingly. Meanwhile also returns the "system id" and "osd name" root
- * attributes which uniquely identify the OSD target. This member is usually
- * called by the ULD. ULD users should call osduld_device_info().
- * This rutine allocates osd requests and memory at GFP_KERNEL level and might
- * sleep.
- */
-int osd_auto_detect_ver(struct osd_dev *od,
- void *caps, struct osd_dev_info *odi);
-
-static inline struct request_queue *osd_request_queue(struct osd_dev *od)
-{
- return od->scsi_device->request_queue;
-}
-
-/* we might want to use function vector in the future */
-static inline void osd_dev_set_ver(struct osd_dev *od, enum osd_std_version v)
-{
-#ifdef OSD_VER1_SUPPORT
- od->version = v;
-#endif
-}
-
-static inline bool osd_dev_is_ver1(struct osd_dev *od)
-{
-#ifdef OSD_VER1_SUPPORT
- return od->version == OSD_VER1;
-#else
- return false;
-#endif
-}
-
-struct osd_request;
-typedef void (osd_req_done_fn)(struct osd_request *or, void *private);
-
-struct osd_request {
- struct osd_cdb cdb;
- struct osd_data_out_integrity_info out_data_integ;
- struct osd_data_in_integrity_info in_data_integ;
-
- struct osd_dev *osd_dev;
- struct request *request;
-
- struct _osd_req_data_segment {
- void *buff;
- unsigned alloc_size; /* 0 here means: don't call kfree */
- unsigned total_bytes;
- } cdb_cont, set_attr, enc_get_attr, get_attr;
-
- struct _osd_io_info {
- struct bio *bio;
- u64 total_bytes;
- u64 residual;
- struct request *req;
- struct _osd_req_data_segment *last_seg;
- u8 *pad_buff;
- } out, in;
-
- gfp_t alloc_flags;
- unsigned timeout;
- unsigned retries;
- unsigned sense_len;
- u8 sense[OSD_MAX_SENSE_LEN];
- enum osd_attributes_mode attributes_mode;
-
- osd_req_done_fn *async_done;
- void *async_private;
- int async_error;
- int req_errors;
-};
-
-static inline bool osd_req_is_ver1(struct osd_request *or)
-{
- return osd_dev_is_ver1(or->osd_dev);
-}
-
-/*
- * How to use the osd library:
- *
- * osd_start_request
- * Allocates a request.
- *
- * osd_req_*
- * Call one of, to encode the desired operation.
- *
- * osd_add_{get,set}_attr
- * Optionally add attributes to the CDB, list or page mode.
- *
- * osd_finalize_request
- * Computes final data out/in offsets and signs the request,
- * making it ready for execution.
- *
- * osd_execute_request
- * May be called to execute it through the block layer. Other wise submit
- * the associated block request in some other way.
- *
- * After execution:
- * osd_req_decode_sense
- * Decodes sense information to verify execution results.
- *
- * osd_req_decode_get_attr
- * Retrieve osd_add_get_attr_list() values if used.
- *
- * osd_end_request
- * Must be called to deallocate the request.
- */
-
-/**
- * osd_start_request - Allocate and initialize an osd_request
- *
- * @osd_dev: OSD device that holds the scsi-device and default values
- * that the request is associated with.
- * @gfp: The allocation flags to use for request allocation, and all
- * subsequent allocations. This will be stored at
- * osd_request->alloc_flags, can be changed by user later
- *
- * Allocate osd_request and initialize all members to the
- * default/initial state.
- */
-struct osd_request *osd_start_request(struct osd_dev *od, gfp_t gfp);
-
-enum osd_req_options {
- OSD_REQ_FUA = 0x08, /* Force Unit Access */
- OSD_REQ_DPO = 0x10, /* Disable Page Out */
-
- OSD_REQ_BYPASS_TIMESTAMPS = 0x80,
-};
-
-/**
- * osd_finalize_request - Sign request and prepare request for execution
- *
- * @or: osd_request to prepare
- * @options: combination of osd_req_options bit flags or 0.
- * @cap: A Pointer to an OSD_CAP_LEN bytes buffer that is received from
- * The security manager as capabilities for this cdb.
- * @cap_key: The cryptographic key used to sign the cdb/data. Can be null
- * if NOSEC is used.
- *
- * The actual request and bios are only allocated here, so are the get_attr
- * buffers that will receive the returned attributes. Copy's @cap to cdb.
- * Sign the cdb/data with @cap_key.
- */
-int osd_finalize_request(struct osd_request *or,
- u8 options, const void *cap, const u8 *cap_key);
-
-/**
- * osd_execute_request - Execute the request synchronously through block-layer
- *
- * @or: osd_request to Executed
- *
- * Calls blk_execute_rq to q the command and waits for completion.
- */
-int osd_execute_request(struct osd_request *or);
-
-/**
- * osd_execute_request_async - Execute the request without waitting.
- *
- * @or: - osd_request to Executed
- * @done: (Optional) - Called at end of execution
- * @private: - Will be passed to @done function
- *
- * Calls blk_execute_rq_nowait to queue the command. When execution is done
- * optionally calls @done with @private as parameter. @or->async_error will
- * have the return code
- */
-int osd_execute_request_async(struct osd_request *or,
- osd_req_done_fn *done, void *private);
-
-/**
- * osd_req_decode_sense_full - Decode sense information after execution.
- *
- * @or: - osd_request to examine
- * @osi - Receives a more detailed error report information (optional).
- * @silent - Do not print to dmsg (Even if enabled)
- * @bad_obj_list - Some commands act on multiple objects. Failed objects will
- * be received here (optional)
- * @max_obj - Size of @bad_obj_list.
- * @bad_attr_list - List of failing attributes (optional)
- * @max_attr - Size of @bad_attr_list.
- *
- * After execution, osd_request results are analyzed using this function. The
- * return code is the final disposition on the error. So it is possible that a
- * CHECK_CONDITION was returned from target but this will return NO_ERROR, for
- * example on recovered errors. All parameters are optional if caller does
- * not need any returned information.
- * Note: This function will also dump the error to dmsg according to settings
- * of the SCSI_OSD_DPRINT_SENSE Kconfig value. Set @silent if you know the
- * command would routinely fail, to not spam the dmsg file.
- */
-
-/**
- * osd_err_priority - osd categorized return codes in ascending severity.
- *
- * The categories are borrowed from the pnfs_osd_errno enum.
- * See comments for translated Linux codes returned by osd_req_decode_sense.
- */
-enum osd_err_priority {
- OSD_ERR_PRI_NO_ERROR = 0,
- /* Recoverable, caller should clear_highpage() all pages */
- OSD_ERR_PRI_CLEAR_PAGES = 1, /* -EFAULT */
- OSD_ERR_PRI_RESOURCE = 2, /* -ENOMEM */
- OSD_ERR_PRI_BAD_CRED = 3, /* -EINVAL */
- OSD_ERR_PRI_NO_ACCESS = 4, /* -EACCES */
- OSD_ERR_PRI_UNREACHABLE = 5, /* any other */
- OSD_ERR_PRI_NOT_FOUND = 6, /* -ENOENT */
- OSD_ERR_PRI_NO_SPACE = 7, /* -ENOSPC */
- OSD_ERR_PRI_EIO = 8, /* -EIO */
-};
-
-struct osd_sense_info {
- enum osd_err_priority osd_err_pri;
-
- int key; /* one of enum scsi_sense_keys */
- int additional_code ; /* enum osd_additional_sense_codes */
- union { /* Sense specific information */
- u16 sense_info;
- u16 cdb_field_offset; /* scsi_invalid_field_in_cdb */
- };
- union { /* Command specific information */
- u64 command_info;
- };
-
- u32 not_initiated_command_functions; /* osd_command_functions_bits */
- u32 completed_command_functions; /* osd_command_functions_bits */
- struct osd_obj_id obj;
- struct osd_attr attr;
-};
-
-int osd_req_decode_sense_full(struct osd_request *or,
- struct osd_sense_info *osi, bool silent,
- struct osd_obj_id *bad_obj_list, int max_obj,
- struct osd_attr *bad_attr_list, int max_attr);
-
-static inline int osd_req_decode_sense(struct osd_request *or,
- struct osd_sense_info *osi)
-{
- return osd_req_decode_sense_full(or, osi, false, NULL, 0, NULL, 0);
-}
-
-/**
- * osd_end_request - return osd_request to free store
- *
- * @or: osd_request to free
- *
- * Deallocate all osd_request resources (struct req's, BIOs, buffers, etc.)
- */
-void osd_end_request(struct osd_request *or);
-
-/*
- * CDB Encoding
- *
- * Note: call only one of the following methods.
- */
-
-/*
- * Device commands
- */
-void osd_req_set_master_seed_xchg(struct osd_request *or, ...);/* NI */
-void osd_req_set_master_key(struct osd_request *or, ...);/* NI */
-
-void osd_req_format(struct osd_request *or, u64 tot_capacity);
-
-/* list all partitions
- * @list header must be initialized to zero on first run.
- *
- * Call osd_is_obj_list_done() to find if we got the complete list.
- */
-int osd_req_list_dev_partitions(struct osd_request *or,
- osd_id initial_id, struct osd_obj_id_list *list, unsigned nelem);
-
-void osd_req_flush_obsd(struct osd_request *or,
- enum osd_options_flush_scope_values);
-
-void osd_req_perform_scsi_command(struct osd_request *or,
- const u8 *cdb, ...);/* NI */
-void osd_req_task_management(struct osd_request *or, ...);/* NI */
-
-/*
- * Partition commands
- */
-void osd_req_create_partition(struct osd_request *or, osd_id partition);
-void osd_req_remove_partition(struct osd_request *or, osd_id partition);
-
-void osd_req_set_partition_key(struct osd_request *or,
- osd_id partition, u8 new_key_id[OSD_CRYPTO_KEYID_SIZE],
- u8 seed[OSD_CRYPTO_SEED_SIZE]);/* NI */
-
-/* list all collections in the partition
- * @list header must be init to zero on first run.
- *
- * Call osd_is_obj_list_done() to find if we got the complete list.
- */
-int osd_req_list_partition_collections(struct osd_request *or,
- osd_id partition, osd_id initial_id, struct osd_obj_id_list *list,
- unsigned nelem);
-
-/* list all objects in the partition
- * @list header must be init to zero on first run.
- *
- * Call osd_is_obj_list_done() to find if we got the complete list.
- */
-int osd_req_list_partition_objects(struct osd_request *or,
- osd_id partition, osd_id initial_id, struct osd_obj_id_list *list,
- unsigned nelem);
-
-void osd_req_flush_partition(struct osd_request *or,
- osd_id partition, enum osd_options_flush_scope_values);
-
-/*
- * Collection commands
- */
-void osd_req_create_collection(struct osd_request *or,
- const struct osd_obj_id *);/* NI */
-void osd_req_remove_collection(struct osd_request *or,
- const struct osd_obj_id *);/* NI */
-
-/* list all objects in the collection */
-int osd_req_list_collection_objects(struct osd_request *or,
- const struct osd_obj_id *, osd_id initial_id,
- struct osd_obj_id_list *list, unsigned nelem);
-
-/* V2 only filtered list of objects in the collection */
-void osd_req_query(struct osd_request *or, ...);/* NI */
-
-void osd_req_flush_collection(struct osd_request *or,
- const struct osd_obj_id *, enum osd_options_flush_scope_values);
-
-void osd_req_get_member_attrs(struct osd_request *or, ...);/* V2-only NI */
-void osd_req_set_member_attrs(struct osd_request *or, ...);/* V2-only NI */
-
-/*
- * Object commands
- */
-void osd_req_create_object(struct osd_request *or, struct osd_obj_id *);
-void osd_req_remove_object(struct osd_request *or, struct osd_obj_id *);
-
-void osd_req_write(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, struct bio *bio, u64 len);
-int osd_req_write_kern(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
-void osd_req_append(struct osd_request *or,
- const struct osd_obj_id *, struct bio *data_out);/* NI */
-void osd_req_create_write(struct osd_request *or,
- const struct osd_obj_id *, struct bio *data_out, u64 offset);/* NI */
-void osd_req_clear(struct osd_request *or,
- const struct osd_obj_id *, u64 offset, u64 len);/* NI */
-void osd_req_punch(struct osd_request *or,
- const struct osd_obj_id *, u64 offset, u64 len);/* V2-only NI */
-
-void osd_req_flush_object(struct osd_request *or,
- const struct osd_obj_id *, enum osd_options_flush_scope_values,
- /*V2*/ u64 offset, /*V2*/ u64 len);
-
-void osd_req_read(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, struct bio *bio, u64 len);
-int osd_req_read_kern(struct osd_request *or,
- const struct osd_obj_id *obj, u64 offset, void *buff, u64 len);
-
-/* Scatter/Gather write/read commands */
-int osd_req_write_sg(struct osd_request *or,
- const struct osd_obj_id *obj, struct bio *bio,
- const struct osd_sg_entry *sglist, unsigned numentries);
-int osd_req_read_sg(struct osd_request *or,
- const struct osd_obj_id *obj, struct bio *bio,
- const struct osd_sg_entry *sglist, unsigned numentries);
-int osd_req_write_sg_kern(struct osd_request *or,
- const struct osd_obj_id *obj, void **buff,
- const struct osd_sg_entry *sglist, unsigned numentries);
-int osd_req_read_sg_kern(struct osd_request *or,
- const struct osd_obj_id *obj, void **buff,
- const struct osd_sg_entry *sglist, unsigned numentries);
-
-/*
- * Root/Partition/Collection/Object Attributes commands
- */
-
-/* get before set */
-void osd_req_get_attributes(struct osd_request *or, const struct osd_obj_id *);
-
-/* set before get */
-void osd_req_set_attributes(struct osd_request *or, const struct osd_obj_id *);
-
-/*
- * Attributes appended to most commands
- */
-
-/* Attributes List mode (or V2 CDB) */
- /*
- * TODO: In ver2 if at finalize time only one attr was set and no gets,
- * then the Attributes CDB mode is used automatically to save IO.
- */
-
-/* set a list of attributes. */
-int osd_req_add_set_attr_list(struct osd_request *or,
- const struct osd_attr *, unsigned nelem);
-
-/* get a list of attributes */
-int osd_req_add_get_attr_list(struct osd_request *or,
- const struct osd_attr *, unsigned nelem);
-
-/*
- * Attributes list decoding
- * Must be called after osd_request.request was executed
- * It is called in a loop to decode the returned get_attr
- * (see osd_add_get_attr)
- */
-int osd_req_decode_get_attr_list(struct osd_request *or,
- struct osd_attr *, int *nelem, void **iterator);
-
-/* Attributes Page mode */
-
-/*
- * Read an attribute page and optionally set one attribute
- *
- * Retrieves the attribute page directly to a user buffer.
- * @attr_page_data shall stay valid until end of execution.
- * See osd_attributes.h for common page structures
- */
-int osd_req_add_get_attr_page(struct osd_request *or,
- u32 page_id, void *attr_page_data, unsigned max_page_len,
- const struct osd_attr *set_one);
-
-#endif /* __OSD_LIB_H__ */
diff --git a/include/scsi/osd_ore.h b/include/scsi/osd_ore.h
deleted file mode 100644
index 7a8d2cd30328..000000000000
--- a/include/scsi/osd_ore.h
+++ /dev/null
@@ -1,201 +0,0 @@
-/*
- * Copyright (C) 2011
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Public Declarations of the ORE API
- *
- * This file is part of the ORE (Object Raid Engine) library.
- *
- * ORE is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation. (GPL v2)
- *
- * ORE is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with the ORE; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-#ifndef __ORE_H__
-#define __ORE_H__
-
-#include <scsi/osd_initiator.h>
-#include <scsi/osd_attributes.h>
-#include <scsi/osd_sec.h>
-#include <linux/pnfs_osd_xdr.h>
-#include <linux/bug.h>
-
-struct ore_comp {
- struct osd_obj_id obj;
- u8 cred[OSD_CAP_LEN];
-};
-
-struct ore_layout {
- /* Our way of looking at the data_map */
- enum pnfs_osd_raid_algorithm4
- raid_algorithm;
- unsigned stripe_unit;
- unsigned mirrors_p1;
-
- unsigned group_width;
- unsigned parity;
- u64 group_depth;
- unsigned group_count;
-
- /* Cached often needed calculations filled in by
- * ore_verify_layout
- */
- unsigned long max_io_length; /* Max length that should be passed to
- * ore_get_rw_state
- */
-};
-
-struct ore_dev {
- struct osd_dev *od;
-};
-
-struct ore_components {
- unsigned first_dev; /* First logical device no */
- unsigned numdevs; /* Num of devices in array */
- /* If @single_comp == EC_SINGLE_COMP, @comps points to a single
- * component. else there are @numdevs components
- */
- enum EC_COMP_USAGE {
- EC_SINGLE_COMP = 0, EC_MULTPLE_COMPS = 0xffffffff
- } single_comp;
- struct ore_comp *comps;
-
- /* Array of pointers to ore_dev-* . User will usually have these pointed
- * too a bigger struct which contain an "ore_dev ored" member and use
- * container_of(oc->ods[i], struct foo_dev, ored) to access the bigger
- * structure.
- */
- struct ore_dev **ods;
-};
-
-/* ore_comp_dev Recievies a logical device index */
-static inline struct osd_dev *ore_comp_dev(
- const struct ore_components *oc, unsigned i)
-{
- BUG_ON((i < oc->first_dev) || (oc->first_dev + oc->numdevs <= i));
- return oc->ods[i - oc->first_dev]->od;
-}
-
-static inline void ore_comp_set_dev(
- struct ore_components *oc, unsigned i, struct osd_dev *od)
-{
- oc->ods[i - oc->first_dev]->od = od;
-}
-
-struct ore_striping_info {
- u64 offset;
- u64 obj_offset;
- u64 length;
- u64 first_stripe_start; /* only used in raid writes */
- u64 M; /* for truncate */
- unsigned bytes_in_stripe;
- unsigned dev;
- unsigned par_dev;
- unsigned unit_off;
- unsigned cur_pg;
- unsigned cur_comp;
- unsigned maxdevUnits;
-};
-
-struct ore_io_state;
-typedef void (*ore_io_done_fn)(struct ore_io_state *ios, void *private);
-struct _ore_r4w_op {
- /* @Priv given here is passed ios->private */
- struct page * (*get_page)(void *priv, u64 page_index, bool *uptodate);
- void (*put_page)(void *priv, struct page *page);
-};
-
-struct ore_io_state {
- struct kref kref;
- struct ore_striping_info si;
-
- void *private;
- ore_io_done_fn done;
-
- struct ore_layout *layout;
- struct ore_components *oc;
-
- /* Global read/write IO*/
- loff_t offset;
- unsigned long length;
- void *kern_buff;
-
- struct page **pages;
- unsigned nr_pages;
- unsigned pgbase;
- unsigned pages_consumed;
-
- /* Attributes */
- unsigned in_attr_len;
- struct osd_attr *in_attr;
- unsigned out_attr_len;
- struct osd_attr *out_attr;
-
- bool reading;
-
- /* House keeping of Parity pages */
- bool extra_part_alloc;
- struct page **parity_pages;
- unsigned max_par_pages;
- unsigned cur_par_page;
- unsigned sgs_per_dev;
- struct __stripe_pages_2d *sp2d;
- struct ore_io_state *ios_read_4_write;
- const struct _ore_r4w_op *r4w;
-
- /* Variable array of size numdevs */
- unsigned numdevs;
- struct ore_per_dev_state {
- struct osd_request *or;
- struct bio *bio;
- loff_t offset;
- unsigned length;
- unsigned last_sgs_total;
- unsigned dev;
- struct osd_sg_entry *sglist;
- unsigned cur_sg;
- } per_dev[];
-};
-
-static inline unsigned ore_io_state_size(unsigned numdevs)
-{
- return sizeof(struct ore_io_state) +
- sizeof(struct ore_per_dev_state) * numdevs;
-}
-
-/* ore.c */
-int ore_verify_layout(unsigned total_comps, struct ore_layout *layout);
-void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- u64 length, struct ore_striping_info *si);
-int ore_get_rw_state(struct ore_layout *layout, struct ore_components *comps,
- bool is_reading, u64 offset, u64 length,
- struct ore_io_state **ios);
-int ore_get_io_state(struct ore_layout *layout, struct ore_components *comps,
- struct ore_io_state **ios);
-void ore_put_io_state(struct ore_io_state *ios);
-
-typedef void (*ore_on_dev_error)(struct ore_io_state *ios, struct ore_dev *od,
- unsigned dev_index, enum osd_err_priority oep,
- u64 dev_offset, u64 dev_len);
-int ore_check_io(struct ore_io_state *ios, ore_on_dev_error rep);
-
-int ore_create(struct ore_io_state *ios);
-int ore_remove(struct ore_io_state *ios);
-int ore_write(struct ore_io_state *ios);
-int ore_read(struct ore_io_state *ios);
-int ore_truncate(struct ore_layout *layout, struct ore_components *comps,
- u64 size);
-
-int extract_attr_from_ios(struct ore_io_state *ios, struct osd_attr *attr);
-
-extern const struct osd_attr g_attr_logical_length;
-
-#endif
--
2.11.0
^ permalink raw reply related
* [PATCH 3/4] nfs: remove the objlayout driver
From: Christoph Hellwig @ 2017-04-12 16:01 UTC (permalink / raw)
To: ooo, martin.petersen, trond.myklebust, axboe
Cc: osd-dev, linux-nfs, linux-scsi, linux-block, linux-kernel
In-Reply-To: <20170412160109.10598-1-hch@lst.de>
The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/admin-guide/kernel-parameters.txt | 6 -
Documentation/filesystems/nfs/pnfs.txt | 37 --
fs/nfs/Kconfig | 5 -
fs/nfs/Makefile | 1 -
fs/nfs/objlayout/Kbuild | 5 -
fs/nfs/objlayout/objio_osd.c | 675 ----------------------
fs/nfs/objlayout/objlayout.c | 706 ------------------------
fs/nfs/objlayout/objlayout.h | 183 ------
fs/nfs/objlayout/pnfs_osd_xdr_cli.c | 415 --------------
9 files changed, 2033 deletions(-)
delete mode 100644 fs/nfs/objlayout/Kbuild
delete mode 100644 fs/nfs/objlayout/objio_osd.c
delete mode 100644 fs/nfs/objlayout/objlayout.c
delete mode 100644 fs/nfs/objlayout/objlayout.h
delete mode 100644 fs/nfs/objlayout/pnfs_osd_xdr_cli.c
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index facc20a3f962..17156d66b124 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2419,12 +2419,6 @@
and gids from such clients. This is intended to ease
migration from NFSv2/v3.
- objlayoutdriver.osd_login_prog=
- [NFS] [OBJLAYOUT] sets the pathname to the program which
- is used to automatically discover and login into new
- osd-targets. Please see:
- Documentation/filesystems/pnfs.txt for more explanations
-
nmi_debug= [KNL,AVR32,SH] Specify one or more actions to take
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
index 8de578a98222..80dc0bdc302a 100644
--- a/Documentation/filesystems/nfs/pnfs.txt
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -64,46 +64,9 @@ table which are called by the nfs-client pnfs-core to implement the
different layout types.
Files-layout-driver code is in: fs/nfs/filelayout/.. directory
-Objects-layout-driver code is in: fs/nfs/objlayout/.. directory
Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
-objects-layout setup
---------------------
-
-As part of the full STD implementation the objlayoutdriver.ko needs, at times,
-to automatically login to yet undiscovered iscsi/osd devices. For this the
-driver makes up-calles to a user-mode script called *osd_login*
-
-The path_name of the script to use is by default:
- /sbin/osd_login.
-This name can be overridden by the Kernel module parameter:
- objlayoutdriver.osd_login_prog
-
-If Kernel does not find the osd_login_prog path it will zero it out
-and will not attempt farther logins. An admin can then write new value
-to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it.
-
-The /sbin/osd_login is part of the nfs-utils package, and should usually
-be installed on distributions that support this Kernel version.
-
-The API to the login script is as follows:
- Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID>
- Options:
- -u target uri e.g. iscsi://<ip>:<port>
- (always exists)
- (More protocols can be defined in the future.
- The client does not interpret this string it is
- passed unchanged as received from the Server)
- -o osdname of the requested target OSD
- (Might be empty)
- (A string which denotes the OSD name, there is a
- limit of 64 chars on this string)
- -s systemid of the requested target OSD
- (Might be empty)
- (This string, if not empty is always an hex
- representation of the 20 bytes osd_system_id)
-
blocks-layout setup
-------------------
diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index f31fd0dd92c6..69d02cf8cf37 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -123,11 +123,6 @@ config PNFS_BLOCK
depends on NFS_V4_1 && BLK_DEV_DM
default NFS_V4
-config PNFS_OBJLAYOUT
- tristate
- depends on NFS_V4_1 && SCSI_OSD_ULD
- default NFS_V4
-
config PNFS_FLEXFILE_LAYOUT
tristate
depends on NFS_V4_1 && NFS_V3
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 6abdda209642..98f4e5728a67 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -31,6 +31,5 @@ nfsv4-$(CONFIG_NFS_V4_1) += pnfs.o pnfs_dev.o pnfs_nfs.o
nfsv4-$(CONFIG_NFS_V4_2) += nfs42proc.o
obj-$(CONFIG_PNFS_FILE_LAYOUT) += filelayout/
-obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
obj-$(CONFIG_PNFS_FLEXFILE_LAYOUT) += flexfilelayout/
diff --git a/fs/nfs/objlayout/Kbuild b/fs/nfs/objlayout/Kbuild
deleted file mode 100644
index ed30ea072bb8..000000000000
--- a/fs/nfs/objlayout/Kbuild
+++ /dev/null
@@ -1,5 +0,0 @@
-#
-# Makefile for the pNFS Objects Layout Driver kernel module
-#
-objlayoutdriver-y := objio_osd.o pnfs_osd_xdr_cli.o objlayout.o
-obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayoutdriver.o
diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
deleted file mode 100644
index 049c1b1f2932..000000000000
--- a/fs/nfs/objlayout/objio_osd.c
+++ /dev/null
@@ -1,675 +0,0 @@
-/*
- * pNFS Objects layout implementation over open-osd initiator library
- *
- * Copyright (C) 2009 Panasas Inc. [year of first publication]
- * All rights reserved.
- *
- * Benny Halevy <bhalevy@panasas.com>
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- * See the file COPYING included with this distribution for more details.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <linux/module.h>
-#include <scsi/osd_ore.h>
-
-#include "objlayout.h"
-#include "../internal.h"
-
-#define NFSDBG_FACILITY NFSDBG_PNFS_LD
-
-struct objio_dev_ent {
- struct nfs4_deviceid_node id_node;
- struct ore_dev od;
-};
-
-static void
-objio_free_deviceid_node(struct nfs4_deviceid_node *d)
-{
- struct objio_dev_ent *de = container_of(d, struct objio_dev_ent, id_node);
-
- dprintk("%s: free od=%p\n", __func__, de->od.od);
- osduld_put_device(de->od.od);
- kfree_rcu(d, rcu);
-}
-
-struct objio_segment {
- struct pnfs_layout_segment lseg;
-
- struct ore_layout layout;
- struct ore_components oc;
-};
-
-static inline struct objio_segment *
-OBJIO_LSEG(struct pnfs_layout_segment *lseg)
-{
- return container_of(lseg, struct objio_segment, lseg);
-}
-
-struct objio_state {
- /* Generic layer */
- struct objlayout_io_res oir;
-
- bool sync;
- /*FIXME: Support for extra_bytes at ore_get_rw_state() */
- struct ore_io_state *ios;
-};
-
-/* Send and wait for a get_device_info of devices in the layout,
- then look them up with the osd_initiator library */
-struct nfs4_deviceid_node *
-objio_alloc_deviceid_node(struct nfs_server *server, struct pnfs_device *pdev,
- gfp_t gfp_flags)
-{
- struct pnfs_osd_deviceaddr *deviceaddr;
- struct objio_dev_ent *ode = NULL;
- struct osd_dev *od;
- struct osd_dev_info odi;
- bool retry_flag = true;
- __be32 *p;
- int err;
-
- deviceaddr = kzalloc(sizeof(*deviceaddr), gfp_flags);
- if (!deviceaddr)
- return NULL;
-
- p = page_address(pdev->pages[0]);
- pnfs_osd_xdr_decode_deviceaddr(deviceaddr, p);
-
- odi.systemid_len = deviceaddr->oda_systemid.len;
- if (odi.systemid_len > sizeof(odi.systemid)) {
- dprintk("%s: odi.systemid_len > sizeof(systemid=%zd)\n",
- __func__, sizeof(odi.systemid));
- err = -EINVAL;
- goto out;
- } else if (odi.systemid_len)
- memcpy(odi.systemid, deviceaddr->oda_systemid.data,
- odi.systemid_len);
- odi.osdname_len = deviceaddr->oda_osdname.len;
- odi.osdname = (u8 *)deviceaddr->oda_osdname.data;
-
- if (!odi.osdname_len && !odi.systemid_len) {
- dprintk("%s: !odi.osdname_len && !odi.systemid_len\n",
- __func__);
- err = -ENODEV;
- goto out;
- }
-
-retry_lookup:
- od = osduld_info_lookup(&odi);
- if (IS_ERR(od)) {
- err = PTR_ERR(od);
- dprintk("%s: osduld_info_lookup => %d\n", __func__, err);
- if (err == -ENODEV && retry_flag) {
- err = objlayout_autologin(deviceaddr);
- if (likely(!err)) {
- retry_flag = false;
- goto retry_lookup;
- }
- }
- goto out;
- }
-
- dprintk("Adding new dev_id(%llx:%llx)\n",
- _DEVID_LO(&pdev->dev_id), _DEVID_HI(&pdev->dev_id));
-
- ode = kzalloc(sizeof(*ode), gfp_flags);
- if (!ode) {
- dprintk("%s: -ENOMEM od=%p\n", __func__, od);
- goto out;
- }
-
- nfs4_init_deviceid_node(&ode->id_node, server, &pdev->dev_id);
- kfree(deviceaddr);
-
- ode->od.od = od;
- return &ode->id_node;
-
-out:
- kfree(deviceaddr);
- return NULL;
-}
-
-static void copy_single_comp(struct ore_components *oc, unsigned c,
- struct pnfs_osd_object_cred *src_comp)
-{
- struct ore_comp *ocomp = &oc->comps[c];
-
- WARN_ON(src_comp->oc_cap_key.cred_len > 0); /* libosd is NO_SEC only */
- WARN_ON(src_comp->oc_cap.cred_len > sizeof(ocomp->cred));
-
- ocomp->obj.partition = src_comp->oc_object_id.oid_partition_id;
- ocomp->obj.id = src_comp->oc_object_id.oid_object_id;
-
- memcpy(ocomp->cred, src_comp->oc_cap.cred, sizeof(ocomp->cred));
-}
-
-static int __alloc_objio_seg(unsigned numdevs, gfp_t gfp_flags,
- struct objio_segment **pseg)
-{
-/* This is the in memory structure of the objio_segment
- *
- * struct __alloc_objio_segment {
- * struct objio_segment olseg;
- * struct ore_dev *ods[numdevs];
- * struct ore_comp comps[numdevs];
- * } *aolseg;
- * NOTE: The code as above compiles and runs perfectly. It is elegant,
- * type safe and compact. At some Past time Linus has decided he does not
- * like variable length arrays, For the sake of this principal we uglify
- * the code as below.
- */
- struct objio_segment *lseg;
- size_t lseg_size = sizeof(*lseg) +
- numdevs * sizeof(lseg->oc.ods[0]) +
- numdevs * sizeof(*lseg->oc.comps);
-
- lseg = kzalloc(lseg_size, gfp_flags);
- if (unlikely(!lseg)) {
- dprintk("%s: Failed allocation numdevs=%d size=%zd\n", __func__,
- numdevs, lseg_size);
- return -ENOMEM;
- }
-
- lseg->oc.numdevs = numdevs;
- lseg->oc.single_comp = EC_MULTPLE_COMPS;
- lseg->oc.ods = (void *)(lseg + 1);
- lseg->oc.comps = (void *)(lseg->oc.ods + numdevs);
-
- *pseg = lseg;
- return 0;
-}
-
-int objio_alloc_lseg(struct pnfs_layout_segment **outp,
- struct pnfs_layout_hdr *pnfslay,
- struct pnfs_layout_range *range,
- struct xdr_stream *xdr,
- gfp_t gfp_flags)
-{
- struct nfs_server *server = NFS_SERVER(pnfslay->plh_inode);
- struct objio_segment *objio_seg;
- struct pnfs_osd_xdr_decode_layout_iter iter;
- struct pnfs_osd_layout layout;
- struct pnfs_osd_object_cred src_comp;
- unsigned cur_comp;
- int err;
-
- err = pnfs_osd_xdr_decode_layout_map(&layout, &iter, xdr);
- if (unlikely(err))
- return err;
-
- err = __alloc_objio_seg(layout.olo_num_comps, gfp_flags, &objio_seg);
- if (unlikely(err))
- return err;
-
- objio_seg->layout.stripe_unit = layout.olo_map.odm_stripe_unit;
- objio_seg->layout.group_width = layout.olo_map.odm_group_width;
- objio_seg->layout.group_depth = layout.olo_map.odm_group_depth;
- objio_seg->layout.mirrors_p1 = layout.olo_map.odm_mirror_cnt + 1;
- objio_seg->layout.raid_algorithm = layout.olo_map.odm_raid_algorithm;
-
- err = ore_verify_layout(layout.olo_map.odm_num_comps,
- &objio_seg->layout);
- if (unlikely(err))
- goto err;
-
- objio_seg->oc.first_dev = layout.olo_comps_index;
- cur_comp = 0;
- while (pnfs_osd_xdr_decode_layout_comp(&src_comp, &iter, xdr, &err)) {
- struct nfs4_deviceid_node *d;
- struct objio_dev_ent *ode;
-
- copy_single_comp(&objio_seg->oc, cur_comp, &src_comp);
-
- d = nfs4_find_get_deviceid(server,
- &src_comp.oc_object_id.oid_device_id,
- pnfslay->plh_lc_cred, gfp_flags);
- if (!d) {
- err = -ENXIO;
- goto err;
- }
-
- ode = container_of(d, struct objio_dev_ent, id_node);
- objio_seg->oc.ods[cur_comp++] = &ode->od;
- }
- /* pnfs_osd_xdr_decode_layout_comp returns false on error */
- if (unlikely(err))
- goto err;
-
- *outp = &objio_seg->lseg;
- return 0;
-
-err:
- kfree(objio_seg);
- dprintk("%s: Error: return %d\n", __func__, err);
- *outp = NULL;
- return err;
-}
-
-void objio_free_lseg(struct pnfs_layout_segment *lseg)
-{
- int i;
- struct objio_segment *objio_seg = OBJIO_LSEG(lseg);
-
- for (i = 0; i < objio_seg->oc.numdevs; i++) {
- struct ore_dev *od = objio_seg->oc.ods[i];
- struct objio_dev_ent *ode;
-
- if (!od)
- break;
- ode = container_of(od, typeof(*ode), od);
- nfs4_put_deviceid_node(&ode->id_node);
- }
- kfree(objio_seg);
-}
-
-static int
-objio_alloc_io_state(struct pnfs_layout_hdr *pnfs_layout_type, bool is_reading,
- struct pnfs_layout_segment *lseg, struct page **pages, unsigned pgbase,
- loff_t offset, size_t count, void *rpcdata, gfp_t gfp_flags,
- struct objio_state **outp)
-{
- struct objio_segment *objio_seg = OBJIO_LSEG(lseg);
- struct ore_io_state *ios;
- int ret;
- struct __alloc_objio_state {
- struct objio_state objios;
- struct pnfs_osd_ioerr ioerrs[objio_seg->oc.numdevs];
- } *aos;
-
- aos = kzalloc(sizeof(*aos), gfp_flags);
- if (unlikely(!aos))
- return -ENOMEM;
-
- objlayout_init_ioerrs(&aos->objios.oir, objio_seg->oc.numdevs,
- aos->ioerrs, rpcdata, pnfs_layout_type);
-
- ret = ore_get_rw_state(&objio_seg->layout, &objio_seg->oc, is_reading,
- offset, count, &ios);
- if (unlikely(ret)) {
- kfree(aos);
- return ret;
- }
-
- ios->pages = pages;
- ios->pgbase = pgbase;
- ios->private = aos;
- BUG_ON(ios->nr_pages > (pgbase + count + PAGE_SIZE - 1) >> PAGE_SHIFT);
-
- aos->objios.sync = 0;
- aos->objios.ios = ios;
- *outp = &aos->objios;
- return 0;
-}
-
-void objio_free_result(struct objlayout_io_res *oir)
-{
- struct objio_state *objios = container_of(oir, struct objio_state, oir);
-
- ore_put_io_state(objios->ios);
- kfree(objios);
-}
-
-static enum pnfs_osd_errno osd_pri_2_pnfs_err(enum osd_err_priority oep)
-{
- switch (oep) {
- case OSD_ERR_PRI_NO_ERROR:
- return (enum pnfs_osd_errno)0;
-
- case OSD_ERR_PRI_CLEAR_PAGES:
- BUG_ON(1);
- return 0;
-
- case OSD_ERR_PRI_RESOURCE:
- return PNFS_OSD_ERR_RESOURCE;
- case OSD_ERR_PRI_BAD_CRED:
- return PNFS_OSD_ERR_BAD_CRED;
- case OSD_ERR_PRI_NO_ACCESS:
- return PNFS_OSD_ERR_NO_ACCESS;
- case OSD_ERR_PRI_UNREACHABLE:
- return PNFS_OSD_ERR_UNREACHABLE;
- case OSD_ERR_PRI_NOT_FOUND:
- return PNFS_OSD_ERR_NOT_FOUND;
- case OSD_ERR_PRI_NO_SPACE:
- return PNFS_OSD_ERR_NO_SPACE;
- default:
- WARN_ON(1);
- /* fallthrough */
- case OSD_ERR_PRI_EIO:
- return PNFS_OSD_ERR_EIO;
- }
-}
-
-static void __on_dev_error(struct ore_io_state *ios,
- struct ore_dev *od, unsigned dev_index, enum osd_err_priority oep,
- u64 dev_offset, u64 dev_len)
-{
- struct objio_state *objios = ios->private;
- struct pnfs_osd_objid pooid;
- struct objio_dev_ent *ode = container_of(od, typeof(*ode), od);
- /* FIXME: what to do with more-then-one-group layouts. We need to
- * translate from ore_io_state index to oc->comps index
- */
- unsigned comp = dev_index;
-
- pooid.oid_device_id = ode->id_node.deviceid;
- pooid.oid_partition_id = ios->oc->comps[comp].obj.partition;
- pooid.oid_object_id = ios->oc->comps[comp].obj.id;
-
- objlayout_io_set_result(&objios->oir, comp,
- &pooid, osd_pri_2_pnfs_err(oep),
- dev_offset, dev_len, !ios->reading);
-}
-
-/*
- * read
- */
-static void _read_done(struct ore_io_state *ios, void *private)
-{
- struct objio_state *objios = private;
- ssize_t status;
- int ret = ore_check_io(ios, &__on_dev_error);
-
- /* FIXME: _io_free(ios) can we dealocate the libosd resources; */
-
- if (likely(!ret))
- status = ios->length;
- else
- status = ret;
-
- objlayout_read_done(&objios->oir, status, objios->sync);
-}
-
-int objio_read_pagelist(struct nfs_pgio_header *hdr)
-{
- struct objio_state *objios;
- int ret;
-
- ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, true,
- hdr->lseg, hdr->args.pages, hdr->args.pgbase,
- hdr->args.offset, hdr->args.count, hdr,
- GFP_KERNEL, &objios);
- if (unlikely(ret))
- return ret;
-
- objios->ios->done = _read_done;
- dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
- hdr->args.offset, hdr->args.count);
- ret = ore_read(objios->ios);
- if (unlikely(ret))
- objio_free_result(&objios->oir);
- return ret;
-}
-
-/*
- * write
- */
-static void _write_done(struct ore_io_state *ios, void *private)
-{
- struct objio_state *objios = private;
- ssize_t status;
- int ret = ore_check_io(ios, &__on_dev_error);
-
- /* FIXME: _io_free(ios) can we dealocate the libosd resources; */
-
- if (likely(!ret)) {
- /* FIXME: should be based on the OSD's persistence model
- * See OSD2r05 Section 4.13 Data persistence model */
- objios->oir.committed = NFS_FILE_SYNC;
- status = ios->length;
- } else {
- status = ret;
- }
-
- objlayout_write_done(&objios->oir, status, objios->sync);
-}
-
-static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
-{
- struct objio_state *objios = priv;
- struct nfs_pgio_header *hdr = objios->oir.rpcdata;
- struct address_space *mapping = hdr->inode->i_mapping;
- pgoff_t index = offset / PAGE_SIZE;
- struct page *page;
- loff_t i_size = i_size_read(hdr->inode);
-
- if (offset >= i_size) {
- *uptodate = true;
- dprintk("%s: g_zero_page index=0x%lx\n", __func__, index);
- return ZERO_PAGE(0);
- }
-
- page = find_get_page(mapping, index);
- if (!page) {
- page = find_or_create_page(mapping, index, GFP_NOFS);
- if (unlikely(!page)) {
- dprintk("%s: grab_cache_page Failed index=0x%lx\n",
- __func__, index);
- return NULL;
- }
- unlock_page(page);
- }
- *uptodate = PageUptodate(page);
- dprintk("%s: index=0x%lx uptodate=%d\n", __func__, index, *uptodate);
- return page;
-}
-
-static void __r4w_put_page(void *priv, struct page *page)
-{
- dprintk("%s: index=0x%lx\n", __func__,
- (page == ZERO_PAGE(0)) ? -1UL : page->index);
- if (ZERO_PAGE(0) != page)
- put_page(page);
- return;
-}
-
-static const struct _ore_r4w_op _r4w_op = {
- .get_page = &__r4w_get_page,
- .put_page = &__r4w_put_page,
-};
-
-int objio_write_pagelist(struct nfs_pgio_header *hdr, int how)
-{
- struct objio_state *objios;
- int ret;
-
- ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, false,
- hdr->lseg, hdr->args.pages, hdr->args.pgbase,
- hdr->args.offset, hdr->args.count, hdr, GFP_NOFS,
- &objios);
- if (unlikely(ret))
- return ret;
-
- objios->sync = 0 != (how & FLUSH_SYNC);
- objios->ios->r4w = &_r4w_op;
-
- if (!objios->sync)
- objios->ios->done = _write_done;
-
- dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
- hdr->args.offset, hdr->args.count);
- ret = ore_write(objios->ios);
- if (unlikely(ret)) {
- objio_free_result(&objios->oir);
- return ret;
- }
-
- if (objios->sync)
- _write_done(objios->ios, objios);
-
- return 0;
-}
-
-/*
- * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
- * of bytes (maximum @req->wb_bytes) that can be coalesced.
- */
-static size_t objio_pg_test(struct nfs_pageio_descriptor *pgio,
- struct nfs_page *prev, struct nfs_page *req)
-{
- struct nfs_pgio_mirror *mirror = nfs_pgio_current_mirror(pgio);
- unsigned int size;
-
- size = pnfs_generic_pg_test(pgio, prev, req);
-
- if (!size || mirror->pg_count + req->wb_bytes >
- (unsigned long)pgio->pg_layout_private)
- return 0;
-
- return min(size, req->wb_bytes);
-}
-
-static void objio_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
-{
- pnfs_generic_pg_init_read(pgio, req);
- if (unlikely(pgio->pg_lseg == NULL))
- return; /* Not pNFS */
-
- pgio->pg_layout_private = (void *)
- OBJIO_LSEG(pgio->pg_lseg)->layout.max_io_length;
-}
-
-static bool aligned_on_raid_stripe(u64 offset, struct ore_layout *layout,
- unsigned long *stripe_end)
-{
- u32 stripe_off;
- unsigned stripe_size;
-
- if (layout->raid_algorithm == PNFS_OSD_RAID_0)
- return true;
-
- stripe_size = layout->stripe_unit *
- (layout->group_width - layout->parity);
-
- div_u64_rem(offset, stripe_size, &stripe_off);
- if (!stripe_off)
- return true;
-
- *stripe_end = stripe_size - stripe_off;
- return false;
-}
-
-static void objio_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
-{
- unsigned long stripe_end = 0;
- u64 wb_size;
-
- if (pgio->pg_dreq == NULL)
- wb_size = i_size_read(pgio->pg_inode) - req_offset(req);
- else
- wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
-
- pnfs_generic_pg_init_write(pgio, req, wb_size);
- if (unlikely(pgio->pg_lseg == NULL))
- return; /* Not pNFS */
-
- if (req->wb_offset ||
- !aligned_on_raid_stripe(req->wb_index * PAGE_SIZE,
- &OBJIO_LSEG(pgio->pg_lseg)->layout,
- &stripe_end)) {
- pgio->pg_layout_private = (void *)stripe_end;
- } else {
- pgio->pg_layout_private = (void *)
- OBJIO_LSEG(pgio->pg_lseg)->layout.max_io_length;
- }
-}
-
-static const struct nfs_pageio_ops objio_pg_read_ops = {
- .pg_init = objio_init_read,
- .pg_test = objio_pg_test,
- .pg_doio = pnfs_generic_pg_readpages,
- .pg_cleanup = pnfs_generic_pg_cleanup,
-};
-
-static const struct nfs_pageio_ops objio_pg_write_ops = {
- .pg_init = objio_init_write,
- .pg_test = objio_pg_test,
- .pg_doio = pnfs_generic_pg_writepages,
- .pg_cleanup = pnfs_generic_pg_cleanup,
-};
-
-static struct pnfs_layoutdriver_type objlayout_type = {
- .id = LAYOUT_OSD2_OBJECTS,
- .name = "LAYOUT_OSD2_OBJECTS",
- .flags = PNFS_LAYOUTRET_ON_SETATTR |
- PNFS_LAYOUTRET_ON_ERROR,
-
- .max_deviceinfo_size = PAGE_SIZE,
- .owner = THIS_MODULE,
- .alloc_layout_hdr = objlayout_alloc_layout_hdr,
- .free_layout_hdr = objlayout_free_layout_hdr,
-
- .alloc_lseg = objlayout_alloc_lseg,
- .free_lseg = objlayout_free_lseg,
-
- .read_pagelist = objlayout_read_pagelist,
- .write_pagelist = objlayout_write_pagelist,
- .pg_read_ops = &objio_pg_read_ops,
- .pg_write_ops = &objio_pg_write_ops,
-
- .sync = pnfs_generic_sync,
-
- .free_deviceid_node = objio_free_deviceid_node,
-
- .encode_layoutcommit = objlayout_encode_layoutcommit,
- .encode_layoutreturn = objlayout_encode_layoutreturn,
-};
-
-MODULE_DESCRIPTION("pNFS Layout Driver for OSD2 objects");
-MODULE_AUTHOR("Benny Halevy <bhalevy@panasas.com>");
-MODULE_LICENSE("GPL");
-
-static int __init
-objlayout_init(void)
-{
- int ret = pnfs_register_layoutdriver(&objlayout_type);
-
- if (ret)
- printk(KERN_INFO
- "NFS: %s: Registering OSD pNFS Layout Driver failed: error=%d\n",
- __func__, ret);
- else
- printk(KERN_INFO "NFS: %s: Registered OSD pNFS Layout Driver\n",
- __func__);
- return ret;
-}
-
-static void __exit
-objlayout_exit(void)
-{
- pnfs_unregister_layoutdriver(&objlayout_type);
- printk(KERN_INFO "NFS: %s: Unregistered OSD pNFS Layout Driver\n",
- __func__);
-}
-
-MODULE_ALIAS("nfs-layouttype4-2");
-
-module_init(objlayout_init);
-module_exit(objlayout_exit);
diff --git a/fs/nfs/objlayout/objlayout.c b/fs/nfs/objlayout/objlayout.c
deleted file mode 100644
index 8f3d2acb81c3..000000000000
--- a/fs/nfs/objlayout/objlayout.c
+++ /dev/null
@@ -1,706 +0,0 @@
-/*
- * pNFS Objects layout driver high level definitions
- *
- * Copyright (C) 2007 Panasas Inc. [year of first publication]
- * All rights reserved.
- *
- * Benny Halevy <bhalevy@panasas.com>
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- * See the file COPYING included with this distribution for more details.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <linux/kmod.h>
-#include <linux/moduleparam.h>
-#include <linux/ratelimit.h>
-#include <scsi/osd_initiator.h>
-#include "objlayout.h"
-
-#define NFSDBG_FACILITY NFSDBG_PNFS_LD
-/*
- * Create a objlayout layout structure for the given inode and return it.
- */
-struct pnfs_layout_hdr *
-objlayout_alloc_layout_hdr(struct inode *inode, gfp_t gfp_flags)
-{
- struct objlayout *objlay;
-
- objlay = kzalloc(sizeof(struct objlayout), gfp_flags);
- if (!objlay)
- return NULL;
- spin_lock_init(&objlay->lock);
- INIT_LIST_HEAD(&objlay->err_list);
- dprintk("%s: Return %p\n", __func__, objlay);
- return &objlay->pnfs_layout;
-}
-
-/*
- * Free an objlayout layout structure
- */
-void
-objlayout_free_layout_hdr(struct pnfs_layout_hdr *lo)
-{
- struct objlayout *objlay = OBJLAYOUT(lo);
-
- dprintk("%s: objlay %p\n", __func__, objlay);
-
- WARN_ON(!list_empty(&objlay->err_list));
- kfree(objlay);
-}
-
-/*
- * Unmarshall layout and store it in pnfslay.
- */
-struct pnfs_layout_segment *
-objlayout_alloc_lseg(struct pnfs_layout_hdr *pnfslay,
- struct nfs4_layoutget_res *lgr,
- gfp_t gfp_flags)
-{
- int status = -ENOMEM;
- struct xdr_stream stream;
- struct xdr_buf buf = {
- .pages = lgr->layoutp->pages,
- .page_len = lgr->layoutp->len,
- .buflen = lgr->layoutp->len,
- .len = lgr->layoutp->len,
- };
- struct page *scratch;
- struct pnfs_layout_segment *lseg;
-
- dprintk("%s: Begin pnfslay %p\n", __func__, pnfslay);
-
- scratch = alloc_page(gfp_flags);
- if (!scratch)
- goto err_nofree;
-
- xdr_init_decode(&stream, &buf, NULL);
- xdr_set_scratch_buffer(&stream, page_address(scratch), PAGE_SIZE);
-
- status = objio_alloc_lseg(&lseg, pnfslay, &lgr->range, &stream, gfp_flags);
- if (unlikely(status)) {
- dprintk("%s: objio_alloc_lseg Return err %d\n", __func__,
- status);
- goto err;
- }
-
- __free_page(scratch);
-
- dprintk("%s: Return %p\n", __func__, lseg);
- return lseg;
-
-err:
- __free_page(scratch);
-err_nofree:
- dprintk("%s: Err Return=>%d\n", __func__, status);
- return ERR_PTR(status);
-}
-
-/*
- * Free a layout segement
- */
-void
-objlayout_free_lseg(struct pnfs_layout_segment *lseg)
-{
- dprintk("%s: freeing layout segment %p\n", __func__, lseg);
-
- if (unlikely(!lseg))
- return;
-
- objio_free_lseg(lseg);
-}
-
-/*
- * I/O Operations
- */
-static inline u64
-end_offset(u64 start, u64 len)
-{
- u64 end;
-
- end = start + len;
- return end >= start ? end : NFS4_MAX_UINT64;
-}
-
-static void _fix_verify_io_params(struct pnfs_layout_segment *lseg,
- struct page ***p_pages, unsigned *p_pgbase,
- u64 offset, unsigned long count)
-{
- u64 lseg_end_offset;
-
- BUG_ON(offset < lseg->pls_range.offset);
- lseg_end_offset = end_offset(lseg->pls_range.offset,
- lseg->pls_range.length);
- BUG_ON(offset >= lseg_end_offset);
- WARN_ON(offset + count > lseg_end_offset);
-
- if (*p_pgbase > PAGE_SIZE) {
- dprintk("%s: pgbase(0x%x) > PAGE_SIZE\n", __func__, *p_pgbase);
- *p_pages += *p_pgbase >> PAGE_SHIFT;
- *p_pgbase &= ~PAGE_MASK;
- }
-}
-
-/*
- * I/O done common code
- */
-static void
-objlayout_iodone(struct objlayout_io_res *oir)
-{
- if (likely(oir->status >= 0)) {
- objio_free_result(oir);
- } else {
- struct objlayout *objlay = oir->objlay;
-
- spin_lock(&objlay->lock);
- objlay->delta_space_valid = OBJ_DSU_INVALID;
- list_add(&objlay->err_list, &oir->err_list);
- spin_unlock(&objlay->lock);
- }
-}
-
-/*
- * objlayout_io_set_result - Set an osd_error code on a specific osd comp.
- *
- * The @index component IO failed (error returned from target). Register
- * the error for later reporting at layout-return.
- */
-void
-objlayout_io_set_result(struct objlayout_io_res *oir, unsigned index,
- struct pnfs_osd_objid *pooid, int osd_error,
- u64 offset, u64 length, bool is_write)
-{
- struct pnfs_osd_ioerr *ioerr = &oir->ioerrs[index];
-
- BUG_ON(index >= oir->num_comps);
- if (osd_error) {
- ioerr->oer_component = *pooid;
- ioerr->oer_comp_offset = offset;
- ioerr->oer_comp_length = length;
- ioerr->oer_iswrite = is_write;
- ioerr->oer_errno = osd_error;
-
- dprintk("%s: err[%d]: errno=%d is_write=%d dev(%llx:%llx) "
- "par=0x%llx obj=0x%llx offset=0x%llx length=0x%llx\n",
- __func__, index, ioerr->oer_errno,
- ioerr->oer_iswrite,
- _DEVID_LO(&ioerr->oer_component.oid_device_id),
- _DEVID_HI(&ioerr->oer_component.oid_device_id),
- ioerr->oer_component.oid_partition_id,
- ioerr->oer_component.oid_object_id,
- ioerr->oer_comp_offset,
- ioerr->oer_comp_length);
- } else {
- /* User need not call if no error is reported */
- ioerr->oer_errno = 0;
- }
-}
-
-/* Function scheduled on rpc workqueue to call ->nfs_readlist_complete().
- * This is because the osd completion is called with ints-off from
- * the block layer
- */
-static void _rpc_read_complete(struct work_struct *work)
-{
- struct rpc_task *task;
- struct nfs_pgio_header *hdr;
-
- dprintk("%s enter\n", __func__);
- task = container_of(work, struct rpc_task, u.tk_work);
- hdr = container_of(task, struct nfs_pgio_header, task);
-
- pnfs_ld_read_done(hdr);
-}
-
-void
-objlayout_read_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
-{
- struct nfs_pgio_header *hdr = oir->rpcdata;
-
- oir->status = hdr->task.tk_status = status;
- if (status >= 0)
- hdr->res.count = status;
- else
- hdr->pnfs_error = status;
- objlayout_iodone(oir);
- /* must not use oir after this point */
-
- dprintk("%s: Return status=%zd eof=%d sync=%d\n", __func__,
- status, hdr->res.eof, sync);
-
- if (sync)
- pnfs_ld_read_done(hdr);
- else {
- INIT_WORK(&hdr->task.u.tk_work, _rpc_read_complete);
- schedule_work(&hdr->task.u.tk_work);
- }
-}
-
-/*
- * Perform sync or async reads.
- */
-enum pnfs_try_status
-objlayout_read_pagelist(struct nfs_pgio_header *hdr)
-{
- struct inode *inode = hdr->inode;
- loff_t offset = hdr->args.offset;
- size_t count = hdr->args.count;
- int err;
- loff_t eof;
-
- eof = i_size_read(inode);
- if (unlikely(offset + count > eof)) {
- if (offset >= eof) {
- err = 0;
- hdr->res.count = 0;
- hdr->res.eof = 1;
- /*FIXME: do we need to call pnfs_ld_read_done() */
- goto out;
- }
- count = eof - offset;
- }
-
- hdr->res.eof = (offset + count) >= eof;
- _fix_verify_io_params(hdr->lseg, &hdr->args.pages,
- &hdr->args.pgbase,
- hdr->args.offset, hdr->args.count);
-
- dprintk("%s: inode(%lx) offset 0x%llx count 0x%zx eof=%d\n",
- __func__, inode->i_ino, offset, count, hdr->res.eof);
-
- err = objio_read_pagelist(hdr);
- out:
- if (unlikely(err)) {
- hdr->pnfs_error = err;
- dprintk("%s: Returned Error %d\n", __func__, err);
- return PNFS_NOT_ATTEMPTED;
- }
- return PNFS_ATTEMPTED;
-}
-
-/* Function scheduled on rpc workqueue to call ->nfs_writelist_complete().
- * This is because the osd completion is called with ints-off from
- * the block layer
- */
-static void _rpc_write_complete(struct work_struct *work)
-{
- struct rpc_task *task;
- struct nfs_pgio_header *hdr;
-
- dprintk("%s enter\n", __func__);
- task = container_of(work, struct rpc_task, u.tk_work);
- hdr = container_of(task, struct nfs_pgio_header, task);
-
- pnfs_ld_write_done(hdr);
-}
-
-void
-objlayout_write_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
-{
- struct nfs_pgio_header *hdr = oir->rpcdata;
-
- oir->status = hdr->task.tk_status = status;
- if (status >= 0) {
- hdr->res.count = status;
- hdr->verf.committed = oir->committed;
- } else {
- hdr->pnfs_error = status;
- }
- objlayout_iodone(oir);
- /* must not use oir after this point */
-
- dprintk("%s: Return status %zd committed %d sync=%d\n", __func__,
- status, hdr->verf.committed, sync);
-
- if (sync)
- pnfs_ld_write_done(hdr);
- else {
- INIT_WORK(&hdr->task.u.tk_work, _rpc_write_complete);
- schedule_work(&hdr->task.u.tk_work);
- }
-}
-
-/*
- * Perform sync or async writes.
- */
-enum pnfs_try_status
-objlayout_write_pagelist(struct nfs_pgio_header *hdr, int how)
-{
- int err;
-
- _fix_verify_io_params(hdr->lseg, &hdr->args.pages,
- &hdr->args.pgbase,
- hdr->args.offset, hdr->args.count);
-
- err = objio_write_pagelist(hdr, how);
- if (unlikely(err)) {
- hdr->pnfs_error = err;
- dprintk("%s: Returned Error %d\n", __func__, err);
- return PNFS_NOT_ATTEMPTED;
- }
- return PNFS_ATTEMPTED;
-}
-
-void
-objlayout_encode_layoutcommit(struct pnfs_layout_hdr *pnfslay,
- struct xdr_stream *xdr,
- const struct nfs4_layoutcommit_args *args)
-{
- struct objlayout *objlay = OBJLAYOUT(pnfslay);
- struct pnfs_osd_layoutupdate lou;
- __be32 *start;
-
- dprintk("%s: Begin\n", __func__);
-
- spin_lock(&objlay->lock);
- lou.dsu_valid = (objlay->delta_space_valid == OBJ_DSU_VALID);
- lou.dsu_delta = objlay->delta_space_used;
- objlay->delta_space_used = 0;
- objlay->delta_space_valid = OBJ_DSU_INIT;
- lou.olu_ioerr_flag = !list_empty(&objlay->err_list);
- spin_unlock(&objlay->lock);
-
- start = xdr_reserve_space(xdr, 4);
-
- BUG_ON(pnfs_osd_xdr_encode_layoutupdate(xdr, &lou));
-
- *start = cpu_to_be32((xdr->p - start - 1) * 4);
-
- dprintk("%s: Return delta_space_used %lld err %d\n", __func__,
- lou.dsu_delta, lou.olu_ioerr_flag);
-}
-
-static int
-err_prio(u32 oer_errno)
-{
- switch (oer_errno) {
- case 0:
- return 0;
-
- case PNFS_OSD_ERR_RESOURCE:
- return OSD_ERR_PRI_RESOURCE;
- case PNFS_OSD_ERR_BAD_CRED:
- return OSD_ERR_PRI_BAD_CRED;
- case PNFS_OSD_ERR_NO_ACCESS:
- return OSD_ERR_PRI_NO_ACCESS;
- case PNFS_OSD_ERR_UNREACHABLE:
- return OSD_ERR_PRI_UNREACHABLE;
- case PNFS_OSD_ERR_NOT_FOUND:
- return OSD_ERR_PRI_NOT_FOUND;
- case PNFS_OSD_ERR_NO_SPACE:
- return OSD_ERR_PRI_NO_SPACE;
- default:
- WARN_ON(1);
- /* fallthrough */
- case PNFS_OSD_ERR_EIO:
- return OSD_ERR_PRI_EIO;
- }
-}
-
-static void
-merge_ioerr(struct pnfs_osd_ioerr *dest_err,
- const struct pnfs_osd_ioerr *src_err)
-{
- u64 dest_end, src_end;
-
- if (!dest_err->oer_errno) {
- *dest_err = *src_err;
- /* accumulated device must be blank */
- memset(&dest_err->oer_component.oid_device_id, 0,
- sizeof(dest_err->oer_component.oid_device_id));
-
- return;
- }
-
- if (dest_err->oer_component.oid_partition_id !=
- src_err->oer_component.oid_partition_id)
- dest_err->oer_component.oid_partition_id = 0;
-
- if (dest_err->oer_component.oid_object_id !=
- src_err->oer_component.oid_object_id)
- dest_err->oer_component.oid_object_id = 0;
-
- if (dest_err->oer_comp_offset > src_err->oer_comp_offset)
- dest_err->oer_comp_offset = src_err->oer_comp_offset;
-
- dest_end = end_offset(dest_err->oer_comp_offset,
- dest_err->oer_comp_length);
- src_end = end_offset(src_err->oer_comp_offset,
- src_err->oer_comp_length);
- if (dest_end < src_end)
- dest_end = src_end;
-
- dest_err->oer_comp_length = dest_end - dest_err->oer_comp_offset;
-
- if ((src_err->oer_iswrite == dest_err->oer_iswrite) &&
- (err_prio(src_err->oer_errno) > err_prio(dest_err->oer_errno))) {
- dest_err->oer_errno = src_err->oer_errno;
- } else if (src_err->oer_iswrite) {
- dest_err->oer_iswrite = true;
- dest_err->oer_errno = src_err->oer_errno;
- }
-}
-
-static void
-encode_accumulated_error(struct objlayout *objlay, __be32 *p)
-{
- struct objlayout_io_res *oir, *tmp;
- struct pnfs_osd_ioerr accumulated_err = {.oer_errno = 0};
-
- list_for_each_entry_safe(oir, tmp, &objlay->err_list, err_list) {
- unsigned i;
-
- for (i = 0; i < oir->num_comps; i++) {
- struct pnfs_osd_ioerr *ioerr = &oir->ioerrs[i];
-
- if (!ioerr->oer_errno)
- continue;
-
- printk(KERN_ERR "NFS: %s: err[%d]: errno=%d "
- "is_write=%d dev(%llx:%llx) par=0x%llx "
- "obj=0x%llx offset=0x%llx length=0x%llx\n",
- __func__, i, ioerr->oer_errno,
- ioerr->oer_iswrite,
- _DEVID_LO(&ioerr->oer_component.oid_device_id),
- _DEVID_HI(&ioerr->oer_component.oid_device_id),
- ioerr->oer_component.oid_partition_id,
- ioerr->oer_component.oid_object_id,
- ioerr->oer_comp_offset,
- ioerr->oer_comp_length);
-
- merge_ioerr(&accumulated_err, ioerr);
- }
- list_del(&oir->err_list);
- objio_free_result(oir);
- }
-
- pnfs_osd_xdr_encode_ioerr(p, &accumulated_err);
-}
-
-void
-objlayout_encode_layoutreturn(struct xdr_stream *xdr,
- const struct nfs4_layoutreturn_args *args)
-{
- struct pnfs_layout_hdr *pnfslay = args->layout;
- struct objlayout *objlay = OBJLAYOUT(pnfslay);
- struct objlayout_io_res *oir, *tmp;
- __be32 *start;
-
- dprintk("%s: Begin\n", __func__);
- start = xdr_reserve_space(xdr, 4);
- BUG_ON(!start);
-
- spin_lock(&objlay->lock);
-
- list_for_each_entry_safe(oir, tmp, &objlay->err_list, err_list) {
- __be32 *last_xdr = NULL, *p;
- unsigned i;
- int res = 0;
-
- for (i = 0; i < oir->num_comps; i++) {
- struct pnfs_osd_ioerr *ioerr = &oir->ioerrs[i];
-
- if (!ioerr->oer_errno)
- continue;
-
- dprintk("%s: err[%d]: errno=%d is_write=%d "
- "dev(%llx:%llx) par=0x%llx obj=0x%llx "
- "offset=0x%llx length=0x%llx\n",
- __func__, i, ioerr->oer_errno,
- ioerr->oer_iswrite,
- _DEVID_LO(&ioerr->oer_component.oid_device_id),
- _DEVID_HI(&ioerr->oer_component.oid_device_id),
- ioerr->oer_component.oid_partition_id,
- ioerr->oer_component.oid_object_id,
- ioerr->oer_comp_offset,
- ioerr->oer_comp_length);
-
- p = pnfs_osd_xdr_ioerr_reserve_space(xdr);
- if (unlikely(!p)) {
- res = -E2BIG;
- break; /* accumulated_error */
- }
-
- last_xdr = p;
- pnfs_osd_xdr_encode_ioerr(p, &oir->ioerrs[i]);
- }
-
- /* TODO: use xdr_write_pages */
- if (unlikely(res)) {
- /* no space for even one error descriptor */
- BUG_ON(!last_xdr);
-
- /* we've encountered a situation with lots and lots of
- * errors and no space to encode them all. Use the last
- * available slot to report the union of all the
- * remaining errors.
- */
- encode_accumulated_error(objlay, last_xdr);
- goto loop_done;
- }
- list_del(&oir->err_list);
- objio_free_result(oir);
- }
-loop_done:
- spin_unlock(&objlay->lock);
-
- *start = cpu_to_be32((xdr->p - start - 1) * 4);
- dprintk("%s: Return\n", __func__);
-}
-
-enum {
- OBJLAYOUT_MAX_URI_LEN = 256, OBJLAYOUT_MAX_OSDNAME_LEN = 64,
- OBJLAYOUT_MAX_SYSID_HEX_LEN = OSD_SYSTEMID_LEN * 2 + 1,
- OSD_LOGIN_UPCALL_PATHLEN = 256
-};
-
-static char osd_login_prog[OSD_LOGIN_UPCALL_PATHLEN] = "/sbin/osd_login";
-
-module_param_string(osd_login_prog, osd_login_prog, sizeof(osd_login_prog),
- 0600);
-MODULE_PARM_DESC(osd_login_prog, "Path to the osd_login upcall program");
-
-struct __auto_login {
- char uri[OBJLAYOUT_MAX_URI_LEN];
- char osdname[OBJLAYOUT_MAX_OSDNAME_LEN];
- char systemid_hex[OBJLAYOUT_MAX_SYSID_HEX_LEN];
-};
-
-static int __objlayout_upcall(struct __auto_login *login)
-{
- static char *envp[] = { "HOME=/",
- "TERM=linux",
- "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
- NULL
- };
- char *argv[8];
- int ret;
-
- if (unlikely(!osd_login_prog[0])) {
- dprintk("%s: osd_login_prog is disabled\n", __func__);
- return -EACCES;
- }
-
- dprintk("%s uri: %s\n", __func__, login->uri);
- dprintk("%s osdname %s\n", __func__, login->osdname);
- dprintk("%s systemid_hex %s\n", __func__, login->systemid_hex);
-
- argv[0] = (char *)osd_login_prog;
- argv[1] = "-u";
- argv[2] = login->uri;
- argv[3] = "-o";
- argv[4] = login->osdname;
- argv[5] = "-s";
- argv[6] = login->systemid_hex;
- argv[7] = NULL;
-
- ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
- /*
- * Disable the upcall mechanism if we're getting an ENOENT or
- * EACCES error. The admin can re-enable it on the fly by using
- * sysfs to set the objlayoutdriver.osd_login_prog module parameter once
- * the problem has been fixed.
- */
- if (ret == -ENOENT || ret == -EACCES) {
- printk(KERN_ERR "PNFS-OBJ: %s was not found please set "
- "objlayoutdriver.osd_login_prog kernel parameter!\n",
- osd_login_prog);
- osd_login_prog[0] = '\0';
- }
- dprintk("%s %s return value: %d\n", __func__, osd_login_prog, ret);
-
- return ret;
-}
-
-/* Assume dest is all zeros */
-static void __copy_nfsS_and_zero_terminate(struct nfs4_string s,
- char *dest, int max_len,
- const char *var_name)
-{
- if (!s.len)
- return;
-
- if (s.len >= max_len) {
- pr_warn_ratelimited(
- "objlayout_autologin: %s: s.len(%d) >= max_len(%d)",
- var_name, s.len, max_len);
- s.len = max_len - 1; /* space for null terminator */
- }
-
- memcpy(dest, s.data, s.len);
-}
-
-/* Assume sysid is all zeros */
-static void _sysid_2_hex(struct nfs4_string s,
- char sysid[OBJLAYOUT_MAX_SYSID_HEX_LEN])
-{
- int i;
- char *cur;
-
- if (!s.len)
- return;
-
- if (s.len != OSD_SYSTEMID_LEN) {
- pr_warn_ratelimited(
- "objlayout_autologin: systemid_len(%d) != OSD_SYSTEMID_LEN",
- s.len);
- if (s.len > OSD_SYSTEMID_LEN)
- s.len = OSD_SYSTEMID_LEN;
- }
-
- cur = sysid;
- for (i = 0; i < s.len; i++)
- cur = hex_byte_pack(cur, s.data[i]);
-}
-
-int objlayout_autologin(struct pnfs_osd_deviceaddr *deviceaddr)
-{
- int rc;
- struct __auto_login login;
-
- if (!deviceaddr->oda_targetaddr.ota_netaddr.r_addr.len)
- return -ENODEV;
-
- memset(&login, 0, sizeof(login));
- __copy_nfsS_and_zero_terminate(
- deviceaddr->oda_targetaddr.ota_netaddr.r_addr,
- login.uri, sizeof(login.uri), "URI");
-
- __copy_nfsS_and_zero_terminate(
- deviceaddr->oda_osdname,
- login.osdname, sizeof(login.osdname), "OSDNAME");
-
- _sysid_2_hex(deviceaddr->oda_systemid, login.systemid_hex);
-
- rc = __objlayout_upcall(&login);
- if (rc > 0) /* script returns positive values */
- rc = -ENODEV;
-
- return rc;
-}
diff --git a/fs/nfs/objlayout/objlayout.h b/fs/nfs/objlayout/objlayout.h
deleted file mode 100644
index fc94a5872ed4..000000000000
--- a/fs/nfs/objlayout/objlayout.h
+++ /dev/null
@@ -1,183 +0,0 @@
-/*
- * Data types and function declerations for interfacing with the
- * pNFS standard object layout driver.
- *
- * Copyright (C) 2007 Panasas Inc. [year of first publication]
- * All rights reserved.
- *
- * Benny Halevy <bhalevy@panasas.com>
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- * See the file COPYING included with this distribution for more details.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#ifndef _OBJLAYOUT_H
-#define _OBJLAYOUT_H
-
-#include <linux/nfs_fs.h>
-#include <linux/pnfs_osd_xdr.h>
-#include "../pnfs.h"
-
-/*
- * per-inode layout
- */
-struct objlayout {
- struct pnfs_layout_hdr pnfs_layout;
-
- /* for layout_commit */
- enum osd_delta_space_valid_enum {
- OBJ_DSU_INIT = 0,
- OBJ_DSU_VALID,
- OBJ_DSU_INVALID,
- } delta_space_valid;
- s64 delta_space_used; /* consumed by write ops */
-
- /* for layout_return */
- spinlock_t lock;
- struct list_head err_list;
-};
-
-static inline struct objlayout *
-OBJLAYOUT(struct pnfs_layout_hdr *lo)
-{
- return container_of(lo, struct objlayout, pnfs_layout);
-}
-
-/*
- * per-I/O operation state
- * embedded in objects provider io_state data structure
- */
-struct objlayout_io_res {
- struct objlayout *objlay;
-
- void *rpcdata;
- int status; /* res */
- int committed; /* res */
-
- /* Error reporting (layout_return) */
- struct list_head err_list;
- unsigned num_comps;
- /* Pointer to array of error descriptors of size num_comps.
- * It should contain as many entries as devices in the osd_layout
- * that participate in the I/O. It is up to the io_engine to allocate
- * needed space and set num_comps.
- */
- struct pnfs_osd_ioerr *ioerrs;
-};
-
-static inline
-void objlayout_init_ioerrs(struct objlayout_io_res *oir, unsigned num_comps,
- struct pnfs_osd_ioerr *ioerrs, void *rpcdata,
- struct pnfs_layout_hdr *pnfs_layout_type)
-{
- oir->objlay = OBJLAYOUT(pnfs_layout_type);
- oir->rpcdata = rpcdata;
- INIT_LIST_HEAD(&oir->err_list);
- oir->num_comps = num_comps;
- oir->ioerrs = ioerrs;
-}
-
-/*
- * Raid engine I/O API
- */
-extern int objio_alloc_lseg(struct pnfs_layout_segment **outp,
- struct pnfs_layout_hdr *pnfslay,
- struct pnfs_layout_range *range,
- struct xdr_stream *xdr,
- gfp_t gfp_flags);
-extern void objio_free_lseg(struct pnfs_layout_segment *lseg);
-
-/* objio_free_result will free these @oir structs received from
- * objlayout_{read,write}_done
- */
-extern void objio_free_result(struct objlayout_io_res *oir);
-
-extern int objio_read_pagelist(struct nfs_pgio_header *rdata);
-extern int objio_write_pagelist(struct nfs_pgio_header *wdata, int how);
-
-/*
- * callback API
- */
-extern void objlayout_io_set_result(struct objlayout_io_res *oir,
- unsigned index, struct pnfs_osd_objid *pooid,
- int osd_error, u64 offset, u64 length, bool is_write);
-
-static inline void
-objlayout_add_delta_space_used(struct objlayout *objlay, s64 space_used)
-{
- /* If one of the I/Os errored out and the delta_space_used was
- * invalid we render the complete report as invalid. Protocol mandate
- * the DSU be accurate or not reported.
- */
- spin_lock(&objlay->lock);
- if (objlay->delta_space_valid != OBJ_DSU_INVALID) {
- objlay->delta_space_valid = OBJ_DSU_VALID;
- objlay->delta_space_used += space_used;
- }
- spin_unlock(&objlay->lock);
-}
-
-extern void objlayout_read_done(struct objlayout_io_res *oir,
- ssize_t status, bool sync);
-extern void objlayout_write_done(struct objlayout_io_res *oir,
- ssize_t status, bool sync);
-
-/*
- * exported generic objects function vectors
- */
-
-extern struct pnfs_layout_hdr *objlayout_alloc_layout_hdr(struct inode *, gfp_t gfp_flags);
-extern void objlayout_free_layout_hdr(struct pnfs_layout_hdr *);
-
-extern struct pnfs_layout_segment *objlayout_alloc_lseg(
- struct pnfs_layout_hdr *,
- struct nfs4_layoutget_res *,
- gfp_t gfp_flags);
-extern void objlayout_free_lseg(struct pnfs_layout_segment *);
-
-extern enum pnfs_try_status objlayout_read_pagelist(
- struct nfs_pgio_header *);
-
-extern enum pnfs_try_status objlayout_write_pagelist(
- struct nfs_pgio_header *,
- int how);
-
-extern void objlayout_encode_layoutcommit(
- struct pnfs_layout_hdr *,
- struct xdr_stream *,
- const struct nfs4_layoutcommit_args *);
-
-extern void objlayout_encode_layoutreturn(
- struct xdr_stream *,
- const struct nfs4_layoutreturn_args *);
-
-extern int objlayout_autologin(struct pnfs_osd_deviceaddr *deviceaddr);
-
-#endif /* _OBJLAYOUT_H */
diff --git a/fs/nfs/objlayout/pnfs_osd_xdr_cli.c b/fs/nfs/objlayout/pnfs_osd_xdr_cli.c
deleted file mode 100644
index f093c7ec983b..000000000000
--- a/fs/nfs/objlayout/pnfs_osd_xdr_cli.c
+++ /dev/null
@@ -1,415 +0,0 @@
-/*
- * Object-Based pNFS Layout XDR layer
- *
- * Copyright (C) 2007 Panasas Inc. [year of first publication]
- * All rights reserved.
- *
- * Benny Halevy <bhalevy@panasas.com>
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2
- * See the file COPYING included with this distribution for more details.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the Panasas company nor the names of its
- * contributors may be used to endorse or promote products derived
- * from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <linux/pnfs_osd_xdr.h>
-
-#define NFSDBG_FACILITY NFSDBG_PNFS_LD
-
-/*
- * The following implementation is based on RFC5664
- */
-
-/*
- * struct pnfs_osd_objid {
- * struct nfs4_deviceid oid_device_id;
- * u64 oid_partition_id;
- * u64 oid_object_id;
- * }; // xdr size 32 bytes
- */
-static __be32 *
-_osd_xdr_decode_objid(__be32 *p, struct pnfs_osd_objid *objid)
-{
- p = xdr_decode_opaque_fixed(p, objid->oid_device_id.data,
- sizeof(objid->oid_device_id.data));
-
- p = xdr_decode_hyper(p, &objid->oid_partition_id);
- p = xdr_decode_hyper(p, &objid->oid_object_id);
- return p;
-}
-/*
- * struct pnfs_osd_opaque_cred {
- * u32 cred_len;
- * void *cred;
- * }; // xdr size [variable]
- * The return pointers are from the xdr buffer
- */
-static int
-_osd_xdr_decode_opaque_cred(struct pnfs_osd_opaque_cred *opaque_cred,
- struct xdr_stream *xdr)
-{
- __be32 *p = xdr_inline_decode(xdr, 1);
-
- if (!p)
- return -EINVAL;
-
- opaque_cred->cred_len = be32_to_cpu(*p++);
-
- p = xdr_inline_decode(xdr, opaque_cred->cred_len);
- if (!p)
- return -EINVAL;
-
- opaque_cred->cred = p;
- return 0;
-}
-
-/*
- * struct pnfs_osd_object_cred {
- * struct pnfs_osd_objid oc_object_id;
- * u32 oc_osd_version;
- * u32 oc_cap_key_sec;
- * struct pnfs_osd_opaque_cred oc_cap_key
- * struct pnfs_osd_opaque_cred oc_cap;
- * }; // xdr size 32 + 4 + 4 + [variable] + [variable]
- */
-static int
-_osd_xdr_decode_object_cred(struct pnfs_osd_object_cred *comp,
- struct xdr_stream *xdr)
-{
- __be32 *p = xdr_inline_decode(xdr, 32 + 4 + 4);
- int ret;
-
- if (!p)
- return -EIO;
-
- p = _osd_xdr_decode_objid(p, &comp->oc_object_id);
- comp->oc_osd_version = be32_to_cpup(p++);
- comp->oc_cap_key_sec = be32_to_cpup(p);
-
- ret = _osd_xdr_decode_opaque_cred(&comp->oc_cap_key, xdr);
- if (unlikely(ret))
- return ret;
-
- ret = _osd_xdr_decode_opaque_cred(&comp->oc_cap, xdr);
- return ret;
-}
-
-/*
- * struct pnfs_osd_data_map {
- * u32 odm_num_comps;
- * u64 odm_stripe_unit;
- * u32 odm_group_width;
- * u32 odm_group_depth;
- * u32 odm_mirror_cnt;
- * u32 odm_raid_algorithm;
- * }; // xdr size 4 + 8 + 4 + 4 + 4 + 4
- */
-static inline int
-_osd_data_map_xdr_sz(void)
-{
- return 4 + 8 + 4 + 4 + 4 + 4;
-}
-
-static __be32 *
-_osd_xdr_decode_data_map(__be32 *p, struct pnfs_osd_data_map *data_map)
-{
- data_map->odm_num_comps = be32_to_cpup(p++);
- p = xdr_decode_hyper(p, &data_map->odm_stripe_unit);
- data_map->odm_group_width = be32_to_cpup(p++);
- data_map->odm_group_depth = be32_to_cpup(p++);
- data_map->odm_mirror_cnt = be32_to_cpup(p++);
- data_map->odm_raid_algorithm = be32_to_cpup(p++);
- dprintk("%s: odm_num_comps=%u odm_stripe_unit=%llu odm_group_width=%u "
- "odm_group_depth=%u odm_mirror_cnt=%u odm_raid_algorithm=%u\n",
- __func__,
- data_map->odm_num_comps,
- (unsigned long long)data_map->odm_stripe_unit,
- data_map->odm_group_width,
- data_map->odm_group_depth,
- data_map->odm_mirror_cnt,
- data_map->odm_raid_algorithm);
- return p;
-}
-
-int pnfs_osd_xdr_decode_layout_map(struct pnfs_osd_layout *layout,
- struct pnfs_osd_xdr_decode_layout_iter *iter, struct xdr_stream *xdr)
-{
- __be32 *p;
-
- memset(iter, 0, sizeof(*iter));
-
- p = xdr_inline_decode(xdr, _osd_data_map_xdr_sz() + 4 + 4);
- if (unlikely(!p))
- return -EINVAL;
-
- p = _osd_xdr_decode_data_map(p, &layout->olo_map);
- layout->olo_comps_index = be32_to_cpup(p++);
- layout->olo_num_comps = be32_to_cpup(p++);
- dprintk("%s: olo_comps_index=%d olo_num_comps=%d\n", __func__,
- layout->olo_comps_index, layout->olo_num_comps);
-
- iter->total_comps = layout->olo_num_comps;
- return 0;
-}
-
-bool pnfs_osd_xdr_decode_layout_comp(struct pnfs_osd_object_cred *comp,
- struct pnfs_osd_xdr_decode_layout_iter *iter, struct xdr_stream *xdr,
- int *err)
-{
- BUG_ON(iter->decoded_comps > iter->total_comps);
- if (iter->decoded_comps == iter->total_comps)
- return false;
-
- *err = _osd_xdr_decode_object_cred(comp, xdr);
- if (unlikely(*err)) {
- dprintk("%s: _osd_xdr_decode_object_cred=>%d decoded_comps=%d "
- "total_comps=%d\n", __func__, *err,
- iter->decoded_comps, iter->total_comps);
- return false; /* stop the loop */
- }
- dprintk("%s: dev(%llx:%llx) par=0x%llx obj=0x%llx "
- "key_len=%u cap_len=%u\n",
- __func__,
- _DEVID_LO(&comp->oc_object_id.oid_device_id),
- _DEVID_HI(&comp->oc_object_id.oid_device_id),
- comp->oc_object_id.oid_partition_id,
- comp->oc_object_id.oid_object_id,
- comp->oc_cap_key.cred_len, comp->oc_cap.cred_len);
-
- iter->decoded_comps++;
- return true;
-}
-
-/*
- * Get Device Information Decoding
- *
- * Note: since Device Information is currently done synchronously, all
- * variable strings fields are left inside the rpc buffer and are only
- * pointed to by the pnfs_osd_deviceaddr members. So the read buffer
- * should not be freed while the returned information is in use.
- */
-/*
- *struct nfs4_string {
- * unsigned int len;
- * char *data;
- *}; // size [variable]
- * NOTE: Returned string points to inside the XDR buffer
- */
-static __be32 *
-__read_u8_opaque(__be32 *p, struct nfs4_string *str)
-{
- str->len = be32_to_cpup(p++);
- str->data = (char *)p;
-
- p += XDR_QUADLEN(str->len);
- return p;
-}
-
-/*
- * struct pnfs_osd_targetid {
- * u32 oti_type;
- * struct nfs4_string oti_scsi_device_id;
- * };// size 4 + [variable]
- */
-static __be32 *
-__read_targetid(__be32 *p, struct pnfs_osd_targetid* targetid)
-{
- u32 oti_type;
-
- oti_type = be32_to_cpup(p++);
- targetid->oti_type = oti_type;
-
- switch (oti_type) {
- case OBJ_TARGET_SCSI_NAME:
- case OBJ_TARGET_SCSI_DEVICE_ID:
- p = __read_u8_opaque(p, &targetid->oti_scsi_device_id);
- }
-
- return p;
-}
-
-/*
- * struct pnfs_osd_net_addr {
- * struct nfs4_string r_netid;
- * struct nfs4_string r_addr;
- * };
- */
-static __be32 *
-__read_net_addr(__be32 *p, struct pnfs_osd_net_addr* netaddr)
-{
- p = __read_u8_opaque(p, &netaddr->r_netid);
- p = __read_u8_opaque(p, &netaddr->r_addr);
-
- return p;
-}
-
-/*
- * struct pnfs_osd_targetaddr {
- * u32 ota_available;
- * struct pnfs_osd_net_addr ota_netaddr;
- * };
- */
-static __be32 *
-__read_targetaddr(__be32 *p, struct pnfs_osd_targetaddr *targetaddr)
-{
- u32 ota_available;
-
- ota_available = be32_to_cpup(p++);
- targetaddr->ota_available = ota_available;
-
- if (ota_available)
- p = __read_net_addr(p, &targetaddr->ota_netaddr);
-
-
- return p;
-}
-
-/*
- * struct pnfs_osd_deviceaddr {
- * struct pnfs_osd_targetid oda_targetid;
- * struct pnfs_osd_targetaddr oda_targetaddr;
- * u8 oda_lun[8];
- * struct nfs4_string oda_systemid;
- * struct pnfs_osd_object_cred oda_root_obj_cred;
- * struct nfs4_string oda_osdname;
- * };
- */
-
-/* We need this version for the pnfs_osd_xdr_decode_deviceaddr which does
- * not have an xdr_stream
- */
-static __be32 *
-__read_opaque_cred(__be32 *p,
- struct pnfs_osd_opaque_cred *opaque_cred)
-{
- opaque_cred->cred_len = be32_to_cpu(*p++);
- opaque_cred->cred = p;
- return p + XDR_QUADLEN(opaque_cred->cred_len);
-}
-
-static __be32 *
-__read_object_cred(__be32 *p, struct pnfs_osd_object_cred *comp)
-{
- p = _osd_xdr_decode_objid(p, &comp->oc_object_id);
- comp->oc_osd_version = be32_to_cpup(p++);
- comp->oc_cap_key_sec = be32_to_cpup(p++);
-
- p = __read_opaque_cred(p, &comp->oc_cap_key);
- p = __read_opaque_cred(p, &comp->oc_cap);
- return p;
-}
-
-void pnfs_osd_xdr_decode_deviceaddr(
- struct pnfs_osd_deviceaddr *deviceaddr, __be32 *p)
-{
- p = __read_targetid(p, &deviceaddr->oda_targetid);
-
- p = __read_targetaddr(p, &deviceaddr->oda_targetaddr);
-
- p = xdr_decode_opaque_fixed(p, deviceaddr->oda_lun,
- sizeof(deviceaddr->oda_lun));
-
- p = __read_u8_opaque(p, &deviceaddr->oda_systemid);
-
- p = __read_object_cred(p, &deviceaddr->oda_root_obj_cred);
-
- p = __read_u8_opaque(p, &deviceaddr->oda_osdname);
-
- /* libosd likes this terminated in dbg. It's last, so no problems */
- deviceaddr->oda_osdname.data[deviceaddr->oda_osdname.len] = 0;
-}
-
-/*
- * struct pnfs_osd_layoutupdate {
- * u32 dsu_valid;
- * s64 dsu_delta;
- * u32 olu_ioerr_flag;
- * }; xdr size 4 + 8 + 4
- */
-int
-pnfs_osd_xdr_encode_layoutupdate(struct xdr_stream *xdr,
- struct pnfs_osd_layoutupdate *lou)
-{
- __be32 *p = xdr_reserve_space(xdr, 4 + 8 + 4);
-
- if (!p)
- return -E2BIG;
-
- *p++ = cpu_to_be32(lou->dsu_valid);
- if (lou->dsu_valid)
- p = xdr_encode_hyper(p, lou->dsu_delta);
- *p++ = cpu_to_be32(lou->olu_ioerr_flag);
- return 0;
-}
-
-/*
- * struct pnfs_osd_objid {
- * struct nfs4_deviceid oid_device_id;
- * u64 oid_partition_id;
- * u64 oid_object_id;
- * }; // xdr size 32 bytes
- */
-static inline __be32 *
-pnfs_osd_xdr_encode_objid(__be32 *p, struct pnfs_osd_objid *object_id)
-{
- p = xdr_encode_opaque_fixed(p, &object_id->oid_device_id.data,
- sizeof(object_id->oid_device_id.data));
- p = xdr_encode_hyper(p, object_id->oid_partition_id);
- p = xdr_encode_hyper(p, object_id->oid_object_id);
-
- return p;
-}
-
-/*
- * struct pnfs_osd_ioerr {
- * struct pnfs_osd_objid oer_component;
- * u64 oer_comp_offset;
- * u64 oer_comp_length;
- * u32 oer_iswrite;
- * u32 oer_errno;
- * }; // xdr size 32 + 24 bytes
- */
-void pnfs_osd_xdr_encode_ioerr(__be32 *p, struct pnfs_osd_ioerr *ioerr)
-{
- p = pnfs_osd_xdr_encode_objid(p, &ioerr->oer_component);
- p = xdr_encode_hyper(p, ioerr->oer_comp_offset);
- p = xdr_encode_hyper(p, ioerr->oer_comp_length);
- *p++ = cpu_to_be32(ioerr->oer_iswrite);
- *p = cpu_to_be32(ioerr->oer_errno);
-}
-
-__be32 *pnfs_osd_xdr_ioerr_reserve_space(struct xdr_stream *xdr)
-{
- __be32 *p;
-
- p = xdr_reserve_space(xdr, 32 + 24);
- if (unlikely(!p))
- dprintk("%s: out of xdr space\n", __func__);
-
- return p;
-}
--
2.11.0
^ permalink raw reply related
* [PATCH 2/4] fs: remove exofs
From: Christoph Hellwig @ 2017-04-12 16:01 UTC (permalink / raw)
To: ooo, martin.petersen, trond.myklebust, axboe
Cc: osd-dev, linux-nfs, linux-scsi, linux-block, linux-kernel
In-Reply-To: <20170412160109.10598-1-hch@lst.de>
This was an example for using the SCSI OSD protocol, which we're trying
to remove.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/filesystems/00-INDEX | 2 -
Documentation/filesystems/exofs.txt | 185 -----
Documentation/scsi/osd.txt | 5 -
MAINTAINERS | 1 -
fs/Kconfig | 3 -
fs/Makefile | 1 -
fs/exofs/BUGS | 3 -
fs/exofs/Kbuild | 20 -
fs/exofs/Kconfig | 13 -
fs/exofs/Kconfig.ore | 14 -
fs/exofs/common.h | 262 ------
fs/exofs/dir.c | 661 ---------------
fs/exofs/exofs.h | 241 ------
fs/exofs/file.c | 83 --
fs/exofs/inode.c | 1514 -----------------------------------
fs/exofs/namei.c | 323 --------
fs/exofs/ore.c | 1164 ---------------------------
fs/exofs/ore_raid.c | 721 -----------------
fs/exofs/ore_raid.h | 62 --
fs/exofs/super.c | 1049 ------------------------
fs/exofs/sys.c | 205 -----
21 files changed, 6532 deletions(-)
delete mode 100644 Documentation/filesystems/exofs.txt
delete mode 100644 fs/exofs/BUGS
delete mode 100644 fs/exofs/Kbuild
delete mode 100644 fs/exofs/Kconfig
delete mode 100644 fs/exofs/Kconfig.ore
delete mode 100644 fs/exofs/common.h
delete mode 100644 fs/exofs/dir.c
delete mode 100644 fs/exofs/exofs.h
delete mode 100644 fs/exofs/file.c
delete mode 100644 fs/exofs/inode.c
delete mode 100644 fs/exofs/namei.c
delete mode 100644 fs/exofs/ore.c
delete mode 100644 fs/exofs/ore_raid.c
delete mode 100644 fs/exofs/ore_raid.h
delete mode 100644 fs/exofs/super.c
delete mode 100644 fs/exofs/sys.c
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9009cc..b3c6b9b1d716 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -51,8 +51,6 @@ ecryptfs.txt
- docs on eCryptfs: stacked cryptographic filesystem for Linux.
efivarfs.txt
- info for the efivarfs filesystem.
-exofs.txt
- - info, usage, mount options, design about EXOFS.
ext2.txt
- info, mount options and specifications for the Ext2 filesystem.
ext3.txt
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
deleted file mode 100644
index 23583a136975..000000000000
--- a/Documentation/filesystems/exofs.txt
+++ /dev/null
@@ -1,185 +0,0 @@
-===============================================================================
-WHAT IS EXOFS?
-===============================================================================
-
-exofs is a file system that uses an OSD and exports the API of a normal Linux
-file system. Users access exofs like any other local file system, and exofs
-will in turn issue commands to the local OSD initiator.
-
-OSD is a new T10 command set that views storage devices not as a large/flat
-array of sectors but as a container of objects, each having a length, quota,
-time attributes and more. Each object is addressed by a 64bit ID, and is
-contained in a 64bit ID partition. Each object has associated attributes
-attached to it, which are integral part of the object and provide metadata about
-the object. The standard defines some common obligatory attributes, but user
-attributes can be added as needed.
-
-===============================================================================
-ENVIRONMENT
-===============================================================================
-
-To use this file system, you need to have an object store to run it on. You
-may download a target from:
-http://open-osd.org
-
-See Documentation/scsi/osd.txt for how to setup a working osd environment.
-
-===============================================================================
-USAGE
-===============================================================================
-
-1. Download and compile exofs and open-osd initiator:
- You need an external Kernel source tree or kernel headers from your
- distribution. (anything based on 2.6.26 or later).
-
- a. download open-osd including exofs source using:
- [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
-
- b. Build the library module like this:
- [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
-
- This will build both the open-osd initiator as well as the exofs kernel
- module. Use whatever parameters you compiled your Kernel with and
- $(KER_DIR) above pointing to the Kernel you compile against. See the file
- open-osd/top-level-Makefile for an example.
-
-2. Get the OSD initiator and target set up properly, and login to the target.
- See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
- for example script that does all these steps.
-
-3. Insmod the exofs.ko module:
- [exofs]$ insmod exofs.ko
-
-4. Make sure the directory where you want to mount exists. If not, create it.
- (For example, mkdir /mnt/exofs)
-
-5. At first run you will need to invoke the mkfs.exofs application
-
- As an example, this will create the file system on:
- /dev/osd0 partition ID 65536
-
- mkfs.exofs --pid=65536 --format /dev/osd0
-
- The --format is optional. If not specified, no OSD_FORMAT will be
- performed and a clean file system will be created in the specified pid,
- in the available space of the target. (Use --format=size_in_meg to limit
- the total LUN space available)
-
- If pid already exists, it will be deleted and a new one will be created in
- its place. Be careful.
-
- An exofs lives inside a single OSD partition. You can create multiple exofs
- filesystems on the same device using multiple pids.
-
- (run mkfs.exofs without any parameters for usage help message)
-
-6. Mount the file system.
-
- For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
-
- mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
-
-7. For reference (See do-exofs example script):
- do-exofs start - an example of how to perform the above steps.
- do-exofs stop - an example of how to unmount the file system.
- do-exofs format - an example of how to format and mkfs a new exofs.
-
-8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
- CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
-
-===============================================================================
-exofs mount options
-===============================================================================
-Similar to any mount command:
- mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
-
-Where:
- -t exofs: specifies the exofs file system
-
- /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
- login into an OSD target.
-
- mount_exofs_directory: The directory to mount the file system on
-
- exofs specific options: Options are separated by commas (,)
- pid=<integer> - The partition number to mount/create as
- container of the filesystem.
- This option is mandatory. integer can be
- Hex by pre-pending an 0x to the number.
- osdname=<id> - Mount by a device's osdname.
- osdname is usually a 36 character uuid of the
- form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
- It is one of the device's uuid specified in the
- mkfs.exofs format command.
- If this option is specified then the /dev/osdX
- above can be empty and is ignored.
- to=<integer> - Timeout in ticks for a single command.
- default is (60 * HZ) [for debugging only]
-
-===============================================================================
-DESIGN
-===============================================================================
-
-* The file system control block (AKA on-disk superblock) resides in an object
- with a special ID (defined in common.h).
- Information included in the file system control block is used to fill the
- in-memory superblock structure at mount time. This object is created before
- the file system is used by mkexofs.c. It contains information such as:
- - The file system's magic number
- - The next inode number to be allocated
-
-* Each file resides in its own object and contains the data (and it will be
- possible to extend the file over multiple objects, though this has not been
- implemented yet).
-
-* A directory is treated as a file, and essentially contains a list of <file
- name, inode #> pairs for files that are found in that directory. The object
- IDs correspond to the files' inode numbers and will be allocated according to
- a bitmap (stored in a separate object). Now they are allocated using a
- counter.
-
-* Each file's control block (AKA on-disk inode) is stored in its object's
- attributes. This applies to both regular files and other types (directories,
- device files, symlinks, etc.).
-
-* Credentials are generated per object (inode and superblock) when they are
- created in memory (read from disk or created). The credential works for all
- operations and is used as long as the object remains in memory.
-
-* Async OSD operations are used whenever possible, but the target may execute
- them out of order. The operations that concern us are create, delete,
- readpage, writepage, update_inode, and truncate. The following pairs of
- operations should execute in the order written, and we need to prevent them
- from executing in reverse order:
- - The following are handled with the OBJ_CREATED and OBJ_2BCREATED
- flags. OBJ_CREATED is set when we know the object exists on the OSD -
- in create's callback function, and when we successfully do a
- read_inode.
- OBJ_2BCREATED is set in the beginning of the create function, so we
- know that we should wait.
- - create/delete: delete should wait until the object is created
- on the OSD.
- - create/readpage: readpage should be able to return a page
- full of zeroes in this case. If there was a write already
- en-route (i.e. create, writepage, readpage) then the page
- would be locked, and so it would really be the same as
- create/writepage.
- - create/writepage: if writepage is called for a sync write, it
- should wait until the object is created on the OSD.
- Otherwise, it should just return.
- - create/truncate: truncate should wait until the object is
- created on the OSD.
- - create/update_inode: update_inode should wait until the
- object is created on the OSD.
- - Handled by VFS locks:
- - readpage/delete: shouldn't happen because of page lock.
- - writepage/delete: shouldn't happen because of page lock.
- - readpage/writepage: shouldn't happen because of page lock.
-
-===============================================================================
-LICENSE/COPYRIGHT
-===============================================================================
-The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
-version 2.6.10). All files include the original copyrights, and the license
-is GPL version 2 (only version 2, as is true for the Linux kernel). The
-Linux kernel can be downloaded from www.kernel.org.
diff --git a/Documentation/scsi/osd.txt b/Documentation/scsi/osd.txt
index 5a9879bad073..2bc2ab06b0c0 100644
--- a/Documentation/scsi/osd.txt
+++ b/Documentation/scsi/osd.txt
@@ -24,11 +24,6 @@ osd-uld:
platform, both for the in-kernel initiator as well as connected targets. It
currently has no useful user-mode API, though it could have if need be.
-exofs:
- Is an OSD based Linux file system. It uses the osd-initiator and osd-uld,
-to export a usable file system for users.
-See Documentation/filesystems/exofs.txt for more details
-
osd target:
There are no current plans for an OSD target implementation in kernel. For all
needs, a user-mode target that is based on the scsi tgt target framework is
diff --git a/MAINTAINERS b/MAINTAINERS
index 882ea01b4efe..0ffaf69becce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9395,7 +9395,6 @@ T: git git://git.open-osd.org/open-osd.git
S: Maintained
F: drivers/scsi/osd/
F: include/scsi/osd_*
-F: fs/exofs/
OVERLAY FILESYSTEM
M: Miklos Szeredi <miklos@szeredi.hu>
diff --git a/fs/Kconfig b/fs/Kconfig
index 83eab52fb3f6..abee7e4468c6 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -247,12 +247,9 @@ source "fs/romfs/Kconfig"
source "fs/pstore/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
-source "fs/exofs/Kconfig"
endif # MISC_FILESYSTEMS
-source "fs/exofs/Kconfig.ore"
-
menuconfig NETWORK_FILESYSTEMS
bool "Network File Systems"
default y
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..58ae3c3e6792 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -124,7 +124,6 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_F2FS_FS) += f2fs/
-obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
diff --git a/fs/exofs/BUGS b/fs/exofs/BUGS
deleted file mode 100644
index 1b2d4c63a579..000000000000
--- a/fs/exofs/BUGS
+++ /dev/null
@@ -1,3 +0,0 @@
-- Out-of-space may cause a severe problem if the object (and directory entry)
- were written, but the inode attributes failed. Then if the filesystem was
- unmounted and mounted the kernel can get into an endless loop doing a readdir.
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
deleted file mode 100644
index a364fd0965ec..000000000000
--- a/fs/exofs/Kbuild
+++ /dev/null
@@ -1,20 +0,0 @@
-#
-# Kbuild for the EXOFS module
-#
-# Copyright (C) 2008 Panasas Inc. All rights reserved.
-#
-# Authors:
-# Boaz Harrosh <ooo@electrozaur.com>
-#
-# This program is free software; you can redistribute it and/or modify
-# it under the terms of the GNU General Public License version 2
-#
-# Kbuild - Gets included from the Kernels Makefile and build system
-#
-
-# ore module library
-libore-y := ore.o ore_raid.o
-obj-$(CONFIG_ORE) += libore.o
-
-exofs-y := inode.o file.o namei.o dir.o super.o sys.o
-obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
deleted file mode 100644
index 86194b2f799d..000000000000
--- a/fs/exofs/Kconfig
+++ /dev/null
@@ -1,13 +0,0 @@
-config EXOFS_FS
- tristate "exofs: OSD based file system support"
- depends on SCSI_OSD_ULD
- help
- EXOFS is a file system that uses an OSD storage device,
- as its backing storage.
-
-# Debugging-related stuff
-config EXOFS_DEBUG
- bool "Enable debugging"
- depends on EXOFS_FS
- help
- This option enables EXOFS debug prints.
diff --git a/fs/exofs/Kconfig.ore b/fs/exofs/Kconfig.ore
deleted file mode 100644
index 2daf2329c28d..000000000000
--- a/fs/exofs/Kconfig.ore
+++ /dev/null
@@ -1,14 +0,0 @@
-# ORE - Objects Raid Engine (libore.ko)
-#
-# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
-# for every ORE user we do it like this. Any user should add itself here
-# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
-# selected here, and we default to "ON". So in effect it is like been
-# selected by any of the users.
-config ORE
- tristate
- depends on EXOFS_FS || PNFS_OBJLAYOUT
- select ASYNC_XOR
- select RAID6_PQ
- select ASYNC_PQ
- default SCSI_OSD_ULD
diff --git a/fs/exofs/common.h b/fs/exofs/common.h
deleted file mode 100644
index 7d88ef566213..000000000000
--- a/fs/exofs/common.h
+++ /dev/null
@@ -1,262 +0,0 @@
-/*
- * common.h - Common definitions for both Kernel and user-mode utilities
- *
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#ifndef __EXOFS_COM_H__
-#define __EXOFS_COM_H__
-
-#include <linux/types.h>
-
-#include <scsi/osd_attributes.h>
-#include <scsi/osd_initiator.h>
-#include <scsi/osd_sec.h>
-
-/****************************************************************************
- * Object ID related defines
- * NOTE: inode# = object ID - EXOFS_OBJ_OFF
- ****************************************************************************/
-#define EXOFS_MIN_PID 0x10000 /* Smallest partition ID */
-#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
-#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
-#define EXOFS_DEVTABLE_ID 0x10001 /* object ID for on-disk device table */
-#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
-
-/* exofs Application specific page/attribute */
-/* Inode attrs */
-# define EXOFS_APAGE_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
-# define EXOFS_ATTR_INODE_DATA 1
-# define EXOFS_ATTR_INODE_FILE_LAYOUT 2
-# define EXOFS_ATTR_INODE_DIR_LAYOUT 3
-/* Partition attrs */
-# define EXOFS_APAGE_SB_DATA (0xF0000000U + 3)
-# define EXOFS_ATTR_SB_STATS 1
-
-/*
- * The maximum number of files we can have is limited by the size of the
- * inode number. This is the largest object ID that the file system supports.
- * Object IDs 0, 1, and 2 are always in use (see above defines).
- */
-enum {
- EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? ULLONG_MAX :
- (1ULL << (sizeof(ino_t) * 8ULL - 1ULL)),
- EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
-};
-
-/****************************************************************************
- * Misc.
- ****************************************************************************/
-#define EXOFS_BLKSHIFT 12
-#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
-
-/****************************************************************************
- * superblock-related things
- ****************************************************************************/
-#define EXOFS_SUPER_MAGIC 0x5DF5
-
-/*
- * The file system control block - stored in object EXOFS_SUPER_ID's data.
- * This is where the in-memory superblock is stored on disk.
- */
-enum {EXOFS_FSCB_VER = 1, EXOFS_DT_VER = 1};
-struct exofs_fscb {
- __le64 s_nextid; /* Only used after mkfs */
- __le64 s_numfiles; /* Only used after mkfs */
- __le32 s_version; /* == EXOFS_FSCB_VER */
- __le16 s_magic; /* Magic signature */
- __le16 s_newfs; /* Non-zero if this is a new fs */
-
- /* From here on it's a static part, only written by mkexofs */
- __le64 s_dev_table_oid; /* Resurved, not used */
- __le64 s_dev_table_count; /* == 0 means no dev_table */
-} __packed;
-
-/*
- * This struct is set on the FS partition's attributes.
- * [EXOFS_APAGE_SB_DATA, EXOFS_ATTR_SB_STATS] and is written together
- * with the create command, to atomically persist the sb writeable information.
- */
-struct exofs_sb_stats {
- __le64 s_nextid; /* Highest object ID used */
- __le64 s_numfiles; /* Number of files on fs */
-} __packed;
-
-/*
- * Describes the raid used in the FS. It is part of the device table.
- * This here is taken from the pNFS-objects definition. In exofs we
- * use one raid policy through-out the filesystem. (NOTE: the funny
- * alignment at beginning. We take care of it at exofs_device_table.
- */
-struct exofs_dt_data_map {
- __le32 cb_num_comps;
- __le64 cb_stripe_unit;
- __le32 cb_group_width;
- __le32 cb_group_depth;
- __le32 cb_mirror_cnt;
- __le32 cb_raid_algorithm;
-} __packed;
-
-/*
- * This is an osd device information descriptor. It is a single entry in
- * the exofs device table. It describes an osd target lun which
- * contains data belonging to this FS. (Same partition_id on all devices)
- */
-struct exofs_dt_device_info {
- __le32 systemid_len;
- u8 systemid[OSD_SYSTEMID_LEN];
- __le64 long_name_offset; /* If !0 then offset-in-file */
- __le32 osdname_len; /* */
- u8 osdname[44]; /* Embbeded, Usually an asci uuid */
-} __packed;
-
-/*
- * The EXOFS device table - stored in object EXOFS_DEVTABLE_ID's data.
- * It contains the raid used for this multy-device FS and an array of
- * participating devices.
- */
-struct exofs_device_table {
- __le32 dt_version; /* == EXOFS_DT_VER */
- struct exofs_dt_data_map dt_data_map; /* Raid policy to use */
-
- /* Resurved space For future use. Total includeing this:
- * (8 * sizeof(le64))
- */
- __le64 __Resurved[4];
-
- __le64 dt_num_devices; /* Array size */
- struct exofs_dt_device_info dt_dev_table[]; /* Array of devices */
-} __packed;
-
-/****************************************************************************
- * inode-related things
- ****************************************************************************/
-#define EXOFS_IDATA 5
-
-/*
- * The file control block - stored in an object's attributes. This is where
- * the in-memory inode is stored on disk.
- */
-struct exofs_fcb {
- __le64 i_size; /* Size of the file */
- __le16 i_mode; /* File mode */
- __le16 i_links_count; /* Links count */
- __le32 i_uid; /* Owner Uid */
- __le32 i_gid; /* Group Id */
- __le32 i_atime; /* Access time */
- __le32 i_ctime; /* Creation time */
- __le32 i_mtime; /* Modification time */
- __le32 i_flags; /* File flags (unused for now)*/
- __le32 i_generation; /* File version (for NFS) */
- __le32 i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
-};
-
-#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
-
-/* This is the Attribute the fcb is stored in */
-static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
- EXOFS_APAGE_FS_DATA,
- EXOFS_ATTR_INODE_DATA,
- EXOFS_INO_ATTR_SIZE);
-
-/****************************************************************************
- * dentry-related things
- ****************************************************************************/
-#define EXOFS_NAME_LEN 255
-
-/*
- * The on-disk directory entry
- */
-struct exofs_dir_entry {
- __le64 inode_no; /* inode number */
- __le16 rec_len; /* directory entry length */
- u8 name_len; /* name length */
- u8 file_type; /* umm...file type */
- char name[EXOFS_NAME_LEN]; /* file name */
-};
-
-enum {
- EXOFS_FT_UNKNOWN,
- EXOFS_FT_REG_FILE,
- EXOFS_FT_DIR,
- EXOFS_FT_CHRDEV,
- EXOFS_FT_BLKDEV,
- EXOFS_FT_FIFO,
- EXOFS_FT_SOCK,
- EXOFS_FT_SYMLINK,
- EXOFS_FT_MAX
-};
-
-#define EXOFS_DIR_PAD 4
-#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
-#define EXOFS_DIR_REC_LEN(name_len) \
- (((name_len) + offsetof(struct exofs_dir_entry, name) + \
- EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
-
-/*
- * The on-disk (optional) layout structure.
- * sits in an EXOFS_ATTR_INODE_FILE_LAYOUT or EXOFS_ATTR_INODE_DIR_LAYOUT
- * attribute, attached to any inode, usually to a directory.
- */
-
-enum exofs_inode_layout_gen_functions {
- LAYOUT_MOVING_WINDOW = 0,
- LAYOUT_IMPLICT = 1,
-};
-
-struct exofs_on_disk_inode_layout {
- __le16 gen_func; /* One of enum exofs_inode_layout_gen_functions */
- __le16 pad;
- union {
- /* gen_func == LAYOUT_MOVING_WINDOW (default) */
- struct exofs_layout_sliding_window {
- __le32 num_devices; /* first n devices in global-table*/
- } sliding_window __packed;
-
- /* gen_func == LAYOUT_IMPLICT */
- struct exofs_layout_implict_list {
- struct exofs_dt_data_map data_map;
- /* Variable array of size data_map.cb_num_comps. These
- * are device indexes of the devices in the global table
- */
- __le32 dev_indexes[];
- } implict __packed;
- };
-} __packed;
-
-static inline size_t exofs_on_disk_inode_layout_size(unsigned max_devs)
-{
- return sizeof(struct exofs_on_disk_inode_layout) +
- max_devs * sizeof(__le32);
-}
-
-#endif /*ifndef __EXOFS_COM_H__*/
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
deleted file mode 100644
index 42f9a0a0c4ca..000000000000
--- a/fs/exofs/dir.c
+++ /dev/null
@@ -1,661 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#include "exofs.h"
-
-static inline unsigned exofs_chunk_size(struct inode *inode)
-{
- return inode->i_sb->s_blocksize;
-}
-
-static inline void exofs_put_page(struct page *page)
-{
- kunmap(page);
- put_page(page);
-}
-
-static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
-{
- loff_t last_byte = inode->i_size;
-
- last_byte -= page_nr << PAGE_SHIFT;
- if (last_byte > PAGE_SIZE)
- last_byte = PAGE_SIZE;
- return last_byte;
-}
-
-static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
-{
- struct address_space *mapping = page->mapping;
- struct inode *dir = mapping->host;
- int err = 0;
-
- dir->i_version++;
-
- if (!PageUptodate(page))
- SetPageUptodate(page);
-
- if (pos+len > dir->i_size) {
- i_size_write(dir, pos+len);
- mark_inode_dirty(dir);
- }
- set_page_dirty(page);
-
- if (IS_DIRSYNC(dir))
- err = write_one_page(page, 1);
- else
- unlock_page(page);
-
- return err;
-}
-
-static bool exofs_check_page(struct page *page)
-{
- struct inode *dir = page->mapping->host;
- unsigned chunk_size = exofs_chunk_size(dir);
- char *kaddr = page_address(page);
- unsigned offs, rec_len;
- unsigned limit = PAGE_SIZE;
- struct exofs_dir_entry *p;
- char *error;
-
- /* if the page is the last one in the directory */
- if ((dir->i_size >> PAGE_SHIFT) == page->index) {
- limit = dir->i_size & ~PAGE_MASK;
- if (limit & (chunk_size - 1))
- goto Ebadsize;
- if (!limit)
- goto out;
- }
- for (offs = 0; offs <= limit - EXOFS_DIR_REC_LEN(1); offs += rec_len) {
- p = (struct exofs_dir_entry *)(kaddr + offs);
- rec_len = le16_to_cpu(p->rec_len);
-
- if (rec_len < EXOFS_DIR_REC_LEN(1))
- goto Eshort;
- if (rec_len & 3)
- goto Ealign;
- if (rec_len < EXOFS_DIR_REC_LEN(p->name_len))
- goto Enamelen;
- if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
- goto Espan;
- }
- if (offs != limit)
- goto Eend;
-out:
- SetPageChecked(page);
- return true;
-
-Ebadsize:
- EXOFS_ERR("ERROR [exofs_check_page]: "
- "size of directory(0x%lx) is not a multiple of chunk size\n",
- dir->i_ino
- );
- goto fail;
-Eshort:
- error = "rec_len is smaller than minimal";
- goto bad_entry;
-Ealign:
- error = "unaligned directory entry";
- goto bad_entry;
-Enamelen:
- error = "rec_len is too small for name_len";
- goto bad_entry;
-Espan:
- error = "directory entry across blocks";
- goto bad_entry;
-bad_entry:
- EXOFS_ERR(
- "ERROR [exofs_check_page]: bad entry in directory(0x%lx): %s - "
- "offset=%lu, inode=0x%llx, rec_len=%d, name_len=%d\n",
- dir->i_ino, error, (page->index<<PAGE_SHIFT)+offs,
- _LLU(le64_to_cpu(p->inode_no)),
- rec_len, p->name_len);
- goto fail;
-Eend:
- p = (struct exofs_dir_entry *)(kaddr + offs);
- EXOFS_ERR("ERROR [exofs_check_page]: "
- "entry in directory(0x%lx) spans the page boundary"
- "offset=%lu, inode=0x%llx\n",
- dir->i_ino, (page->index<<PAGE_SHIFT)+offs,
- _LLU(le64_to_cpu(p->inode_no)));
-fail:
- SetPageError(page);
- return false;
-}
-
-static struct page *exofs_get_page(struct inode *dir, unsigned long n)
-{
- struct address_space *mapping = dir->i_mapping;
- struct page *page = read_mapping_page(mapping, n, NULL);
-
- if (!IS_ERR(page)) {
- kmap(page);
- if (unlikely(!PageChecked(page))) {
- if (PageError(page) || !exofs_check_page(page))
- goto fail;
- }
- }
- return page;
-
-fail:
- exofs_put_page(page);
- return ERR_PTR(-EIO);
-}
-
-static inline int exofs_match(int len, const unsigned char *name,
- struct exofs_dir_entry *de)
-{
- if (len != de->name_len)
- return 0;
- if (!de->inode_no)
- return 0;
- return !memcmp(name, de->name, len);
-}
-
-static inline
-struct exofs_dir_entry *exofs_next_entry(struct exofs_dir_entry *p)
-{
- return (struct exofs_dir_entry *)((char *)p + le16_to_cpu(p->rec_len));
-}
-
-static inline unsigned
-exofs_validate_entry(char *base, unsigned offset, unsigned mask)
-{
- struct exofs_dir_entry *de = (struct exofs_dir_entry *)(base + offset);
- struct exofs_dir_entry *p =
- (struct exofs_dir_entry *)(base + (offset&mask));
- while ((char *)p < (char *)de) {
- if (p->rec_len == 0)
- break;
- p = exofs_next_entry(p);
- }
- return (char *)p - base;
-}
-
-static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
- [EXOFS_FT_UNKNOWN] = DT_UNKNOWN,
- [EXOFS_FT_REG_FILE] = DT_REG,
- [EXOFS_FT_DIR] = DT_DIR,
- [EXOFS_FT_CHRDEV] = DT_CHR,
- [EXOFS_FT_BLKDEV] = DT_BLK,
- [EXOFS_FT_FIFO] = DT_FIFO,
- [EXOFS_FT_SOCK] = DT_SOCK,
- [EXOFS_FT_SYMLINK] = DT_LNK,
-};
-
-#define S_SHIFT 12
-static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
- [S_IFREG >> S_SHIFT] = EXOFS_FT_REG_FILE,
- [S_IFDIR >> S_SHIFT] = EXOFS_FT_DIR,
- [S_IFCHR >> S_SHIFT] = EXOFS_FT_CHRDEV,
- [S_IFBLK >> S_SHIFT] = EXOFS_FT_BLKDEV,
- [S_IFIFO >> S_SHIFT] = EXOFS_FT_FIFO,
- [S_IFSOCK >> S_SHIFT] = EXOFS_FT_SOCK,
- [S_IFLNK >> S_SHIFT] = EXOFS_FT_SYMLINK,
-};
-
-static inline
-void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
-{
- umode_t mode = inode->i_mode;
- de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
-}
-
-static int
-exofs_readdir(struct file *file, struct dir_context *ctx)
-{
- loff_t pos = ctx->pos;
- struct inode *inode = file_inode(file);
- unsigned int offset = pos & ~PAGE_MASK;
- unsigned long n = pos >> PAGE_SHIFT;
- unsigned long npages = dir_pages(inode);
- unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
- int need_revalidate = (file->f_version != inode->i_version);
-
- if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
- return 0;
-
- for ( ; n < npages; n++, offset = 0) {
- char *kaddr, *limit;
- struct exofs_dir_entry *de;
- struct page *page = exofs_get_page(inode, n);
-
- if (IS_ERR(page)) {
- EXOFS_ERR("ERROR: bad page in directory(0x%lx)\n",
- inode->i_ino);
- ctx->pos += PAGE_SIZE - offset;
- return PTR_ERR(page);
- }
- kaddr = page_address(page);
- if (unlikely(need_revalidate)) {
- if (offset) {
- offset = exofs_validate_entry(kaddr, offset,
- chunk_mask);
- ctx->pos = (n<<PAGE_SHIFT) + offset;
- }
- file->f_version = inode->i_version;
- need_revalidate = 0;
- }
- de = (struct exofs_dir_entry *)(kaddr + offset);
- limit = kaddr + exofs_last_byte(inode, n) -
- EXOFS_DIR_REC_LEN(1);
- for (; (char *)de <= limit; de = exofs_next_entry(de)) {
- if (de->rec_len == 0) {
- EXOFS_ERR("ERROR: "
- "zero-length entry in directory(0x%lx)\n",
- inode->i_ino);
- exofs_put_page(page);
- return -EIO;
- }
- if (de->inode_no) {
- unsigned char t;
-
- if (de->file_type < EXOFS_FT_MAX)
- t = exofs_filetype_table[de->file_type];
- else
- t = DT_UNKNOWN;
-
- if (!dir_emit(ctx, de->name, de->name_len,
- le64_to_cpu(de->inode_no),
- t)) {
- exofs_put_page(page);
- return 0;
- }
- }
- ctx->pos += le16_to_cpu(de->rec_len);
- }
- exofs_put_page(page);
- }
- return 0;
-}
-
-struct exofs_dir_entry *exofs_find_entry(struct inode *dir,
- struct dentry *dentry, struct page **res_page)
-{
- const unsigned char *name = dentry->d_name.name;
- int namelen = dentry->d_name.len;
- unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
- unsigned long start, n;
- unsigned long npages = dir_pages(dir);
- struct page *page = NULL;
- struct exofs_i_info *oi = exofs_i(dir);
- struct exofs_dir_entry *de;
-
- if (npages == 0)
- goto out;
-
- *res_page = NULL;
-
- start = oi->i_dir_start_lookup;
- if (start >= npages)
- start = 0;
- n = start;
- do {
- char *kaddr;
- page = exofs_get_page(dir, n);
- if (!IS_ERR(page)) {
- kaddr = page_address(page);
- de = (struct exofs_dir_entry *) kaddr;
- kaddr += exofs_last_byte(dir, n) - reclen;
- while ((char *) de <= kaddr) {
- if (de->rec_len == 0) {
- EXOFS_ERR("ERROR: zero-length entry in "
- "directory(0x%lx)\n",
- dir->i_ino);
- exofs_put_page(page);
- goto out;
- }
- if (exofs_match(namelen, name, de))
- goto found;
- de = exofs_next_entry(de);
- }
- exofs_put_page(page);
- }
- if (++n >= npages)
- n = 0;
- } while (n != start);
-out:
- return NULL;
-
-found:
- *res_page = page;
- oi->i_dir_start_lookup = n;
- return de;
-}
-
-struct exofs_dir_entry *exofs_dotdot(struct inode *dir, struct page **p)
-{
- struct page *page = exofs_get_page(dir, 0);
- struct exofs_dir_entry *de = NULL;
-
- if (!IS_ERR(page)) {
- de = exofs_next_entry(
- (struct exofs_dir_entry *)page_address(page));
- *p = page;
- }
- return de;
-}
-
-ino_t exofs_parent_ino(struct dentry *child)
-{
- struct page *page;
- struct exofs_dir_entry *de;
- ino_t ino;
-
- de = exofs_dotdot(d_inode(child), &page);
- if (!de)
- return 0;
-
- ino = le64_to_cpu(de->inode_no);
- exofs_put_page(page);
- return ino;
-}
-
-ino_t exofs_inode_by_name(struct inode *dir, struct dentry *dentry)
-{
- ino_t res = 0;
- struct exofs_dir_entry *de;
- struct page *page;
-
- de = exofs_find_entry(dir, dentry, &page);
- if (de) {
- res = le64_to_cpu(de->inode_no);
- exofs_put_page(page);
- }
- return res;
-}
-
-int exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
- struct page *page, struct inode *inode)
-{
- loff_t pos = page_offset(page) +
- (char *) de - (char *) page_address(page);
- unsigned len = le16_to_cpu(de->rec_len);
- int err;
-
- lock_page(page);
- err = exofs_write_begin(NULL, page->mapping, pos, len,
- AOP_FLAG_UNINTERRUPTIBLE, &page, NULL);
- if (err)
- EXOFS_ERR("exofs_set_link: exofs_write_begin FAILED => %d\n",
- err);
-
- de->inode_no = cpu_to_le64(inode->i_ino);
- exofs_set_de_type(de, inode);
- if (likely(!err))
- err = exofs_commit_chunk(page, pos, len);
- exofs_put_page(page);
- dir->i_mtime = dir->i_ctime = current_time(dir);
- mark_inode_dirty(dir);
- return err;
-}
-
-int exofs_add_link(struct dentry *dentry, struct inode *inode)
-{
- struct inode *dir = d_inode(dentry->d_parent);
- const unsigned char *name = dentry->d_name.name;
- int namelen = dentry->d_name.len;
- unsigned chunk_size = exofs_chunk_size(dir);
- unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
- unsigned short rec_len, name_len;
- struct page *page = NULL;
- struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
- struct exofs_dir_entry *de;
- unsigned long npages = dir_pages(dir);
- unsigned long n;
- char *kaddr;
- loff_t pos;
- int err;
-
- for (n = 0; n <= npages; n++) {
- char *dir_end;
-
- page = exofs_get_page(dir, n);
- err = PTR_ERR(page);
- if (IS_ERR(page))
- goto out;
- lock_page(page);
- kaddr = page_address(page);
- dir_end = kaddr + exofs_last_byte(dir, n);
- de = (struct exofs_dir_entry *)kaddr;
- kaddr += PAGE_SIZE - reclen;
- while ((char *)de <= kaddr) {
- if ((char *)de == dir_end) {
- name_len = 0;
- rec_len = chunk_size;
- de->rec_len = cpu_to_le16(chunk_size);
- de->inode_no = 0;
- goto got_it;
- }
- if (de->rec_len == 0) {
- EXOFS_ERR("ERROR: exofs_add_link: "
- "zero-length entry in directory(0x%lx)\n",
- inode->i_ino);
- err = -EIO;
- goto out_unlock;
- }
- err = -EEXIST;
- if (exofs_match(namelen, name, de))
- goto out_unlock;
- name_len = EXOFS_DIR_REC_LEN(de->name_len);
- rec_len = le16_to_cpu(de->rec_len);
- if (!de->inode_no && rec_len >= reclen)
- goto got_it;
- if (rec_len >= name_len + reclen)
- goto got_it;
- de = (struct exofs_dir_entry *) ((char *) de + rec_len);
- }
- unlock_page(page);
- exofs_put_page(page);
- }
-
- EXOFS_ERR("exofs_add_link: BAD dentry=%p or inode=0x%lx\n",
- dentry, inode->i_ino);
- return -EINVAL;
-
-got_it:
- pos = page_offset(page) +
- (char *)de - (char *)page_address(page);
- err = exofs_write_begin(NULL, page->mapping, pos, rec_len, 0,
- &page, NULL);
- if (err)
- goto out_unlock;
- if (de->inode_no) {
- struct exofs_dir_entry *de1 =
- (struct exofs_dir_entry *)((char *)de + name_len);
- de1->rec_len = cpu_to_le16(rec_len - name_len);
- de->rec_len = cpu_to_le16(name_len);
- de = de1;
- }
- de->name_len = namelen;
- memcpy(de->name, name, namelen);
- de->inode_no = cpu_to_le64(inode->i_ino);
- exofs_set_de_type(de, inode);
- err = exofs_commit_chunk(page, pos, rec_len);
- dir->i_mtime = dir->i_ctime = current_time(dir);
- mark_inode_dirty(dir);
- sbi->s_numfiles++;
-
-out_put:
- exofs_put_page(page);
-out:
- return err;
-out_unlock:
- unlock_page(page);
- goto out_put;
-}
-
-int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
-{
- struct address_space *mapping = page->mapping;
- struct inode *inode = mapping->host;
- struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
- char *kaddr = page_address(page);
- unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
- unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
- loff_t pos;
- struct exofs_dir_entry *pde = NULL;
- struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
- int err;
-
- while (de < dir) {
- if (de->rec_len == 0) {
- EXOFS_ERR("ERROR: exofs_delete_entry:"
- "zero-length entry in directory(0x%lx)\n",
- inode->i_ino);
- err = -EIO;
- goto out;
- }
- pde = de;
- de = exofs_next_entry(de);
- }
- if (pde)
- from = (char *)pde - (char *)page_address(page);
- pos = page_offset(page) + from;
- lock_page(page);
- err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
- &page, NULL);
- if (err)
- EXOFS_ERR("exofs_delete_entry: exofs_write_begin FAILED => %d\n",
- err);
- if (pde)
- pde->rec_len = cpu_to_le16(to - from);
- dir->inode_no = 0;
- if (likely(!err))
- err = exofs_commit_chunk(page, pos, to - from);
- inode->i_ctime = inode->i_mtime = current_time(inode);
- mark_inode_dirty(inode);
- sbi->s_numfiles--;
-out:
- exofs_put_page(page);
- return err;
-}
-
-/* kept aligned on 4 bytes */
-#define THIS_DIR ".\0\0"
-#define PARENT_DIR "..\0"
-
-int exofs_make_empty(struct inode *inode, struct inode *parent)
-{
- struct address_space *mapping = inode->i_mapping;
- struct page *page = grab_cache_page(mapping, 0);
- unsigned chunk_size = exofs_chunk_size(inode);
- struct exofs_dir_entry *de;
- int err;
- void *kaddr;
-
- if (!page)
- return -ENOMEM;
-
- err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
- &page, NULL);
- if (err) {
- unlock_page(page);
- goto fail;
- }
-
- kaddr = kmap_atomic(page);
- de = (struct exofs_dir_entry *)kaddr;
- de->name_len = 1;
- de->rec_len = cpu_to_le16(EXOFS_DIR_REC_LEN(1));
- memcpy(de->name, THIS_DIR, sizeof(THIS_DIR));
- de->inode_no = cpu_to_le64(inode->i_ino);
- exofs_set_de_type(de, inode);
-
- de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
- de->name_len = 2;
- de->rec_len = cpu_to_le16(chunk_size - EXOFS_DIR_REC_LEN(1));
- de->inode_no = cpu_to_le64(parent->i_ino);
- memcpy(de->name, PARENT_DIR, sizeof(PARENT_DIR));
- exofs_set_de_type(de, inode);
- kunmap_atomic(kaddr);
- err = exofs_commit_chunk(page, 0, chunk_size);
-fail:
- put_page(page);
- return err;
-}
-
-int exofs_empty_dir(struct inode *inode)
-{
- struct page *page = NULL;
- unsigned long i, npages = dir_pages(inode);
-
- for (i = 0; i < npages; i++) {
- char *kaddr;
- struct exofs_dir_entry *de;
- page = exofs_get_page(inode, i);
-
- if (IS_ERR(page))
- continue;
-
- kaddr = page_address(page);
- de = (struct exofs_dir_entry *)kaddr;
- kaddr += exofs_last_byte(inode, i) - EXOFS_DIR_REC_LEN(1);
-
- while ((char *)de <= kaddr) {
- if (de->rec_len == 0) {
- EXOFS_ERR("ERROR: exofs_empty_dir: "
- "zero-length directory entry"
- "kaddr=%p, de=%p\n", kaddr, de);
- goto not_empty;
- }
- if (de->inode_no != 0) {
- /* check for . and .. */
- if (de->name[0] != '.')
- goto not_empty;
- if (de->name_len > 2)
- goto not_empty;
- if (de->name_len < 2) {
- if (le64_to_cpu(de->inode_no) !=
- inode->i_ino)
- goto not_empty;
- } else if (de->name[1] != '.')
- goto not_empty;
- }
- de = exofs_next_entry(de);
- }
- exofs_put_page(page);
- }
- return 1;
-
-not_empty:
- exofs_put_page(page);
- return 0;
-}
-
-const struct file_operations exofs_dir_operations = {
- .llseek = generic_file_llseek,
- .read = generic_read_dir,
- .iterate_shared = exofs_readdir,
-};
diff --git a/fs/exofs/exofs.h b/fs/exofs/exofs.h
deleted file mode 100644
index 2e86086bc940..000000000000
--- a/fs/exofs/exofs.h
+++ /dev/null
@@ -1,241 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-#ifndef __EXOFS_H__
-#define __EXOFS_H__
-
-#include <linux/fs.h>
-#include <linux/time.h>
-#include <linux/backing-dev.h>
-#include <scsi/osd_ore.h>
-
-#include "common.h"
-
-#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
-
-#ifdef CONFIG_EXOFS_DEBUG
-#define EXOFS_DBGMSG(fmt, a...) \
- printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
-#else
-#define EXOFS_DBGMSG(fmt, a...) \
- do { if (0) printk(fmt, ##a); } while (0)
-#endif
-
-/* u64 has problems with printk this will cast it to unsigned long long */
-#define _LLU(x) (unsigned long long)(x)
-
-struct exofs_dev {
- struct ore_dev ored;
- unsigned did;
- unsigned urilen;
- uint8_t *uri;
- struct kobject ed_kobj;
-};
-/*
- * our extension to the in-memory superblock
- */
-struct exofs_sb_info {
- struct backing_dev_info bdi; /* register our bdi with VFS */
- struct exofs_sb_stats s_ess; /* Written often, pre-allocate*/
- int s_timeout; /* timeout for OSD operations */
- uint64_t s_nextid; /* highest object ID used */
- uint32_t s_numfiles; /* number of files on fs */
- spinlock_t s_next_gen_lock; /* spinlock for gen # update */
- u32 s_next_generation; /* next gen # to use */
- atomic_t s_curr_pending; /* number of pending commands */
-
- struct ore_layout layout; /* Default files layout */
- struct ore_comp one_comp; /* id & cred of partition id=0*/
- struct ore_components oc; /* comps for the partition */
- struct kobject s_kobj; /* holds per-sbi kobject */
-};
-
-/*
- * our extension to the in-memory inode
- */
-struct exofs_i_info {
- struct inode vfs_inode; /* normal in-memory inode */
- wait_queue_head_t i_wq; /* wait queue for inode */
- unsigned long i_flags; /* various atomic flags */
- uint32_t i_data[EXOFS_IDATA];/*short symlink names and device #s*/
- uint32_t i_dir_start_lookup; /* which page to start lookup */
- uint64_t i_commit_size; /* the object's written length */
- struct ore_comp one_comp; /* same component for all devices */
- struct ore_components oc; /* inode view of the device table */
-};
-
-static inline osd_id exofs_oi_objno(struct exofs_i_info *oi)
-{
- return oi->vfs_inode.i_ino + EXOFS_OBJ_OFF;
-}
-
-/*
- * our inode flags
- */
-#define OBJ_2BCREATED 0 /* object will be created soon*/
-#define OBJ_CREATED 1 /* object has been created on the osd*/
-
-static inline int obj_2bcreated(struct exofs_i_info *oi)
-{
- return test_bit(OBJ_2BCREATED, &oi->i_flags);
-}
-
-static inline void set_obj_2bcreated(struct exofs_i_info *oi)
-{
- set_bit(OBJ_2BCREATED, &oi->i_flags);
-}
-
-static inline int obj_created(struct exofs_i_info *oi)
-{
- return test_bit(OBJ_CREATED, &oi->i_flags);
-}
-
-static inline void set_obj_created(struct exofs_i_info *oi)
-{
- set_bit(OBJ_CREATED, &oi->i_flags);
-}
-
-int __exofs_wait_obj_created(struct exofs_i_info *oi);
-static inline int wait_obj_created(struct exofs_i_info *oi)
-{
- if (likely(obj_created(oi)))
- return 0;
-
- return __exofs_wait_obj_created(oi);
-}
-
-/*
- * get to our inode from the vfs inode
- */
-static inline struct exofs_i_info *exofs_i(struct inode *inode)
-{
- return container_of(inode, struct exofs_i_info, vfs_inode);
-}
-
-/*
- * Maximum count of links to a file
- */
-#define EXOFS_LINK_MAX 32000
-
-/*************************
- * function declarations *
- *************************/
-
-/* inode.c */
-unsigned exofs_max_io_pages(struct ore_layout *layout,
- unsigned expected_pages);
-int exofs_setattr(struct dentry *, struct iattr *);
-int exofs_write_begin(struct file *file, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
- struct page **pagep, void **fsdata);
-extern struct inode *exofs_iget(struct super_block *, unsigned long);
-struct inode *exofs_new_inode(struct inode *, umode_t);
-extern int exofs_write_inode(struct inode *, struct writeback_control *wbc);
-extern void exofs_evict_inode(struct inode *);
-
-/* dir.c: */
-int exofs_add_link(struct dentry *, struct inode *);
-ino_t exofs_inode_by_name(struct inode *, struct dentry *);
-int exofs_delete_entry(struct exofs_dir_entry *, struct page *);
-int exofs_make_empty(struct inode *, struct inode *);
-struct exofs_dir_entry *exofs_find_entry(struct inode *, struct dentry *,
- struct page **);
-int exofs_empty_dir(struct inode *);
-struct exofs_dir_entry *exofs_dotdot(struct inode *, struct page **);
-ino_t exofs_parent_ino(struct dentry *child);
-int exofs_set_link(struct inode *, struct exofs_dir_entry *, struct page *,
- struct inode *);
-
-/* super.c */
-void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
- const struct osd_obj_id *obj);
-int exofs_sbi_write_stats(struct exofs_sb_info *sbi);
-
-/* sys.c */
-int exofs_sysfs_init(void);
-void exofs_sysfs_uninit(void);
-int exofs_sysfs_sb_add(struct exofs_sb_info *sbi,
- struct exofs_dt_device_info *dt_dev);
-void exofs_sysfs_sb_del(struct exofs_sb_info *sbi);
-int exofs_sysfs_odev_add(struct exofs_dev *edev,
- struct exofs_sb_info *sbi);
-void exofs_sysfs_dbg_print(void);
-
-/*********************
- * operation vectors *
- *********************/
-/* dir.c: */
-extern const struct file_operations exofs_dir_operations;
-
-/* file.c */
-extern const struct inode_operations exofs_file_inode_operations;
-extern const struct file_operations exofs_file_operations;
-
-/* inode.c */
-extern const struct address_space_operations exofs_aops;
-
-/* namei.c */
-extern const struct inode_operations exofs_dir_inode_operations;
-extern const struct inode_operations exofs_special_inode_operations;
-
-/* exofs_init_comps will initialize an ore_components device array
- * pointing to a single ore_comp struct, and a round-robin view
- * of the device table.
- * The first device of each inode is the [inode->ino % num_devices]
- * and the rest of the devices sequentially following where the
- * first device is after the last device.
- * It is assumed that the global device array at @sbi is twice
- * bigger and that the device table repeats twice.
- * See: exofs_read_lookup_dev_table()
- */
-static inline void exofs_init_comps(struct ore_components *oc,
- struct ore_comp *one_comp,
- struct exofs_sb_info *sbi, osd_id oid)
-{
- unsigned dev_mod = (unsigned)oid, first_dev;
-
- one_comp->obj.partition = sbi->one_comp.obj.partition;
- one_comp->obj.id = oid;
- exofs_make_credential(one_comp->cred, &one_comp->obj);
-
- oc->first_dev = 0;
- oc->numdevs = sbi->layout.group_width * sbi->layout.mirrors_p1 *
- sbi->layout.group_count;
- oc->single_comp = EC_SINGLE_COMP;
- oc->comps = one_comp;
-
- /* Round robin device view of the table */
- first_dev = (dev_mod * sbi->layout.mirrors_p1) % sbi->oc.numdevs;
- oc->ods = &sbi->oc.ods[first_dev];
-}
-
-#endif
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
deleted file mode 100644
index 28645f0640f7..000000000000
--- a/fs/exofs/file.c
+++ /dev/null
@@ -1,83 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-#include "exofs.h"
-
-static int exofs_release_file(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
-/* exofs_file_fsync - flush the inode to disk
- *
- * Note, in exofs all metadata is written as part of inode, regardless.
- * The writeout is synchronous
- */
-static int exofs_file_fsync(struct file *filp, loff_t start, loff_t end,
- int datasync)
-{
- struct inode *inode = filp->f_mapping->host;
- int ret;
-
- ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
- if (ret)
- return ret;
-
- inode_lock(inode);
- ret = sync_inode_metadata(filp->f_mapping->host, 1);
- inode_unlock(inode);
- return ret;
-}
-
-static int exofs_flush(struct file *file, fl_owner_t id)
-{
- int ret = vfs_fsync(file, 0);
- /* TODO: Flush the OSD target */
- return ret;
-}
-
-const struct file_operations exofs_file_operations = {
- .llseek = generic_file_llseek,
- .read_iter = generic_file_read_iter,
- .write_iter = generic_file_write_iter,
- .mmap = generic_file_mmap,
- .open = generic_file_open,
- .release = exofs_release_file,
- .fsync = exofs_file_fsync,
- .flush = exofs_flush,
- .splice_read = generic_file_splice_read,
- .splice_write = iter_file_splice_write,
-};
-
-const struct inode_operations exofs_file_inode_operations = {
- .setattr = exofs_setattr,
-};
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
deleted file mode 100644
index 0ac62811b341..000000000000
--- a/fs/exofs/inode.c
+++ /dev/null
@@ -1,1514 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#include <linux/slab.h>
-
-#include "exofs.h"
-
-#define EXOFS_DBGMSG2(M...) do {} while (0)
-
-unsigned exofs_max_io_pages(struct ore_layout *layout,
- unsigned expected_pages)
-{
- unsigned pages = min_t(unsigned, expected_pages,
- layout->max_io_length / PAGE_SIZE);
-
- return pages;
-}
-
-struct page_collect {
- struct exofs_sb_info *sbi;
- struct inode *inode;
- unsigned expected_pages;
- struct ore_io_state *ios;
-
- struct page **pages;
- unsigned alloc_pages;
- unsigned nr_pages;
- unsigned long length;
- loff_t pg_first; /* keep 64bit also in 32-arches */
- bool read_4_write; /* This means two things: that the read is sync
- * And the pages should not be unlocked.
- */
- struct page *that_locked_page;
-};
-
-static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
- struct inode *inode)
-{
- struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
-
- pcol->sbi = sbi;
- pcol->inode = inode;
- pcol->expected_pages = expected_pages;
-
- pcol->ios = NULL;
- pcol->pages = NULL;
- pcol->alloc_pages = 0;
- pcol->nr_pages = 0;
- pcol->length = 0;
- pcol->pg_first = -1;
- pcol->read_4_write = false;
- pcol->that_locked_page = NULL;
-}
-
-static void _pcol_reset(struct page_collect *pcol)
-{
- pcol->expected_pages -= min(pcol->nr_pages, pcol->expected_pages);
-
- pcol->pages = NULL;
- pcol->alloc_pages = 0;
- pcol->nr_pages = 0;
- pcol->length = 0;
- pcol->pg_first = -1;
- pcol->ios = NULL;
- pcol->that_locked_page = NULL;
-
- /* this is probably the end of the loop but in writes
- * it might not end here. don't be left with nothing
- */
- if (!pcol->expected_pages)
- pcol->expected_pages =
- exofs_max_io_pages(&pcol->sbi->layout, ~0);
-}
-
-static int pcol_try_alloc(struct page_collect *pcol)
-{
- unsigned pages;
-
- /* TODO: easily support bio chaining */
- pages = exofs_max_io_pages(&pcol->sbi->layout, pcol->expected_pages);
-
- for (; pages; pages >>= 1) {
- pcol->pages = kmalloc(pages * sizeof(struct page *),
- GFP_KERNEL);
- if (likely(pcol->pages)) {
- pcol->alloc_pages = pages;
- return 0;
- }
- }
-
- EXOFS_ERR("Failed to kmalloc expected_pages=%u\n",
- pcol->expected_pages);
- return -ENOMEM;
-}
-
-static void pcol_free(struct page_collect *pcol)
-{
- kfree(pcol->pages);
- pcol->pages = NULL;
-
- if (pcol->ios) {
- ore_put_io_state(pcol->ios);
- pcol->ios = NULL;
- }
-}
-
-static int pcol_add_page(struct page_collect *pcol, struct page *page,
- unsigned len)
-{
- if (unlikely(pcol->nr_pages >= pcol->alloc_pages))
- return -ENOMEM;
-
- pcol->pages[pcol->nr_pages++] = page;
- pcol->length += len;
- return 0;
-}
-
-enum {PAGE_WAS_NOT_IN_IO = 17};
-static int update_read_page(struct page *page, int ret)
-{
- switch (ret) {
- case 0:
- /* Everything is OK */
- SetPageUptodate(page);
- if (PageError(page))
- ClearPageError(page);
- break;
- case -EFAULT:
- /* In this case we were trying to read something that wasn't on
- * disk yet - return a page full of zeroes. This should be OK,
- * because the object should be empty (if there was a write
- * before this read, the read would be waiting with the page
- * locked */
- clear_highpage(page);
-
- SetPageUptodate(page);
- if (PageError(page))
- ClearPageError(page);
- EXOFS_DBGMSG("recovered read error\n");
- /* fall through */
- case PAGE_WAS_NOT_IN_IO:
- ret = 0; /* recovered error */
- break;
- default:
- SetPageError(page);
- }
- return ret;
-}
-
-static void update_write_page(struct page *page, int ret)
-{
- if (unlikely(ret == PAGE_WAS_NOT_IN_IO))
- return; /* don't pass start don't collect $200 */
-
- if (ret) {
- mapping_set_error(page->mapping, ret);
- SetPageError(page);
- }
- end_page_writeback(page);
-}
-
-/* Called at the end of reads, to optionally unlock pages and update their
- * status.
- */
-static int __readpages_done(struct page_collect *pcol)
-{
- int i;
- u64 good_bytes;
- u64 length = 0;
- int ret = ore_check_io(pcol->ios, NULL);
-
- if (likely(!ret)) {
- good_bytes = pcol->length;
- ret = PAGE_WAS_NOT_IN_IO;
- } else {
- good_bytes = 0;
- }
-
- EXOFS_DBGMSG2("readpages_done(0x%lx) good_bytes=0x%llx"
- " length=0x%lx nr_pages=%u\n",
- pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
- pcol->nr_pages);
-
- for (i = 0; i < pcol->nr_pages; i++) {
- struct page *page = pcol->pages[i];
- struct inode *inode = page->mapping->host;
- int page_stat;
-
- if (inode != pcol->inode)
- continue; /* osd might add more pages at end */
-
- if (likely(length < good_bytes))
- page_stat = 0;
- else
- page_stat = ret;
-
- EXOFS_DBGMSG2(" readpages_done(0x%lx, 0x%lx) %s\n",
- inode->i_ino, page->index,
- page_stat ? "bad_bytes" : "good_bytes");
-
- ret = update_read_page(page, page_stat);
- if (!pcol->read_4_write)
- unlock_page(page);
- length += PAGE_SIZE;
- }
-
- pcol_free(pcol);
- EXOFS_DBGMSG2("readpages_done END\n");
- return ret;
-}
-
-/* callback of async reads */
-static void readpages_done(struct ore_io_state *ios, void *p)
-{
- struct page_collect *pcol = p;
-
- __readpages_done(pcol);
- atomic_dec(&pcol->sbi->s_curr_pending);
- kfree(pcol);
-}
-
-static void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
-{
- int i;
-
- for (i = 0; i < pcol->nr_pages; i++) {
- struct page *page = pcol->pages[i];
-
- if (rw == READ)
- update_read_page(page, ret);
- else
- update_write_page(page, ret);
-
- unlock_page(page);
- }
-}
-
-static int _maybe_not_all_in_one_io(struct ore_io_state *ios,
- struct page_collect *pcol_src, struct page_collect *pcol)
-{
- /* length was wrong or offset was not page aligned */
- BUG_ON(pcol_src->nr_pages < ios->nr_pages);
-
- if (pcol_src->nr_pages > ios->nr_pages) {
- struct page **src_page;
- unsigned pages_less = pcol_src->nr_pages - ios->nr_pages;
- unsigned long len_less = pcol_src->length - ios->length;
- unsigned i;
- int ret;
-
- /* This IO was trimmed */
- pcol_src->nr_pages = ios->nr_pages;
- pcol_src->length = ios->length;
-
- /* Left over pages are passed to the next io */
- pcol->expected_pages += pages_less;
- pcol->nr_pages = pages_less;
- pcol->length = len_less;
- src_page = pcol_src->pages + pcol_src->nr_pages;
- pcol->pg_first = (*src_page)->index;
-
- ret = pcol_try_alloc(pcol);
- if (unlikely(ret))
- return ret;
-
- for (i = 0; i < pages_less; ++i)
- pcol->pages[i] = *src_page++;
-
- EXOFS_DBGMSG("Length was adjusted nr_pages=0x%x "
- "pages_less=0x%x expected_pages=0x%x "
- "next_offset=0x%llx next_len=0x%lx\n",
- pcol_src->nr_pages, pages_less, pcol->expected_pages,
- pcol->pg_first * PAGE_SIZE, pcol->length);
- }
- return 0;
-}
-
-static int read_exec(struct page_collect *pcol)
-{
- struct exofs_i_info *oi = exofs_i(pcol->inode);
- struct ore_io_state *ios;
- struct page_collect *pcol_copy = NULL;
- int ret;
-
- if (!pcol->pages)
- return 0;
-
- if (!pcol->ios) {
- int ret = ore_get_rw_state(&pcol->sbi->layout, &oi->oc, true,
- pcol->pg_first << PAGE_SHIFT,
- pcol->length, &pcol->ios);
-
- if (ret)
- return ret;
- }
-
- ios = pcol->ios;
- ios->pages = pcol->pages;
-
- if (pcol->read_4_write) {
- ore_read(pcol->ios);
- return __readpages_done(pcol);
- }
-
- pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
- if (!pcol_copy) {
- ret = -ENOMEM;
- goto err;
- }
-
- *pcol_copy = *pcol;
- ios->done = readpages_done;
- ios->private = pcol_copy;
-
- /* pages ownership was passed to pcol_copy */
- _pcol_reset(pcol);
-
- ret = _maybe_not_all_in_one_io(ios, pcol_copy, pcol);
- if (unlikely(ret))
- goto err;
-
- EXOFS_DBGMSG2("read_exec(0x%lx) offset=0x%llx length=0x%llx\n",
- pcol->inode->i_ino, _LLU(ios->offset), _LLU(ios->length));
-
- ret = ore_read(ios);
- if (unlikely(ret))
- goto err;
-
- atomic_inc(&pcol->sbi->s_curr_pending);
-
- return 0;
-
-err:
- if (!pcol_copy) /* Failed before ownership transfer */
- pcol_copy = pcol;
- _unlock_pcol_pages(pcol_copy, ret, READ);
- pcol_free(pcol_copy);
- kfree(pcol_copy);
-
- return ret;
-}
-
-/* readpage_strip is called either directly from readpage() or by the VFS from
- * within read_cache_pages(), to add one more page to be read. It will try to
- * collect as many contiguous pages as posible. If a discontinuity is
- * encountered, or it runs out of resources, it will submit the previous segment
- * and will start a new collection. Eventually caller must submit the last
- * segment if present.
- */
-static int readpage_strip(void *data, struct page *page)
-{
- struct page_collect *pcol = data;
- struct inode *inode = pcol->inode;
- struct exofs_i_info *oi = exofs_i(inode);
- loff_t i_size = i_size_read(inode);
- pgoff_t end_index = i_size >> PAGE_SHIFT;
- size_t len;
- int ret;
-
- BUG_ON(!PageLocked(page));
-
- /* FIXME: Just for debugging, will be removed */
- if (PageUptodate(page))
- EXOFS_ERR("PageUptodate(0x%lx, 0x%lx)\n", pcol->inode->i_ino,
- page->index);
-
- pcol->that_locked_page = page;
-
- if (page->index < end_index)
- len = PAGE_SIZE;
- else if (page->index == end_index)
- len = i_size & ~PAGE_MASK;
- else
- len = 0;
-
- if (!len || !obj_created(oi)) {
- /* this will be out of bounds, or doesn't exist yet.
- * Current page is cleared and the request is split
- */
- clear_highpage(page);
-
- SetPageUptodate(page);
- if (PageError(page))
- ClearPageError(page);
-
- if (!pcol->read_4_write)
- unlock_page(page);
- EXOFS_DBGMSG("readpage_strip(0x%lx) empty page len=%zx "
- "read_4_write=%d index=0x%lx end_index=0x%lx "
- "splitting\n", inode->i_ino, len,
- pcol->read_4_write, page->index, end_index);
-
- return read_exec(pcol);
- }
-
-try_again:
-
- if (unlikely(pcol->pg_first == -1)) {
- pcol->pg_first = page->index;
- } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
- page->index)) {
- /* Discontinuity detected, split the request */
- ret = read_exec(pcol);
- if (unlikely(ret))
- goto fail;
- goto try_again;
- }
-
- if (!pcol->pages) {
- ret = pcol_try_alloc(pcol);
- if (unlikely(ret))
- goto fail;
- }
-
- if (len != PAGE_SIZE)
- zero_user(page, len, PAGE_SIZE - len);
-
- EXOFS_DBGMSG2(" readpage_strip(0x%lx, 0x%lx) len=0x%zx\n",
- inode->i_ino, page->index, len);
-
- ret = pcol_add_page(pcol, page, len);
- if (ret) {
- EXOFS_DBGMSG2("Failed pcol_add_page pages[i]=%p "
- "this_len=0x%zx nr_pages=%u length=0x%lx\n",
- page, len, pcol->nr_pages, pcol->length);
-
- /* split the request, and start again with current page */
- ret = read_exec(pcol);
- if (unlikely(ret))
- goto fail;
-
- goto try_again;
- }
-
- return 0;
-
-fail:
- /* SetPageError(page); ??? */
- unlock_page(page);
- return ret;
-}
-
-static int exofs_readpages(struct file *file, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages)
-{
- struct page_collect pcol;
- int ret;
-
- _pcol_init(&pcol, nr_pages, mapping->host);
-
- ret = read_cache_pages(mapping, pages, readpage_strip, &pcol);
- if (ret) {
- EXOFS_ERR("read_cache_pages => %d\n", ret);
- return ret;
- }
-
- ret = read_exec(&pcol);
- if (unlikely(ret))
- return ret;
-
- return read_exec(&pcol);
-}
-
-static int _readpage(struct page *page, bool read_4_write)
-{
- struct page_collect pcol;
- int ret;
-
- _pcol_init(&pcol, 1, page->mapping->host);
-
- pcol.read_4_write = read_4_write;
- ret = readpage_strip(&pcol, page);
- if (ret) {
- EXOFS_ERR("_readpage => %d\n", ret);
- return ret;
- }
-
- return read_exec(&pcol);
-}
-
-/*
- * We don't need the file
- */
-static int exofs_readpage(struct file *file, struct page *page)
-{
- return _readpage(page, false);
-}
-
-/* Callback for osd_write. All writes are asynchronous */
-static void writepages_done(struct ore_io_state *ios, void *p)
-{
- struct page_collect *pcol = p;
- int i;
- u64 good_bytes;
- u64 length = 0;
- int ret = ore_check_io(ios, NULL);
-
- atomic_dec(&pcol->sbi->s_curr_pending);
-
- if (likely(!ret)) {
- good_bytes = pcol->length;
- ret = PAGE_WAS_NOT_IN_IO;
- } else {
- good_bytes = 0;
- }
-
- EXOFS_DBGMSG2("writepages_done(0x%lx) good_bytes=0x%llx"
- " length=0x%lx nr_pages=%u\n",
- pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
- pcol->nr_pages);
-
- for (i = 0; i < pcol->nr_pages; i++) {
- struct page *page = pcol->pages[i];
- struct inode *inode = page->mapping->host;
- int page_stat;
-
- if (inode != pcol->inode)
- continue; /* osd might add more pages to a bio */
-
- if (likely(length < good_bytes))
- page_stat = 0;
- else
- page_stat = ret;
-
- update_write_page(page, page_stat);
- unlock_page(page);
- EXOFS_DBGMSG2(" writepages_done(0x%lx, 0x%lx) status=%d\n",
- inode->i_ino, page->index, page_stat);
-
- length += PAGE_SIZE;
- }
-
- pcol_free(pcol);
- kfree(pcol);
- EXOFS_DBGMSG2("writepages_done END\n");
-}
-
-static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
-{
- struct page_collect *pcol = priv;
- pgoff_t index = offset / PAGE_SIZE;
-
- if (!pcol->that_locked_page ||
- (pcol->that_locked_page->index != index)) {
- struct page *page;
- loff_t i_size = i_size_read(pcol->inode);
-
- if (offset >= i_size) {
- *uptodate = true;
- EXOFS_DBGMSG2("offset >= i_size index=0x%lx\n", index);
- return ZERO_PAGE(0);
- }
-
- page = find_get_page(pcol->inode->i_mapping, index);
- if (!page) {
- page = find_or_create_page(pcol->inode->i_mapping,
- index, GFP_NOFS);
- if (unlikely(!page)) {
- EXOFS_DBGMSG("grab_cache_page Failed "
- "index=0x%llx\n", _LLU(index));
- return NULL;
- }
- unlock_page(page);
- }
- *uptodate = PageUptodate(page);
- EXOFS_DBGMSG2("index=0x%lx uptodate=%d\n", index, *uptodate);
- return page;
- } else {
- EXOFS_DBGMSG2("YES that_locked_page index=0x%lx\n",
- pcol->that_locked_page->index);
- *uptodate = true;
- return pcol->that_locked_page;
- }
-}
-
-static void __r4w_put_page(void *priv, struct page *page)
-{
- struct page_collect *pcol = priv;
-
- if ((pcol->that_locked_page != page) && (ZERO_PAGE(0) != page)) {
- EXOFS_DBGMSG2("index=0x%lx\n", page->index);
- put_page(page);
- return;
- }
- EXOFS_DBGMSG2("that_locked_page index=0x%lx\n",
- ZERO_PAGE(0) == page ? -1 : page->index);
-}
-
-static const struct _ore_r4w_op _r4w_op = {
- .get_page = &__r4w_get_page,
- .put_page = &__r4w_put_page,
-};
-
-static int write_exec(struct page_collect *pcol)
-{
- struct exofs_i_info *oi = exofs_i(pcol->inode);
- struct ore_io_state *ios;
- struct page_collect *pcol_copy = NULL;
- int ret;
-
- if (!pcol->pages)
- return 0;
-
- BUG_ON(pcol->ios);
- ret = ore_get_rw_state(&pcol->sbi->layout, &oi->oc, false,
- pcol->pg_first << PAGE_SHIFT,
- pcol->length, &pcol->ios);
- if (unlikely(ret))
- goto err;
-
- pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
- if (!pcol_copy) {
- EXOFS_ERR("write_exec: Failed to kmalloc(pcol)\n");
- ret = -ENOMEM;
- goto err;
- }
-
- *pcol_copy = *pcol;
-
- ios = pcol->ios;
- ios->pages = pcol_copy->pages;
- ios->done = writepages_done;
- ios->r4w = &_r4w_op;
- ios->private = pcol_copy;
-
- /* pages ownership was passed to pcol_copy */
- _pcol_reset(pcol);
-
- ret = _maybe_not_all_in_one_io(ios, pcol_copy, pcol);
- if (unlikely(ret))
- goto err;
-
- EXOFS_DBGMSG2("write_exec(0x%lx) offset=0x%llx length=0x%llx\n",
- pcol->inode->i_ino, _LLU(ios->offset), _LLU(ios->length));
-
- ret = ore_write(ios);
- if (unlikely(ret)) {
- EXOFS_ERR("write_exec: ore_write() Failed\n");
- goto err;
- }
-
- atomic_inc(&pcol->sbi->s_curr_pending);
- return 0;
-
-err:
- if (!pcol_copy) /* Failed before ownership transfer */
- pcol_copy = pcol;
- _unlock_pcol_pages(pcol_copy, ret, WRITE);
- pcol_free(pcol_copy);
- kfree(pcol_copy);
-
- return ret;
-}
-
-/* writepage_strip is called either directly from writepage() or by the VFS from
- * within write_cache_pages(), to add one more page to be written to storage.
- * It will try to collect as many contiguous pages as possible. If a
- * discontinuity is encountered or it runs out of resources it will submit the
- * previous segment and will start a new collection.
- * Eventually caller must submit the last segment if present.
- */
-static int writepage_strip(struct page *page,
- struct writeback_control *wbc_unused, void *data)
-{
- struct page_collect *pcol = data;
- struct inode *inode = pcol->inode;
- struct exofs_i_info *oi = exofs_i(inode);
- loff_t i_size = i_size_read(inode);
- pgoff_t end_index = i_size >> PAGE_SHIFT;
- size_t len;
- int ret;
-
- BUG_ON(!PageLocked(page));
-
- ret = wait_obj_created(oi);
- if (unlikely(ret))
- goto fail;
-
- if (page->index < end_index)
- /* in this case, the page is within the limits of the file */
- len = PAGE_SIZE;
- else {
- len = i_size & ~PAGE_MASK;
-
- if (page->index > end_index || !len) {
- /* in this case, the page is outside the limits
- * (truncate in progress)
- */
- ret = write_exec(pcol);
- if (unlikely(ret))
- goto fail;
- if (PageError(page))
- ClearPageError(page);
- unlock_page(page);
- EXOFS_DBGMSG("writepage_strip(0x%lx, 0x%lx) "
- "outside the limits\n",
- inode->i_ino, page->index);
- return 0;
- }
- }
-
-try_again:
-
- if (unlikely(pcol->pg_first == -1)) {
- pcol->pg_first = page->index;
- } else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
- page->index)) {
- /* Discontinuity detected, split the request */
- ret = write_exec(pcol);
- if (unlikely(ret))
- goto fail;
-
- EXOFS_DBGMSG("writepage_strip(0x%lx, 0x%lx) Discontinuity\n",
- inode->i_ino, page->index);
- goto try_again;
- }
-
- if (!pcol->pages) {
- ret = pcol_try_alloc(pcol);
- if (unlikely(ret))
- goto fail;
- }
-
- EXOFS_DBGMSG2(" writepage_strip(0x%lx, 0x%lx) len=0x%zx\n",
- inode->i_ino, page->index, len);
-
- ret = pcol_add_page(pcol, page, len);
- if (unlikely(ret)) {
- EXOFS_DBGMSG2("Failed pcol_add_page "
- "nr_pages=%u total_length=0x%lx\n",
- pcol->nr_pages, pcol->length);
-
- /* split the request, next loop will start again */
- ret = write_exec(pcol);
- if (unlikely(ret)) {
- EXOFS_DBGMSG("write_exec failed => %d", ret);
- goto fail;
- }
-
- goto try_again;
- }
-
- BUG_ON(PageWriteback(page));
- set_page_writeback(page);
-
- return 0;
-
-fail:
- EXOFS_DBGMSG("Error: writepage_strip(0x%lx, 0x%lx)=>%d\n",
- inode->i_ino, page->index, ret);
- mapping_set_error(page->mapping, -EIO);
- unlock_page(page);
- return ret;
-}
-
-static int exofs_writepages(struct address_space *mapping,
- struct writeback_control *wbc)
-{
- struct page_collect pcol;
- long start, end, expected_pages;
- int ret;
-
- start = wbc->range_start >> PAGE_SHIFT;
- end = (wbc->range_end == LLONG_MAX) ?
- start + mapping->nrpages :
- wbc->range_end >> PAGE_SHIFT;
-
- if (start || end)
- expected_pages = end - start + 1;
- else
- expected_pages = mapping->nrpages;
-
- if (expected_pages < 32L)
- expected_pages = 32L;
-
- EXOFS_DBGMSG2("inode(0x%lx) wbc->start=0x%llx wbc->end=0x%llx "
- "nrpages=%lu start=0x%lx end=0x%lx expected_pages=%ld\n",
- mapping->host->i_ino, wbc->range_start, wbc->range_end,
- mapping->nrpages, start, end, expected_pages);
-
- _pcol_init(&pcol, expected_pages, mapping->host);
-
- ret = write_cache_pages(mapping, wbc, writepage_strip, &pcol);
- if (unlikely(ret)) {
- EXOFS_ERR("write_cache_pages => %d\n", ret);
- return ret;
- }
-
- ret = write_exec(&pcol);
- if (unlikely(ret))
- return ret;
-
- if (wbc->sync_mode == WB_SYNC_ALL) {
- return write_exec(&pcol); /* pump the last reminder */
- } else if (pcol.nr_pages) {
- /* not SYNC let the reminder join the next writeout */
- unsigned i;
-
- for (i = 0; i < pcol.nr_pages; i++) {
- struct page *page = pcol.pages[i];
-
- end_page_writeback(page);
- set_page_dirty(page);
- unlock_page(page);
- }
- }
- return 0;
-}
-
-/*
-static int exofs_writepage(struct page *page, struct writeback_control *wbc)
-{
- struct page_collect pcol;
- int ret;
-
- _pcol_init(&pcol, 1, page->mapping->host);
-
- ret = writepage_strip(page, NULL, &pcol);
- if (ret) {
- EXOFS_ERR("exofs_writepage => %d\n", ret);
- return ret;
- }
-
- return write_exec(&pcol);
-}
-*/
-/* i_mutex held using inode->i_size directly */
-static void _write_failed(struct inode *inode, loff_t to)
-{
- if (to > inode->i_size)
- truncate_pagecache(inode, inode->i_size);
-}
-
-int exofs_write_begin(struct file *file, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
- struct page **pagep, void **fsdata)
-{
- int ret = 0;
- struct page *page;
-
- page = *pagep;
- if (page == NULL) {
- page = grab_cache_page_write_begin(mapping, pos >> PAGE_SHIFT,
- flags);
- if (!page) {
- EXOFS_DBGMSG("grab_cache_page_write_begin failed\n");
- return -ENOMEM;
- }
- *pagep = page;
- }
-
- /* read modify write */
- if (!PageUptodate(page) && (len != PAGE_SIZE)) {
- loff_t i_size = i_size_read(mapping->host);
- pgoff_t end_index = i_size >> PAGE_SHIFT;
-
- if (page->index > end_index) {
- clear_highpage(page);
- SetPageUptodate(page);
- } else {
- ret = _readpage(page, true);
- if (ret) {
- unlock_page(page);
- EXOFS_DBGMSG("__readpage failed\n");
- }
- }
- }
- return ret;
-}
-
-static int exofs_write_begin_export(struct file *file,
- struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
- struct page **pagep, void **fsdata)
-{
- *pagep = NULL;
-
- return exofs_write_begin(file, mapping, pos, len, flags, pagep,
- fsdata);
-}
-
-static int exofs_write_end(struct file *file, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata)
-{
- struct inode *inode = mapping->host;
- loff_t last_pos = pos + copied;
-
- if (!PageUptodate(page)) {
- if (copied < len) {
- _write_failed(inode, pos + len);
- copied = 0;
- goto out;
- }
- SetPageUptodate(page);
- }
- if (last_pos > inode->i_size) {
- i_size_write(inode, last_pos);
- mark_inode_dirty(inode);
- }
- set_page_dirty(page);
-out:
- unlock_page(page);
- put_page(page);
- return copied;
-}
-
-static int exofs_releasepage(struct page *page, gfp_t gfp)
-{
- EXOFS_DBGMSG("page 0x%lx\n", page->index);
- WARN_ON(1);
- return 0;
-}
-
-static void exofs_invalidatepage(struct page *page, unsigned int offset,
- unsigned int length)
-{
- EXOFS_DBGMSG("page 0x%lx offset 0x%x length 0x%x\n",
- page->index, offset, length);
- WARN_ON(1);
-}
-
-
- /* TODO: Should be easy enough to do proprly */
-static ssize_t exofs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- return 0;
-}
-
-const struct address_space_operations exofs_aops = {
- .readpage = exofs_readpage,
- .readpages = exofs_readpages,
- .writepage = NULL,
- .writepages = exofs_writepages,
- .write_begin = exofs_write_begin_export,
- .write_end = exofs_write_end,
- .releasepage = exofs_releasepage,
- .set_page_dirty = __set_page_dirty_nobuffers,
- .invalidatepage = exofs_invalidatepage,
-
- /* Not implemented Yet */
- .bmap = NULL, /* TODO: use osd's OSD_ACT_READ_MAP */
- .direct_IO = exofs_direct_IO,
-
- /* With these NULL has special meaning or default is not exported */
- .migratepage = NULL,
- .launder_page = NULL,
- .is_partially_uptodate = NULL,
- .error_remove_page = NULL,
-};
-
-/******************************************************************************
- * INODE OPERATIONS
- *****************************************************************************/
-
-/*
- * Test whether an inode is a fast symlink.
- */
-static inline int exofs_inode_is_fast_symlink(struct inode *inode)
-{
- struct exofs_i_info *oi = exofs_i(inode);
-
- return S_ISLNK(inode->i_mode) && (oi->i_data[0] != 0);
-}
-
-static int _do_truncate(struct inode *inode, loff_t newsize)
-{
- struct exofs_i_info *oi = exofs_i(inode);
- struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
- int ret;
-
- inode->i_mtime = inode->i_ctime = current_time(inode);
-
- ret = ore_truncate(&sbi->layout, &oi->oc, (u64)newsize);
- if (likely(!ret))
- truncate_setsize(inode, newsize);
-
- EXOFS_DBGMSG2("(0x%lx) size=0x%llx ret=>%d\n",
- inode->i_ino, newsize, ret);
- return ret;
-}
-
-/*
- * Set inode attributes - update size attribute on OSD if needed,
- * otherwise just call generic functions.
- */
-int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
-{
- struct inode *inode = d_inode(dentry);
- int error;
-
- /* if we are about to modify an object, and it hasn't been
- * created yet, wait
- */
- error = wait_obj_created(exofs_i(inode));
- if (unlikely(error))
- return error;
-
- error = setattr_prepare(dentry, iattr);
- if (unlikely(error))
- return error;
-
- if ((iattr->ia_valid & ATTR_SIZE) &&
- iattr->ia_size != i_size_read(inode)) {
- error = _do_truncate(inode, iattr->ia_size);
- if (unlikely(error))
- return error;
- }
-
- setattr_copy(inode, iattr);
- mark_inode_dirty(inode);
- return 0;
-}
-
-static const struct osd_attr g_attr_inode_file_layout = ATTR_DEF(
- EXOFS_APAGE_FS_DATA,
- EXOFS_ATTR_INODE_FILE_LAYOUT,
- 0);
-static const struct osd_attr g_attr_inode_dir_layout = ATTR_DEF(
- EXOFS_APAGE_FS_DATA,
- EXOFS_ATTR_INODE_DIR_LAYOUT,
- 0);
-
-/*
- * Read the Linux inode info from the OSD, and return it as is. In exofs the
- * inode info is in an application specific page/attribute of the osd-object.
- */
-static int exofs_get_inode(struct super_block *sb, struct exofs_i_info *oi,
- struct exofs_fcb *inode)
-{
- struct exofs_sb_info *sbi = sb->s_fs_info;
- struct osd_attr attrs[] = {
- [0] = g_attr_inode_data,
- [1] = g_attr_inode_file_layout,
- [2] = g_attr_inode_dir_layout,
- };
- struct ore_io_state *ios;
- struct exofs_on_disk_inode_layout *layout;
- int ret;
-
- ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
- return ret;
- }
-
- attrs[1].len = exofs_on_disk_inode_layout_size(sbi->oc.numdevs);
- attrs[2].len = exofs_on_disk_inode_layout_size(sbi->oc.numdevs);
-
- ios->in_attr = attrs;
- ios->in_attr_len = ARRAY_SIZE(attrs);
-
- ret = ore_read(ios);
- if (unlikely(ret)) {
- EXOFS_ERR("object(0x%llx) corrupted, return empty file=>%d\n",
- _LLU(oi->one_comp.obj.id), ret);
- memset(inode, 0, sizeof(*inode));
- inode->i_mode = 0040000 | (0777 & ~022);
- /* If object is lost on target we might as well enable it's
- * delete.
- */
- ret = 0;
- goto out;
- }
-
- ret = extract_attr_from_ios(ios, &attrs[0]);
- if (ret) {
- EXOFS_ERR("%s: extract_attr 0 of inode failed\n", __func__);
- goto out;
- }
- WARN_ON(attrs[0].len != EXOFS_INO_ATTR_SIZE);
- memcpy(inode, attrs[0].val_ptr, EXOFS_INO_ATTR_SIZE);
-
- ret = extract_attr_from_ios(ios, &attrs[1]);
- if (ret) {
- EXOFS_ERR("%s: extract_attr 1 of inode failed\n", __func__);
- goto out;
- }
- if (attrs[1].len) {
- layout = attrs[1].val_ptr;
- if (layout->gen_func != cpu_to_le16(LAYOUT_MOVING_WINDOW)) {
- EXOFS_ERR("%s: unsupported files layout %d\n",
- __func__, layout->gen_func);
- ret = -ENOTSUPP;
- goto out;
- }
- }
-
- ret = extract_attr_from_ios(ios, &attrs[2]);
- if (ret) {
- EXOFS_ERR("%s: extract_attr 2 of inode failed\n", __func__);
- goto out;
- }
- if (attrs[2].len) {
- layout = attrs[2].val_ptr;
- if (layout->gen_func != cpu_to_le16(LAYOUT_MOVING_WINDOW)) {
- EXOFS_ERR("%s: unsupported meta-data layout %d\n",
- __func__, layout->gen_func);
- ret = -ENOTSUPP;
- goto out;
- }
- }
-
-out:
- ore_put_io_state(ios);
- return ret;
-}
-
-static void __oi_init(struct exofs_i_info *oi)
-{
- init_waitqueue_head(&oi->i_wq);
- oi->i_flags = 0;
-}
-/*
- * Fill in an inode read from the OSD and set it up for use
- */
-struct inode *exofs_iget(struct super_block *sb, unsigned long ino)
-{
- struct exofs_i_info *oi;
- struct exofs_fcb fcb;
- struct inode *inode;
- int ret;
-
- inode = iget_locked(sb, ino);
- if (!inode)
- return ERR_PTR(-ENOMEM);
- if (!(inode->i_state & I_NEW))
- return inode;
- oi = exofs_i(inode);
- __oi_init(oi);
- exofs_init_comps(&oi->oc, &oi->one_comp, sb->s_fs_info,
- exofs_oi_objno(oi));
-
- /* read the inode from the osd */
- ret = exofs_get_inode(sb, oi, &fcb);
- if (ret)
- goto bad_inode;
-
- set_obj_created(oi);
-
- /* copy stuff from on-disk struct to in-memory struct */
- inode->i_mode = le16_to_cpu(fcb.i_mode);
- i_uid_write(inode, le32_to_cpu(fcb.i_uid));
- i_gid_write(inode, le32_to_cpu(fcb.i_gid));
- set_nlink(inode, le16_to_cpu(fcb.i_links_count));
- inode->i_ctime.tv_sec = (signed)le32_to_cpu(fcb.i_ctime);
- inode->i_atime.tv_sec = (signed)le32_to_cpu(fcb.i_atime);
- inode->i_mtime.tv_sec = (signed)le32_to_cpu(fcb.i_mtime);
- inode->i_ctime.tv_nsec =
- inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec = 0;
- oi->i_commit_size = le64_to_cpu(fcb.i_size);
- i_size_write(inode, oi->i_commit_size);
- inode->i_blkbits = EXOFS_BLKSHIFT;
- inode->i_generation = le32_to_cpu(fcb.i_generation);
-
- oi->i_dir_start_lookup = 0;
-
- if ((inode->i_nlink == 0) && (inode->i_mode == 0)) {
- ret = -ESTALE;
- goto bad_inode;
- }
-
- if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
- if (fcb.i_data[0])
- inode->i_rdev =
- old_decode_dev(le32_to_cpu(fcb.i_data[0]));
- else
- inode->i_rdev =
- new_decode_dev(le32_to_cpu(fcb.i_data[1]));
- } else {
- memcpy(oi->i_data, fcb.i_data, sizeof(fcb.i_data));
- }
-
- if (S_ISREG(inode->i_mode)) {
- inode->i_op = &exofs_file_inode_operations;
- inode->i_fop = &exofs_file_operations;
- inode->i_mapping->a_ops = &exofs_aops;
- } else if (S_ISDIR(inode->i_mode)) {
- inode->i_op = &exofs_dir_inode_operations;
- inode->i_fop = &exofs_dir_operations;
- inode->i_mapping->a_ops = &exofs_aops;
- } else if (S_ISLNK(inode->i_mode)) {
- if (exofs_inode_is_fast_symlink(inode)) {
- inode->i_op = &simple_symlink_inode_operations;
- inode->i_link = (char *)oi->i_data;
- } else {
- inode->i_op = &page_symlink_inode_operations;
- inode_nohighmem(inode);
- inode->i_mapping->a_ops = &exofs_aops;
- }
- } else {
- inode->i_op = &exofs_special_inode_operations;
- if (fcb.i_data[0])
- init_special_inode(inode, inode->i_mode,
- old_decode_dev(le32_to_cpu(fcb.i_data[0])));
- else
- init_special_inode(inode, inode->i_mode,
- new_decode_dev(le32_to_cpu(fcb.i_data[1])));
- }
-
- unlock_new_inode(inode);
- return inode;
-
-bad_inode:
- iget_failed(inode);
- return ERR_PTR(ret);
-}
-
-int __exofs_wait_obj_created(struct exofs_i_info *oi)
-{
- if (!obj_created(oi)) {
- EXOFS_DBGMSG("!obj_created\n");
- BUG_ON(!obj_2bcreated(oi));
- wait_event(oi->i_wq, obj_created(oi));
- EXOFS_DBGMSG("wait_event done\n");
- }
- return unlikely(is_bad_inode(&oi->vfs_inode)) ? -EIO : 0;
-}
-
-/*
- * Callback function from exofs_new_inode(). The important thing is that we
- * set the obj_created flag so that other methods know that the object exists on
- * the OSD.
- */
-static void create_done(struct ore_io_state *ios, void *p)
-{
- struct inode *inode = p;
- struct exofs_i_info *oi = exofs_i(inode);
- struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
- int ret;
-
- ret = ore_check_io(ios, NULL);
- ore_put_io_state(ios);
-
- atomic_dec(&sbi->s_curr_pending);
-
- if (unlikely(ret)) {
- EXOFS_ERR("object=0x%llx creation failed in pid=0x%llx",
- _LLU(exofs_oi_objno(oi)),
- _LLU(oi->one_comp.obj.partition));
- /*TODO: When FS is corrupted creation can fail, object already
- * exist. Get rid of this asynchronous creation, if exist
- * increment the obj counter and try the next object. Until we
- * succeed. All these dangling objects will be made into lost
- * files by chkfs.exofs
- */
- }
-
- set_obj_created(oi);
-
- wake_up(&oi->i_wq);
-}
-
-/*
- * Set up a new inode and create an object for it on the OSD
- */
-struct inode *exofs_new_inode(struct inode *dir, umode_t mode)
-{
- struct super_block *sb = dir->i_sb;
- struct exofs_sb_info *sbi = sb->s_fs_info;
- struct inode *inode;
- struct exofs_i_info *oi;
- struct ore_io_state *ios;
- int ret;
-
- inode = new_inode(sb);
- if (!inode)
- return ERR_PTR(-ENOMEM);
-
- oi = exofs_i(inode);
- __oi_init(oi);
-
- set_obj_2bcreated(oi);
-
- inode_init_owner(inode, dir, mode);
- inode->i_ino = sbi->s_nextid++;
- inode->i_blkbits = EXOFS_BLKSHIFT;
- inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
- oi->i_commit_size = inode->i_size = 0;
- spin_lock(&sbi->s_next_gen_lock);
- inode->i_generation = sbi->s_next_generation++;
- spin_unlock(&sbi->s_next_gen_lock);
- insert_inode_hash(inode);
-
- exofs_init_comps(&oi->oc, &oi->one_comp, sb->s_fs_info,
- exofs_oi_objno(oi));
- exofs_sbi_write_stats(sbi); /* Make sure new sbi->s_nextid is on disk */
-
- mark_inode_dirty(inode);
-
- ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("exofs_new_inode: ore_get_io_state failed\n");
- return ERR_PTR(ret);
- }
-
- ios->done = create_done;
- ios->private = inode;
-
- ret = ore_create(ios);
- if (ret) {
- ore_put_io_state(ios);
- return ERR_PTR(ret);
- }
- atomic_inc(&sbi->s_curr_pending);
-
- return inode;
-}
-
-/*
- * struct to pass two arguments to update_inode's callback
- */
-struct updatei_args {
- struct exofs_sb_info *sbi;
- struct exofs_fcb fcb;
-};
-
-/*
- * Callback function from exofs_update_inode().
- */
-static void updatei_done(struct ore_io_state *ios, void *p)
-{
- struct updatei_args *args = p;
-
- ore_put_io_state(ios);
-
- atomic_dec(&args->sbi->s_curr_pending);
-
- kfree(args);
-}
-
-/*
- * Write the inode to the OSD. Just fill up the struct, and set the attribute
- * synchronously or asynchronously depending on the do_sync flag.
- */
-static int exofs_update_inode(struct inode *inode, int do_sync)
-{
- struct exofs_i_info *oi = exofs_i(inode);
- struct super_block *sb = inode->i_sb;
- struct exofs_sb_info *sbi = sb->s_fs_info;
- struct ore_io_state *ios;
- struct osd_attr attr;
- struct exofs_fcb *fcb;
- struct updatei_args *args;
- int ret;
-
- args = kzalloc(sizeof(*args), GFP_KERNEL);
- if (!args) {
- EXOFS_DBGMSG("Failed kzalloc of args\n");
- return -ENOMEM;
- }
-
- fcb = &args->fcb;
-
- fcb->i_mode = cpu_to_le16(inode->i_mode);
- fcb->i_uid = cpu_to_le32(i_uid_read(inode));
- fcb->i_gid = cpu_to_le32(i_gid_read(inode));
- fcb->i_links_count = cpu_to_le16(inode->i_nlink);
- fcb->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
- fcb->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
- fcb->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
- oi->i_commit_size = i_size_read(inode);
- fcb->i_size = cpu_to_le64(oi->i_commit_size);
- fcb->i_generation = cpu_to_le32(inode->i_generation);
-
- if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
- if (old_valid_dev(inode->i_rdev)) {
- fcb->i_data[0] =
- cpu_to_le32(old_encode_dev(inode->i_rdev));
- fcb->i_data[1] = 0;
- } else {
- fcb->i_data[0] = 0;
- fcb->i_data[1] =
- cpu_to_le32(new_encode_dev(inode->i_rdev));
- fcb->i_data[2] = 0;
- }
- } else
- memcpy(fcb->i_data, oi->i_data, sizeof(fcb->i_data));
-
- ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
- goto free_args;
- }
-
- attr = g_attr_inode_data;
- attr.val_ptr = fcb;
- ios->out_attr_len = 1;
- ios->out_attr = &attr;
-
- wait_obj_created(oi);
-
- if (!do_sync) {
- args->sbi = sbi;
- ios->done = updatei_done;
- ios->private = args;
- }
-
- ret = ore_write(ios);
- if (!do_sync && !ret) {
- atomic_inc(&sbi->s_curr_pending);
- goto out; /* deallocation in updatei_done */
- }
-
- ore_put_io_state(ios);
-free_args:
- kfree(args);
-out:
- EXOFS_DBGMSG("(0x%lx) do_sync=%d ret=>%d\n",
- inode->i_ino, do_sync, ret);
- return ret;
-}
-
-int exofs_write_inode(struct inode *inode, struct writeback_control *wbc)
-{
- /* FIXME: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */
- return exofs_update_inode(inode, 1);
-}
-
-/*
- * Callback function from exofs_delete_inode() - don't have much cleaning up to
- * do.
- */
-static void delete_done(struct ore_io_state *ios, void *p)
-{
- struct exofs_sb_info *sbi = p;
-
- ore_put_io_state(ios);
-
- atomic_dec(&sbi->s_curr_pending);
-}
-
-/*
- * Called when the refcount of an inode reaches zero. We remove the object
- * from the OSD here. We make sure the object was created before we try and
- * delete it.
- */
-void exofs_evict_inode(struct inode *inode)
-{
- struct exofs_i_info *oi = exofs_i(inode);
- struct super_block *sb = inode->i_sb;
- struct exofs_sb_info *sbi = sb->s_fs_info;
- struct ore_io_state *ios;
- int ret;
-
- truncate_inode_pages_final(&inode->i_data);
-
- /* TODO: should do better here */
- if (inode->i_nlink || is_bad_inode(inode))
- goto no_delete;
-
- inode->i_size = 0;
- clear_inode(inode);
-
- /* if we are deleting an obj that hasn't been created yet, wait.
- * This also makes sure that create_done cannot be called with an
- * already evicted inode.
- */
- wait_obj_created(oi);
- /* ignore the error, attempt a remove anyway */
-
- /* Now Remove the OSD objects */
- ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_get_io_state failed\n", __func__);
- return;
- }
-
- ios->done = delete_done;
- ios->private = sbi;
-
- ret = ore_remove(ios);
- if (ret) {
- EXOFS_ERR("%s: ore_remove failed\n", __func__);
- ore_put_io_state(ios);
- return;
- }
- atomic_inc(&sbi->s_curr_pending);
-
- return;
-
-no_delete:
- clear_inode(inode);
-}
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
deleted file mode 100644
index 7295cd722770..000000000000
--- a/fs/exofs/namei.c
+++ /dev/null
@@ -1,323 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#include "exofs.h"
-
-static inline int exofs_add_nondir(struct dentry *dentry, struct inode *inode)
-{
- int err = exofs_add_link(dentry, inode);
- if (!err) {
- d_instantiate(dentry, inode);
- return 0;
- }
- inode_dec_link_count(inode);
- iput(inode);
- return err;
-}
-
-static struct dentry *exofs_lookup(struct inode *dir, struct dentry *dentry,
- unsigned int flags)
-{
- struct inode *inode;
- ino_t ino;
-
- if (dentry->d_name.len > EXOFS_NAME_LEN)
- return ERR_PTR(-ENAMETOOLONG);
-
- ino = exofs_inode_by_name(dir, dentry);
- inode = ino ? exofs_iget(dir->i_sb, ino) : NULL;
- return d_splice_alias(inode, dentry);
-}
-
-static int exofs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
- bool excl)
-{
- struct inode *inode = exofs_new_inode(dir, mode);
- int err = PTR_ERR(inode);
- if (!IS_ERR(inode)) {
- inode->i_op = &exofs_file_inode_operations;
- inode->i_fop = &exofs_file_operations;
- inode->i_mapping->a_ops = &exofs_aops;
- mark_inode_dirty(inode);
- err = exofs_add_nondir(dentry, inode);
- }
- return err;
-}
-
-static int exofs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
- dev_t rdev)
-{
- struct inode *inode;
- int err;
-
- inode = exofs_new_inode(dir, mode);
- err = PTR_ERR(inode);
- if (!IS_ERR(inode)) {
- init_special_inode(inode, inode->i_mode, rdev);
- mark_inode_dirty(inode);
- err = exofs_add_nondir(dentry, inode);
- }
- return err;
-}
-
-static int exofs_symlink(struct inode *dir, struct dentry *dentry,
- const char *symname)
-{
- struct super_block *sb = dir->i_sb;
- int err = -ENAMETOOLONG;
- unsigned l = strlen(symname)+1;
- struct inode *inode;
- struct exofs_i_info *oi;
-
- if (l > sb->s_blocksize)
- goto out;
-
- inode = exofs_new_inode(dir, S_IFLNK | S_IRWXUGO);
- err = PTR_ERR(inode);
- if (IS_ERR(inode))
- goto out;
-
- oi = exofs_i(inode);
- if (l > sizeof(oi->i_data)) {
- /* slow symlink */
- inode->i_op = &page_symlink_inode_operations;
- inode_nohighmem(inode);
- inode->i_mapping->a_ops = &exofs_aops;
- memset(oi->i_data, 0, sizeof(oi->i_data));
-
- err = page_symlink(inode, symname, l);
- if (err)
- goto out_fail;
- } else {
- /* fast symlink */
- inode->i_op = &simple_symlink_inode_operations;
- inode->i_link = (char *)oi->i_data;
- memcpy(oi->i_data, symname, l);
- inode->i_size = l-1;
- }
- mark_inode_dirty(inode);
-
- err = exofs_add_nondir(dentry, inode);
-out:
- return err;
-
-out_fail:
- inode_dec_link_count(inode);
- iput(inode);
- goto out;
-}
-
-static int exofs_link(struct dentry *old_dentry, struct inode *dir,
- struct dentry *dentry)
-{
- struct inode *inode = d_inode(old_dentry);
-
- inode->i_ctime = current_time(inode);
- inode_inc_link_count(inode);
- ihold(inode);
-
- return exofs_add_nondir(dentry, inode);
-}
-
-static int exofs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
-{
- struct inode *inode;
- int err;
-
- inode_inc_link_count(dir);
-
- inode = exofs_new_inode(dir, S_IFDIR | mode);
- err = PTR_ERR(inode);
- if (IS_ERR(inode))
- goto out_dir;
-
- inode->i_op = &exofs_dir_inode_operations;
- inode->i_fop = &exofs_dir_operations;
- inode->i_mapping->a_ops = &exofs_aops;
-
- inode_inc_link_count(inode);
-
- err = exofs_make_empty(inode, dir);
- if (err)
- goto out_fail;
-
- err = exofs_add_link(dentry, inode);
- if (err)
- goto out_fail;
-
- d_instantiate(dentry, inode);
-out:
- return err;
-
-out_fail:
- inode_dec_link_count(inode);
- inode_dec_link_count(inode);
- iput(inode);
-out_dir:
- inode_dec_link_count(dir);
- goto out;
-}
-
-static int exofs_unlink(struct inode *dir, struct dentry *dentry)
-{
- struct inode *inode = d_inode(dentry);
- struct exofs_dir_entry *de;
- struct page *page;
- int err = -ENOENT;
-
- de = exofs_find_entry(dir, dentry, &page);
- if (!de)
- goto out;
-
- err = exofs_delete_entry(de, page);
- if (err)
- goto out;
-
- inode->i_ctime = dir->i_ctime;
- inode_dec_link_count(inode);
- err = 0;
-out:
- return err;
-}
-
-static int exofs_rmdir(struct inode *dir, struct dentry *dentry)
-{
- struct inode *inode = d_inode(dentry);
- int err = -ENOTEMPTY;
-
- if (exofs_empty_dir(inode)) {
- err = exofs_unlink(dir, dentry);
- if (!err) {
- inode->i_size = 0;
- inode_dec_link_count(inode);
- inode_dec_link_count(dir);
- }
- }
- return err;
-}
-
-static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry,
- unsigned int flags)
-{
- struct inode *old_inode = d_inode(old_dentry);
- struct inode *new_inode = d_inode(new_dentry);
- struct page *dir_page = NULL;
- struct exofs_dir_entry *dir_de = NULL;
- struct page *old_page;
- struct exofs_dir_entry *old_de;
- int err = -ENOENT;
-
- if (flags & ~RENAME_NOREPLACE)
- return -EINVAL;
-
- old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
- if (!old_de)
- goto out;
-
- if (S_ISDIR(old_inode->i_mode)) {
- err = -EIO;
- dir_de = exofs_dotdot(old_inode, &dir_page);
- if (!dir_de)
- goto out_old;
- }
-
- if (new_inode) {
- struct page *new_page;
- struct exofs_dir_entry *new_de;
-
- err = -ENOTEMPTY;
- if (dir_de && !exofs_empty_dir(new_inode))
- goto out_dir;
-
- err = -ENOENT;
- new_de = exofs_find_entry(new_dir, new_dentry, &new_page);
- if (!new_de)
- goto out_dir;
- err = exofs_set_link(new_dir, new_de, new_page, old_inode);
- new_inode->i_ctime = current_time(new_inode);
- if (dir_de)
- drop_nlink(new_inode);
- inode_dec_link_count(new_inode);
- if (err)
- goto out_dir;
- } else {
- err = exofs_add_link(new_dentry, old_inode);
- if (err)
- goto out_dir;
- if (dir_de)
- inode_inc_link_count(new_dir);
- }
-
- old_inode->i_ctime = current_time(old_inode);
-
- exofs_delete_entry(old_de, old_page);
- mark_inode_dirty(old_inode);
-
- if (dir_de) {
- err = exofs_set_link(old_inode, dir_de, dir_page, new_dir);
- inode_dec_link_count(old_dir);
- if (err)
- goto out_dir;
- }
- return 0;
-
-
-out_dir:
- if (dir_de) {
- kunmap(dir_page);
- put_page(dir_page);
- }
-out_old:
- kunmap(old_page);
- put_page(old_page);
-out:
- return err;
-}
-
-const struct inode_operations exofs_dir_inode_operations = {
- .create = exofs_create,
- .lookup = exofs_lookup,
- .link = exofs_link,
- .unlink = exofs_unlink,
- .symlink = exofs_symlink,
- .mkdir = exofs_mkdir,
- .rmdir = exofs_rmdir,
- .mknod = exofs_mknod,
- .rename = exofs_rename,
- .setattr = exofs_setattr,
-};
-
-const struct inode_operations exofs_special_inode_operations = {
- .setattr = exofs_setattr,
-};
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
deleted file mode 100644
index 8bb72807e70d..000000000000
--- a/fs/exofs/ore.c
+++ /dev/null
@@ -1,1164 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#include <linux/slab.h>
-#include <linux/module.h>
-#include <asm/div64.h>
-#include <linux/lcm.h>
-
-#include "ore_raid.h"
-
-MODULE_AUTHOR("Boaz Harrosh <ooo@electrozaur.com>");
-MODULE_DESCRIPTION("Objects Raid Engine ore.ko");
-MODULE_LICENSE("GPL");
-
-/* ore_verify_layout does a couple of things:
- * 1. Given a minimum number of needed parameters fixes up the rest of the
- * members to be operatonals for the ore. The needed parameters are those
- * that are defined by the pnfs-objects layout STD.
- * 2. Check to see if the current ore code actually supports these parameters
- * for example stripe_unit must be a multple of the system PAGE_SIZE,
- * and etc...
- * 3. Cache some havily used calculations that will be needed by users.
- */
-
-enum { BIO_MAX_PAGES_KMALLOC =
- (PAGE_SIZE - sizeof(struct bio)) / sizeof(struct bio_vec),};
-
-int ore_verify_layout(unsigned total_comps, struct ore_layout *layout)
-{
- u64 stripe_length;
-
- switch (layout->raid_algorithm) {
- case PNFS_OSD_RAID_0:
- layout->parity = 0;
- break;
- case PNFS_OSD_RAID_5:
- layout->parity = 1;
- break;
- case PNFS_OSD_RAID_PQ:
- layout->parity = 2;
- break;
- case PNFS_OSD_RAID_4:
- default:
- ORE_ERR("Only RAID_0/5/6 for now received-enum=%d\n",
- layout->raid_algorithm);
- return -EINVAL;
- }
- if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
- ORE_ERR("Stripe Unit(0x%llx)"
- " must be Multples of PAGE_SIZE(0x%lx)\n",
- _LLU(layout->stripe_unit), PAGE_SIZE);
- return -EINVAL;
- }
- if (layout->group_width) {
- if (!layout->group_depth) {
- ORE_ERR("group_depth == 0 && group_width != 0\n");
- return -EINVAL;
- }
- if (total_comps < (layout->group_width * layout->mirrors_p1)) {
- ORE_ERR("Data Map wrong, "
- "numdevs=%d < group_width=%d * mirrors=%d\n",
- total_comps, layout->group_width,
- layout->mirrors_p1);
- return -EINVAL;
- }
- layout->group_count = total_comps / layout->mirrors_p1 /
- layout->group_width;
- } else {
- if (layout->group_depth) {
- printk(KERN_NOTICE "Warning: group_depth ignored "
- "group_width == 0 && group_depth == %lld\n",
- _LLU(layout->group_depth));
- }
- layout->group_width = total_comps / layout->mirrors_p1;
- layout->group_depth = -1;
- layout->group_count = 1;
- }
-
- stripe_length = (u64)layout->group_width * layout->stripe_unit;
- if (stripe_length >= (1ULL << 32)) {
- ORE_ERR("Stripe_length(0x%llx) >= 32bit is not supported\n",
- _LLU(stripe_length));
- return -EINVAL;
- }
-
- layout->max_io_length =
- (BIO_MAX_PAGES_KMALLOC * PAGE_SIZE - layout->stripe_unit) *
- (layout->group_width - layout->parity);
- if (layout->parity) {
- unsigned stripe_length =
- (layout->group_width - layout->parity) *
- layout->stripe_unit;
-
- layout->max_io_length /= stripe_length;
- layout->max_io_length *= stripe_length;
- }
- ORE_DBGMSG("max_io_length=0x%lx\n", layout->max_io_length);
-
- return 0;
-}
-EXPORT_SYMBOL(ore_verify_layout);
-
-static u8 *_ios_cred(struct ore_io_state *ios, unsigned index)
-{
- return ios->oc->comps[index & ios->oc->single_comp].cred;
-}
-
-static struct osd_obj_id *_ios_obj(struct ore_io_state *ios, unsigned index)
-{
- return &ios->oc->comps[index & ios->oc->single_comp].obj;
-}
-
-static struct osd_dev *_ios_od(struct ore_io_state *ios, unsigned index)
-{
- ORE_DBGMSG2("oc->first_dev=%d oc->numdevs=%d i=%d oc->ods=%p\n",
- ios->oc->first_dev, ios->oc->numdevs, index,
- ios->oc->ods);
-
- return ore_comp_dev(ios->oc, index);
-}
-
-int _ore_get_io_state(struct ore_layout *layout,
- struct ore_components *oc, unsigned numdevs,
- unsigned sgs_per_dev, unsigned num_par_pages,
- struct ore_io_state **pios)
-{
- struct ore_io_state *ios;
- struct page **pages;
- struct osd_sg_entry *sgilist;
- struct __alloc_all_io_state {
- struct ore_io_state ios;
- struct ore_per_dev_state per_dev[numdevs];
- union {
- struct osd_sg_entry sglist[sgs_per_dev * numdevs];
- struct page *pages[num_par_pages];
- };
- } *_aios;
-
- if (likely(sizeof(*_aios) <= PAGE_SIZE)) {
- _aios = kzalloc(sizeof(*_aios), GFP_KERNEL);
- if (unlikely(!_aios)) {
- ORE_DBGMSG("Failed kzalloc bytes=%zd\n",
- sizeof(*_aios));
- *pios = NULL;
- return -ENOMEM;
- }
- pages = num_par_pages ? _aios->pages : NULL;
- sgilist = sgs_per_dev ? _aios->sglist : NULL;
- ios = &_aios->ios;
- } else {
- struct __alloc_small_io_state {
- struct ore_io_state ios;
- struct ore_per_dev_state per_dev[numdevs];
- } *_aio_small;
- union __extra_part {
- struct osd_sg_entry sglist[sgs_per_dev * numdevs];
- struct page *pages[num_par_pages];
- } *extra_part;
-
- _aio_small = kzalloc(sizeof(*_aio_small), GFP_KERNEL);
- if (unlikely(!_aio_small)) {
- ORE_DBGMSG("Failed alloc first part bytes=%zd\n",
- sizeof(*_aio_small));
- *pios = NULL;
- return -ENOMEM;
- }
- extra_part = kzalloc(sizeof(*extra_part), GFP_KERNEL);
- if (unlikely(!extra_part)) {
- ORE_DBGMSG("Failed alloc second part bytes=%zd\n",
- sizeof(*extra_part));
- kfree(_aio_small);
- *pios = NULL;
- return -ENOMEM;
- }
-
- pages = num_par_pages ? extra_part->pages : NULL;
- sgilist = sgs_per_dev ? extra_part->sglist : NULL;
- /* In this case the per_dev[0].sgilist holds the pointer to
- * be freed
- */
- ios = &_aio_small->ios;
- ios->extra_part_alloc = true;
- }
-
- if (pages) {
- ios->parity_pages = pages;
- ios->max_par_pages = num_par_pages;
- }
- if (sgilist) {
- unsigned d;
-
- for (d = 0; d < numdevs; ++d) {
- ios->per_dev[d].sglist = sgilist;
- sgilist += sgs_per_dev;
- }
- ios->sgs_per_dev = sgs_per_dev;
- }
-
- ios->layout = layout;
- ios->oc = oc;
- *pios = ios;
- return 0;
-}
-
-/* Allocate an io_state for only a single group of devices
- *
- * If a user needs to call ore_read/write() this version must be used becase it
- * allocates extra stuff for striping and raid.
- * The ore might decide to only IO less then @length bytes do to alignmets
- * and constrains as follows:
- * - The IO cannot cross group boundary.
- * - In raid5/6 The end of the IO must align at end of a stripe eg.
- * (@offset + @length) % strip_size == 0. Or the complete range is within a
- * single stripe.
- * - Memory condition only permitted a shorter IO. (A user can use @length=~0
- * And check the returned ios->length for max_io_size.)
- *
- * The caller must check returned ios->length (and/or ios->nr_pages) and
- * re-issue these pages that fall outside of ios->length
- */
-int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
- bool is_reading, u64 offset, u64 length,
- struct ore_io_state **pios)
-{
- struct ore_io_state *ios;
- unsigned numdevs = layout->group_width * layout->mirrors_p1;
- unsigned sgs_per_dev = 0, max_par_pages = 0;
- int ret;
-
- if (layout->parity && length) {
- unsigned data_devs = layout->group_width - layout->parity;
- unsigned stripe_size = layout->stripe_unit * data_devs;
- unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
- u32 remainder;
- u64 num_stripes;
- u64 num_raid_units;
-
- num_stripes = div_u64_rem(length, stripe_size, &remainder);
- if (remainder)
- ++num_stripes;
-
- num_raid_units = num_stripes * layout->parity;
-
- if (is_reading) {
- /* For reads add per_dev sglist array */
- /* TODO: Raid 6 we need twice more. Actually:
- * num_stripes / LCMdP(W,P);
- * if (W%P != 0) num_stripes *= parity;
- */
-
- /* first/last seg is split */
- num_raid_units += layout->group_width;
- sgs_per_dev = div_u64(num_raid_units, data_devs) + 2;
- } else {
- /* For Writes add parity pages array. */
- max_par_pages = num_raid_units * pages_in_unit *
- sizeof(struct page *);
- }
- }
-
- ret = _ore_get_io_state(layout, oc, numdevs, sgs_per_dev, max_par_pages,
- pios);
- if (unlikely(ret))
- return ret;
-
- ios = *pios;
- ios->reading = is_reading;
- ios->offset = offset;
-
- if (length) {
- ore_calc_stripe_info(layout, offset, length, &ios->si);
- ios->length = ios->si.length;
- ios->nr_pages = ((ios->offset & (PAGE_SIZE - 1)) +
- ios->length + PAGE_SIZE - 1) / PAGE_SIZE;
- if (layout->parity)
- _ore_post_alloc_raid_stuff(ios);
- }
-
- return 0;
-}
-EXPORT_SYMBOL(ore_get_rw_state);
-
-/* Allocate an io_state for all the devices in the comps array
- *
- * This version of io_state allocation is used mostly by create/remove
- * and trunc where we currently need all the devices. The only wastful
- * bit is the read/write_attributes with no IO. Those sites should
- * be converted to use ore_get_rw_state() with length=0
- */
-int ore_get_io_state(struct ore_layout *layout, struct ore_components *oc,
- struct ore_io_state **pios)
-{
- return _ore_get_io_state(layout, oc, oc->numdevs, 0, 0, pios);
-}
-EXPORT_SYMBOL(ore_get_io_state);
-
-void ore_put_io_state(struct ore_io_state *ios)
-{
- if (ios) {
- unsigned i;
-
- for (i = 0; i < ios->numdevs; i++) {
- struct ore_per_dev_state *per_dev = &ios->per_dev[i];
-
- if (per_dev->or)
- osd_end_request(per_dev->or);
- if (per_dev->bio)
- bio_put(per_dev->bio);
- }
-
- _ore_free_raid_stuff(ios);
- kfree(ios);
- }
-}
-EXPORT_SYMBOL(ore_put_io_state);
-
-static void _sync_done(struct ore_io_state *ios, void *p)
-{
- struct completion *waiting = p;
-
- complete(waiting);
-}
-
-static void _last_io(struct kref *kref)
-{
- struct ore_io_state *ios = container_of(
- kref, struct ore_io_state, kref);
-
- ios->done(ios, ios->private);
-}
-
-static void _done_io(struct osd_request *or, void *p)
-{
- struct ore_io_state *ios = p;
-
- kref_put(&ios->kref, _last_io);
-}
-
-int ore_io_execute(struct ore_io_state *ios)
-{
- DECLARE_COMPLETION_ONSTACK(wait);
- bool sync = (ios->done == NULL);
- int i, ret;
-
- if (sync) {
- ios->done = _sync_done;
- ios->private = &wait;
- }
-
- for (i = 0; i < ios->numdevs; i++) {
- struct osd_request *or = ios->per_dev[i].or;
- if (unlikely(!or))
- continue;
-
- ret = osd_finalize_request(or, 0, _ios_cred(ios, i), NULL);
- if (unlikely(ret)) {
- ORE_DBGMSG("Failed to osd_finalize_request() => %d\n",
- ret);
- return ret;
- }
- }
-
- kref_init(&ios->kref);
-
- for (i = 0; i < ios->numdevs; i++) {
- struct osd_request *or = ios->per_dev[i].or;
- if (unlikely(!or))
- continue;
-
- kref_get(&ios->kref);
- osd_execute_request_async(or, _done_io, ios);
- }
-
- kref_put(&ios->kref, _last_io);
- ret = 0;
-
- if (sync) {
- wait_for_completion(&wait);
- ret = ore_check_io(ios, NULL);
- }
- return ret;
-}
-
-static void _clear_bio(struct bio *bio)
-{
- struct bio_vec *bv;
- unsigned i;
-
- bio_for_each_segment_all(bv, bio, i) {
- unsigned this_count = bv->bv_len;
-
- if (likely(PAGE_SIZE == this_count))
- clear_highpage(bv->bv_page);
- else
- zero_user(bv->bv_page, bv->bv_offset, this_count);
- }
-}
-
-int ore_check_io(struct ore_io_state *ios, ore_on_dev_error on_dev_error)
-{
- enum osd_err_priority acumulated_osd_err = 0;
- int acumulated_lin_err = 0;
- int i;
-
- for (i = 0; i < ios->numdevs; i++) {
- struct osd_sense_info osi;
- struct ore_per_dev_state *per_dev = &ios->per_dev[i];
- struct osd_request *or = per_dev->or;
- int ret;
-
- if (unlikely(!or))
- continue;
-
- ret = osd_req_decode_sense(or, &osi);
- if (likely(!ret))
- continue;
-
- if ((OSD_ERR_PRI_CLEAR_PAGES == osi.osd_err_pri) &&
- per_dev->bio) {
- /* start read offset passed endof file.
- * Note: if we do not have bio it means read-attributes
- * In this case we should return error to caller.
- */
- _clear_bio(per_dev->bio);
- ORE_DBGMSG("start read offset passed end of file "
- "offset=0x%llx, length=0x%llx\n",
- _LLU(per_dev->offset),
- _LLU(per_dev->length));
-
- continue; /* we recovered */
- }
-
- if (on_dev_error) {
- u64 residual = ios->reading ?
- or->in.residual : or->out.residual;
- u64 offset = (ios->offset + ios->length) - residual;
- unsigned dev = per_dev->dev - ios->oc->first_dev;
- struct ore_dev *od = ios->oc->ods[dev];
-
- on_dev_error(ios, od, dev, osi.osd_err_pri,
- offset, residual);
- }
- if (osi.osd_err_pri >= acumulated_osd_err) {
- acumulated_osd_err = osi.osd_err_pri;
- acumulated_lin_err = ret;
- }
- }
-
- return acumulated_lin_err;
-}
-EXPORT_SYMBOL(ore_check_io);
-
-/*
- * L - logical offset into the file
- *
- * D - number of Data devices
- * D = group_width - parity
- *
- * U - The number of bytes in a stripe within a group
- * U = stripe_unit * D
- *
- * T - The number of bytes striped within a group of component objects
- * (before advancing to the next group)
- * T = U * group_depth
- *
- * S - The number of bytes striped across all component objects
- * before the pattern repeats
- * S = T * group_count
- *
- * M - The "major" (i.e., across all components) cycle number
- * M = L / S
- *
- * G - Counts the groups from the beginning of the major cycle
- * G = (L - (M * S)) / T [or (L % S) / T]
- *
- * H - The byte offset within the group
- * H = (L - (M * S)) % T [or (L % S) % T]
- *
- * N - The "minor" (i.e., across the group) stripe number
- * N = H / U
- *
- * C - The component index coresponding to L
- *
- * C = (H - (N * U)) / stripe_unit + G * D
- * [or (L % U) / stripe_unit + G * D]
- *
- * O - The component offset coresponding to L
- * O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
- *
- * LCMdP – Parity cycle: Lowest Common Multiple of group_width, parity
- * divide by parity
- * LCMdP = lcm(group_width, parity) / parity
- *
- * R - The parity Rotation stripe
- * (Note parity cycle always starts at a group's boundary)
- * R = N % LCMdP
- *
- * I = the first parity device index
- * I = (group_width + group_width - R*parity - parity) % group_width
- *
- * Craid - The component index Rotated
- * Craid = (group_width + C - R*parity) % group_width
- * (We add the group_width to avoid negative numbers modulo math)
- */
-void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- u64 length, struct ore_striping_info *si)
-{
- u32 stripe_unit = layout->stripe_unit;
- u32 group_width = layout->group_width;
- u64 group_depth = layout->group_depth;
- u32 parity = layout->parity;
-
- u32 D = group_width - parity;
- u32 U = D * stripe_unit;
- u64 T = U * group_depth;
- u64 S = T * layout->group_count;
- u64 M = div64_u64(file_offset, S);
-
- /*
- G = (L - (M * S)) / T
- H = (L - (M * S)) % T
- */
- u64 LmodS = file_offset - M * S;
- u32 G = div64_u64(LmodS, T);
- u64 H = LmodS - G * T;
-
- u32 N = div_u64(H, U);
- u32 Nlast;
-
- /* "H - (N * U)" is just "H % U" so it's bound to u32 */
- u32 C = (u32)(H - (N * U)) / stripe_unit + G * group_width;
- u32 first_dev = C - C % group_width;
-
- div_u64_rem(file_offset, stripe_unit, &si->unit_off);
-
- si->obj_offset = si->unit_off + (N * stripe_unit) +
- (M * group_depth * stripe_unit);
- si->cur_comp = C - first_dev;
- si->cur_pg = si->unit_off / PAGE_SIZE;
-
- if (parity) {
- u32 LCMdP = lcm(group_width, parity) / parity;
- /* R = N % LCMdP; */
- u32 RxP = (N % LCMdP) * parity;
-
- si->par_dev = (group_width + group_width - parity - RxP) %
- group_width + first_dev;
- si->dev = (group_width + group_width + C - RxP) %
- group_width + first_dev;
- si->bytes_in_stripe = U;
- si->first_stripe_start = M * S + G * T + N * U;
- } else {
- /* Make the math correct see _prepare_one_group */
- si->par_dev = group_width;
- si->dev = C;
- }
-
- si->dev *= layout->mirrors_p1;
- si->par_dev *= layout->mirrors_p1;
- si->offset = file_offset;
- si->length = T - H;
- if (si->length > length)
- si->length = length;
-
- Nlast = div_u64(H + si->length + U - 1, U);
- si->maxdevUnits = Nlast - N;
-
- si->M = M;
-}
-EXPORT_SYMBOL(ore_calc_stripe_info);
-
-int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
- unsigned pgbase, struct page **pages,
- struct ore_per_dev_state *per_dev, int cur_len)
-{
- unsigned pg = *cur_pg;
- struct request_queue *q =
- osd_request_queue(_ios_od(ios, per_dev->dev));
- unsigned len = cur_len;
- int ret;
-
- if (per_dev->bio == NULL) {
- unsigned bio_size;
-
- if (!ios->reading) {
- bio_size = ios->si.maxdevUnits;
- } else {
- bio_size = (ios->si.maxdevUnits + 1) *
- (ios->layout->group_width - ios->layout->parity) /
- ios->layout->group_width;
- }
- bio_size *= (ios->layout->stripe_unit / PAGE_SIZE);
-
- per_dev->bio = bio_kmalloc(GFP_KERNEL, bio_size);
- if (unlikely(!per_dev->bio)) {
- ORE_DBGMSG("Failed to allocate BIO size=%u\n",
- bio_size);
- ret = -ENOMEM;
- goto out;
- }
- }
-
- while (cur_len > 0) {
- unsigned pglen = min_t(unsigned, PAGE_SIZE - pgbase, cur_len);
- unsigned added_len;
-
- cur_len -= pglen;
-
- added_len = bio_add_pc_page(q, per_dev->bio, pages[pg],
- pglen, pgbase);
- if (unlikely(pglen != added_len)) {
- /* If bi_vcnt == bi_max then this is a SW BUG */
- ORE_DBGMSG("Failed bio_add_pc_page bi_vcnt=0x%x "
- "bi_max=0x%x BIO_MAX=0x%x cur_len=0x%x\n",
- per_dev->bio->bi_vcnt,
- per_dev->bio->bi_max_vecs,
- BIO_MAX_PAGES_KMALLOC, cur_len);
- ret = -ENOMEM;
- goto out;
- }
- _add_stripe_page(ios->sp2d, &ios->si, pages[pg]);
-
- pgbase = 0;
- ++pg;
- }
- BUG_ON(cur_len);
-
- per_dev->length += len;
- *cur_pg = pg;
- ret = 0;
-out: /* we fail the complete unit on an error eg don't advance
- * per_dev->length and cur_pg. This means that we might have a bigger
- * bio than the CDB requested length (per_dev->length). That's fine
- * only the oposite is fatal.
- */
- return ret;
-}
-
-static int _add_parity_units(struct ore_io_state *ios,
- struct ore_striping_info *si,
- unsigned dev, unsigned first_dev,
- unsigned mirrors_p1, unsigned devs_in_group,
- unsigned cur_len)
-{
- unsigned do_parity;
- int ret = 0;
-
- for (do_parity = ios->layout->parity; do_parity; --do_parity) {
- struct ore_per_dev_state *per_dev;
-
- per_dev = &ios->per_dev[dev - first_dev];
- if (!per_dev->length && !per_dev->offset) {
- /* Only/always the parity unit of the first
- * stripe will be empty. So this is a chance to
- * initialize the per_dev info.
- */
- per_dev->dev = dev;
- per_dev->offset = si->obj_offset - si->unit_off;
- }
-
- ret = _ore_add_parity_unit(ios, si, per_dev, cur_len,
- do_parity == 1);
- if (unlikely(ret))
- break;
-
- if (do_parity != 1) {
- dev = ((dev + mirrors_p1) % devs_in_group) + first_dev;
- si->cur_comp = (si->cur_comp + 1) %
- ios->layout->group_width;
- }
- }
-
- return ret;
-}
-
-static int _prepare_for_striping(struct ore_io_state *ios)
-{
- struct ore_striping_info *si = &ios->si;
- unsigned stripe_unit = ios->layout->stripe_unit;
- unsigned mirrors_p1 = ios->layout->mirrors_p1;
- unsigned group_width = ios->layout->group_width;
- unsigned devs_in_group = group_width * mirrors_p1;
- unsigned dev = si->dev;
- unsigned first_dev = dev - (dev % devs_in_group);
- unsigned cur_pg = ios->pages_consumed;
- u64 length = ios->length;
- int ret = 0;
-
- if (!ios->pages) {
- ios->numdevs = ios->layout->mirrors_p1;
- return 0;
- }
-
- BUG_ON(length > si->length);
-
- while (length) {
- struct ore_per_dev_state *per_dev =
- &ios->per_dev[dev - first_dev];
- unsigned cur_len, page_off = 0;
-
- if (!per_dev->length && !per_dev->offset) {
- /* First time initialize the per_dev info. */
- per_dev->dev = dev;
- if (dev == si->dev) {
- WARN_ON(dev == si->par_dev);
- per_dev->offset = si->obj_offset;
- cur_len = stripe_unit - si->unit_off;
- page_off = si->unit_off & ~PAGE_MASK;
- BUG_ON(page_off && (page_off != ios->pgbase));
- } else {
- per_dev->offset = si->obj_offset - si->unit_off;
- cur_len = stripe_unit;
- }
- } else {
- cur_len = stripe_unit;
- }
- if (cur_len >= length)
- cur_len = length;
-
- ret = _ore_add_stripe_unit(ios, &cur_pg, page_off, ios->pages,
- per_dev, cur_len);
- if (unlikely(ret))
- goto out;
-
- length -= cur_len;
-
- dev = ((dev + mirrors_p1) % devs_in_group) + first_dev;
- si->cur_comp = (si->cur_comp + 1) % group_width;
- if (unlikely((dev == si->par_dev) || (!length && ios->sp2d))) {
- if (!length && ios->sp2d) {
- /* If we are writing and this is the very last
- * stripe. then operate on parity dev.
- */
- dev = si->par_dev;
- /* If last stripe operate on parity comp */
- si->cur_comp = group_width - ios->layout->parity;
- }
-
- /* In writes cur_len just means if it's the
- * last one. See _ore_add_parity_unit.
- */
- ret = _add_parity_units(ios, si, dev, first_dev,
- mirrors_p1, devs_in_group,
- ios->sp2d ? length : cur_len);
- if (unlikely(ret))
- goto out;
-
- /* Rotate next par_dev backwards with wraping */
- si->par_dev = (devs_in_group + si->par_dev -
- ios->layout->parity * mirrors_p1) %
- devs_in_group + first_dev;
- /* Next stripe, start fresh */
- si->cur_comp = 0;
- si->cur_pg = 0;
- si->obj_offset += cur_len;
- si->unit_off = 0;
- }
- }
-out:
- ios->numdevs = devs_in_group;
- ios->pages_consumed = cur_pg;
- return ret;
-}
-
-int ore_create(struct ore_io_state *ios)
-{
- int i, ret;
-
- for (i = 0; i < ios->oc->numdevs; i++) {
- struct osd_request *or;
-
- or = osd_start_request(_ios_od(ios, i), GFP_KERNEL);
- if (unlikely(!or)) {
- ORE_ERR("%s: osd_start_request failed\n", __func__);
- ret = -ENOMEM;
- goto out;
- }
- ios->per_dev[i].or = or;
- ios->numdevs++;
-
- osd_req_create_object(or, _ios_obj(ios, i));
- }
- ret = ore_io_execute(ios);
-
-out:
- return ret;
-}
-EXPORT_SYMBOL(ore_create);
-
-int ore_remove(struct ore_io_state *ios)
-{
- int i, ret;
-
- for (i = 0; i < ios->oc->numdevs; i++) {
- struct osd_request *or;
-
- or = osd_start_request(_ios_od(ios, i), GFP_KERNEL);
- if (unlikely(!or)) {
- ORE_ERR("%s: osd_start_request failed\n", __func__);
- ret = -ENOMEM;
- goto out;
- }
- ios->per_dev[i].or = or;
- ios->numdevs++;
-
- osd_req_remove_object(or, _ios_obj(ios, i));
- }
- ret = ore_io_execute(ios);
-
-out:
- return ret;
-}
-EXPORT_SYMBOL(ore_remove);
-
-static int _write_mirror(struct ore_io_state *ios, int cur_comp)
-{
- struct ore_per_dev_state *master_dev = &ios->per_dev[cur_comp];
- unsigned dev = ios->per_dev[cur_comp].dev;
- unsigned last_comp = cur_comp + ios->layout->mirrors_p1;
- int ret = 0;
-
- if (ios->pages && !master_dev->length)
- return 0; /* Just an empty slot */
-
- for (; cur_comp < last_comp; ++cur_comp, ++dev) {
- struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
- struct osd_request *or;
-
- or = osd_start_request(_ios_od(ios, dev), GFP_KERNEL);
- if (unlikely(!or)) {
- ORE_ERR("%s: osd_start_request failed\n", __func__);
- ret = -ENOMEM;
- goto out;
- }
- per_dev->or = or;
-
- if (ios->pages) {
- struct bio *bio;
-
- if (per_dev != master_dev) {
- bio = bio_clone_kmalloc(master_dev->bio,
- GFP_KERNEL);
- if (unlikely(!bio)) {
- ORE_DBGMSG(
- "Failed to allocate BIO size=%u\n",
- master_dev->bio->bi_max_vecs);
- ret = -ENOMEM;
- goto out;
- }
-
- bio->bi_bdev = NULL;
- bio->bi_next = NULL;
- per_dev->offset = master_dev->offset;
- per_dev->length = master_dev->length;
- per_dev->bio = bio;
- per_dev->dev = dev;
- } else {
- bio = master_dev->bio;
- /* FIXME: bio_set_dir() */
- bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
- }
-
- osd_req_write(or, _ios_obj(ios, cur_comp),
- per_dev->offset, bio, per_dev->length);
- ORE_DBGMSG("write(0x%llx) offset=0x%llx "
- "length=0x%llx dev=%d\n",
- _LLU(_ios_obj(ios, cur_comp)->id),
- _LLU(per_dev->offset),
- _LLU(per_dev->length), dev);
- } else if (ios->kern_buff) {
- per_dev->offset = ios->si.obj_offset;
- per_dev->dev = ios->si.dev + dev;
-
- /* no cross device without page array */
- BUG_ON((ios->layout->group_width > 1) &&
- (ios->si.unit_off + ios->length >
- ios->layout->stripe_unit));
-
- ret = osd_req_write_kern(or, _ios_obj(ios, cur_comp),
- per_dev->offset,
- ios->kern_buff, ios->length);
- if (unlikely(ret))
- goto out;
- ORE_DBGMSG2("write_kern(0x%llx) offset=0x%llx "
- "length=0x%llx dev=%d\n",
- _LLU(_ios_obj(ios, cur_comp)->id),
- _LLU(per_dev->offset),
- _LLU(ios->length), per_dev->dev);
- } else {
- osd_req_set_attributes(or, _ios_obj(ios, cur_comp));
- ORE_DBGMSG2("obj(0x%llx) set_attributes=%d dev=%d\n",
- _LLU(_ios_obj(ios, cur_comp)->id),
- ios->out_attr_len, dev);
- }
-
- if (ios->out_attr)
- osd_req_add_set_attr_list(or, ios->out_attr,
- ios->out_attr_len);
-
- if (ios->in_attr)
- osd_req_add_get_attr_list(or, ios->in_attr,
- ios->in_attr_len);
- }
-
-out:
- return ret;
-}
-
-int ore_write(struct ore_io_state *ios)
-{
- int i;
- int ret;
-
- if (unlikely(ios->sp2d && !ios->r4w)) {
- /* A library is attempting a RAID-write without providing
- * a pages lock interface.
- */
- WARN_ON_ONCE(1);
- return -ENOTSUPP;
- }
-
- ret = _prepare_for_striping(ios);
- if (unlikely(ret))
- return ret;
-
- for (i = 0; i < ios->numdevs; i += ios->layout->mirrors_p1) {
- ret = _write_mirror(ios, i);
- if (unlikely(ret))
- return ret;
- }
-
- ret = ore_io_execute(ios);
- return ret;
-}
-EXPORT_SYMBOL(ore_write);
-
-int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp)
-{
- struct osd_request *or;
- struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
- struct osd_obj_id *obj = _ios_obj(ios, cur_comp);
- unsigned first_dev = (unsigned)obj->id;
-
- if (ios->pages && !per_dev->length)
- return 0; /* Just an empty slot */
-
- first_dev = per_dev->dev + first_dev % ios->layout->mirrors_p1;
- or = osd_start_request(_ios_od(ios, first_dev), GFP_KERNEL);
- if (unlikely(!or)) {
- ORE_ERR("%s: osd_start_request failed\n", __func__);
- return -ENOMEM;
- }
- per_dev->or = or;
-
- if (ios->pages) {
- if (per_dev->cur_sg) {
- /* finalize the last sg_entry */
- _ore_add_sg_seg(per_dev, 0, false);
- if (unlikely(!per_dev->cur_sg))
- return 0; /* Skip parity only device */
-
- osd_req_read_sg(or, obj, per_dev->bio,
- per_dev->sglist, per_dev->cur_sg);
- } else {
- /* The no raid case */
- osd_req_read(or, obj, per_dev->offset,
- per_dev->bio, per_dev->length);
- }
-
- ORE_DBGMSG("read(0x%llx) offset=0x%llx length=0x%llx"
- " dev=%d sg_len=%d\n", _LLU(obj->id),
- _LLU(per_dev->offset), _LLU(per_dev->length),
- first_dev, per_dev->cur_sg);
- } else {
- BUG_ON(ios->kern_buff);
-
- osd_req_get_attributes(or, obj);
- ORE_DBGMSG2("obj(0x%llx) get_attributes=%d dev=%d\n",
- _LLU(obj->id),
- ios->in_attr_len, first_dev);
- }
- if (ios->out_attr)
- osd_req_add_set_attr_list(or, ios->out_attr, ios->out_attr_len);
-
- if (ios->in_attr)
- osd_req_add_get_attr_list(or, ios->in_attr, ios->in_attr_len);
-
- return 0;
-}
-
-int ore_read(struct ore_io_state *ios)
-{
- int i;
- int ret;
-
- ret = _prepare_for_striping(ios);
- if (unlikely(ret))
- return ret;
-
- for (i = 0; i < ios->numdevs; i += ios->layout->mirrors_p1) {
- ret = _ore_read_mirror(ios, i);
- if (unlikely(ret))
- return ret;
- }
-
- ret = ore_io_execute(ios);
- return ret;
-}
-EXPORT_SYMBOL(ore_read);
-
-int extract_attr_from_ios(struct ore_io_state *ios, struct osd_attr *attr)
-{
- struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
- void *iter = NULL;
- int nelem;
-
- do {
- nelem = 1;
- osd_req_decode_get_attr_list(ios->per_dev[0].or,
- &cur_attr, &nelem, &iter);
- if ((cur_attr.attr_page == attr->attr_page) &&
- (cur_attr.attr_id == attr->attr_id)) {
- attr->len = cur_attr.len;
- attr->val_ptr = cur_attr.val_ptr;
- return 0;
- }
- } while (iter);
-
- return -EIO;
-}
-EXPORT_SYMBOL(extract_attr_from_ios);
-
-static int _truncate_mirrors(struct ore_io_state *ios, unsigned cur_comp,
- struct osd_attr *attr)
-{
- int last_comp = cur_comp + ios->layout->mirrors_p1;
-
- for (; cur_comp < last_comp; ++cur_comp) {
- struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
- struct osd_request *or;
-
- or = osd_start_request(_ios_od(ios, cur_comp), GFP_KERNEL);
- if (unlikely(!or)) {
- ORE_ERR("%s: osd_start_request failed\n", __func__);
- return -ENOMEM;
- }
- per_dev->or = or;
-
- osd_req_set_attributes(or, _ios_obj(ios, cur_comp));
- osd_req_add_set_attr_list(or, attr, 1);
- }
-
- return 0;
-}
-
-struct _trunc_info {
- struct ore_striping_info si;
- u64 prev_group_obj_off;
- u64 next_group_obj_off;
-
- unsigned first_group_dev;
- unsigned nex_group_dev;
-};
-
-static void _calc_trunk_info(struct ore_layout *layout, u64 file_offset,
- struct _trunc_info *ti)
-{
- unsigned stripe_unit = layout->stripe_unit;
-
- ore_calc_stripe_info(layout, file_offset, 0, &ti->si);
-
- ti->prev_group_obj_off = ti->si.M * stripe_unit;
- ti->next_group_obj_off = ti->si.M ? (ti->si.M - 1) * stripe_unit : 0;
-
- ti->first_group_dev = ti->si.dev - (ti->si.dev % layout->group_width);
- ti->nex_group_dev = ti->first_group_dev + layout->group_width;
-}
-
-int ore_truncate(struct ore_layout *layout, struct ore_components *oc,
- u64 size)
-{
- struct ore_io_state *ios;
- struct exofs_trunc_attr {
- struct osd_attr attr;
- __be64 newsize;
- } *size_attrs;
- struct _trunc_info ti;
- int i, ret;
-
- ret = ore_get_io_state(layout, oc, &ios);
- if (unlikely(ret))
- return ret;
-
- _calc_trunk_info(ios->layout, size, &ti);
-
- size_attrs = kcalloc(ios->oc->numdevs, sizeof(*size_attrs),
- GFP_KERNEL);
- if (unlikely(!size_attrs)) {
- ret = -ENOMEM;
- goto out;
- }
-
- ios->numdevs = ios->oc->numdevs;
-
- for (i = 0; i < ios->numdevs; ++i) {
- struct exofs_trunc_attr *size_attr = &size_attrs[i];
- u64 obj_size;
-
- if (i < ti.first_group_dev)
- obj_size = ti.prev_group_obj_off;
- else if (i >= ti.nex_group_dev)
- obj_size = ti.next_group_obj_off;
- else if (i < ti.si.dev) /* dev within this group */
- obj_size = ti.si.obj_offset +
- ios->layout->stripe_unit - ti.si.unit_off;
- else if (i == ti.si.dev)
- obj_size = ti.si.obj_offset;
- else /* i > ti.dev */
- obj_size = ti.si.obj_offset - ti.si.unit_off;
-
- size_attr->newsize = cpu_to_be64(obj_size);
- size_attr->attr = g_attr_logical_length;
- size_attr->attr.val_ptr = &size_attr->newsize;
-
- ORE_DBGMSG2("trunc(0x%llx) obj_offset=0x%llx dev=%d\n",
- _LLU(oc->comps->obj.id), _LLU(obj_size), i);
- ret = _truncate_mirrors(ios, i * ios->layout->mirrors_p1,
- &size_attr->attr);
- if (unlikely(ret))
- goto out;
- }
- ret = ore_io_execute(ios);
-
-out:
- kfree(size_attrs);
- ore_put_io_state(ios);
- return ret;
-}
-EXPORT_SYMBOL(ore_truncate);
-
-const struct osd_attr g_attr_logical_length = ATTR_DEF(
- OSD_APAGE_OBJECT_INFORMATION, OSD_ATTR_OI_LOGICAL_LENGTH, 8);
-EXPORT_SYMBOL(g_attr_logical_length);
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
deleted file mode 100644
index 27cbdb697649..000000000000
--- a/fs/exofs/ore_raid.c
+++ /dev/null
@@ -1,721 +0,0 @@
-/*
- * Copyright (C) 2011
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This file is part of the objects raid engine (ore).
- *
- * It is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
- *
- * You should have received a copy of the GNU General Public License
- * along with "ore". If not, write to the Free Software Foundation, Inc:
- * "Free Software Foundation <info@fsf.org>"
- */
-
-#include <linux/gfp.h>
-#include <linux/async_tx.h>
-
-#include "ore_raid.h"
-
-#undef ORE_DBGMSG2
-#define ORE_DBGMSG2 ORE_DBGMSG
-
-static struct page *_raid_page_alloc(void)
-{
- return alloc_page(GFP_KERNEL);
-}
-
-static void _raid_page_free(struct page *p)
-{
- __free_page(p);
-}
-
-/* This struct is forward declare in ore_io_state, but is private to here.
- * It is put on ios->sp2d for RAID5/6 writes only. See _gen_xor_unit.
- *
- * __stripe_pages_2d is a 2d array of pages, and it is also a corner turn.
- * Ascending page index access is sp2d(p-minor, c-major). But storage is
- * sp2d[p-minor][c-major], so it can be properlly presented to the async-xor
- * API.
- */
-struct __stripe_pages_2d {
- /* Cache some hot path repeated calculations */
- unsigned parity;
- unsigned data_devs;
- unsigned pages_in_unit;
-
- bool needed ;
-
- /* Array size is pages_in_unit (layout->stripe_unit / PAGE_SIZE) */
- struct __1_page_stripe {
- bool alloc;
- unsigned write_count;
- struct async_submit_ctl submit;
- struct dma_async_tx_descriptor *tx;
-
- /* The size of this array is data_devs + parity */
- struct page **pages;
- struct page **scribble;
- /* bool array, size of this array is data_devs */
- char *page_is_read;
- } _1p_stripes[];
-};
-
-/* This can get bigger then a page. So support multiple page allocations
- * _sp2d_free should be called even if _sp2d_alloc fails (by returning
- * none-zero).
- */
-static int _sp2d_alloc(unsigned pages_in_unit, unsigned group_width,
- unsigned parity, struct __stripe_pages_2d **psp2d)
-{
- struct __stripe_pages_2d *sp2d;
- unsigned data_devs = group_width - parity;
- struct _alloc_all_bytes {
- struct __alloc_stripe_pages_2d {
- struct __stripe_pages_2d sp2d;
- struct __1_page_stripe _1p_stripes[pages_in_unit];
- } __asp2d;
- struct __alloc_1p_arrays {
- struct page *pages[group_width];
- struct page *scribble[group_width];
- char page_is_read[data_devs];
- } __a1pa[pages_in_unit];
- } *_aab;
- struct __alloc_1p_arrays *__a1pa;
- struct __alloc_1p_arrays *__a1pa_end;
- const unsigned sizeof__a1pa = sizeof(_aab->__a1pa[0]);
- unsigned num_a1pa, alloc_size, i;
-
- /* FIXME: check these numbers in ore_verify_layout */
- BUG_ON(sizeof(_aab->__asp2d) > PAGE_SIZE);
- BUG_ON(sizeof__a1pa > PAGE_SIZE);
-
- if (sizeof(*_aab) > PAGE_SIZE) {
- num_a1pa = (PAGE_SIZE - sizeof(_aab->__asp2d)) / sizeof__a1pa;
- alloc_size = sizeof(_aab->__asp2d) + sizeof__a1pa * num_a1pa;
- } else {
- num_a1pa = pages_in_unit;
- alloc_size = sizeof(*_aab);
- }
-
- _aab = kzalloc(alloc_size, GFP_KERNEL);
- if (unlikely(!_aab)) {
- ORE_DBGMSG("!! Failed to alloc sp2d size=%d\n", alloc_size);
- return -ENOMEM;
- }
-
- sp2d = &_aab->__asp2d.sp2d;
- *psp2d = sp2d; /* From here Just call _sp2d_free */
-
- __a1pa = _aab->__a1pa;
- __a1pa_end = __a1pa + num_a1pa;
-
- for (i = 0; i < pages_in_unit; ++i) {
- if (unlikely(__a1pa >= __a1pa_end)) {
- num_a1pa = min_t(unsigned, PAGE_SIZE / sizeof__a1pa,
- pages_in_unit - i);
-
- __a1pa = kcalloc(num_a1pa, sizeof__a1pa, GFP_KERNEL);
- if (unlikely(!__a1pa)) {
- ORE_DBGMSG("!! Failed to _alloc_1p_arrays=%d\n",
- num_a1pa);
- return -ENOMEM;
- }
- __a1pa_end = __a1pa + num_a1pa;
- /* First *pages is marked for kfree of the buffer */
- sp2d->_1p_stripes[i].alloc = true;
- }
-
- sp2d->_1p_stripes[i].pages = __a1pa->pages;
- sp2d->_1p_stripes[i].scribble = __a1pa->scribble ;
- sp2d->_1p_stripes[i].page_is_read = __a1pa->page_is_read;
- ++__a1pa;
- }
-
- sp2d->parity = parity;
- sp2d->data_devs = data_devs;
- sp2d->pages_in_unit = pages_in_unit;
- return 0;
-}
-
-static void _sp2d_reset(struct __stripe_pages_2d *sp2d,
- const struct _ore_r4w_op *r4w, void *priv)
-{
- unsigned data_devs = sp2d->data_devs;
- unsigned group_width = data_devs + sp2d->parity;
- int p, c;
-
- if (!sp2d->needed)
- return;
-
- for (c = data_devs - 1; c >= 0; --c)
- for (p = sp2d->pages_in_unit - 1; p >= 0; --p) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- if (_1ps->page_is_read[c]) {
- struct page *page = _1ps->pages[c];
-
- r4w->put_page(priv, page);
- _1ps->page_is_read[c] = false;
- }
- }
-
- for (p = 0; p < sp2d->pages_in_unit; p++) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- memset(_1ps->pages, 0, group_width * sizeof(*_1ps->pages));
- _1ps->write_count = 0;
- _1ps->tx = NULL;
- }
-
- sp2d->needed = false;
-}
-
-static void _sp2d_free(struct __stripe_pages_2d *sp2d)
-{
- unsigned i;
-
- if (!sp2d)
- return;
-
- for (i = 0; i < sp2d->pages_in_unit; ++i) {
- if (sp2d->_1p_stripes[i].alloc)
- kfree(sp2d->_1p_stripes[i].pages);
- }
-
- kfree(sp2d);
-}
-
-static unsigned _sp2d_min_pg(struct __stripe_pages_2d *sp2d)
-{
- unsigned p;
-
- for (p = 0; p < sp2d->pages_in_unit; p++) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- if (_1ps->write_count)
- return p;
- }
-
- return ~0;
-}
-
-static unsigned _sp2d_max_pg(struct __stripe_pages_2d *sp2d)
-{
- int p;
-
- for (p = sp2d->pages_in_unit - 1; p >= 0; --p) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- if (_1ps->write_count)
- return p;
- }
-
- return ~0;
-}
-
-static void _gen_xor_unit(struct __stripe_pages_2d *sp2d)
-{
- unsigned p;
- unsigned tx_flags = ASYNC_TX_ACK;
-
- if (sp2d->parity == 1)
- tx_flags |= ASYNC_TX_XOR_ZERO_DST;
-
- for (p = 0; p < sp2d->pages_in_unit; p++) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- if (!_1ps->write_count)
- continue;
-
- init_async_submit(&_1ps->submit, tx_flags,
- NULL, NULL, NULL, (addr_conv_t *)_1ps->scribble);
-
- if (sp2d->parity == 1)
- _1ps->tx = async_xor(_1ps->pages[sp2d->data_devs],
- _1ps->pages, 0, sp2d->data_devs,
- PAGE_SIZE, &_1ps->submit);
- else /* parity == 2 */
- _1ps->tx = async_gen_syndrome(_1ps->pages, 0,
- sp2d->data_devs + sp2d->parity,
- PAGE_SIZE, &_1ps->submit);
- }
-
- for (p = 0; p < sp2d->pages_in_unit; p++) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
- /* NOTE: We wait for HW synchronously (I don't have such HW
- * to test with.) Is parallelism needed with today's multi
- * cores?
- */
- async_tx_issue_pending(_1ps->tx);
- }
-}
-
-void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
- struct ore_striping_info *si, struct page *page)
-{
- struct __1_page_stripe *_1ps;
-
- sp2d->needed = true;
-
- _1ps = &sp2d->_1p_stripes[si->cur_pg];
- _1ps->pages[si->cur_comp] = page;
- ++_1ps->write_count;
-
- si->cur_pg = (si->cur_pg + 1) % sp2d->pages_in_unit;
- /* si->cur_comp is advanced outside at main loop */
-}
-
-void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
- bool not_last)
-{
- struct osd_sg_entry *sge;
-
- ORE_DBGMSG("dev=%d cur_len=0x%x not_last=%d cur_sg=%d "
- "offset=0x%llx length=0x%x last_sgs_total=0x%x\n",
- per_dev->dev, cur_len, not_last, per_dev->cur_sg,
- _LLU(per_dev->offset), per_dev->length,
- per_dev->last_sgs_total);
-
- if (!per_dev->cur_sg) {
- sge = per_dev->sglist;
-
- /* First time we prepare two entries */
- if (per_dev->length) {
- ++per_dev->cur_sg;
- sge->offset = per_dev->offset;
- sge->len = per_dev->length;
- } else {
- /* Here the parity is the first unit of this object.
- * This happens every time we reach a parity device on
- * the same stripe as the per_dev->offset. We need to
- * just skip this unit.
- */
- per_dev->offset += cur_len;
- return;
- }
- } else {
- /* finalize the last one */
- sge = &per_dev->sglist[per_dev->cur_sg - 1];
- sge->len = per_dev->length - per_dev->last_sgs_total;
- }
-
- if (not_last) {
- /* Partly prepare the next one */
- struct osd_sg_entry *next_sge = sge + 1;
-
- ++per_dev->cur_sg;
- next_sge->offset = sge->offset + sge->len + cur_len;
- /* Save cur len so we know how mutch was added next time */
- per_dev->last_sgs_total = per_dev->length;
- next_sge->len = 0;
- } else if (!sge->len) {
- /* Optimize for when the last unit is a parity */
- --per_dev->cur_sg;
- }
-}
-
-static int _alloc_read_4_write(struct ore_io_state *ios)
-{
- struct ore_layout *layout = ios->layout;
- int ret;
- /* We want to only read those pages not in cache so worst case
- * is a stripe populated with every other page
- */
- unsigned sgs_per_dev = ios->sp2d->pages_in_unit + 2;
-
- ret = _ore_get_io_state(layout, ios->oc,
- layout->group_width * layout->mirrors_p1,
- sgs_per_dev, 0, &ios->ios_read_4_write);
- return ret;
-}
-
-/* @si contains info of the to-be-inserted page. Update of @si should be
- * maintained by caller. Specificaly si->dev, si->obj_offset, ...
- */
-static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
- struct page *page, unsigned pg_len)
-{
- struct request_queue *q;
- struct ore_per_dev_state *per_dev;
- struct ore_io_state *read_ios;
- unsigned first_dev = si->dev - (si->dev %
- (ios->layout->group_width * ios->layout->mirrors_p1));
- unsigned comp = si->dev - first_dev;
- unsigned added_len;
-
- if (!ios->ios_read_4_write) {
- int ret = _alloc_read_4_write(ios);
-
- if (unlikely(ret))
- return ret;
- }
-
- read_ios = ios->ios_read_4_write;
- read_ios->numdevs = ios->layout->group_width * ios->layout->mirrors_p1;
-
- per_dev = &read_ios->per_dev[comp];
- if (!per_dev->length) {
- per_dev->bio = bio_kmalloc(GFP_KERNEL,
- ios->sp2d->pages_in_unit);
- if (unlikely(!per_dev->bio)) {
- ORE_DBGMSG("Failed to allocate BIO size=%u\n",
- ios->sp2d->pages_in_unit);
- return -ENOMEM;
- }
- per_dev->offset = si->obj_offset;
- per_dev->dev = si->dev;
- } else if (si->obj_offset != (per_dev->offset + per_dev->length)) {
- u64 gap = si->obj_offset - (per_dev->offset + per_dev->length);
-
- _ore_add_sg_seg(per_dev, gap, true);
- }
- q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
- added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
- si->obj_offset % PAGE_SIZE);
- if (unlikely(added_len != pg_len)) {
- ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
- per_dev->bio->bi_vcnt);
- return -ENOMEM;
- }
-
- per_dev->length += pg_len;
- return 0;
-}
-
-/* read the beginning of an unaligned first page */
-static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
-{
- struct ore_striping_info si;
- unsigned pg_len;
-
- ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
-
- pg_len = si.obj_offset % PAGE_SIZE;
- si.obj_offset -= pg_len;
-
- ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
- _LLU(si.obj_offset), pg_len, page->index, si.dev);
-
- return _add_to_r4w(ios, &si, page, pg_len);
-}
-
-/* read the end of an incomplete last page */
-static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
-{
- struct ore_striping_info si;
- struct page *page;
- unsigned pg_len, p, c;
-
- ore_calc_stripe_info(ios->layout, *offset, 0, &si);
-
- p = si.cur_pg;
- c = si.cur_comp;
- page = ios->sp2d->_1p_stripes[p].pages[c];
-
- pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
- *offset += pg_len;
-
- ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
- p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
-
- BUG_ON(!page);
-
- return _add_to_r4w(ios, &si, page, pg_len);
-}
-
-static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
-{
- struct bio_vec *bv;
- unsigned i, d;
-
- /* loop on all devices all pages */
- for (d = 0; d < ios->numdevs; d++) {
- struct bio *bio = ios->per_dev[d].bio;
-
- if (!bio)
- continue;
-
- bio_for_each_segment_all(bv, bio, i) {
- struct page *page = bv->bv_page;
-
- SetPageUptodate(page);
- if (PageError(page))
- ClearPageError(page);
- }
- }
-}
-
-/* read_4_write is hacked to read the start of the first stripe and/or
- * the end of the last stripe. If needed, with an sg-gap at each device/page.
- * It is assumed to be called after the to_be_written pages of the first stripe
- * are populating ios->sp2d[][]
- *
- * NOTE: We call ios->r4w->lock_fn for all pages needed for parity calculations
- * These pages are held at sp2d[p].pages[c] but with
- * sp2d[p].page_is_read[c] = true. At _sp2d_reset these pages are
- * ios->r4w->lock_fn(). The ios->r4w->lock_fn might signal that the page is
- * @uptodate=true, so we don't need to read it, only unlock, after IO.
- *
- * TODO: The read_4_write should calc a need_to_read_pages_count, if bigger then
- * to-be-written count, we should consider the xor-in-place mode.
- * need_to_read_pages_count is the actual number of pages not present in cache.
- * maybe "devs_in_group - ios->sp2d[p].write_count" is a good enough
- * approximation? In this mode the read pages are put in the empty places of
- * ios->sp2d[p][*], xor is calculated the same way. These pages are
- * allocated/freed and don't go through cache
- */
-static int _read_4_write_first_stripe(struct ore_io_state *ios)
-{
- struct ore_striping_info read_si;
- struct __stripe_pages_2d *sp2d = ios->sp2d;
- u64 offset = ios->si.first_stripe_start;
- unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
-
- if (offset == ios->offset) /* Go to start collect $200 */
- goto read_last_stripe;
-
- min_p = _sp2d_min_pg(sp2d);
- max_p = _sp2d_max_pg(sp2d);
-
- ORE_DBGMSG("stripe_start=0x%llx ios->offset=0x%llx min_p=%d max_p=%d\n",
- offset, ios->offset, min_p, max_p);
-
- for (c = 0; ; c++) {
- ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
- read_si.obj_offset += min_p * PAGE_SIZE;
- offset += min_p * PAGE_SIZE;
- for (p = min_p; p <= max_p; p++) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
- struct page **pp = &_1ps->pages[c];
- bool uptodate;
-
- if (*pp) {
- if (ios->offset % PAGE_SIZE)
- /* Read the remainder of the page */
- _add_to_r4w_first_page(ios, *pp);
- /* to-be-written pages start here */
- goto read_last_stripe;
- }
-
- *pp = ios->r4w->get_page(ios->private, offset,
- &uptodate);
- if (unlikely(!*pp))
- return -ENOMEM;
-
- if (!uptodate)
- _add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
-
- /* Mark read-pages to be cache_released */
- _1ps->page_is_read[c] = true;
- read_si.obj_offset += PAGE_SIZE;
- offset += PAGE_SIZE;
- }
- offset += (sp2d->pages_in_unit - p) * PAGE_SIZE;
- }
-
-read_last_stripe:
- return 0;
-}
-
-static int _read_4_write_last_stripe(struct ore_io_state *ios)
-{
- struct ore_striping_info read_si;
- struct __stripe_pages_2d *sp2d = ios->sp2d;
- u64 offset;
- u64 last_stripe_end;
- unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
- unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
-
- offset = ios->offset + ios->length;
- if (offset % PAGE_SIZE)
- _add_to_r4w_last_page(ios, &offset);
- /* offset will be aligned to next page */
-
- last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
- * bytes_in_stripe;
- if (offset == last_stripe_end) /* Optimize for the aligned case */
- goto read_it;
-
- ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
- p = read_si.cur_pg;
- c = read_si.cur_comp;
-
- if (min_p == sp2d->pages_in_unit) {
- /* Didn't do it yet */
- min_p = _sp2d_min_pg(sp2d);
- max_p = _sp2d_max_pg(sp2d);
- }
-
- ORE_DBGMSG("offset=0x%llx stripe_end=0x%llx min_p=%d max_p=%d\n",
- offset, last_stripe_end, min_p, max_p);
-
- while (offset < last_stripe_end) {
- struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
-
- if ((min_p <= p) && (p <= max_p)) {
- struct page *page;
- bool uptodate;
-
- BUG_ON(_1ps->pages[c]);
- page = ios->r4w->get_page(ios->private, offset,
- &uptodate);
- if (unlikely(!page))
- return -ENOMEM;
-
- _1ps->pages[c] = page;
- /* Mark read-pages to be cache_released */
- _1ps->page_is_read[c] = true;
- if (!uptodate)
- _add_to_r4w(ios, &read_si, page, PAGE_SIZE);
- }
-
- offset += PAGE_SIZE;
- if (p == (sp2d->pages_in_unit - 1)) {
- ++c;
- p = 0;
- ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
- } else {
- read_si.obj_offset += PAGE_SIZE;
- ++p;
- }
- }
-
-read_it:
- return 0;
-}
-
-static int _read_4_write_execute(struct ore_io_state *ios)
-{
- struct ore_io_state *ios_read;
- unsigned i;
- int ret;
-
- ios_read = ios->ios_read_4_write;
- if (!ios_read)
- return 0;
-
- /* FIXME: Ugly to signal _sbi_read_mirror that we have bio(s). Change
- * to check for per_dev->bio
- */
- ios_read->pages = ios->pages;
-
- /* Now read these devices */
- for (i = 0; i < ios_read->numdevs; i += ios_read->layout->mirrors_p1) {
- ret = _ore_read_mirror(ios_read, i);
- if (unlikely(ret))
- return ret;
- }
-
- ret = ore_io_execute(ios_read); /* Synchronus execution */
- if (unlikely(ret)) {
- ORE_DBGMSG("!! ore_io_execute => %d\n", ret);
- return ret;
- }
-
- _mark_read4write_pages_uptodate(ios_read, ret);
- ore_put_io_state(ios_read);
- ios->ios_read_4_write = NULL; /* Might need a reuse at last stripe */
- return 0;
-}
-
-/* In writes @cur_len means length left. .i.e cur_len==0 is the last parity U */
-int _ore_add_parity_unit(struct ore_io_state *ios,
- struct ore_striping_info *si,
- struct ore_per_dev_state *per_dev,
- unsigned cur_len, bool do_xor)
-{
- if (ios->reading) {
- if (per_dev->cur_sg >= ios->sgs_per_dev) {
- ORE_DBGMSG("cur_sg(%d) >= sgs_per_dev(%d)\n" ,
- per_dev->cur_sg, ios->sgs_per_dev);
- return -ENOMEM;
- }
- _ore_add_sg_seg(per_dev, cur_len, true);
- } else {
- struct __stripe_pages_2d *sp2d = ios->sp2d;
- struct page **pages = ios->parity_pages + ios->cur_par_page;
- unsigned num_pages;
- unsigned array_start = 0;
- unsigned i;
- int ret;
-
- si->cur_pg = _sp2d_min_pg(sp2d);
- num_pages = _sp2d_max_pg(sp2d) + 1 - si->cur_pg;
-
- if (!per_dev->length) {
- per_dev->offset += si->cur_pg * PAGE_SIZE;
- /* If first stripe, Read in all read4write pages
- * (if needed) before we calculate the first parity.
- */
- if (do_xor)
- _read_4_write_first_stripe(ios);
- }
- if (!cur_len && do_xor)
- /* If last stripe r4w pages of last stripe */
- _read_4_write_last_stripe(ios);
- _read_4_write_execute(ios);
-
- for (i = 0; i < num_pages; i++) {
- pages[i] = _raid_page_alloc();
- if (unlikely(!pages[i]))
- return -ENOMEM;
-
- ++(ios->cur_par_page);
- }
-
- BUG_ON(si->cur_comp < sp2d->data_devs);
- BUG_ON(si->cur_pg + num_pages > sp2d->pages_in_unit);
-
- ret = _ore_add_stripe_unit(ios, &array_start, 0, pages,
- per_dev, num_pages * PAGE_SIZE);
- if (unlikely(ret))
- return ret;
-
- if (do_xor) {
- _gen_xor_unit(sp2d);
- _sp2d_reset(sp2d, ios->r4w, ios->private);
- }
- }
- return 0;
-}
-
-int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
-{
- if (ios->parity_pages) {
- struct ore_layout *layout = ios->layout;
- unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
-
- if (_sp2d_alloc(pages_in_unit, layout->group_width,
- layout->parity, &ios->sp2d)) {
- return -ENOMEM;
- }
- }
- return 0;
-}
-
-void _ore_free_raid_stuff(struct ore_io_state *ios)
-{
- if (ios->sp2d) { /* writing and raid */
- unsigned i;
-
- for (i = 0; i < ios->cur_par_page; i++) {
- struct page *page = ios->parity_pages[i];
-
- if (page)
- _raid_page_free(page);
- }
- if (ios->extra_part_alloc)
- kfree(ios->parity_pages);
- /* If IO returned an error pages might need unlocking */
- _sp2d_reset(ios->sp2d, ios->r4w, ios->private);
- _sp2d_free(ios->sp2d);
- } else {
- /* Will only be set if raid reading && sglist is big */
- if (ios->extra_part_alloc)
- kfree(ios->per_dev[0].sglist);
- }
- if (ios->ios_read_4_write)
- ore_put_io_state(ios->ios_read_4_write);
-}
diff --git a/fs/exofs/ore_raid.h b/fs/exofs/ore_raid.h
deleted file mode 100644
index a6e746775570..000000000000
--- a/fs/exofs/ore_raid.h
+++ /dev/null
@@ -1,62 +0,0 @@
-/*
- * Copyright (C) from 2011
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This file is part of the objects raid engine (ore).
- *
- * It is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
- *
- * You should have received a copy of the GNU General Public License
- * along with "ore". If not, write to the Free Software Foundation, Inc:
- * "Free Software Foundation <info@fsf.org>"
- */
-
-#include <scsi/osd_ore.h>
-
-#define ORE_ERR(fmt, a...) printk(KERN_ERR "ore: " fmt, ##a)
-
-#ifdef CONFIG_EXOFS_DEBUG
-#define ORE_DBGMSG(fmt, a...) \
- printk(KERN_NOTICE "ore @%s:%d: " fmt, __func__, __LINE__, ##a)
-#else
-#define ORE_DBGMSG(fmt, a...) \
- do { if (0) printk(fmt, ##a); } while (0)
-#endif
-
-/* u64 has problems with printk this will cast it to unsigned long long */
-#define _LLU(x) (unsigned long long)(x)
-
-#define ORE_DBGMSG2(M...) do {} while (0)
-/* #define ORE_DBGMSG2 ORE_DBGMSG */
-
-/* ios_raid.c stuff needed by ios.c */
-int _ore_post_alloc_raid_stuff(struct ore_io_state *ios);
-void _ore_free_raid_stuff(struct ore_io_state *ios);
-
-void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
- bool not_last);
-int _ore_add_parity_unit(struct ore_io_state *ios, struct ore_striping_info *si,
- struct ore_per_dev_state *per_dev, unsigned cur_len,
- bool do_xor);
-void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
- struct ore_striping_info *si, struct page *page);
-static inline void _add_stripe_page(struct __stripe_pages_2d *sp2d,
- struct ore_striping_info *si, struct page *page)
-{
- if (!sp2d) /* Inline the fast path */
- return; /* Hay no raid stuff */
- _ore_add_stripe_page(sp2d, si, page);
-}
-
-/* ios.c stuff needed by ios_raid.c */
-int _ore_get_io_state(struct ore_layout *layout,
- struct ore_components *oc, unsigned numdevs,
- unsigned sgs_per_dev, unsigned num_par_pages,
- struct ore_io_state **pios);
-int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
- unsigned pgbase, struct page **pages,
- struct ore_per_dev_state *per_dev, int cur_len);
-int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp);
-int ore_io_execute(struct ore_io_state *ios);
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
deleted file mode 100644
index 1076a4233b39..000000000000
--- a/fs/exofs/super.c
+++ /dev/null
@@ -1,1049 +0,0 @@
-/*
- * Copyright (C) 2005, 2006
- * Avishay Traeger (avishay@gmail.com)
- * Copyright (C) 2008, 2009
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * Copyrights for code taken from ext2:
- * Copyright (C) 1992, 1993, 1994, 1995
- * Remy Card (card@masi.ibp.fr)
- * Laboratoire MASI - Institut Blaise Pascal
- * Universite Pierre et Marie Curie (Paris VI)
- * from
- * linux/fs/minix/inode.c
- * Copyright (C) 1991, 1992 Linus Torvalds
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation. Since it is based on ext2, and the only
- * valid version of GPL for the Linux kernel is version 2, the only valid
- * version of GPL for exofs is version 2.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-
-#include <linux/string.h>
-#include <linux/parser.h>
-#include <linux/vfs.h>
-#include <linux/random.h>
-#include <linux/module.h>
-#include <linux/exportfs.h>
-#include <linux/slab.h>
-
-#include "exofs.h"
-
-#define EXOFS_DBGMSG2(M...) do {} while (0)
-
-/******************************************************************************
- * MOUNT OPTIONS
- *****************************************************************************/
-
-/*
- * struct to hold what we get from mount options
- */
-struct exofs_mountopt {
- bool is_osdname;
- const char *dev_name;
- uint64_t pid;
- int timeout;
-};
-
-/*
- * exofs-specific mount-time options.
- */
-enum { Opt_name, Opt_pid, Opt_to, Opt_err };
-
-/*
- * Our mount-time options. These should ideally be 64-bit unsigned, but the
- * kernel's parsing functions do not currently support that. 32-bit should be
- * sufficient for most applications now.
- */
-static match_table_t tokens = {
- {Opt_name, "osdname=%s"},
- {Opt_pid, "pid=%u"},
- {Opt_to, "to=%u"},
- {Opt_err, NULL}
-};
-
-/*
- * The main option parsing method. Also makes sure that all of the mandatory
- * mount options were set.
- */
-static int parse_options(char *options, struct exofs_mountopt *opts)
-{
- char *p;
- substring_t args[MAX_OPT_ARGS];
- int option;
- bool s_pid = false;
-
- EXOFS_DBGMSG("parse_options %s\n", options);
- /* defaults */
- memset(opts, 0, sizeof(*opts));
- opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
-
- while ((p = strsep(&options, ",")) != NULL) {
- int token;
- char str[32];
-
- if (!*p)
- continue;
-
- token = match_token(p, tokens, args);
- switch (token) {
- case Opt_name:
- opts->dev_name = match_strdup(&args[0]);
- if (unlikely(!opts->dev_name)) {
- EXOFS_ERR("Error allocating dev_name");
- return -ENOMEM;
- }
- opts->is_osdname = true;
- break;
- case Opt_pid:
- if (0 == match_strlcpy(str, &args[0], sizeof(str)))
- return -EINVAL;
- opts->pid = simple_strtoull(str, NULL, 0);
- if (opts->pid < EXOFS_MIN_PID) {
- EXOFS_ERR("Partition ID must be >= %u",
- EXOFS_MIN_PID);
- return -EINVAL;
- }
- s_pid = 1;
- break;
- case Opt_to:
- if (match_int(&args[0], &option))
- return -EINVAL;
- if (option <= 0) {
- EXOFS_ERR("Timeout must be > 0");
- return -EINVAL;
- }
- opts->timeout = option * HZ;
- break;
- }
- }
-
- if (!s_pid) {
- EXOFS_ERR("Need to specify the following options:\n");
- EXOFS_ERR(" -o pid=pid_no_to_use\n");
- return -EINVAL;
- }
-
- return 0;
-}
-
-/******************************************************************************
- * INODE CACHE
- *****************************************************************************/
-
-/*
- * Our inode cache. Isn't it pretty?
- */
-static struct kmem_cache *exofs_inode_cachep;
-
-/*
- * Allocate an inode in the cache
- */
-static struct inode *exofs_alloc_inode(struct super_block *sb)
-{
- struct exofs_i_info *oi;
-
- oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
- if (!oi)
- return NULL;
-
- oi->vfs_inode.i_version = 1;
- return &oi->vfs_inode;
-}
-
-static void exofs_i_callback(struct rcu_head *head)
-{
- struct inode *inode = container_of(head, struct inode, i_rcu);
- kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
-}
-
-/*
- * Remove an inode from the cache
- */
-static void exofs_destroy_inode(struct inode *inode)
-{
- call_rcu(&inode->i_rcu, exofs_i_callback);
-}
-
-/*
- * Initialize the inode
- */
-static void exofs_init_once(void *foo)
-{
- struct exofs_i_info *oi = foo;
-
- inode_init_once(&oi->vfs_inode);
-}
-
-/*
- * Create and initialize the inode cache
- */
-static int init_inodecache(void)
-{
- exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
- sizeof(struct exofs_i_info), 0,
- SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD |
- SLAB_ACCOUNT, exofs_init_once);
- if (exofs_inode_cachep == NULL)
- return -ENOMEM;
- return 0;
-}
-
-/*
- * Destroy the inode cache
- */
-static void destroy_inodecache(void)
-{
- /*
- * Make sure all delayed rcu free inodes are flushed before we
- * destroy cache.
- */
- rcu_barrier();
- kmem_cache_destroy(exofs_inode_cachep);
-}
-
-/******************************************************************************
- * Some osd helpers
- *****************************************************************************/
-void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
-{
- osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
-}
-
-static int exofs_read_kern(struct osd_dev *od, u8 *cred, struct osd_obj_id *obj,
- u64 offset, void *p, unsigned length)
-{
- struct osd_request *or = osd_start_request(od, GFP_KERNEL);
-/* struct osd_sense_info osi = {.key = 0};*/
- int ret;
-
- if (unlikely(!or)) {
- EXOFS_DBGMSG("%s: osd_start_request failed.\n", __func__);
- return -ENOMEM;
- }
- ret = osd_req_read_kern(or, obj, offset, p, length);
- if (unlikely(ret)) {
- EXOFS_DBGMSG("%s: osd_req_read_kern failed.\n", __func__);
- goto out;
- }
-
- ret = osd_finalize_request(or, 0, cred, NULL);
- if (unlikely(ret)) {
- EXOFS_DBGMSG("Failed to osd_finalize_request() => %d\n", ret);
- goto out;
- }
-
- ret = osd_execute_request(or);
- if (unlikely(ret))
- EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
- /* osd_req_decode_sense(or, ret); */
-
-out:
- osd_end_request(or);
- EXOFS_DBGMSG2("read_kern(0x%llx) offset=0x%llx "
- "length=0x%llx dev=%p ret=>%d\n",
- _LLU(obj->id), _LLU(offset), _LLU(length), od, ret);
- return ret;
-}
-
-static const struct osd_attr g_attr_sb_stats = ATTR_DEF(
- EXOFS_APAGE_SB_DATA,
- EXOFS_ATTR_SB_STATS,
- sizeof(struct exofs_sb_stats));
-
-static int __sbi_read_stats(struct exofs_sb_info *sbi)
-{
- struct osd_attr attrs[] = {
- [0] = g_attr_sb_stats,
- };
- struct ore_io_state *ios;
- int ret;
-
- ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
- return ret;
- }
-
- ios->in_attr = attrs;
- ios->in_attr_len = ARRAY_SIZE(attrs);
-
- ret = ore_read(ios);
- if (unlikely(ret)) {
- EXOFS_ERR("Error reading super_block stats => %d\n", ret);
- goto out;
- }
-
- ret = extract_attr_from_ios(ios, &attrs[0]);
- if (ret) {
- EXOFS_ERR("%s: extract_attr of sb_stats failed\n", __func__);
- goto out;
- }
- if (attrs[0].len) {
- struct exofs_sb_stats *ess;
-
- if (unlikely(attrs[0].len != sizeof(*ess))) {
- EXOFS_ERR("%s: Wrong version of exofs_sb_stats "
- "size(%d) != expected(%zd)\n",
- __func__, attrs[0].len, sizeof(*ess));
- goto out;
- }
-
- ess = attrs[0].val_ptr;
- sbi->s_nextid = le64_to_cpu(ess->s_nextid);
- sbi->s_numfiles = le32_to_cpu(ess->s_numfiles);
- }
-
-out:
- ore_put_io_state(ios);
- return ret;
-}
-
-static void stats_done(struct ore_io_state *ios, void *p)
-{
- ore_put_io_state(ios);
- /* Good thanks nothing to do anymore */
-}
-
-/* Asynchronously write the stats attribute */
-int exofs_sbi_write_stats(struct exofs_sb_info *sbi)
-{
- struct osd_attr attrs[] = {
- [0] = g_attr_sb_stats,
- };
- struct ore_io_state *ios;
- int ret;
-
- ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
- return ret;
- }
-
- sbi->s_ess.s_nextid = cpu_to_le64(sbi->s_nextid);
- sbi->s_ess.s_numfiles = cpu_to_le64(sbi->s_numfiles);
- attrs[0].val_ptr = &sbi->s_ess;
-
-
- ios->done = stats_done;
- ios->private = sbi;
- ios->out_attr = attrs;
- ios->out_attr_len = ARRAY_SIZE(attrs);
-
- ret = ore_write(ios);
- if (unlikely(ret)) {
- EXOFS_ERR("%s: ore_write failed.\n", __func__);
- ore_put_io_state(ios);
- }
-
- return ret;
-}
-
-/******************************************************************************
- * SUPERBLOCK FUNCTIONS
- *****************************************************************************/
-static const struct super_operations exofs_sops;
-static const struct export_operations exofs_export_ops;
-
-/*
- * Write the superblock to the OSD
- */
-static int exofs_sync_fs(struct super_block *sb, int wait)
-{
- struct exofs_sb_info *sbi;
- struct exofs_fscb *fscb;
- struct ore_comp one_comp;
- struct ore_components oc;
- struct ore_io_state *ios;
- int ret = -ENOMEM;
-
- fscb = kmalloc(sizeof(*fscb), GFP_KERNEL);
- if (unlikely(!fscb))
- return -ENOMEM;
-
- sbi = sb->s_fs_info;
-
- /* NOTE: We no longer dirty the super_block anywhere in exofs. The
- * reason we write the fscb here on unmount is so we can stay backwards
- * compatible with fscb->s_version == 1. (What we are not compatible
- * with is if a new version FS crashed and then we try to mount an old
- * version). Otherwise the exofs_fscb is read-only from mkfs time. All
- * the writeable info is set in exofs_sbi_write_stats() above.
- */
-
- exofs_init_comps(&oc, &one_comp, sbi, EXOFS_SUPER_ID);
-
- ret = ore_get_io_state(&sbi->layout, &oc, &ios);
- if (unlikely(ret))
- goto out;
-
- ios->length = offsetof(struct exofs_fscb, s_dev_table_oid);
- memset(fscb, 0, ios->length);
- fscb->s_nextid = cpu_to_le64(sbi->s_nextid);
- fscb->s_numfiles = cpu_to_le64(sbi->s_numfiles);
- fscb->s_magic = cpu_to_le16(sb->s_magic);
- fscb->s_newfs = 0;
- fscb->s_version = EXOFS_FSCB_VER;
-
- ios->offset = 0;
- ios->kern_buff = fscb;
-
- ret = ore_write(ios);
- if (unlikely(ret))
- EXOFS_ERR("%s: ore_write failed.\n", __func__);
-
-out:
- EXOFS_DBGMSG("s_nextid=0x%llx ret=%d\n", _LLU(sbi->s_nextid), ret);
- ore_put_io_state(ios);
- kfree(fscb);
- return ret;
-}
-
-static void _exofs_print_device(const char *msg, const char *dev_path,
- struct osd_dev *od, u64 pid)
-{
- const struct osd_dev_info *odi = osduld_device_info(od);
-
- printk(KERN_NOTICE "exofs: %s %s osd_name-%s pid-0x%llx\n",
- msg, dev_path ?: "", odi->osdname, _LLU(pid));
-}
-
-static void exofs_free_sbi(struct exofs_sb_info *sbi)
-{
- unsigned numdevs = sbi->oc.numdevs;
-
- while (numdevs) {
- unsigned i = --numdevs;
- struct osd_dev *od = ore_comp_dev(&sbi->oc, i);
-
- if (od) {
- ore_comp_set_dev(&sbi->oc, i, NULL);
- osduld_put_device(od);
- }
- }
- kfree(sbi->oc.ods);
- kfree(sbi);
-}
-
-/*
- * This function is called when the vfs is freeing the superblock. We just
- * need to free our own part.
- */
-static void exofs_put_super(struct super_block *sb)
-{
- int num_pend;
- struct exofs_sb_info *sbi = sb->s_fs_info;
-
- /* make sure there are no pending commands */
- for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
- num_pend = atomic_read(&sbi->s_curr_pending)) {
- wait_queue_head_t wq;
-
- printk(KERN_NOTICE "%s: !!Pending operations in flight. "
- "This is a BUG. please report to osd-dev@open-osd.org\n",
- __func__);
- init_waitqueue_head(&wq);
- wait_event_timeout(wq,
- (atomic_read(&sbi->s_curr_pending) == 0),
- msecs_to_jiffies(100));
- }
-
- _exofs_print_device("Unmounting", NULL, ore_comp_dev(&sbi->oc, 0),
- sbi->one_comp.obj.partition);
-
- exofs_sysfs_sb_del(sbi);
- bdi_destroy(&sbi->bdi);
- exofs_free_sbi(sbi);
- sb->s_fs_info = NULL;
-}
-
-static int _read_and_match_data_map(struct exofs_sb_info *sbi, unsigned numdevs,
- struct exofs_device_table *dt)
-{
- int ret;
-
- sbi->layout.stripe_unit =
- le64_to_cpu(dt->dt_data_map.cb_stripe_unit);
- sbi->layout.group_width =
- le32_to_cpu(dt->dt_data_map.cb_group_width);
- sbi->layout.group_depth =
- le32_to_cpu(dt->dt_data_map.cb_group_depth);
- sbi->layout.mirrors_p1 =
- le32_to_cpu(dt->dt_data_map.cb_mirror_cnt) + 1;
- sbi->layout.raid_algorithm =
- le32_to_cpu(dt->dt_data_map.cb_raid_algorithm);
-
- ret = ore_verify_layout(numdevs, &sbi->layout);
-
- EXOFS_DBGMSG("exofs: layout: "
- "num_comps=%u stripe_unit=0x%x group_width=%u "
- "group_depth=0x%llx mirrors_p1=%u raid_algorithm=%u\n",
- numdevs,
- sbi->layout.stripe_unit,
- sbi->layout.group_width,
- _LLU(sbi->layout.group_depth),
- sbi->layout.mirrors_p1,
- sbi->layout.raid_algorithm);
- return ret;
-}
-
-static unsigned __ra_pages(struct ore_layout *layout)
-{
- const unsigned _MIN_RA = 32; /* min 128K read-ahead */
- unsigned ra_pages = layout->group_width * layout->stripe_unit /
- PAGE_SIZE;
- unsigned max_io_pages = exofs_max_io_pages(layout, ~0);
-
- ra_pages *= 2; /* two stripes */
- if (ra_pages < _MIN_RA)
- ra_pages = roundup(_MIN_RA, ra_pages / 2);
-
- if (ra_pages > max_io_pages)
- ra_pages = max_io_pages;
-
- return ra_pages;
-}
-
-/* @odi is valid only as long as @fscb_dev is valid */
-static int exofs_devs_2_odi(struct exofs_dt_device_info *dt_dev,
- struct osd_dev_info *odi)
-{
- odi->systemid_len = le32_to_cpu(dt_dev->systemid_len);
- if (likely(odi->systemid_len))
- memcpy(odi->systemid, dt_dev->systemid, OSD_SYSTEMID_LEN);
-
- odi->osdname_len = le32_to_cpu(dt_dev->osdname_len);
- odi->osdname = dt_dev->osdname;
-
- /* FIXME support long names. Will need a _put function */
- if (dt_dev->long_name_offset)
- return -EINVAL;
-
- /* Make sure osdname is printable!
- * mkexofs should give us space for a null-terminator else the
- * device-table is invalid.
- */
- if (unlikely(odi->osdname_len >= sizeof(dt_dev->osdname)))
- odi->osdname_len = sizeof(dt_dev->osdname) - 1;
- dt_dev->osdname[odi->osdname_len] = 0;
-
- /* If it's all zeros something is bad we read past end-of-obj */
- return !(odi->systemid_len || odi->osdname_len);
-}
-
-static int __alloc_dev_table(struct exofs_sb_info *sbi, unsigned numdevs,
- struct exofs_dev **peds)
-{
- struct __alloc_ore_devs_and_exofs_devs {
- /* Twice bigger table: See exofs_init_comps() and comment at
- * exofs_read_lookup_dev_table()
- */
- struct ore_dev *oreds[numdevs * 2 - 1];
- struct exofs_dev eds[numdevs];
- } *aoded;
- struct exofs_dev *eds;
- unsigned i;
-
- aoded = kzalloc(sizeof(*aoded), GFP_KERNEL);
- if (unlikely(!aoded)) {
- EXOFS_ERR("ERROR: failed allocating Device array[%d]\n",
- numdevs);
- return -ENOMEM;
- }
-
- sbi->oc.ods = aoded->oreds;
- *peds = eds = aoded->eds;
- for (i = 0; i < numdevs; ++i)
- aoded->oreds[i] = &eds[i].ored;
- return 0;
-}
-
-static int exofs_read_lookup_dev_table(struct exofs_sb_info *sbi,
- struct osd_dev *fscb_od,
- unsigned table_count)
-{
- struct ore_comp comp;
- struct exofs_device_table *dt;
- struct exofs_dev *eds;
- unsigned table_bytes = table_count * sizeof(dt->dt_dev_table[0]) +
- sizeof(*dt);
- unsigned numdevs, i;
- int ret;
-
- dt = kmalloc(table_bytes, GFP_KERNEL);
- if (unlikely(!dt)) {
- EXOFS_ERR("ERROR: allocating %x bytes for device table\n",
- table_bytes);
- return -ENOMEM;
- }
-
- sbi->oc.numdevs = 0;
-
- comp.obj.partition = sbi->one_comp.obj.partition;
- comp.obj.id = EXOFS_DEVTABLE_ID;
- exofs_make_credential(comp.cred, &comp.obj);
-
- ret = exofs_read_kern(fscb_od, comp.cred, &comp.obj, 0, dt,
- table_bytes);
- if (unlikely(ret)) {
- EXOFS_ERR("ERROR: reading device table\n");
- goto out;
- }
-
- numdevs = le64_to_cpu(dt->dt_num_devices);
- if (unlikely(!numdevs)) {
- ret = -EINVAL;
- goto out;
- }
- WARN_ON(table_count != numdevs);
-
- ret = _read_and_match_data_map(sbi, numdevs, dt);
- if (unlikely(ret))
- goto out;
-
- ret = __alloc_dev_table(sbi, numdevs, &eds);
- if (unlikely(ret))
- goto out;
- /* exofs round-robins the device table view according to inode
- * number. We hold a: twice bigger table hence inodes can point
- * to any device and have a sequential view of the table
- * starting at this device. See exofs_init_comps()
- */
- memcpy(&sbi->oc.ods[numdevs], &sbi->oc.ods[0],
- (numdevs - 1) * sizeof(sbi->oc.ods[0]));
-
- /* create sysfs subdir under which we put the device table
- * And cluster layout. A Superblock is identified by the string:
- * "dev[0].osdname"_"pid"
- */
- exofs_sysfs_sb_add(sbi, &dt->dt_dev_table[0]);
-
- for (i = 0; i < numdevs; i++) {
- struct exofs_fscb fscb;
- struct osd_dev_info odi;
- struct osd_dev *od;
-
- if (exofs_devs_2_odi(&dt->dt_dev_table[i], &odi)) {
- EXOFS_ERR("ERROR: Read all-zeros device entry\n");
- ret = -EINVAL;
- goto out;
- }
-
- printk(KERN_NOTICE "Add device[%d]: osd_name-%s\n",
- i, odi.osdname);
-
- /* the exofs id is currently the table index */
- eds[i].did = i;
-
- /* On all devices the device table is identical. The user can
- * specify any one of the participating devices on the command
- * line. We always keep them in device-table order.
- */
- if (fscb_od && osduld_device_same(fscb_od, &odi)) {
- eds[i].ored.od = fscb_od;
- ++sbi->oc.numdevs;
- fscb_od = NULL;
- exofs_sysfs_odev_add(&eds[i], sbi);
- continue;
- }
-
- od = osduld_info_lookup(&odi);
- if (IS_ERR(od)) {
- ret = PTR_ERR(od);
- EXOFS_ERR("ERROR: device requested is not found "
- "osd_name-%s =>%d\n", odi.osdname, ret);
- goto out;
- }
-
- eds[i].ored.od = od;
- ++sbi->oc.numdevs;
-
- /* Read the fscb of the other devices to make sure the FS
- * partition is there.
- */
- ret = exofs_read_kern(od, comp.cred, &comp.obj, 0, &fscb,
- sizeof(fscb));
- if (unlikely(ret)) {
- EXOFS_ERR("ERROR: Malformed participating device "
- "error reading fscb osd_name-%s\n",
- odi.osdname);
- goto out;
- }
- exofs_sysfs_odev_add(&eds[i], sbi);
-
- /* TODO: verify other information is correct and FS-uuid
- * matches. Benny what did you say about device table
- * generation and old devices?
- */
- }
-
-out:
- kfree(dt);
- if (unlikely(fscb_od && !ret)) {
- EXOFS_ERR("ERROR: Bad device-table container device not present\n");
- osduld_put_device(fscb_od);
- return -EINVAL;
- }
- return ret;
-}
-
-/*
- * Read the superblock from the OSD and fill in the fields
- */
-static int exofs_fill_super(struct super_block *sb, void *data, int silent)
-{
- struct inode *root;
- struct exofs_mountopt *opts = data;
- struct exofs_sb_info *sbi; /*extended info */
- struct osd_dev *od; /* Master device */
- struct exofs_fscb fscb; /*on-disk superblock info */
- struct ore_comp comp;
- unsigned table_count;
- int ret;
-
- sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
- if (!sbi)
- return -ENOMEM;
-
- /* use mount options to fill superblock */
- if (opts->is_osdname) {
- struct osd_dev_info odi = {.systemid_len = 0};
-
- odi.osdname_len = strlen(opts->dev_name);
- odi.osdname = (u8 *)opts->dev_name;
- od = osduld_info_lookup(&odi);
- kfree(opts->dev_name);
- opts->dev_name = NULL;
- } else {
- od = osduld_path_lookup(opts->dev_name);
- }
- if (IS_ERR(od)) {
- ret = -EINVAL;
- goto free_sbi;
- }
-
- /* Default layout in case we do not have a device-table */
- sbi->layout.stripe_unit = PAGE_SIZE;
- sbi->layout.mirrors_p1 = 1;
- sbi->layout.group_width = 1;
- sbi->layout.group_depth = -1;
- sbi->layout.group_count = 1;
- sbi->s_timeout = opts->timeout;
-
- sbi->one_comp.obj.partition = opts->pid;
- sbi->one_comp.obj.id = 0;
- exofs_make_credential(sbi->one_comp.cred, &sbi->one_comp.obj);
- sbi->oc.single_comp = EC_SINGLE_COMP;
- sbi->oc.comps = &sbi->one_comp;
-
- /* fill in some other data by hand */
- memset(sb->s_id, 0, sizeof(sb->s_id));
- strcpy(sb->s_id, "exofs");
- sb->s_blocksize = EXOFS_BLKSIZE;
- sb->s_blocksize_bits = EXOFS_BLKSHIFT;
- sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_max_links = EXOFS_LINK_MAX;
- atomic_set(&sbi->s_curr_pending, 0);
- sb->s_bdev = NULL;
- sb->s_dev = 0;
-
- comp.obj.partition = sbi->one_comp.obj.partition;
- comp.obj.id = EXOFS_SUPER_ID;
- exofs_make_credential(comp.cred, &comp.obj);
-
- ret = exofs_read_kern(od, comp.cred, &comp.obj, 0, &fscb, sizeof(fscb));
- if (unlikely(ret))
- goto free_sbi;
-
- sb->s_magic = le16_to_cpu(fscb.s_magic);
- /* NOTE: we read below to be backward compatible with old versions */
- sbi->s_nextid = le64_to_cpu(fscb.s_nextid);
- sbi->s_numfiles = le32_to_cpu(fscb.s_numfiles);
-
- /* make sure what we read from the object store is correct */
- if (sb->s_magic != EXOFS_SUPER_MAGIC) {
- if (!silent)
- EXOFS_ERR("ERROR: Bad magic value\n");
- ret = -EINVAL;
- goto free_sbi;
- }
- if (le32_to_cpu(fscb.s_version) > EXOFS_FSCB_VER) {
- EXOFS_ERR("ERROR: Bad FSCB version expected-%d got-%d\n",
- EXOFS_FSCB_VER, le32_to_cpu(fscb.s_version));
- ret = -EINVAL;
- goto free_sbi;
- }
-
- /* start generation numbers from a random point */
- get_random_bytes(&sbi->s_next_generation, sizeof(u32));
- spin_lock_init(&sbi->s_next_gen_lock);
-
- table_count = le64_to_cpu(fscb.s_dev_table_count);
- if (table_count) {
- ret = exofs_read_lookup_dev_table(sbi, od, table_count);
- if (unlikely(ret))
- goto free_sbi;
- } else {
- struct exofs_dev *eds;
-
- ret = __alloc_dev_table(sbi, 1, &eds);
- if (unlikely(ret))
- goto free_sbi;
-
- ore_comp_set_dev(&sbi->oc, 0, od);
- sbi->oc.numdevs = 1;
- }
-
- __sbi_read_stats(sbi);
-
- /* set up operation vectors */
- sbi->bdi.ra_pages = __ra_pages(&sbi->layout);
- sb->s_bdi = &sbi->bdi;
- sb->s_fs_info = sbi;
- sb->s_op = &exofs_sops;
- sb->s_export_op = &exofs_export_ops;
- root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
- if (IS_ERR(root)) {
- EXOFS_ERR("ERROR: exofs_iget failed\n");
- ret = PTR_ERR(root);
- goto free_sbi;
- }
- sb->s_root = d_make_root(root);
- if (!sb->s_root) {
- EXOFS_ERR("ERROR: get root inode failed\n");
- ret = -ENOMEM;
- goto free_sbi;
- }
-
- if (!S_ISDIR(root->i_mode)) {
- dput(sb->s_root);
- sb->s_root = NULL;
- EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
- root->i_mode);
- ret = -EINVAL;
- goto free_sbi;
- }
-
- ret = bdi_setup_and_register(&sbi->bdi, "exofs");
- if (ret) {
- EXOFS_DBGMSG("Failed to bdi_setup_and_register\n");
- dput(sb->s_root);
- sb->s_root = NULL;
- goto free_sbi;
- }
-
- exofs_sysfs_dbg_print();
- _exofs_print_device("Mounting", opts->dev_name,
- ore_comp_dev(&sbi->oc, 0),
- sbi->one_comp.obj.partition);
- return 0;
-
-free_sbi:
- EXOFS_ERR("Unable to mount exofs on %s pid=0x%llx err=%d\n",
- opts->dev_name, sbi->one_comp.obj.partition, ret);
- exofs_free_sbi(sbi);
- return ret;
-}
-
-/*
- * Set up the superblock (calls exofs_fill_super eventually)
- */
-static struct dentry *exofs_mount(struct file_system_type *type,
- int flags, const char *dev_name,
- void *data)
-{
- struct exofs_mountopt opts;
- int ret;
-
- ret = parse_options(data, &opts);
- if (ret)
- return ERR_PTR(ret);
-
- if (!opts.dev_name)
- opts.dev_name = dev_name;
- return mount_nodev(type, flags, &opts, exofs_fill_super);
-}
-
-/*
- * Return information about the file system state in the buffer. This is used
- * by the 'df' command, for example.
- */
-static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
-{
- struct super_block *sb = dentry->d_sb;
- struct exofs_sb_info *sbi = sb->s_fs_info;
- struct ore_io_state *ios;
- struct osd_attr attrs[] = {
- ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
- OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
- ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
- OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
- };
- uint64_t capacity = ULLONG_MAX;
- uint64_t used = ULLONG_MAX;
- int ret;
-
- ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
- if (ret) {
- EXOFS_DBGMSG("ore_get_io_state failed.\n");
- return ret;
- }
-
- ios->in_attr = attrs;
- ios->in_attr_len = ARRAY_SIZE(attrs);
-
- ret = ore_read(ios);
- if (unlikely(ret))
- goto out;
-
- ret = extract_attr_from_ios(ios, &attrs[0]);
- if (likely(!ret)) {
- capacity = get_unaligned_be64(attrs[0].val_ptr);
- if (unlikely(!capacity))
- capacity = ULLONG_MAX;
- } else
- EXOFS_DBGMSG("exofs_statfs: get capacity failed.\n");
-
- ret = extract_attr_from_ios(ios, &attrs[1]);
- if (likely(!ret))
- used = get_unaligned_be64(attrs[1].val_ptr);
- else
- EXOFS_DBGMSG("exofs_statfs: get used-space failed.\n");
-
- /* fill in the stats buffer */
- buf->f_type = EXOFS_SUPER_MAGIC;
- buf->f_bsize = EXOFS_BLKSIZE;
- buf->f_blocks = capacity >> 9;
- buf->f_bfree = (capacity - used) >> 9;
- buf->f_bavail = buf->f_bfree;
- buf->f_files = sbi->s_numfiles;
- buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
- buf->f_namelen = EXOFS_NAME_LEN;
-
-out:
- ore_put_io_state(ios);
- return ret;
-}
-
-static const struct super_operations exofs_sops = {
- .alloc_inode = exofs_alloc_inode,
- .destroy_inode = exofs_destroy_inode,
- .write_inode = exofs_write_inode,
- .evict_inode = exofs_evict_inode,
- .put_super = exofs_put_super,
- .sync_fs = exofs_sync_fs,
- .statfs = exofs_statfs,
-};
-
-/******************************************************************************
- * EXPORT OPERATIONS
- *****************************************************************************/
-
-static struct dentry *exofs_get_parent(struct dentry *child)
-{
- unsigned long ino = exofs_parent_ino(child);
-
- if (!ino)
- return ERR_PTR(-ESTALE);
-
- return d_obtain_alias(exofs_iget(child->d_sb, ino));
-}
-
-static struct inode *exofs_nfs_get_inode(struct super_block *sb,
- u64 ino, u32 generation)
-{
- struct inode *inode;
-
- inode = exofs_iget(sb, ino);
- if (IS_ERR(inode))
- return ERR_CAST(inode);
- if (generation && inode->i_generation != generation) {
- /* we didn't find the right inode.. */
- iput(inode);
- return ERR_PTR(-ESTALE);
- }
- return inode;
-}
-
-static struct dentry *exofs_fh_to_dentry(struct super_block *sb,
- struct fid *fid, int fh_len, int fh_type)
-{
- return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
- exofs_nfs_get_inode);
-}
-
-static struct dentry *exofs_fh_to_parent(struct super_block *sb,
- struct fid *fid, int fh_len, int fh_type)
-{
- return generic_fh_to_parent(sb, fid, fh_len, fh_type,
- exofs_nfs_get_inode);
-}
-
-static const struct export_operations exofs_export_ops = {
- .fh_to_dentry = exofs_fh_to_dentry,
- .fh_to_parent = exofs_fh_to_parent,
- .get_parent = exofs_get_parent,
-};
-
-/******************************************************************************
- * INSMOD/RMMOD
- *****************************************************************************/
-
-/*
- * struct that describes this file system
- */
-static struct file_system_type exofs_type = {
- .owner = THIS_MODULE,
- .name = "exofs",
- .mount = exofs_mount,
- .kill_sb = generic_shutdown_super,
-};
-MODULE_ALIAS_FS("exofs");
-
-static int __init init_exofs(void)
-{
- int err;
-
- err = init_inodecache();
- if (err)
- goto out;
-
- err = register_filesystem(&exofs_type);
- if (err)
- goto out_d;
-
- /* We don't fail if sysfs creation failed */
- exofs_sysfs_init();
-
- return 0;
-out_d:
- destroy_inodecache();
-out:
- return err;
-}
-
-static void __exit exit_exofs(void)
-{
- exofs_sysfs_uninit();
- unregister_filesystem(&exofs_type);
- destroy_inodecache();
-}
-
-MODULE_AUTHOR("Avishay Traeger <avishay@gmail.com>");
-MODULE_DESCRIPTION("exofs");
-MODULE_LICENSE("GPL");
-
-module_init(init_exofs)
-module_exit(exit_exofs)
diff --git a/fs/exofs/sys.c b/fs/exofs/sys.c
deleted file mode 100644
index 1f7d5e46cdda..000000000000
--- a/fs/exofs/sys.c
+++ /dev/null
@@ -1,205 +0,0 @@
-/*
- * Copyright (C) 2012
- * Sachin Bhamare <sbhamare@panasas.com>
- * Boaz Harrosh <ooo@electrozaur.com>
- *
- * This file is part of exofs.
- *
- * exofs is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License 2 as published by
- * the Free Software Foundation.
- *
- * exofs is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with exofs; if not, write to the:
- * Free Software Foundation <licensing@fsf.org>
- */
-
-#include <linux/kobject.h>
-#include <linux/device.h>
-
-#include "exofs.h"
-
-struct odev_attr {
- struct attribute attr;
- ssize_t (*show)(struct exofs_dev *, char *);
- ssize_t (*store)(struct exofs_dev *, const char *, size_t);
-};
-
-static ssize_t odev_attr_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct exofs_dev *edp = container_of(kobj, struct exofs_dev, ed_kobj);
- struct odev_attr *a = container_of(attr, struct odev_attr, attr);
-
- return a->show ? a->show(edp, buf) : 0;
-}
-
-static ssize_t odev_attr_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t len)
-{
- struct exofs_dev *edp = container_of(kobj, struct exofs_dev, ed_kobj);
- struct odev_attr *a = container_of(attr, struct odev_attr, attr);
-
- return a->store ? a->store(edp, buf, len) : len;
-}
-
-static const struct sysfs_ops odev_attr_ops = {
- .show = odev_attr_show,
- .store = odev_attr_store,
-};
-
-
-static struct kset *exofs_kset;
-
-static ssize_t osdname_show(struct exofs_dev *edp, char *buf)
-{
- struct osd_dev *odev = edp->ored.od;
- const struct osd_dev_info *odi = osduld_device_info(odev);
-
- return snprintf(buf, odi->osdname_len + 1, "%s", odi->osdname);
-}
-
-static ssize_t systemid_show(struct exofs_dev *edp, char *buf)
-{
- struct osd_dev *odev = edp->ored.od;
- const struct osd_dev_info *odi = osduld_device_info(odev);
-
- memcpy(buf, odi->systemid, odi->systemid_len);
- return odi->systemid_len;
-}
-
-static ssize_t uri_show(struct exofs_dev *edp, char *buf)
-{
- return snprintf(buf, edp->urilen, "%s", edp->uri);
-}
-
-static ssize_t uri_store(struct exofs_dev *edp, const char *buf, size_t len)
-{
- uint8_t *new_uri;
-
- edp->urilen = strlen(buf) + 1;
- new_uri = krealloc(edp->uri, edp->urilen, GFP_KERNEL);
- if (new_uri == NULL)
- return -ENOMEM;
- edp->uri = new_uri;
- strncpy(edp->uri, buf, edp->urilen);
- return edp->urilen;
-}
-
-#define OSD_ATTR(name, mode, show, store) \
- static struct odev_attr odev_attr_##name = \
- __ATTR(name, mode, show, store)
-
-OSD_ATTR(osdname, S_IRUGO, osdname_show, NULL);
-OSD_ATTR(systemid, S_IRUGO, systemid_show, NULL);
-OSD_ATTR(uri, S_IRWXU, uri_show, uri_store);
-
-static struct attribute *odev_attrs[] = {
- &odev_attr_osdname.attr,
- &odev_attr_systemid.attr,
- &odev_attr_uri.attr,
- NULL,
-};
-
-static struct kobj_type odev_ktype = {
- .default_attrs = odev_attrs,
- .sysfs_ops = &odev_attr_ops,
-};
-
-static struct kobj_type uuid_ktype = {
-};
-
-void exofs_sysfs_dbg_print(void)
-{
-#ifdef CONFIG_EXOFS_DEBUG
- struct kobject *k_name, *k_tmp;
-
- list_for_each_entry_safe(k_name, k_tmp, &exofs_kset->list, entry) {
- printk(KERN_INFO "%s: name %s ref %d\n",
- __func__, kobject_name(k_name),
- (int)kref_read(&k_name->kref));
- }
-#endif
-}
-/*
- * This function removes all kobjects under exofs_kset
- * At the end of it, exofs_kset kobject will have a refcount
- * of 1 which gets decremented only on exofs module unload
- */
-void exofs_sysfs_sb_del(struct exofs_sb_info *sbi)
-{
- struct kobject *k_name, *k_tmp;
- struct kobject *s_kobj = &sbi->s_kobj;
-
- list_for_each_entry_safe(k_name, k_tmp, &exofs_kset->list, entry) {
- /* Remove all that are children of this SBI */
- if (k_name->parent == s_kobj)
- kobject_put(k_name);
- }
- kobject_put(s_kobj);
-}
-
-/*
- * This function creates sysfs entries to hold the current exofs cluster
- * instance (uniquely identified by osdname,pid tuple).
- * This function gets called once per exofs mount instance.
- */
-int exofs_sysfs_sb_add(struct exofs_sb_info *sbi,
- struct exofs_dt_device_info *dt_dev)
-{
- struct kobject *s_kobj;
- int retval = 0;
- uint64_t pid = sbi->one_comp.obj.partition;
-
- /* allocate new uuid dirent */
- s_kobj = &sbi->s_kobj;
- s_kobj->kset = exofs_kset;
- retval = kobject_init_and_add(s_kobj, &uuid_ktype,
- &exofs_kset->kobj, "%s_%llx", dt_dev->osdname, pid);
- if (retval) {
- EXOFS_ERR("ERROR: Failed to create sysfs entry for "
- "uuid-%s_%llx => %d\n", dt_dev->osdname, pid, retval);
- return -ENOMEM;
- }
- return 0;
-}
-
-int exofs_sysfs_odev_add(struct exofs_dev *edev, struct exofs_sb_info *sbi)
-{
- struct kobject *d_kobj;
- int retval = 0;
-
- /* create osd device group which contains following attributes
- * osdname, systemid & uri
- */
- d_kobj = &edev->ed_kobj;
- d_kobj->kset = exofs_kset;
- retval = kobject_init_and_add(d_kobj, &odev_ktype,
- &sbi->s_kobj, "dev%u", edev->did);
- if (retval) {
- EXOFS_ERR("ERROR: Failed to create sysfs entry for "
- "device dev%u\n", edev->did);
- return retval;
- }
- return 0;
-}
-
-int exofs_sysfs_init(void)
-{
- exofs_kset = kset_create_and_add("exofs", NULL, fs_kobj);
- if (!exofs_kset) {
- EXOFS_ERR("ERROR: kset_create_and_add exofs failed\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-void exofs_sysfs_uninit(void)
-{
- kset_unregister(exofs_kset);
-}
--
2.11.0
^ permalink raw reply related
* [PATCH 1/4] block: remove the osdblk driver
From: Christoph Hellwig @ 2017-04-12 16:01 UTC (permalink / raw)
To: ooo, martin.petersen, trond.myklebust, axboe
Cc: osd-dev, linux-nfs, linux-scsi, linux-block, linux-kernel
In-Reply-To: <20170412160109.10598-1-hch@lst.de>
This was just a proof of concept user for the SCSI OSD library, and
never had any real users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/block/Kconfig | 16 --
drivers/block/Makefile | 1 -
drivers/block/osdblk.c | 693 -------------------------------------------------
3 files changed, 710 deletions(-)
delete mode 100644 drivers/block/osdblk.c
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a1c2e816128f..58e24c354933 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -312,22 +312,6 @@ config BLK_DEV_SKD
Use device /dev/skd$N amd /dev/skd$Np$M.
-config BLK_DEV_OSD
- tristate "OSD object-as-blkdev support"
- depends on SCSI_OSD_ULD
- ---help---
- Saying Y or M here will allow the exporting of a single SCSI
- OSD (object-based storage) object as a Linux block device.
-
- For example, if you create a 2G object on an OSD device,
- you can then use this module to present that 2G object as
- a Linux block device.
-
- To compile this driver as a module, choose M here: the
- module will be called osdblk.
-
- If unsure, say N.
-
config BLK_DEV_SX8
tristate "Promise SATA SX8 support"
depends on PCI
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index b12c772bbeb3..42e8cee1cbc7 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -22,7 +22,6 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
obj-$(CONFIG_MG_DISK) += mg_disk.o
obj-$(CONFIG_SUNVDC) += sunvdc.o
obj-$(CONFIG_BLK_DEV_SKD) += skd.o
-obj-$(CONFIG_BLK_DEV_OSD) += osdblk.o
obj-$(CONFIG_BLK_DEV_UMEM) += umem.o
obj-$(CONFIG_BLK_DEV_NBD) += nbd.o
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
deleted file mode 100644
index 8127b8201a01..000000000000
--- a/drivers/block/osdblk.c
+++ /dev/null
@@ -1,693 +0,0 @@
-
-/*
- osdblk.c -- Export a single SCSI OSD object as a Linux block device
-
-
- Copyright 2009 Red Hat, Inc.
-
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; see the file COPYING. If not, write to
- the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
-
-
- Instructions for use
- --------------------
-
- 1) Map a Linux block device to an existing OSD object.
-
- In this example, we will use partition id 1234, object id 5678,
- OSD device /dev/osd1.
-
- $ echo "1234 5678 /dev/osd1" > /sys/class/osdblk/add
-
-
- 2) List all active blkdev<->object mappings.
-
- In this example, we have performed step #1 twice, creating two blkdevs,
- mapped to two separate OSD objects.
-
- $ cat /sys/class/osdblk/list
- 0 174 1234 5678 /dev/osd1
- 1 179 1994 897123 /dev/osd0
-
- The columns, in order, are:
- - blkdev unique id
- - blkdev assigned major
- - OSD object partition id
- - OSD object id
- - OSD device
-
-
- 3) Remove an active blkdev<->object mapping.
-
- In this example, we remove the mapping with blkdev unique id 1.
-
- $ echo 1 > /sys/class/osdblk/remove
-
-
- NOTE: The actual creation and deletion of OSD objects is outside the scope
- of this driver.
-
- */
-
-#include <linux/kernel.h>
-#include <linux/device.h>
-#include <linux/module.h>
-#include <linux/fs.h>
-#include <linux/slab.h>
-#include <scsi/osd_initiator.h>
-#include <scsi/osd_attributes.h>
-#include <scsi/osd_sec.h>
-#include <scsi/scsi_device.h>
-
-#define DRV_NAME "osdblk"
-#define PFX DRV_NAME ": "
-
-/* #define _OSDBLK_DEBUG */
-#ifdef _OSDBLK_DEBUG
-#define OSDBLK_DEBUG(fmt, a...) \
- printk(KERN_NOTICE "osdblk @%s:%d: " fmt, __func__, __LINE__, ##a)
-#else
-#define OSDBLK_DEBUG(fmt, a...) \
- do { if (0) printk(fmt, ##a); } while (0)
-#endif
-
-MODULE_AUTHOR("Jeff Garzik <jeff@garzik.org>");
-MODULE_DESCRIPTION("block device inside an OSD object osdblk.ko");
-MODULE_LICENSE("GPL");
-
-struct osdblk_device;
-
-enum {
- OSDBLK_MINORS_PER_MAJOR = 256, /* max minors per blkdev */
- OSDBLK_MAX_REQ = 32, /* max parallel requests */
- OSDBLK_OP_TIMEOUT = 4 * 60, /* sync OSD req timeout */
-};
-
-struct osdblk_request {
- struct request *rq; /* blk layer request */
- struct bio *bio; /* cloned bio */
- struct osdblk_device *osdev; /* associated blkdev */
-};
-
-struct osdblk_device {
- int id; /* blkdev unique id */
-
- int major; /* blkdev assigned major */
- struct gendisk *disk; /* blkdev's gendisk and rq */
- struct request_queue *q;
-
- struct osd_dev *osd; /* associated OSD */
-
- char name[32]; /* blkdev name, e.g. osdblk34 */
-
- spinlock_t lock; /* queue lock */
-
- struct osd_obj_id obj; /* OSD partition, obj id */
- uint8_t obj_cred[OSD_CAP_LEN]; /* OSD cred */
-
- struct osdblk_request req[OSDBLK_MAX_REQ]; /* request table */
-
- struct list_head node;
-
- char osd_path[0]; /* OSD device path */
-};
-
-static struct class *class_osdblk; /* /sys/class/osdblk */
-static DEFINE_MUTEX(ctl_mutex); /* Serialize open/close/setup/teardown */
-static LIST_HEAD(osdblkdev_list);
-
-static const struct block_device_operations osdblk_bd_ops = {
- .owner = THIS_MODULE,
-};
-
-static const struct osd_attr g_attr_logical_length = ATTR_DEF(
- OSD_APAGE_OBJECT_INFORMATION, OSD_ATTR_OI_LOGICAL_LENGTH, 8);
-
-static void osdblk_make_credential(u8 cred_a[OSD_CAP_LEN],
- const struct osd_obj_id *obj)
-{
- osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
-}
-
-/* copied from exofs; move to libosd? */
-/*
- * Perform a synchronous OSD operation. copied from exofs; move to libosd?
- */
-static int osd_sync_op(struct osd_request *or, int timeout, uint8_t *credential)
-{
- int ret;
-
- or->timeout = timeout;
- ret = osd_finalize_request(or, 0, credential, NULL);
- if (ret)
- return ret;
-
- ret = osd_execute_request(or);
-
- /* osd_req_decode_sense(or, ret); */
- return ret;
-}
-
-/*
- * Perform an asynchronous OSD operation. copied from exofs; move to libosd?
- */
-static int osd_async_op(struct osd_request *or, osd_req_done_fn *async_done,
- void *caller_context, u8 *cred)
-{
- int ret;
-
- ret = osd_finalize_request(or, 0, cred, NULL);
- if (ret)
- return ret;
-
- ret = osd_execute_request_async(or, async_done, caller_context);
-
- return ret;
-}
-
-/* copied from exofs; move to libosd? */
-static int extract_attr_from_req(struct osd_request *or, struct osd_attr *attr)
-{
- struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
- void *iter = NULL;
- int nelem;
-
- do {
- nelem = 1;
- osd_req_decode_get_attr_list(or, &cur_attr, &nelem, &iter);
- if ((cur_attr.attr_page == attr->attr_page) &&
- (cur_attr.attr_id == attr->attr_id)) {
- attr->len = cur_attr.len;
- attr->val_ptr = cur_attr.val_ptr;
- return 0;
- }
- } while (iter);
-
- return -EIO;
-}
-
-static int osdblk_get_obj_size(struct osdblk_device *osdev, u64 *size_out)
-{
- struct osd_request *or;
- struct osd_attr attr;
- int ret;
-
- /* start request */
- or = osd_start_request(osdev->osd, GFP_KERNEL);
- if (!or)
- return -ENOMEM;
-
- /* create a get-attributes(length) request */
- osd_req_get_attributes(or, &osdev->obj);
-
- osd_req_add_get_attr_list(or, &g_attr_logical_length, 1);
-
- /* execute op synchronously */
- ret = osd_sync_op(or, OSDBLK_OP_TIMEOUT, osdev->obj_cred);
- if (ret)
- goto out;
-
- /* extract length from returned attribute info */
- attr = g_attr_logical_length;
- ret = extract_attr_from_req(or, &attr);
- if (ret)
- goto out;
-
- *size_out = get_unaligned_be64(attr.val_ptr);
-
-out:
- osd_end_request(or);
- return ret;
-
-}
-
-static void osdblk_osd_complete(struct osd_request *or, void *private)
-{
- struct osdblk_request *orq = private;
- struct osd_sense_info osi;
- int ret = osd_req_decode_sense(or, &osi);
-
- if (ret) {
- ret = -EIO;
- OSDBLK_DEBUG("osdblk_osd_complete with err=%d\n", ret);
- }
-
- /* complete OSD request */
- osd_end_request(or);
-
- /* complete request passed to osdblk by block layer */
- __blk_end_request_all(orq->rq, ret);
-}
-
-static void bio_chain_put(struct bio *chain)
-{
- struct bio *tmp;
-
- while (chain) {
- tmp = chain;
- chain = chain->bi_next;
-
- bio_put(tmp);
- }
-}
-
-static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
-{
- struct bio *tmp, *new_chain = NULL, *tail = NULL;
-
- while (old_chain) {
- tmp = bio_clone_kmalloc(old_chain, gfpmask);
- if (!tmp)
- goto err_out;
-
- tmp->bi_bdev = NULL;
- gfpmask &= ~__GFP_DIRECT_RECLAIM;
- tmp->bi_next = NULL;
-
- if (!new_chain)
- new_chain = tail = tmp;
- else {
- tail->bi_next = tmp;
- tail = tmp;
- }
-
- old_chain = old_chain->bi_next;
- }
-
- return new_chain;
-
-err_out:
- OSDBLK_DEBUG("bio_chain_clone with err\n");
- bio_chain_put(new_chain);
- return NULL;
-}
-
-static void osdblk_rq_fn(struct request_queue *q)
-{
- struct osdblk_device *osdev = q->queuedata;
-
- while (1) {
- struct request *rq;
- struct osdblk_request *orq;
- struct osd_request *or;
- struct bio *bio;
- bool do_write, do_flush;
-
- /* peek at request from block layer */
- rq = blk_fetch_request(q);
- if (!rq)
- break;
-
- /* deduce our operation (read, write, flush) */
- /* I wish the block layer simplified cmd_type/cmd_flags/cmd[]
- * into a clearly defined set of RPC commands:
- * read, write, flush, scsi command, power mgmt req,
- * driver-specific, etc.
- */
-
- do_flush = (req_op(rq) == REQ_OP_FLUSH);
- do_write = (rq_data_dir(rq) == WRITE);
-
- if (!do_flush) { /* osd_flush does not use a bio */
- /* a bio clone to be passed down to OSD request */
- bio = bio_chain_clone(rq->bio, GFP_ATOMIC);
- if (!bio)
- break;
- } else
- bio = NULL;
-
- /* alloc internal OSD request, for OSD command execution */
- or = osd_start_request(osdev->osd, GFP_ATOMIC);
- if (!or) {
- bio_chain_put(bio);
- OSDBLK_DEBUG("osd_start_request with err\n");
- break;
- }
-
- orq = &osdev->req[rq->tag];
- orq->rq = rq;
- orq->bio = bio;
- orq->osdev = osdev;
-
- /* init OSD command: flush, write or read */
- if (do_flush)
- osd_req_flush_object(or, &osdev->obj,
- OSD_CDB_FLUSH_ALL, 0, 0);
- else if (do_write)
- osd_req_write(or, &osdev->obj, blk_rq_pos(rq) * 512ULL,
- bio, blk_rq_bytes(rq));
- else
- osd_req_read(or, &osdev->obj, blk_rq_pos(rq) * 512ULL,
- bio, blk_rq_bytes(rq));
-
- OSDBLK_DEBUG("%s 0x%x bytes at 0x%llx\n",
- do_flush ? "flush" : do_write ?
- "write" : "read", blk_rq_bytes(rq),
- blk_rq_pos(rq) * 512ULL);
-
- /* begin OSD command execution */
- if (osd_async_op(or, osdblk_osd_complete, orq,
- osdev->obj_cred)) {
- osd_end_request(or);
- blk_requeue_request(q, rq);
- bio_chain_put(bio);
- OSDBLK_DEBUG("osd_execute_request_async with err\n");
- break;
- }
-
- /* remove the special 'flush' marker, now that the command
- * is executing
- */
- rq->special = NULL;
- }
-}
-
-static void osdblk_free_disk(struct osdblk_device *osdev)
-{
- struct gendisk *disk = osdev->disk;
-
- if (!disk)
- return;
-
- if (disk->flags & GENHD_FL_UP)
- del_gendisk(disk);
- if (disk->queue)
- blk_cleanup_queue(disk->queue);
- put_disk(disk);
-}
-
-static int osdblk_init_disk(struct osdblk_device *osdev)
-{
- struct gendisk *disk;
- struct request_queue *q;
- int rc;
- u64 obj_size = 0;
-
- /* contact OSD, request size info about the object being mapped */
- rc = osdblk_get_obj_size(osdev, &obj_size);
- if (rc)
- return rc;
-
- /* create gendisk info */
- disk = alloc_disk(OSDBLK_MINORS_PER_MAJOR);
- if (!disk)
- return -ENOMEM;
-
- sprintf(disk->disk_name, DRV_NAME "%d", osdev->id);
- disk->major = osdev->major;
- disk->first_minor = 0;
- disk->fops = &osdblk_bd_ops;
- disk->private_data = osdev;
-
- /* init rq */
- q = blk_init_queue(osdblk_rq_fn, &osdev->lock);
- if (!q) {
- put_disk(disk);
- return -ENOMEM;
- }
-
- /* switch queue to TCQ mode; allocate tag map */
- rc = blk_queue_init_tags(q, OSDBLK_MAX_REQ, NULL, BLK_TAG_ALLOC_FIFO);
- if (rc) {
- blk_cleanup_queue(q);
- put_disk(disk);
- return rc;
- }
-
- /* Set our limits to the lower device limits, because osdblk cannot
- * sleep when allocating a lower-request and therefore cannot be
- * bouncing.
- */
- blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
-
- blk_queue_prep_rq(q, blk_queue_start_tag);
- blk_queue_write_cache(q, true, false);
-
- disk->queue = q;
-
- q->queuedata = osdev;
-
- osdev->disk = disk;
- osdev->q = q;
-
- /* finally, announce the disk to the world */
- set_capacity(disk, obj_size / 512ULL);
- add_disk(disk);
-
- printk(KERN_INFO "%s: Added of size 0x%llx\n",
- disk->disk_name, (unsigned long long)obj_size);
-
- return 0;
-}
-
-/********************************************************************
- * /sys/class/osdblk/
- * add map OSD object to blkdev
- * remove unmap OSD object
- * list show mappings
- *******************************************************************/
-
-static void class_osdblk_release(struct class *cls)
-{
- kfree(cls);
-}
-
-static ssize_t class_osdblk_list(struct class *c,
- struct class_attribute *attr,
- char *data)
-{
- int n = 0;
- struct list_head *tmp;
-
- mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
-
- list_for_each(tmp, &osdblkdev_list) {
- struct osdblk_device *osdev;
-
- osdev = list_entry(tmp, struct osdblk_device, node);
-
- n += sprintf(data+n, "%d %d %llu %llu %s\n",
- osdev->id,
- osdev->major,
- osdev->obj.partition,
- osdev->obj.id,
- osdev->osd_path);
- }
-
- mutex_unlock(&ctl_mutex);
- return n;
-}
-
-static ssize_t class_osdblk_add(struct class *c,
- struct class_attribute *attr,
- const char *buf, size_t count)
-{
- struct osdblk_device *osdev;
- ssize_t rc;
- int irc, new_id = 0;
- struct list_head *tmp;
-
- if (!try_module_get(THIS_MODULE))
- return -ENODEV;
-
- /* new osdblk_device object */
- osdev = kzalloc(sizeof(*osdev) + strlen(buf) + 1, GFP_KERNEL);
- if (!osdev) {
- rc = -ENOMEM;
- goto err_out_mod;
- }
-
- /* static osdblk_device initialization */
- spin_lock_init(&osdev->lock);
- INIT_LIST_HEAD(&osdev->node);
-
- /* generate unique id: find highest unique id, add one */
-
- mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
-
- list_for_each(tmp, &osdblkdev_list) {
- struct osdblk_device *osdev;
-
- osdev = list_entry(tmp, struct osdblk_device, node);
- if (osdev->id > new_id)
- new_id = osdev->id + 1;
- }
-
- osdev->id = new_id;
-
- /* add to global list */
- list_add_tail(&osdev->node, &osdblkdev_list);
-
- mutex_unlock(&ctl_mutex);
-
- /* parse add command */
- if (sscanf(buf, "%llu %llu %s", &osdev->obj.partition, &osdev->obj.id,
- osdev->osd_path) != 3) {
- rc = -EINVAL;
- goto err_out_slot;
- }
-
- /* initialize rest of new object */
- sprintf(osdev->name, DRV_NAME "%d", osdev->id);
-
- /* contact requested OSD */
- osdev->osd = osduld_path_lookup(osdev->osd_path);
- if (IS_ERR(osdev->osd)) {
- rc = PTR_ERR(osdev->osd);
- goto err_out_slot;
- }
-
- /* build OSD credential */
- osdblk_make_credential(osdev->obj_cred, &osdev->obj);
-
- /* register our block device */
- irc = register_blkdev(0, osdev->name);
- if (irc < 0) {
- rc = irc;
- goto err_out_osd;
- }
-
- osdev->major = irc;
-
- /* set up and announce blkdev mapping */
- rc = osdblk_init_disk(osdev);
- if (rc)
- goto err_out_blkdev;
-
- return count;
-
-err_out_blkdev:
- unregister_blkdev(osdev->major, osdev->name);
-err_out_osd:
- osduld_put_device(osdev->osd);
-err_out_slot:
- mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
- list_del_init(&osdev->node);
- mutex_unlock(&ctl_mutex);
-
- kfree(osdev);
-err_out_mod:
- OSDBLK_DEBUG("Error adding device %s\n", buf);
- module_put(THIS_MODULE);
- return rc;
-}
-
-static ssize_t class_osdblk_remove(struct class *c,
- struct class_attribute *attr,
- const char *buf,
- size_t count)
-{
- struct osdblk_device *osdev = NULL;
- int target_id, rc;
- unsigned long ul;
- struct list_head *tmp;
-
- rc = kstrtoul(buf, 10, &ul);
- if (rc)
- return rc;
-
- /* convert to int; abort if we lost anything in the conversion */
- target_id = (int) ul;
- if (target_id != ul)
- return -EINVAL;
-
- /* remove object from list immediately */
- mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
-
- list_for_each(tmp, &osdblkdev_list) {
- osdev = list_entry(tmp, struct osdblk_device, node);
- if (osdev->id == target_id) {
- list_del_init(&osdev->node);
- break;
- }
- osdev = NULL;
- }
-
- mutex_unlock(&ctl_mutex);
-
- if (!osdev)
- return -ENOENT;
-
- /* clean up and free blkdev and associated OSD connection */
- osdblk_free_disk(osdev);
- unregister_blkdev(osdev->major, osdev->name);
- osduld_put_device(osdev->osd);
- kfree(osdev);
-
- /* release module ref */
- module_put(THIS_MODULE);
-
- return count;
-}
-
-static struct class_attribute class_osdblk_attrs[] = {
- __ATTR(add, 0200, NULL, class_osdblk_add),
- __ATTR(remove, 0200, NULL, class_osdblk_remove),
- __ATTR(list, 0444, class_osdblk_list, NULL),
- __ATTR_NULL
-};
-
-static int osdblk_sysfs_init(void)
-{
- int ret = 0;
-
- /*
- * create control files in sysfs
- * /sys/class/osdblk/...
- */
- class_osdblk = kzalloc(sizeof(*class_osdblk), GFP_KERNEL);
- if (!class_osdblk)
- return -ENOMEM;
-
- class_osdblk->name = DRV_NAME;
- class_osdblk->owner = THIS_MODULE;
- class_osdblk->class_release = class_osdblk_release;
- class_osdblk->class_attrs = class_osdblk_attrs;
-
- ret = class_register(class_osdblk);
- if (ret) {
- kfree(class_osdblk);
- class_osdblk = NULL;
- printk(PFX "failed to create class osdblk\n");
- return ret;
- }
-
- return 0;
-}
-
-static void osdblk_sysfs_cleanup(void)
-{
- if (class_osdblk)
- class_destroy(class_osdblk);
- class_osdblk = NULL;
-}
-
-static int __init osdblk_init(void)
-{
- int rc;
-
- rc = osdblk_sysfs_init();
- if (rc)
- return rc;
-
- return 0;
-}
-
-static void __exit osdblk_exit(void)
-{
- osdblk_sysfs_cleanup();
-}
-
-module_init(osdblk_init);
-module_exit(osdblk_exit);
-
--
2.11.0
^ permalink raw reply related
* RFC: drop the T10 OSD code and its users
From: Christoph Hellwig @ 2017-04-12 16:01 UTC (permalink / raw)
To: ooo, martin.petersen, trond.myklebust, axboe
Cc: osd-dev, linux-nfs, linux-scsi, linux-block, linux-kernel
The only real user of the T10 OSD protocol, the pNFS object layout
driver never went to the point of having shipping products, and the
other two users (osdblk and exofs) were simple example of it's usage.
The code has been mostly unmaintained for years and is getting in the
way of block / SCSI changes, so I think it's finally time to drop it.
These patches are against Jens' block for-next tree as that already
has various modifications of the SCSI code.
^ permalink raw reply
* Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler
From: Bart Van Assche @ 2017-04-12 15:30 UTC (permalink / raw)
To: paolo.valente@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
avanzini.arianna@gmail.com, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <17416A05-8401-4B03-B46C-78A8976C5B8E@linaro.org>
On Wed, 2017-04-12 at 08:01 +0200, Paolo Valente wrote:
> Where is my mistake?
I think in the Makefile. How about the patch below? Please note that I'm no
Kbuild expert.
diff --git a/block/Makefile b/block/Makefile
index 546066ee7fa6..b3711af6b637 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,7 +20,8 @@ obj-$(CONFIG_IOSCHED_NOOP) =A0=A0=A0+=3D noop-iosched.o
=A0obj-$(CONFIG_IOSCHED_DEADLINE) +=3D deadline-iosched.o
=A0obj-$(CONFIG_IOSCHED_CFQ) =A0=A0=A0=A0=A0+=3D cfq-iosched.o
=A0obj-$(CONFIG_MQ_IOSCHED_DEADLINE) =A0=A0=A0=A0=A0+=3D mq-deadline.o
-obj-$(CONFIG_IOSCHED_BFQ) =A0=A0=A0=A0=A0+=3D bfq-iosched.o bfq-wf2q.o bfq=
-cgroup.o
+bfq-y =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0:=3D bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
+obj-$(CONFIG_IOSCHED_BFQ) =A0=A0=A0=A0=A0+=3D bfq.o
=A0=A0=A0obj-$(CONFIG_BLOCK_COMPAT) =A0=A0=A0=A0+=3D compat_ioctl.o
=A0obj-$(CONFIG_BLK_CMDLINE_PARSER) =A0=A0=A0=A0=A0=A0+=3D cmdline-parser.o
^ permalink raw reply related
* Re: [PATCH 19/25] nilfs2: Convert to properly refcounting bdi
From: Ryusuke Konishi @ 2017-04-12 13:47 UTC (permalink / raw)
To: Jan Kara, Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-nilfs
In-Reply-To: <20170412102449.16901-20-jack@suse.cz>
On 2017/04/12 19:24, Jan Kara wrote:
> Similarly to set_bdev_super() NILFS2 just used block device reference to
> bdi. Convert it to properly getting bdi reference. The reference will
> get automatically dropped on superblock destruction.
>
> CC: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
> CC: linux-nilfs@vger.kernel.org
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>
Looks fine, thanks.
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
> ---
> fs/nilfs2/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
> index e1872f36147f..feb796a38b8d 100644
> --- a/fs/nilfs2/super.c
> +++ b/fs/nilfs2/super.c
> @@ -1068,7 +1068,8 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent)
> sb->s_time_gran = 1;
> sb->s_max_links = NILFS_LINK_MAX;
>
> - sb->s_bdi = bdev_get_queue(sb->s_bdev)->backing_dev_info;
> + sb->s_bdi = bdi_get(sb->s_bdev->bd_bdi);
> + sb->s_iflags |= SB_I_DYNBDI;
>
> err = load_nilfs(nilfs, sb);
> if (err)
>
^ permalink raw reply
* [PATCH v6] lightnvm: pblk
From: Javier González @ 2017-04-12 13:19 UTC (permalink / raw)
To: mb; +Cc: linux-block, linux-kernel, Bart.VanAssche, Javier González
Hi Matias,
A last spin to fix a regression that I introduced yesterday on v4. This
should be the one.
Thanks,
Javier
Changes since v5:
* Fix regression on the erase scheduler introduced on v4
Changes since v4:
* Rebase on top of Matias' for-4.12/core
* Fix type implicit conversions reported by sparse (reported by Bart Van
Assche)
* Make error and debug statistics long atomic variables.
Changes since v3:
* Apply Bart's feedback [1]
* Implement dynamic L2P optimizations for > 32-bit physical media
geometry (from Matias Bjørling)
* Fix memory leak on GC (Reported by Simon A. F. Lund)
* 8064 is a perfectly round number of lines :)
[1] https://lkml.org/lkml/2017/4/8/172
Changes since v2:
* Rebase on top of Matias' for-4.12/core
* Implement L2P scan recovery to recover L2P table in case of power
failure.
* Re-design disk format to be more flexible in future versions (from
Matias Bjørling)
* Implement per-instance uuid to allow correct recovery without
forcing line erases (from Matias Bjørling)
* Re-design GC threading to have several GC readers and a single
writer that places data on the write buffer. This allows to maximize
the GC write buffer budget without having unnecessary GC writers
competing for the write buffer lock.
* Simplify sysfs interface.
* Refactoring and several code improvements (together with Matias
Bjørling)
Changes since v1:
* Rebase on top of Matias' for-4.12/core
* Move from per-LUN block allocation to a line model. This means that a
whole lines across all LUNs is allocated at a time. Data is still
stripped in a round-robin fashion at a page granurality.
* Implement new disk format scheme, where metadata is stored per line
instead of per LUN. This allows for space optimizations.
* Improvements on GC workqueue management and victim selection.
* Implement sysfs interface to query pblk's operation and statistics.
* Implement a user - GC I/O rate-limiter
* Various bug fixes
Javier González (1):
lightnvm: physical block device (pblk) target
Documentation/lightnvm/pblk.txt | 21 +
drivers/lightnvm/Kconfig | 9 +
drivers/lightnvm/Makefile | 5 +
drivers/lightnvm/pblk-cache.c | 114 +++
drivers/lightnvm/pblk-core.c | 1655 ++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/pblk-gc.c | 555 +++++++++++++
drivers/lightnvm/pblk-init.c | 949 ++++++++++++++++++++++
drivers/lightnvm/pblk-map.c | 136 ++++
drivers/lightnvm/pblk-rb.c | 852 ++++++++++++++++++++
drivers/lightnvm/pblk-read.c | 529 ++++++++++++
drivers/lightnvm/pblk-recovery.c | 998 +++++++++++++++++++++++
drivers/lightnvm/pblk-rl.c | 182 +++++
drivers/lightnvm/pblk-sysfs.c | 507 ++++++++++++
drivers/lightnvm/pblk-write.c | 411 ++++++++++
drivers/lightnvm/pblk.h | 1121 ++++++++++++++++++++++++++
15 files changed, 8044 insertions(+)
create mode 100644 Documentation/lightnvm/pblk.txt
create mode 100644 drivers/lightnvm/pblk-cache.c
create mode 100644 drivers/lightnvm/pblk-core.c
create mode 100644 drivers/lightnvm/pblk-gc.c
create mode 100644 drivers/lightnvm/pblk-init.c
create mode 100644 drivers/lightnvm/pblk-map.c
create mode 100644 drivers/lightnvm/pblk-rb.c
create mode 100644 drivers/lightnvm/pblk-read.c
create mode 100644 drivers/lightnvm/pblk-recovery.c
create mode 100644 drivers/lightnvm/pblk-rl.c
create mode 100644 drivers/lightnvm/pblk-sysfs.c
create mode 100644 drivers/lightnvm/pblk-write.c
create mode 100644 drivers/lightnvm/pblk.h
--
2.7.4
^ permalink raw reply
* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Benjamin Block @ 2017-04-12 10:55 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Jens Axboe, linux-block, linux-scsi
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>
On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> Hello Jens,
>
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.
>
Hey Bart,
just out of curiosity. Is this maybe related to similar stuff happening
when CPUs are hot plugged - at least in that the stack gets stuck? Like
in this thread here:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
Would be interesting, because we recently saw similar stuff happening.
Beste Gr��e / Best regards,
- Benjamin Block
--
Linux on z Systems Development / IBM Systems & Technology Group
IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz / Gesch�ftsf�hrung: Dirk Wittkopp
Sitz der Gesellschaft: B�blingen / Registergericht: AmtsG Stuttgart, HRB 243294
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox