From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Paolo Valente <paolo.valente@unimore.it>,
"David S. Miller" <davem@davemloft.net>
Subject: [ 38/79] pkt_sched: sch_qfq: remove a source of high packet delay/jitter
Date: Fri, 26 Jul 2013 13:47:28 -0700 [thread overview]
Message-ID: <20130726204725.907756224@linuxfoundation.org> (raw)
In-Reply-To: <20130726204721.849052763@linuxfoundation.org>
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Paolo Valente <paolo.valente@unimore.it>
[ Upstream commit 87f40dd6ce7042caca0b3b557e8923127f51f902 ]
QFQ+ inherits from QFQ a design choice that may cause a high packet
delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
special quantity, the system virtual time, to track the service
provided by the ideal system it approximates. When a packet is
dequeued, this quantity must be incremented by the size of the packet,
divided by the sum of the weights of the aggregates waiting to be
served. Tracking this sum correctly is a non-trivial task, because, to
preserve tight service guarantees, the decrement of this sum must be
delayed in a special way [1]: this sum can be decremented only after
that its value would decrease also in the ideal system approximated by
QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
weight sum, increased and decreased immediately as the weight of an
aggregate changes, and as an aggregate is created or destroyed (which,
in its turn, happens as a consequence of some class being
created/destroyed/changed). However, to avoid the problems caused to
service guarantees by these immediate decreases, QFQ+ increments the
system virtual time using the maximum value allowed for the weight
sum, 2^10, in place of the dynamic, instantaneous value. The
instantaneous value of the weight sum is used only to check whether a
request of weight increase or a class creation can be satisfied.
Unfortunately, the problems caused by this choice are worse than the
temporary degradation of the service guarantees that may occur, when a
class is changed or destroyed, if the instantaneous value of the
weight sum was used to update the system virtual time. In fact, the
fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
equal to the ratio between the weight of the aggregate and the sum of
the weights of the competing aggregates. The packet delay guaranteed
to the aggregate is instead inversely proportional to the guaranteed
bandwidth. By using the maximum possible value, and not the actual
value of the weight sum, QFQ+ provides each aggregate with the worst
possible service guarantees, and not with service guarantees related
to the actual set of competing aggregates. To see the consequences of
this fact, consider the following simple example.
Suppose that only the following aggregates are backlogged, i.e., that
only the classes in the following aggregates have packets to transmit:
one aggregate with weight 10, say A, and ten aggregates with weight 1,
say B1, B2, ..., B10. In particular, suppose that these aggregates are
always backlogged. Given the weight distribution, the smoothest and
fairest service order would be:
A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...
QFQ+ would provide exactly this optimal service if it used the actual
value for the weight sum instead of the maximum possible value, i.e.,
11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
serves aggregates as follows (easy to prove and to reproduce
experimentally):
A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...
By replacing 10 with N in the above example, and by increasing N, one
can increase at will the maximum packet delay and the jitter
experienced by the classes in aggregate A.
This patch addresses this issue by just using the above
'instantaneous' value of the weight sum, instead of the maximum
possible value, when updating the system virtual time. After the
instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
service for a time interval in the order of the time to serve one
maximum-size packet for each backlogged class. The worst-case extent
of the deviation exhibited by QFQ+ during this time interval [1] is
basically the same as of the deviation described above (but, without
this patch, QFQ+ suffers from such a deviation all the time). Finally,
this patch modifies the comment to the function qfq_slot_insert, to
make it coherent with the fact that the weight sum used by QFQ+ can
now be lower than the maximum possible value.
[1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
Proceedings of AAA-IDEA'05, June 2005.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/sched/sch_qfq.c | 85 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 56 insertions(+), 29 deletions(-)
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -113,7 +113,6 @@
#define FRAC_BITS 30 /* fixed point arithmetic */
#define ONE_FP (1UL << FRAC_BITS)
-#define IWSUM (ONE_FP/QFQ_MAX_WSUM)
#define QFQ_MTU_SHIFT 16 /* to support TSO/GSO */
#define QFQ_MIN_LMAX 512 /* see qfq_slot_insert */
@@ -189,6 +188,7 @@ struct qfq_sched {
struct qfq_aggregate *in_serv_agg; /* Aggregate being served. */
u32 num_active_agg; /* Num. of active aggregates */
u32 wsum; /* weight sum */
+ u32 iwsum; /* inverse weight sum */
unsigned long bitmaps[QFQ_MAX_STATE]; /* Group bitmaps. */
struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */
@@ -314,6 +314,7 @@ static void qfq_update_agg(struct qfq_sc
q->wsum +=
(int) agg->class_weight * (new_num_classes - agg->num_classes);
+ q->iwsum = ONE_FP / q->wsum;
agg->num_classes = new_num_classes;
}
@@ -340,6 +341,10 @@ static void qfq_destroy_agg(struct qfq_s
{
if (!hlist_unhashed(&agg->nonfull_next))
hlist_del_init(&agg->nonfull_next);
+ q->wsum -= agg->class_weight;
+ if (q->wsum != 0)
+ q->iwsum = ONE_FP / q->wsum;
+
if (q->in_serv_agg == agg)
q->in_serv_agg = qfq_choose_next_agg(q);
kfree(agg);
@@ -827,38 +832,60 @@ static void qfq_make_eligible(struct qfq
}
}
-
/*
- * The index of the slot in which the aggregate is to be inserted must
- * not be higher than QFQ_MAX_SLOTS-2. There is a '-2' and not a '-1'
- * because the start time of the group may be moved backward by one
- * slot after the aggregate has been inserted, and this would cause
- * non-empty slots to be right-shifted by one position.
+ * The index of the slot in which the input aggregate agg is to be
+ * inserted must not be higher than QFQ_MAX_SLOTS-2. There is a '-2'
+ * and not a '-1' because the start time of the group may be moved
+ * backward by one slot after the aggregate has been inserted, and
+ * this would cause non-empty slots to be right-shifted by one
+ * position.
+ *
+ * QFQ+ fully satisfies this bound to the slot index if the parameters
+ * of the classes are not changed dynamically, and if QFQ+ never
+ * happens to postpone the service of agg unjustly, i.e., it never
+ * happens that the aggregate becomes backlogged and eligible, or just
+ * eligible, while an aggregate with a higher approximated finish time
+ * is being served. In particular, in this case QFQ+ guarantees that
+ * the timestamps of agg are low enough that the slot index is never
+ * higher than 2. Unfortunately, QFQ+ cannot provide the same
+ * guarantee if it happens to unjustly postpone the service of agg, or
+ * if the parameters of some class are changed.
+ *
+ * As for the first event, i.e., an out-of-order service, the
+ * upper bound to the slot index guaranteed by QFQ+ grows to
+ * 2 +
+ * QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
+ * (current_max_weight/current_wsum) <= 2 + 8 * 128 * 1.
*
- * If the weight and lmax (max_pkt_size) of the classes do not change,
- * then QFQ+ does meet the above contraint according to the current
- * values of its parameters. In fact, if the weight and lmax of the
- * classes do not change, then, from the theory, QFQ+ guarantees that
- * the slot index is never higher than
- * 2 + QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) *
- * (QFQ_MAX_WEIGHT/QFQ_MAX_WSUM) = 2 + 8 * 128 * (1 / 64) = 18
+ * The following function deals with this problem by backward-shifting
+ * the timestamps of agg, if needed, so as to guarantee that the slot
+ * index is never higher than QFQ_MAX_SLOTS-2. This backward-shift may
+ * cause the service of other aggregates to be postponed, yet the
+ * worst-case guarantees of these aggregates are not violated. In
+ * fact, in case of no out-of-order service, the timestamps of agg
+ * would have been even lower than they are after the backward shift,
+ * because QFQ+ would have guaranteed a maximum value equal to 2 for
+ * the slot index, and 2 < QFQ_MAX_SLOTS-2. Hence the aggregates whose
+ * service is postponed because of the backward-shift would have
+ * however waited for the service of agg before being served.
*
- * When the weight of a class is increased or the lmax of the class is
- * decreased, a new aggregate with smaller slot size than the original
- * parent aggregate of the class may happen to be activated. The
- * activation of this aggregate should be properly delayed to when the
- * service of the class has finished in the ideal system tracked by
- * QFQ+. If the activation of the aggregate is not delayed to this
- * reference time instant, then this aggregate may be unjustly served
- * before other aggregates waiting for service. This may cause the
- * above bound to the slot index to be violated for some of these
- * unlucky aggregates.
+ * The other event that may cause the slot index to be higher than 2
+ * for agg is a recent change of the parameters of some class. If the
+ * weight of a class is increased or the lmax (max_pkt_size) of the
+ * class is decreased, then a new aggregate with smaller slot size
+ * than the original parent aggregate of the class may happen to be
+ * activated. The activation of this aggregate should be properly
+ * delayed to when the service of the class has finished in the ideal
+ * system tracked by QFQ+. If the activation of the aggregate is not
+ * delayed to this reference time instant, then this aggregate may be
+ * unjustly served before other aggregates waiting for service. This
+ * may cause the above bound to the slot index to be violated for some
+ * of these unlucky aggregates.
*
* Instead of delaying the activation of the new aggregate, which is
- * quite complex, the following inaccurate but simple solution is used:
- * if the slot index is higher than QFQ_MAX_SLOTS-2, then the
- * timestamps of the aggregate are shifted backward so as to let the
- * slot index become equal to QFQ_MAX_SLOTS-2.
+ * quite complex, the above-discussed capping of the slot index is
+ * used to handle also the consequences of a change of the parameters
+ * of a class.
*/
static void qfq_slot_insert(struct qfq_group *grp, struct qfq_aggregate *agg,
u64 roundedS)
@@ -1077,7 +1104,7 @@ static struct sk_buff *qfq_dequeue(struc
else
in_serv_agg->budget -= len;
- q->V += (u64)len * IWSUM;
+ q->V += (u64)len * q->iwsum;
pr_debug("qfq dequeue: len %u F %lld now %lld\n",
len, (unsigned long long) in_serv_agg->F,
(unsigned long long) q->V);
next prev parent reply other threads:[~2013-07-26 21:36 UTC|newest]
Thread overview: 98+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-26 20:46 [ 00/79] 3.10.4-stable review Greg Kroah-Hartman
2013-07-26 20:46 ` [ 01/79] writeback: Fix periodic writeback after fs mount Greg Kroah-Hartman
2013-07-26 20:46 ` [ 02/79] sparc32: vm_area_struct access for old Sun SPARCs Greg Kroah-Hartman
2013-07-27 15:39 ` Ben Hutchings
2013-07-26 20:46 ` [ 03/79] ipv6: only apply anti-spoofing checks to not-pointopoint tunnels Greg Kroah-Hartman
2013-07-26 20:46 ` [ 04/79] neighbour: fix a race in neigh_destroy() Greg Kroah-Hartman
2013-07-26 20:46 ` [ 05/79] x25: Fix broken locking in ioctl error paths Greg Kroah-Hartman
2013-07-26 20:46 ` [ 06/79] net: Swap ver and type in pppoe_hdr Greg Kroah-Hartman
2013-07-27 15:58 ` Ben Hutchings
2013-07-28 0:55 ` David Miller
2013-07-28 3:14 ` Ben Hutchings
2013-07-28 4:16 ` Greg KH
2013-07-28 4:39 ` David Miller
2013-07-28 18:29 ` Greg KH
2013-07-26 20:46 ` [ 07/79] gre: fix a regression in ioctl Greg Kroah-Hartman
2013-07-26 20:46 ` [ 08/79] vti: remove duplicated code to fix a memory leak Greg Kroah-Hartman
2013-07-26 20:46 ` [ 09/79] ipv6,mcast: always hold idev->lock before mca_lock Greg Kroah-Hartman
2013-07-26 20:47 ` [ 10/79] ip_tunnels: Use skb-len to PMTU check Greg Kroah-Hartman
2013-07-26 20:47 ` [ 11/79] l2tp: add missing .owner to struct pppox_proto Greg Kroah-Hartman
2013-07-26 20:47 ` [ 12/79] ipip: fix a regression in ioctl Greg Kroah-Hartman
2013-07-26 20:47 ` [ 13/79] ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data Greg Kroah-Hartman
2013-07-26 20:47 ` [ 14/79] ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size Greg Kroah-Hartman
2013-07-26 20:47 ` [ 15/79] ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available Greg Kroah-Hartman
2013-07-26 20:47 ` [ 16/79] sfc: Fix memory leak when discarding scattered packets Greg Kroah-Hartman
2013-07-26 20:47 ` [ 17/79] net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit Greg Kroah-Hartman
2013-07-26 20:47 ` [ 18/79] virtio: support unlocked queue poll Greg Kroah-Hartman
2013-07-26 20:47 ` [ 19/79] virtio_net: fix race in RX VQ processing Greg Kroah-Hartman
2013-07-26 20:47 ` [ 20/79] vhost-net: fix use-after-free in vhost_net_flush Greg Kroah-Hartman
2013-07-26 20:47 ` [ 21/79] sunvnet: vnet_port_remove must call unregister_netdev Greg Kroah-Hartman
2013-07-26 20:47 ` [ 22/79] ifb: fix rcu_sched self-detected stalls Greg Kroah-Hartman
2013-07-26 20:47 ` [ 23/79] tuntap: correctly linearize skb when zerocopy is used Greg Kroah-Hartman
2013-07-26 20:47 ` [ 24/79] macvtap: " Greg Kroah-Hartman
2013-07-26 20:47 ` [ 25/79] ipv6: in case of link failure remove route directly instead of letting it expire Greg Kroah-Hartman
2013-07-26 20:47 ` [ 26/79] 9p: fix off by one causing access violations and memory corruption Greg Kroah-Hartman
2013-07-26 20:47 ` [ 27/79] alx: fix lockdep annotation Greg Kroah-Hartman
2013-07-26 20:47 ` [ 28/79] ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF Greg Kroah-Hartman
2013-07-26 20:47 ` [ 29/79] dummy: fix oops when loading the dummy failed Greg Kroah-Hartman
2013-07-26 20:47 ` [ 30/79] ifb: fix oops when loading the ifb failed Greg Kroah-Hartman
2013-07-26 20:47 ` [ 31/79] gre: Fix MTU sizing check for gretap tunnels Greg Kroah-Hartman
2013-07-26 20:47 ` [ 32/79] ipv6: only static routes qualify for equal cost multipathing Greg Kroah-Hartman
2013-07-26 20:47 ` [ 33/79] atl1e: fix dma mapping warnings Greg Kroah-Hartman
2013-07-26 20:47 ` [ 34/79] atl1e: unmap partially mapped skb on dma error and free skb Greg Kroah-Hartman
2013-07-26 20:47 ` [ 35/79] ipv4: set transport header earlier Greg Kroah-Hartman
2013-07-26 20:47 ` [ 36/79] be2net: Fix to avoid hardware workaround when not needed Greg Kroah-Hartman
2013-07-26 20:47 ` [ 37/79] hyperv: Fix the NETIF_F_SG flag setting in netvsc Greg Kroah-Hartman
2013-07-26 20:47 ` Greg Kroah-Hartman [this message]
2013-07-26 20:47 ` [ 39/79] tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS Greg Kroah-Hartman
2013-07-26 20:47 ` [ 40/79] macvtap: " Greg Kroah-Hartman
2013-07-26 20:47 ` [ 41/79] vlan: mask vlan prio bits Greg Kroah-Hartman
2013-07-26 20:47 ` [ 42/79] vlan: fix a race in egress prio management Greg Kroah-Hartman
2013-07-27 16:55 ` Ben Hutchings
2013-07-27 17:38 ` Eric Dumazet
2013-07-27 17:58 ` Ben Hutchings
2013-07-26 20:47 ` [ 43/79] MIPS: Oceton: Fix build error Greg Kroah-Hartman
2013-07-26 20:47 ` [ 44/79] RAPIDIO: IDT_GEN2: " Greg Kroah-Hartman
2013-07-26 20:47 ` [ 45/79] fuse: readdirplus: fix dentry leak Greg Kroah-Hartman
2013-07-26 20:47 ` [ 46/79] fuse: readdirplus: fix instantiate Greg Kroah-Hartman
2013-07-26 20:47 ` [ 47/79] fuse: readdirplus: sanity checks Greg Kroah-Hartman
2013-07-26 20:47 ` [ 48/79] bcache: Fix a dumb race Greg Kroah-Hartman
2013-07-26 20:47 ` [ 49/79] bcache: Advertise that flushes are supported Greg Kroah-Hartman
2013-07-26 20:47 ` [ 50/79] bcache: Shutdown fix Greg Kroah-Hartman
2013-07-26 20:47 ` [ 51/79] bcache: Fix a sysfs splat on shutdown Greg Kroah-Hartman
2013-07-26 20:47 ` [ 52/79] bcache: Fix GC_SECTORS_USED() calculation Greg Kroah-Hartman
2013-07-26 20:47 ` [ 53/79] bcache: Journal replay fix Greg Kroah-Hartman
2013-07-26 20:47 ` [ 54/79] EDAC: Fix lockdep splat Greg Kroah-Hartman
2013-07-26 20:47 ` [ 55/79] SCSI: mpt3sas: Infinite loops can occur if MPI2_IOCSTATUS_CONFIG_INVALID_PAGE is not returned Greg Kroah-Hartman
2013-07-26 20:47 ` [ 56/79] SCSI: mpt3sas: fix for kernel panic when driver loads with HBA conected to non LUN 0 configured expander Greg Kroah-Hartman
2013-07-26 20:47 ` [ 57/79] SCSI: megaraid_sas: fix memory leak if SGL has zero length entries Greg Kroah-Hartman
2013-07-26 20:47 ` [ 58/79] lib/Kconfig.debug: Restrict FRAME_POINTER for MIPS Greg Kroah-Hartman
2013-07-26 20:47 ` [ 59/79] usb: serial: option: blacklist ONDA MT689DC QMI interface Greg Kroah-Hartman
2013-07-26 20:47 ` [ 60/79] usb: option: add TP-LINK MA260 Greg Kroah-Hartman
2013-07-26 20:47 ` [ 61/79] usb: serial: option: add Olivetti Olicard 200 Greg Kroah-Hartman
2013-07-26 20:47 ` [ 62/79] usb: serial: option.c: remove ONDA MT825UP product ID fromdriver Greg Kroah-Hartman
2013-07-26 20:47 ` [ 63/79] USB: option: append Petatel NP10T device to GSM modems list Greg Kroah-Hartman
2013-07-26 20:47 ` [ 64/79] USB: option: add D-Link DWM-152/C1 and DWM-156/C1 Greg Kroah-Hartman
2013-07-26 20:47 ` [ 65/79] usb: serial: option: Add ONYX 3G device support Greg Kroah-Hartman
2013-07-26 20:47 ` [ 66/79] ARM: S3C24XX: Add missing clkdev entries for s3c2440 UART Greg Kroah-Hartman
2013-07-26 20:47 ` [ 67/79] ARM: footbridge: fix overlapping PCI mappings Greg Kroah-Hartman
2013-07-26 20:47 ` [ 68/79] usb: serial: cp210x: Add USB ID for Netgear Switches embedded serial adapter Greg Kroah-Hartman
2013-07-26 20:47 ` [ 69/79] USB: cp210x: add MMB and PI ZigBee USB Device Support Greg Kroah-Hartman
2013-07-26 20:48 ` [ 70/79] usb: cp210x support SEL C662 Vendor/Device Greg Kroah-Hartman
2013-07-26 20:48 ` [ 71/79] ext4: fix error handling in ext4_ext_truncate() Greg Kroah-Hartman
2013-07-27 21:33 ` Ben Hutchings
2013-07-28 11:40 ` Theodore Ts'o
2013-07-28 18:27 ` Greg Kroah-Hartman
2013-07-28 21:15 ` Ben Hutchings
2013-07-26 20:48 ` [ 72/79] PM / Sleep: avoid autosleep in shutdown progress Greg Kroah-Hartman
2013-07-26 20:48 ` [ 73/79] media: saa7134: Fix unlocked snd_pcm_stop() call Greg Kroah-Hartman
2013-07-26 20:48 ` [ 74/79] media: dmxdev: remove dvb_ringbuffer_flush() on writer side Greg Kroah-Hartman
2013-07-26 20:48 ` [ 75/79] lockd: protect nlm_blocked access in nlmsvc_retry_blocked Greg Kroah-Hartman
2013-07-26 20:48 ` [ 76/79] hrtimers: Move SMP function call to thread context Greg Kroah-Hartman
2013-07-26 20:48 ` [ 77/79] ALSA: hda - Remove NO_PRESENCE bit override for Dell 1420n Laptop Greg Kroah-Hartman
2013-07-26 20:48 ` [ 78/79] ALSA: usb-audio: 6fire: return correct XRUN indication Greg Kroah-Hartman
2013-07-26 20:48 ` [ 79/79] ALSA: hda - Fix EAPD GPIO control for Sigmatel codecs Greg Kroah-Hartman
2013-07-27 0:19 ` [ 00/79] 3.10.4-stable review Shuah Khan
2013-07-27 0:57 ` Greg Kroah-Hartman
2013-07-28 0:48 ` linux
2013-07-28 18:26 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130726204725.907756224@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=davem@davemloft.net \
--cc=linux-kernel@vger.kernel.org \
--cc=paolo.valente@unimore.it \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox