[PATCH 2/4] deferred drop, __parent workaround, reshape

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
@ 2004-08-13  0:48 sandr8
  2004-08-13 12:51 ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: sandr8 @ 2004-08-13  0:48 UTC (permalink / raw)
  To: hadi, kuznet, davem, devik, shemminger, kaber, rusty, laforge
  Cc: netdev, netfilter-devel

2) The second patch eliminates the need for the deprecated __parent 
field and eliminates some tricks such as the rx_class field...
to do so it introduces a slight change in the interface of the enqueue() 
and requeue() operations... the "struct sk_buff * skb" becomes now a 
"struct sk_buff ** const skb". All the in-2.6.8-kernel packet schedulers 
and the core/dev.c are updated consequently... changes are quite trivial 
so there's no issue for out-of-the-kernel-tree schedulers.

in order to make it clearer i should give a broader insight... i hate to 
describe the gory details but it's a bit delicate.

a) the socket buffer for a dropped packet is freed as late as possible 
(lazy lazy lazy!). in other words, locally, a packet is _candidated_ to 
be dropped, but the latest word is the most global one.

who's got the most global viewpoint? Elementary! My Dear Watson! the 
most external enqueuing operation. Ok... now you get it... the stack is 
your friend and will discriminate for free :) the pointer to the socket 
buffer structure always stays at the same location, but is updated by 
the enqueueing operations.

This means that whatever packet "Barabba" is choosen to be dropped 
internally, it can be further saved by the caller, who can exchange it 
with an other victim... (i think there's no need to say who)

b) the changes in (a) made it possible to remove the deprecated
__parent field due to cbq. this is done by handling the reshape 
operation from the outside (in a very stupid loop cycle), whoever
drops whatever.

this way we also avoid saving the class in the rx_class field, calling a 
function that retrieves it and doing complex tricks to cope with the old 
byval api

c) now it's possible to have (almost) a single point (core/dev.c) where 
packets are dropped. Though there are some other few places, the inline 
function before_explicit_drop() easies the centralization of the 
operations. this comes into great help for patch 4.

Also, the empty macro IMPLICIT_DROP() behaves as a marker for the 
reader   and the coder, helping maintainance and readability.

d) yet the reshape_fail field is kept, and the code that handled it in 
the leaf disciplines is kept as well (slightly reworked but kept). this 
just in case some out-of-the-kernel classful schedulers use it: we don't 
want to break them (for what concerns the in-kernel-schedulers, you can 
wipe out all the code related to reshape_fail without any harm, it is 
not any more needed... with regard to it, what about wrapping it in some 
#ifdef CONFIG_TC_DONT_BREAK_OUT_OF_THE_KERNEL_CLASSFUL_SCHEDULERS ?)

NOTA BENE
e) it is darn important that if a packet is dropped all the outmost 
queueing discipline's enqueue()'s return value tells it!

NOTA BENE
f) kfree_skb() for locally allocated socket buffers are still there and 
must stay there

NOTA BENE
g) kfree_skb() whose semantic is to just decrement the user count are 
not a packet drop. furthermore, the same kfree_skb() could have a 
different run-time semantic, and is hence splitted in two, as in the 
following excerpt of code:

         }
 
         if (terminal) {
-            kfree_skb(skb);
+            if( any_dropped(*qres) ){
+                before_explicit_drop(*skb);
+                IMPLICIT_DROP();
+            } else
+                kfree_skb(*skb);
             return NULL;
         }

Alessandro Salvatori
--
the _NOSPAM_ account is the one i am subscribed with, please remove 
_NOSPAM_ for personal replies

diff -NaurX dontdiff 
linux-2.6.8-rc4-netxmitcodes/include/net/pkt_sched.h 
linux-2.6.8-rc4-apichanged/include/net/pkt_sched.h
--- linux-2.6.8-rc4-netxmitcodes/include/net/pkt_sched.h    2004-08-12 
13:31:12.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/include/net/pkt_sched.h    2004-08-12 
16:01:09.134153672 +0200
@@ -52,10 +52,10 @@
     char            id[IFNAMSIZ];
     int            priv_size;
 
-    int             (*enqueue)(struct sk_buff *, struct Qdisc *);
+    int             (*enqueue)(struct sk_buff ** const, struct Qdisc *);
     struct sk_buff *    (*dequeue)(struct Qdisc *);
-    int             (*requeue)(struct sk_buff *, struct Qdisc *);
-    unsigned int        (*drop)(struct Qdisc *);
+    int             (*requeue)(struct sk_buff ** const, struct Qdisc *);
+    unsigned int        (*drop)(struct Qdisc *, struct sk_buff ** const);
 
     int            (*init)(struct Qdisc *, struct rtattr *arg);
     void            (*reset)(struct Qdisc *);
@@ -71,7 +71,7 @@
 
 struct Qdisc
 {
-    int             (*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
+    int             (*enqueue)(struct sk_buff ** const skb, struct 
Qdisc *dev);
     struct sk_buff *    (*dequeue)(struct Qdisc *dev);
     unsigned        flags;
 #define TCQ_F_BUILTIN    1
@@ -88,14 +88,17 @@
     struct tc_stats        stats;
     spinlock_t        *stats_lock;
     struct rcu_head     q_rcu;
-    int            (*reshape_fail)(struct sk_buff *skb, struct Qdisc *q);
-
-    /* This field is deprecated, but it is still used by CBQ
-     * and it will live until better solution will be invented.
-     */
-    struct Qdisc        *__parent;
+    int            (*reshape_fail)(struct sk_buff ** const skb,
+                    struct Qdisc *q);
 };
 
+#define IMPLICIT_DROP() do; while (0) /* readability: just to be aware 
of what you are doing!!! */
+
+static inline void before_explicit_drop(const struct sk_buff * skb)
+{
+    /* for the moment there's nothing to do. see next patch!!! */
+}
+
 #define    QDISC_ALIGN        32
 #define    QDISC_ALIGN_CONST    (QDISC_ALIGN - 1)
 
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/core/dev.c 
linux-2.6.8-rc4-apichanged/net/core/dev.c
--- linux-2.6.8-rc4-netxmitcodes/net/core/dev.c    2004-08-10 
12:27:35.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/core/dev.c    2004-08-12 
17:23:43.682947768 +0200
@@ -1341,13 +1341,18 @@
         /* Grab device queue */
         spin_lock_bh(&dev->queue_lock);
 
-        rc = q->enqueue(skb, q);
+        rc = q->enqueue(&skb, q);
 
         qdisc_run(dev);
 
         spin_unlock_bh(&dev->queue_lock);
         rcu_read_unlock();
         rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
+
+        if(rc!=NET_XMIT_SUCCESS){ /* unlikely? better dynamically IMHO */
+            before_explicit_drop(skb);
+            goto out_kfree_skb;
+        }
         goto out;
     }
     rcu_read_unlock();
@@ -1747,7 +1752,7 @@
         }
         spin_lock(&dev->ingress_lock);
         if ((q = dev->qdisc_ingress) != NULL)
-            result = q->enqueue(skb, q);
+            result = q->enqueue(&skb, q);
         spin_unlock(&dev->ingress_lock);
 
     }
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_api.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_api.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_api.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_api.c    2004-08-12 
17:00:27.707168344 +0200
@@ -13,6 +13,7 @@
  * Rani Assaf <rani@magic.metawire.com> :980802: JIFFIES and CPU clock 
sources are repaired.
  * Eduardo J. Blanco <ejbs@netlabs.com.uy> :990222: kmod support
  * Jamal Hadi Salim <hadi@nortelnetworks.com>: 990601: ingress support
+ * sandr8 <alessandro.salvatori@eurecom.fr> :040812: api change, 
deferred drop, __parent workaround
  */
 
 #include <linux/config.h>
@@ -95,9 +96,9 @@
 
    ---enqueue
 
-   enqueue returns 0, if packet was enqueued successfully.
+   enqueue returns an even number, if packet was enqueued successfully.
    If packet (this one or another one) was dropped, it returns
-   not zero error code.
+   an odd error code.
    NET_XMIT_DROP     - this packet dropped
      Expected action: do not backoff, but wait until queue will clear.
    NET_XMIT_CN         - probably this packet enqueued, but another one 
dropped.
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_atm.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_atm.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_atm.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_atm.c    2004-08-12 
16:39:10.000000000 +0200
@@ -398,7 +398,7 @@
 /* --------------------------- Qdisc operations 
---------------------------- */
 
 
-static int atm_tc_enqueue(struct sk_buff *skb,struct Qdisc *sch)
+static int atm_tc_enqueue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
     struct atm_qdisc_data *p = PRIV(sch);
     struct atm_flow_data *flow = NULL ; /* @@@ */
@@ -406,13 +406,13 @@
     int result;
     int ret = NET_XMIT_POLICED;
 
-    D2PRINTK("atm_tc_enqueue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
+    D2PRINTK("atm_tc_enqueue(skb %p,sch %p,[qdisc %p])\n",*skb,sch,p);
     result = TC_POLICE_OK; /* be nice to gcc */
-    if (TC_H_MAJ(skb->priority) != sch->handle ||
-        !(flow = (struct atm_flow_data *) atm_tc_get(sch,skb->priority)))
+    if (TC_H_MAJ((*skb)->priority) != sch->handle ||
+        !(flow = (struct atm_flow_data *) 
atm_tc_get(sch,(*skb)->priority)))
         for (flow = p->flows; flow; flow = flow->next)
             if (flow->filter_list) {
-                result = tc_classify(skb,flow->filter_list,
+                result = tc_classify((*skb),flow->filter_list,
                     &res);
                 if (result < 0) continue;
                 flow = (struct atm_flow_data *) res.class;
@@ -422,17 +422,17 @@
     if (!flow) flow = &p->link;
     else {
         if (flow->vcc)
-            ATM_SKB(skb)->atm_options = flow->vcc->atm_options;
+            ATM_SKB(*skb)->atm_options = flow->vcc->atm_options;
             /*@@@ looks good ... but it's not supposed to work :-)*/
 #ifdef CONFIG_NET_CLS_POLICE
         switch (result) {
             case TC_POLICE_SHOT:
-                kfree_skb(skb);
+                IMPLICIT_DROP();
                 break;
             case TC_POLICE_RECLASSIFY:
                 if (flow->excess) flow = flow->excess;
                 else {
-                    ATM_SKB(skb)->atm_options |=
+                    ATM_SKB(*skb)->atm_options |=
                         ATM_ATMOPT_CLP;
                     break;
                 }
@@ -508,8 +508,11 @@
                 struct sk_buff *new;
 
                 new = skb_realloc_headroom(skb,flow->hdr_len);
+                if(!new)
+                    before_explicit_drop(skb);
                 dev_kfree_skb(skb);
-                if (!new) continue;
+                if (!new)
+                    continue;
                 skb = new;
             }
             D2PRINTK("sch_atm_dequeue: ip %p, data %p\n",
@@ -538,12 +541,12 @@
 }
 
 
-static int atm_tc_requeue(struct sk_buff *skb,struct Qdisc *sch)
+static int atm_tc_requeue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
     struct atm_qdisc_data *p = PRIV(sch);
     int ret;
 
-    D2PRINTK("atm_tc_requeue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
+    D2PRINTK("atm_tc_requeue(skb %p,sch %p,[qdisc %p])\n",*skb,sch,p);
     ret = p->link.q->ops->requeue(skb,p->link.q);
     if (!ret) sch->q.qlen++;
     else {
@@ -554,7 +557,7 @@
 }
 
 
-static unsigned int atm_tc_drop(struct Qdisc *sch)
+static unsigned int atm_tc_drop(struct Qdisc *sch, struct sk_buff ** 
const  skb)
 {
     struct atm_qdisc_data *p = PRIV(sch);
     struct atm_flow_data *flow;
@@ -562,7 +565,7 @@
 
     DPRINTK("atm_tc_drop(sch %p,[qdisc %p])\n",sch,p);
     for (flow = p->flows; flow; flow = flow->next)
-        if (flow->q->ops->drop && (len = flow->q->ops->drop(flow->q)))
+        if (flow->q->ops->drop && (len = flow->q->ops->drop(flow->q, skb)))
             return len;
     return 0;
 }
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_cbq.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_cbq.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_cbq.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_cbq.c    2004-08-12 
17:03:46.257984032 +0200
@@ -8,6 +8,8 @@
  *
  * Authors:    Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
  *
+ * Fixes:
+ * sandr8 <alessandro.salvatori@eurecom.fr> __parent workaround, 
rx_class removal
  */
 
 #include <linux/config.h>
@@ -170,9 +172,6 @@
     struct cbq_class    *active[TC_CBQ_MAXPRIO+1];    /* List of all 
classes
                                    with backlog */
 
-#ifdef CONFIG_NET_CLS_POLICE
-    struct cbq_class    *rx_class;
-#endif
     struct cbq_class    *tx_class;
     struct cbq_class    *tx_borrowed;
     int            tx_len;
@@ -239,17 +238,17 @@
  */
 
 static struct cbq_class *
-cbq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qres)
+cbq_classify(struct sk_buff ** const skb, struct Qdisc *sch, int *qres)
 {
     struct cbq_sched_data *q = qdisc_priv(sch);
     struct cbq_class *head = &q->link;
     struct cbq_class **defmap;
     struct cbq_class *cl = NULL;
-    u32 prio = skb->priority;
+    u32 prio = (*skb)->priority;
     struct tcf_result res;
 
     /*
-     *  Step 1. If skb->priority points to one of our classes, use it.
+     *  Step 1. If (*skb)->priority points to one of our classes, use it.
      */
     if (TC_H_MAJ(prio^sch->handle) == 0 &&
         (cl = cbq_class_lookup(q, prio)) != NULL)
@@ -265,7 +264,7 @@
         /*
          * Step 2+n. Apply classifier.
          */
-        if (!head->filter_list || (result = tc_classify(skb, 
head->filter_list, &res)) < 0)
+        if (!head->filter_list || (result = tc_classify(*skb, 
head->filter_list, &res)) < 0)
             goto fallback;
 
         if ((cl = (void*)res.class) == NULL) {
@@ -296,14 +295,18 @@
         }
 
         if (terminal) {
-            kfree_skb(skb);
+            if( any_dropped(*qres) ){
+                before_explicit_drop(*skb);
+                IMPLICIT_DROP();
+            } else
+                kfree_skb(*skb);
             return NULL;
         }
 #else
 #ifdef CONFIG_NET_CLS_POLICE
         switch (result) {
         case TC_POLICE_RECLASSIFY:
-            return cbq_reclassify(skb, cl);
+            return cbq_reclassify(*skb, cl);
         case TC_POLICE_SHOT:
             return NULL;
         default:
@@ -417,61 +420,61 @@
 }
 
 static int
-cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+cbq_enqueue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct cbq_sched_data *q = qdisc_priv(sch);
-    int len = skb->len;
     int ret = NET_XMIT_SUCCESS;
     struct cbq_class *cl = cbq_classify(skb, sch,&ret);
 
-#ifdef CONFIG_NET_CLS_POLICE
-    q->rx_class = cl;
-#endif
-    if (cl) {
-#ifdef CONFIG_NET_CLS_POLICE
-        cl->q->__parent = sch;
-#endif
-        if ((ret = cl->q->enqueue(skb, cl->q)) == NET_XMIT_SUCCESS) {
+    while (cl) {
+        cbq_mark_toplevel(q, cl);
+        if ( no_dropped(ret = cl->q->enqueue(skb, cl->q)) ) {
             sch->q.qlen++;
             sch->stats.packets++;
-            sch->stats.bytes+=len;
-            cbq_mark_toplevel(q, cl);
+            sch->stats.bytes+=(*skb)->len;
             if (!cl->next_alive)
                 cbq_activate_class(cl);
             return ret;
+#ifdef CONFIG_NET_CLS_POLICE
+        } else {
+            /* we renqueue the the (latest) dropped packet */
+            cl->stats.drops++;
+           
+            if(cl->police != TC_POLICE_RECLASSIFY)
+                break;
+            /* just one line follows. this is instead of the
+             * old reshape_fail, without the need of the
+             * tricks with rx_class and __parent */
+            cl=cbq_reclassify(*skb, cl);
+#endif
         }
     }
 
 #ifndef CONFIG_NET_CLS_ACT
     sch->stats.drops++;
     if (cl == NULL)
-        kfree_skb(skb);
+        IMPLICIT_DROP();
     else {
         cbq_mark_toplevel(q, cl);
         cl->stats.drops++;
     }
 #else
-    if ( NET_XMIT_DROP == ret) {
+    if ( any_dropped(ret) ) {
         sch->stats.drops++;
     }
-
-    if (cl != NULL) {
-        cbq_mark_toplevel(q, cl);
-        cl->stats.drops++;
-    }
 #endif
     return ret;
 }
 
 static int
-cbq_requeue(struct sk_buff *skb, struct Qdisc *sch)
+cbq_requeue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct cbq_sched_data *q = qdisc_priv(sch);
     struct cbq_class *cl;
     int ret;
 
     if ((cl = q->tx_class) == NULL) {
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         sch->stats.drops++;
         return NET_XMIT_CN;
     }
@@ -479,10 +482,6 @@
 
     cbq_mark_toplevel(q, cl);
 
-#ifdef CONFIG_NET_CLS_POLICE
-    q->rx_class = cl;
-    cl->q->__parent = sch;
-#endif
     if ((ret = cl->q->ops->requeue(skb, cl->q)) == 0) {
         sch->q.qlen++;
         if (!cl->next_alive)
@@ -625,9 +624,13 @@
 
 static void cbq_ovl_drop(struct cbq_class *cl)
 {
+    struct sk_buff * skb;
+
     if (cl->q->ops->drop)
-        if (cl->q->ops->drop(cl->q))
+        if (cl->q->ops->drop(cl->q, &skb)){
+            before_explicit_drop(skb);
             cl->qdisc->q.qlen--;
+        }
     cl->xstats.overactions++;
     cbq_ovl_classic(cl);
 }
@@ -708,42 +711,6 @@
     netif_schedule(sch->dev);
 }
 
-
-#ifdef CONFIG_NET_CLS_POLICE
-
-static int cbq_reshape_fail(struct sk_buff *skb, struct Qdisc *child)
-{
-    int len = skb->len;
-    struct Qdisc *sch = child->__parent;
-    struct cbq_sched_data *q = qdisc_priv(sch);
-    struct cbq_class *cl = q->rx_class;
-
-    q->rx_class = NULL;
-
-    if (cl && (cl = cbq_reclassify(skb, cl)) != NULL) {
-
-        cbq_mark_toplevel(q, cl);
-
-        q->rx_class = cl;
-        cl->q->__parent = sch;
-
-        if (cl->q->enqueue(skb, cl->q) == 0) {
-            sch->q.qlen++;
-            sch->stats.packets++;
-            sch->stats.bytes+=len;
-            if (!cl->next_alive)
-                cbq_activate_class(cl);
-            return 0;
-        }
-        sch->stats.drops++;
-        return 0;
-    }
-
-    sch->stats.drops++;
-    return -1;
-}
-#endif
-
 /*
    It is mission critical procedure.
 
@@ -1268,7 +1235,7 @@
     }
 }
 
-static unsigned int cbq_drop(struct Qdisc* sch)
+static unsigned int cbq_drop(struct Qdisc* sch, struct sk_buff ** 
const  skb)
 {
     struct cbq_sched_data *q = qdisc_priv(sch);
     struct cbq_class *cl, *cl_head;
@@ -1281,7 +1248,7 @@
 
         cl = cl_head;
         do {
-            if (cl->q->ops->drop && (len = cl->q->ops->drop(cl->q))) {
+            if (cl->q->ops->drop && (len = cl->q->ops->drop(cl->q, skb))) {
                 sch->q.qlen--;
                 return len;
             }
@@ -1413,13 +1380,7 @@
 static int cbq_set_police(struct cbq_class *cl, struct tc_cbq_police *p)
 {
     cl->police = p->police;
-
-    if (cl->q->handle) {
-        if (p->police == TC_POLICE_RECLASSIFY)
-            cl->q->reshape_fail = cbq_reshape_fail;
-        else
-            cl->q->reshape_fail = NULL;
-    }
+printk("police set to: %d", cl->police);
     return 0;
 }
 #endif
@@ -1698,11 +1659,6 @@
         if (new == NULL) {
             if ((new = qdisc_create_dflt(sch->dev, &pfifo_qdisc_ops)) 
== NULL)
                 return -ENOBUFS;
-        } else {
-#ifdef CONFIG_NET_CLS_POLICE
-            if (cl->police == TC_POLICE_RECLASSIFY)
-                new->reshape_fail = cbq_reshape_fail;
-#endif
         }
         sch_tree_lock(sch);
         *old = cl->q;
@@ -1764,9 +1720,6 @@
     struct cbq_class *cl;
     unsigned h;
 
-#ifdef CONFIG_NET_CLS_POLICE
-    q->rx_class = NULL;
-#endif
     for (h = 0; h < 16; h++) {
         for (cl = q->classes[h]; cl; cl = cl->next)
             cbq_destroy_filters(cl);
@@ -1790,15 +1743,6 @@
     struct cbq_class *cl = (struct cbq_class*)arg;
 
     if (--cl->refcnt == 0) {
-#ifdef CONFIG_NET_CLS_POLICE
-        struct cbq_sched_data *q = qdisc_priv(sch);
-
-        spin_lock_bh(&sch->dev->queue_lock);
-        if (q->rx_class == cl)
-            q->rx_class = NULL;
-        spin_unlock_bh(&sch->dev->queue_lock);
-#endif
-
         cbq_destroy_class(cl);
     }
 }
@@ -2021,10 +1965,6 @@
         q->tx_class = NULL;
         q->tx_borrowed = NULL;
     }
-#ifdef CONFIG_NET_CLS_POLICE
-    if (q->rx_class == cl)
-        q->rx_class = NULL;
-#endif
 
     cbq_unlink_class(cl);
     cbq_adjust_levels(cl->tparent);
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_dsmark.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_dsmark.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_dsmark.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_dsmark.c    2004-08-12 
16:39:11.000000000 +0200
@@ -186,38 +186,38 @@
 /* --------------------------- Qdisc operations 
---------------------------- */
 
 
-static int dsmark_enqueue(struct sk_buff *skb,struct Qdisc *sch)
+static int dsmark_enqueue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
     struct dsmark_qdisc_data *p = PRIV(sch);
     struct tcf_result res;
     int result;
     int ret = NET_XMIT_POLICED;
 
-    D2PRINTK("dsmark_enqueue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
+    D2PRINTK("dsmark_enqueue(skb %p,sch %p,[qdisc %p])\n",*skb,sch,p);
     if (p->set_tc_index) {
         /* FIXME: Safe with non-linear skbs? --RR */
-        switch (skb->protocol) {
+        switch ((*skb)->protocol) {
             case __constant_htons(ETH_P_IP):
-                skb->tc_index = ipv4_get_dsfield(skb->nh.iph);
+                (*skb)->tc_index = ipv4_get_dsfield((*skb)->nh.iph);
                 break;
             case __constant_htons(ETH_P_IPV6):
-                skb->tc_index = ipv6_get_dsfield(skb->nh.ipv6h);
+                (*skb)->tc_index = ipv6_get_dsfield((*skb)->nh.ipv6h);
                 break;
             default:
-                skb->tc_index = 0;
+                (*skb)->tc_index = 0;
                 break;
         };
     }
     result = TC_POLICE_OK; /* be nice to gcc */
-    if (TC_H_MAJ(skb->priority) == sch->handle) {
-        skb->tc_index = TC_H_MIN(skb->priority);
+    if (TC_H_MAJ((*skb)->priority) == sch->handle) {
+        (*skb)->tc_index = TC_H_MIN((*skb)->priority);
     } else {
-        result = tc_classify(skb,p->filter_list,&res);
+        result = tc_classify(*skb,p->filter_list,&res);
         D2PRINTK("result %d class 0x%04x\n",result,res.classid);
         switch (result) {
 #ifdef CONFIG_NET_CLS_POLICE
             case TC_POLICE_SHOT:
-                kfree_skb(skb);
+                IMPLICIT_DROP(); /* this whole ifdef will never be 
coded! */
                 break;
 #if 0
             case TC_POLICE_RECLASSIFY:
@@ -225,13 +225,13 @@
 #endif
 #endif
             case TC_POLICE_OK:
-                skb->tc_index = TC_H_MIN(res.classid);
+                (*skb)->tc_index = TC_H_MIN(res.classid);
                 break;
             case TC_POLICE_UNSPEC:
                 /* fall through */
             default:
                 if (p->default_index != NO_DEFAULT_INDEX)
-                    skb->tc_index = p->default_index;
+                    (*skb)->tc_index = p->default_index;
                 break;
         };
     }
@@ -240,11 +240,11 @@
         result == TC_POLICE_SHOT ||
 #endif
 
-        ((ret = p->q->enqueue(skb,p->q)) != 0)) {
+        (0x1 & (ret = p->q->enqueue(skb,p->q))) ) {
         sch->stats.drops++;
         return ret;
     }
-    sch->stats.bytes += skb->len;
+    sch->stats.bytes += (*skb)->len;
     sch->stats.packets++;
     sch->q.qlen++;
     return ret;
@@ -289,12 +289,12 @@
 }
 
 
-static int dsmark_requeue(struct sk_buff *skb,struct Qdisc *sch)
+static int dsmark_requeue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
     int ret;
     struct dsmark_qdisc_data *p = PRIV(sch);
 
-    D2PRINTK("dsmark_requeue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
+    D2PRINTK("dsmark_requeue(skb %p,sch %p,[qdisc %p])\n",*skb,sch,p);
         if ((ret = p->q->ops->requeue(skb, p->q)) == 0) {
         sch->q.qlen++;
         return 0;
@@ -304,7 +304,7 @@
 }
 
 
-static unsigned int dsmark_drop(struct Qdisc *sch)
+static unsigned int dsmark_drop(struct Qdisc *sch, struct sk_buff ** 
const  skb)
 {
     struct dsmark_qdisc_data *p = PRIV(sch);
     unsigned int len;
@@ -312,7 +312,7 @@
     DPRINTK("dsmark_reset(sch %p,[qdisc %p])\n",sch,p);
     if (!p->q->ops->drop)
         return 0;
-    if (!(len = p->q->ops->drop(p->q)))
+    if (!(len = p->q->ops->drop(p->q, skb)))
         return 0;
     sch->q.qlen--;
     return len;
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_fifo.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_fifo.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_fifo.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_fifo.c    2004-08-12 
16:39:11.000000000 +0200
@@ -43,30 +43,34 @@
 };
 
 static int
-bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+bfifo_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct fifo_sched_data *q = qdisc_priv(sch);
 
-    if (sch->stats.backlog + skb->len <= q->limit) {
-        __skb_queue_tail(&sch->q, skb);
-        sch->stats.backlog += skb->len;
-        sch->stats.bytes += skb->len;
+    if (sch->stats.backlog + (*skb)->len <= q->limit) {
+        __skb_queue_tail(&sch->q, *skb);
+        sch->stats.backlog += (*skb)->len;
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         return 0;
     }
     sch->stats.drops++;
 #ifdef CONFIG_NET_CLS_POLICE
-    if (sch->reshape_fail==NULL || sch->reshape_fail(skb, sch))
+    if (sch->reshape_fail==NULL || sch->reshape_fail(skb, sch)){
+#endif
+        IMPLICIT_DROP();
+        return NET_XMIT_DROP;
+#ifdef CONFIG_NET_CLS_POLICE
+    }
+    return NET_XMIT_RESHAPED;
 #endif
-        kfree_skb(skb);
-    return NET_XMIT_DROP;
 }
 
 static int
-bfifo_requeue(struct sk_buff *skb, struct Qdisc* sch)
+bfifo_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
-    __skb_queue_head(&sch->q, skb);
-    sch->stats.backlog += skb->len;
+    __skb_queue_head(&sch->q, *skb);
+    sch->stats.backlog += (*skb)->len;
     return 0;
 }
 
@@ -82,15 +86,13 @@
 }
 
 static unsigned int
-fifo_drop(struct Qdisc* sch)
+fifo_drop(struct Qdisc* sch, struct sk_buff ** const  skb)
 {
-    struct sk_buff *skb;
-
-    skb = __skb_dequeue_tail(&sch->q);
-    if (skb) {
-        unsigned int len = skb->len;
+    *skb = __skb_dequeue_tail(&sch->q);
+    if (*skb) {
+        unsigned int len = (*skb)->len;
         sch->stats.backlog -= len;
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         return len;
     }
     return 0;
@@ -104,28 +106,33 @@
 }
 
 static int
-pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+pfifo_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct fifo_sched_data *q = qdisc_priv(sch);
 
     if (sch->q.qlen < q->limit) {
-        __skb_queue_tail(&sch->q, skb);
-        sch->stats.bytes += skb->len;
+        __skb_queue_tail(&sch->q, *skb);
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         return 0;
     }
     sch->stats.drops++;
+
+#ifdef CONFIG_NET_CLS_POLICE
+    if (sch->reshape_fail==NULL || sch->reshape_fail(skb, sch)){
+#endif
+        IMPLICIT_DROP();
+        return NET_XMIT_DROP;
 #ifdef CONFIG_NET_CLS_POLICE
-    if (sch->reshape_fail==NULL || sch->reshape_fail(skb, sch))
+    }
+    return NET_XMIT_RESHAPED;
 #endif
-        kfree_skb(skb);
-    return NET_XMIT_DROP;
 }
 
 static int
-pfifo_requeue(struct sk_buff *skb, struct Qdisc* sch)
+pfifo_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
-    __skb_queue_head(&sch->q, skb);
+    __skb_queue_head(&sch->q, *skb);
     return 0;
 }
 
diff -NaurX dontdiff 
linux-2.6.8-rc4-netxmitcodes/net/sched/sch_generic.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_generic.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_generic.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_generic.c    2004-08-12 
16:39:11.000000000 +0200
@@ -131,6 +131,7 @@
                packet when deadloop is detected.
              */
             if (dev->xmit_lock_owner == smp_processor_id()) {
+                before_explicit_drop(skb);
                 kfree_skb(skb);
                 if (net_ratelimit())
                     printk(KERN_DEBUG "Dead loop on netdevice %s, fix 
it urgently!\n", dev->name);
@@ -149,7 +150,7 @@
            3. device is buggy (ppp)
          */
 
-        q->ops->requeue(skb, q);
+        q->ops->requeue(&skb, q);
         netif_schedule(dev);
         return 1;
     }
@@ -217,9 +218,9 @@
  */
 
 static int
-noop_enqueue(struct sk_buff *skb, struct Qdisc * qdisc)
+noop_enqueue(struct sk_buff ** const skb, struct Qdisc * qdisc)
 {
-    kfree_skb(skb);
+    IMPLICIT_DROP();
     return NET_XMIT_CN;
 }
 
@@ -230,11 +231,11 @@
 }
 
 static int
-noop_requeue(struct sk_buff *skb, struct Qdisc* qdisc)
+noop_requeue(struct sk_buff ** const  skb, struct Qdisc* qdisc)
 {
     if (net_ratelimit())
-        printk(KERN_DEBUG "%s deferred output. It is buggy.\n", 
skb->dev->name);
-    kfree_skb(skb);
+        printk(KERN_DEBUG "%s deferred output. It is buggy.\n", 
(*skb)->dev->name);
+    IMPLICIT_DROP();
     return NET_XMIT_CN;
 }
 
@@ -283,21 +284,21 @@
  */
 
 static int
-pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc* qdisc)
+pfifo_fast_enqueue(struct sk_buff ** const skb, struct Qdisc* qdisc)
 {
     struct sk_buff_head *list = qdisc_priv(qdisc);
 
-    list += prio2band[skb->priority&TC_PRIO_MAX];
+    list += prio2band[(*skb)->priority&TC_PRIO_MAX];
 
     if (list->qlen < qdisc->dev->tx_queue_len) {
-        __skb_queue_tail(list, skb);
+        __skb_queue_tail(list, (*skb));
         qdisc->q.qlen++;
-        qdisc->stats.bytes += skb->len;
+        qdisc->stats.bytes += (*skb)->len;
         qdisc->stats.packets++;
         return 0;
     }
     qdisc->stats.drops++;
-    kfree_skb(skb);
+    IMPLICIT_DROP();
     return NET_XMIT_DROP;
 }
 
@@ -319,13 +320,13 @@
 }
 
 static int
-pfifo_fast_requeue(struct sk_buff *skb, struct Qdisc* qdisc)
+pfifo_fast_requeue(struct sk_buff ** const  skb, struct Qdisc* qdisc)
 {
     struct sk_buff_head *list = qdisc_priv(qdisc);
 
-    list += prio2band[skb->priority&TC_PRIO_MAX];
+    list += prio2band[(*skb)->priority&TC_PRIO_MAX];
 
-    __skb_queue_head(list, skb);
+    __skb_queue_head(list, *skb);
     qdisc->q.qlen++;
     return 0;
 }
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_gred.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_gred.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_gred.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_gred.c    2004-08-12 
16:39:11.000000000 +0200
@@ -102,7 +102,7 @@
 };
 
 static int
-gred_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+gred_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     psched_time_t now;
     struct gred_sched_data *q=NULL;
@@ -116,7 +116,7 @@
     }
 
 
-    if ( ((skb->tc_index&0xf) > (t->DPs -1)) || 
!(q=t->tab[skb->tc_index&0xf])) {
+    if ( (((*skb)->tc_index&0xf) > (t->DPs -1)) || 
!(q=t->tab[(*skb)->tc_index&0xf])) {
         printk("GRED: setting to default (%d)\n ",t->def);
         if (!(q=t->tab[t->def])) {
             DPRINTK("GRED: setting to default FAILED! dropping!! "
@@ -125,11 +125,11 @@
         }
         /* fix tc_index? --could be controvesial but needed for
            requeueing */
-        skb->tc_index=(skb->tc_index&0xfffffff0) | t->def;
+        (*skb)->tc_index=((*skb)->tc_index&0xfffffff0) | t->def;
     }
 
     D2PRINTK("gred_enqueue virtualQ 0x%x classid %x backlog %d "
-        "general backlog %d\n",skb->tc_index&0xf,sch->handle,q->backlog,
+        "general backlog %d\n",(*skb)->tc_index&0xf,sch->handle,q->backlog,
         sch->stats.backlog);
     /* sum up all the qaves of prios <= to ours to get the new qave*/
     if (!t->eqp && t->grio) {
@@ -144,7 +144,7 @@
     }
 
     q->packetsin++;
-    q->bytesin+=skb->len;
+    q->bytesin+=(*skb)->len;
 
     if (t->eqp && t->grio) {
         qave=0;
@@ -175,12 +175,12 @@
     if ((q->qave+qave) < q->qth_min) {
         q->qcount = -1;
 enqueue:
-        if (q->backlog + skb->len <= q->limit) {
-            q->backlog += skb->len;
+        if (q->backlog + (*skb)->len <= q->limit) {
+            q->backlog += (*skb)->len;
 do_enqueue:
-            __skb_queue_tail(&sch->q, skb);
-            sch->stats.backlog += skb->len;
-            sch->stats.bytes += skb->len;
+            __skb_queue_tail(&sch->q, *skb);
+            sch->stats.backlog += (*skb)->len;
+            sch->stats.bytes += (*skb)->len;
             sch->stats.packets++;
             return 0;
         } else {
@@ -188,7 +188,7 @@
         }
 
 drop:
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         sch->stats.drops++;
         return NET_XMIT_DROP;
     }
@@ -212,17 +212,17 @@
 }
 
 static int
-gred_requeue(struct sk_buff *skb, struct Qdisc* sch)
+gred_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct gred_sched_data *q;
     struct gred_sched *t= qdisc_priv(sch);
-    q= t->tab[(skb->tc_index&0xf)];
+    q= t->tab[((*skb)->tc_index&0xf)];
 /* error checking here -- probably unnecessary */
     PSCHED_SET_PASTPERFECT(q->qidlestart);
 
-    __skb_queue_head(&sch->q, skb);
-    sch->stats.backlog += skb->len;
-    q->backlog += skb->len;
+    __skb_queue_head(&sch->q, *skb);
+    sch->stats.backlog += (*skb)->len;
+    q->backlog += (*skb)->len;
     return 0;
 }
 
@@ -259,29 +259,27 @@
     return NULL;
 }
 
-static unsigned int gred_drop(struct Qdisc* sch)
+static unsigned int gred_drop(struct Qdisc* sch, struct sk_buff ** 
const  skb)
 {
-    struct sk_buff *skb;
-
     struct gred_sched_data *q;
     struct gred_sched *t= qdisc_priv(sch);
 
-    skb = __skb_dequeue_tail(&sch->q);
-    if (skb) {
-        unsigned int len = skb->len;
+    *skb = __skb_dequeue_tail(&sch->q);
+    if (*skb) {
+        unsigned int len = (*skb)->len;
         sch->stats.backlog -= len;
         sch->stats.drops++;
-        q= t->tab[(skb->tc_index&0xf)];
+        q= t->tab[((*skb)->tc_index&0xf)];
         if (q) {
             q->backlog -= len;
             q->other++;
             if (!q->backlog && !t->eqp)
                 PSCHED_GET_TIME(q->qidlestart);
         } else {
-            D2PRINTK("gred_dequeue: skb has bad tcindex 
%x\n",skb->tc_index&0xf);
+            D2PRINTK("gred_dequeue: skb has bad tcindex 
%x\n",(*skb)->tc_index&0xf);
         }
 
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         return len;
     }
 
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_hfsc.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_hfsc.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_hfsc.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_hfsc.c    2004-08-12 
16:39:11.000000000 +0200
@@ -967,7 +967,8 @@
         return 0;
     }
     len = skb->len;
-    if (unlikely(sch->ops->requeue(skb, sch) != NET_XMIT_SUCCESS)) {
+    if (unlikely(sch->ops->requeue(&skb, sch) != NET_XMIT_SUCCESS)) {
+        before_explicit_drop(skb);
         if (net_ratelimit())
             printk("qdisc_peek_len: failed to requeue\n");
         return 0;
@@ -1238,7 +1239,7 @@
 }
 
 static struct hfsc_class *
-hfsc_classify(struct sk_buff *skb, struct Qdisc *sch, int *qres)
+hfsc_classify(struct sk_buff ** const skb, struct Qdisc *sch, int *qres)
 {
     struct hfsc_sched *q = qdisc_priv(sch);
     struct hfsc_class *cl;
@@ -1246,13 +1247,13 @@
     struct tcf_proto *tcf;
     int result;
 
-    if (TC_H_MAJ(skb->priority ^ sch->handle) == 0 &&
-        (cl = hfsc_find_class(skb->priority, sch)) != NULL)
+    if (TC_H_MAJ((*skb)->priority ^ sch->handle) == 0 &&
+        (cl = hfsc_find_class((*skb)->priority, sch)) != NULL)
         if (cl->level == 0)
             return cl;
 
     tcf = q->root.filter_list;
-    while (tcf && (result = tc_classify(skb, tcf, &res)) >= 0) {
+    while (tcf && (result = tc_classify(*skb, tcf, &res)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
         int terminal = 0;
         switch (result) {
@@ -1272,7 +1273,11 @@
         }
 
         if (terminal) {
-            kfree_skb(skb);
+            if( any_dropped(*qres) ){
+                before_explicit_drop(*skb);
+                IMPLICIT_DROP();
+            } else
+                kfree_skb(*skb);
             return NULL;
         }
 #else
@@ -1685,11 +1690,11 @@
 }
 
 static int
-hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+hfsc_enqueue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     int ret = NET_XMIT_SUCCESS;
     struct hfsc_class *cl = hfsc_classify(skb, sch, &ret);
-    unsigned int len = skb->len;
+    unsigned int len = (*skb)->len;
     int err;
 
 
@@ -1702,14 +1707,14 @@
     }
 #else
     if (cl == NULL) {
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         sch->stats.drops++;
         return NET_XMIT_DROP;
     }
 #endif
 
     err = cl->qdisc->enqueue(skb, cl->qdisc);
-    if (unlikely(err != NET_XMIT_SUCCESS)) {
+    if (unlikely(any_dropped(err))) {
         cl->stats.drops++;
         sch->stats.drops++;
         return err;
@@ -1797,17 +1802,17 @@
 }
 
 static int
-hfsc_requeue(struct sk_buff *skb, struct Qdisc *sch)
+hfsc_requeue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct hfsc_sched *q = qdisc_priv(sch);
 
-    __skb_queue_head(&q->requeue, skb);
+    __skb_queue_head(&q->requeue, *skb);
     sch->q.qlen++;
     return NET_XMIT_SUCCESS;
 }
 
 static unsigned int
-hfsc_drop(struct Qdisc *sch)
+hfsc_drop(struct Qdisc *sch, struct sk_buff ** const  skb)
 {
     struct hfsc_sched *q = qdisc_priv(sch);
     struct hfsc_class *cl;
@@ -1815,7 +1820,7 @@
 
     list_for_each_entry(cl, &q->droplist, dlist) {
         if (cl->qdisc->ops->drop != NULL &&
-            (len = cl->qdisc->ops->drop(cl->qdisc)) > 0) {
+            (len = cl->qdisc->ops->drop(cl->qdisc, skb)) > 0) {
             if (cl->qdisc->q.qlen == 0) {
                 update_vf(cl, 0, 0);
                 set_passive(cl);
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_htb.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_htb.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_htb.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_htb.c    2004-08-12 
16:39:11.000000000 +0200
@@ -298,7 +298,7 @@
     return (cl && cl != HTB_DIRECT) ? cl->classid : TC_H_UNSPEC;
 }
 
-static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc 
*sch, int *qres)
+static struct htb_class *htb_classify(struct sk_buff ** const skb, 
struct Qdisc *sch, int *qres)
 {
     struct htb_sched *q = qdisc_priv(sch);
     struct htb_class *cl;
@@ -306,16 +306,16 @@
     struct tcf_proto *tcf;
     int result;
 
-    /* allow to select class by setting skb->priority to valid classid;
+    /* allow to select class by setting *skb->priority to valid classid;
        note that nfmark can be used too by attaching filter fw with no
        rules in it */
-    if (skb->priority == sch->handle)
+    if ((*skb)->priority == sch->handle)
         return HTB_DIRECT;  /* X:0 (direct flow) selected */
-    if ((cl = htb_find(skb->priority,sch)) != NULL && cl->level == 0)
+    if ((cl = htb_find((*skb)->priority,sch)) != NULL && cl->level == 0)
         return cl;
 
     tcf = q->filter_list;
-    while (tcf && (result = tc_classify(skb, tcf, &res)) >= 0) {
+    while (tcf && (result = tc_classify(*skb, tcf, &res)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
         int terminal = 0;
         switch (result) {
@@ -335,7 +335,11 @@
         }
 
         if (terminal) {
-            kfree_skb(skb);
+            if( any_dropped(*qres) ){
+                before_explicit_drop(*skb);
+                IMPLICIT_DROP();
+            } else
+                kfree_skb(*skb);
             return NULL;
         }
 #else
@@ -709,7 +713,7 @@
     list_del_init(&cl->un.leaf.drop_list);
 }
 
-static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int htb_enqueue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     int ret = NET_XMIT_SUCCESS;
     struct htb_sched *q = qdisc_priv(sch);
@@ -719,7 +723,7 @@
 #ifdef CONFIG_NET_CLS_ACT
     if (cl == HTB_DIRECT ) {
     if (q->direct_queue.qlen < q->direct_qlen ) {
-        __skb_queue_tail(&q->direct_queue, skb);
+        __skb_queue_tail(&q->direct_queue, *skb);
         q->direct_pkts++;
     }
     } else if (!cl) {
@@ -732,10 +736,10 @@
     if (cl == HTB_DIRECT || !cl) {
     /* enqueue to helper queue */
     if (q->direct_queue.qlen < q->direct_qlen && cl) {
-        __skb_queue_tail(&q->direct_queue, skb);
+        __skb_queue_tail(&q->direct_queue, *skb);
         q->direct_pkts++;
     } else {
-        kfree_skb (skb);
+        IMPLICIT_DROP();
         sch->stats.drops++;
         return NET_XMIT_DROP;
     }
@@ -746,32 +750,31 @@
     cl->stats.drops++;
     return NET_XMIT_DROP;
     } else {
-    cl->stats.packets++; cl->stats.bytes += skb->len;
+    cl->stats.packets++; cl->stats.bytes += (*skb)->len;
     htb_activate (q,cl);
     }
 
     sch->q.qlen++;
-    sch->stats.packets++; sch->stats.bytes += skb->len;
-    HTB_DBG(1,1,"htb_enq_ok cl=%X skb=%p\n",(cl && cl != 
HTB_DIRECT)?cl->classid:0,skb);
+    sch->stats.packets++; sch->stats.bytes += (*skb)->len;
+    HTB_DBG(1,1,"htb_enq_ok cl=%X skb=%p\n",(cl && cl != 
HTB_DIRECT)?cl->classid:0,*skb);
     return NET_XMIT_SUCCESS;
 }
 
 /* TODO: requeuing packet charges it to policers again !! */
-static int htb_requeue(struct sk_buff *skb, struct Qdisc *sch)
+static int htb_requeue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct htb_sched *q = qdisc_priv(sch);
     int ret =  NET_XMIT_SUCCESS;
     struct htb_class *cl = htb_classify(skb,sch, &ret);
-    struct sk_buff *tskb;
 
     if (cl == HTB_DIRECT || !cl) {
     /* enqueue to helper queue */
     if (q->direct_queue.qlen < q->direct_qlen && cl) {
-        __skb_queue_head(&q->direct_queue, skb);
+        __skb_queue_head(&q->direct_queue, *skb);
     } else {
-            __skb_queue_head(&q->direct_queue, skb);
-            tskb = __skb_dequeue_tail(&q->direct_queue);
-            kfree_skb (tskb);
+            __skb_queue_head(&q->direct_queue, *skb);
+            *skb = __skb_dequeue_tail(&q->direct_queue);
+            IMPLICIT_DROP();
             sch->stats.drops++;
             return NET_XMIT_CN;   
     }
@@ -783,7 +786,7 @@
         htb_activate (q,cl);
 
     sch->q.qlen++;
-    HTB_DBG(1,1,"htb_req_ok cl=%X skb=%p\n",(cl && cl != 
HTB_DIRECT)?cl->classid:0,skb);
+    HTB_DBG(1,1,"htb_req_ok cl=%X skb=%p\n",(cl && cl != 
HTB_DIRECT)?cl->classid:0,*skb);
     return NET_XMIT_SUCCESS;
 }
 
@@ -1145,7 +1148,7 @@
 }
 
 /* try to drop from each class (by prio) until one succeed */
-static unsigned int htb_drop(struct Qdisc* sch)
+static unsigned int htb_drop(struct Qdisc* sch, struct sk_buff ** 
const  skb)
 {
     struct htb_sched *q = qdisc_priv(sch);
     int prio;
@@ -1157,7 +1160,7 @@
                               un.leaf.drop_list);
             unsigned int len;
             if (cl->un.leaf.q->ops->drop &&
-                (len = cl->un.leaf.q->ops->drop(cl->un.leaf.q))) {
+                (len = cl->un.leaf.q->ops->drop(cl->un.leaf.q, skb))) {
                 sch->q.qlen--;
                 if (!cl->un.leaf.q->q.qlen)
                     htb_deactivate (q,cl);
diff -NaurX dontdiff 
linux-2.6.8-rc4-netxmitcodes/net/sched/sch_ingress.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_ingress.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_ingress.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_ingress.c    2004-08-12 
16:39:11.000000000 +0200
@@ -137,14 +137,14 @@
 /* --------------------------- Qdisc operations 
---------------------------- */
 
 
-static int ingress_enqueue(struct sk_buff *skb,struct Qdisc *sch)
+static int ingress_enqueue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
     struct ingress_qdisc_data *p = PRIV(sch);
     struct tcf_result res;
     int result;
 
-    D2PRINTK("ingress_enqueue(skb %p,sch %p,[qdisc %p])\n", skb, sch, p);
-    result = tc_classify(skb, p->filter_list, &res);
+    D2PRINTK("ingress_enqueue(skb %p,sch %p,[qdisc %p])\n", *skb, sch, p);
+    result = tc_classify(*skb, p->filter_list, &res);
     D2PRINTK("result %d class 0x%04x\n", result, res.classid);
     /*
      * Unlike normal "enqueue" functions, ingress_enqueue returns a
@@ -152,7 +152,7 @@
      */
 #ifdef CONFIG_NET_CLS_ACT
     sch->stats.packets++;
-    sch->stats.bytes += skb->len;
+    sch->stats.bytes += (*skb)->len;
     switch (result) {
         case TC_ACT_SHOT:
             result = TC_ACT_SHOT;
@@ -166,7 +166,7 @@
         case TC_ACT_OK:
         case TC_ACT_UNSPEC:
         default:
-            skb->tc_index = TC_H_MIN(res.classid);
+            (*skb)->tc_index = TC_H_MIN(res.classid);
             result = TC_ACT_OK;
             break;
     };
@@ -183,7 +183,7 @@
         case TC_POLICE_UNSPEC:
         default:
         sch->stats.packets++;
-        sch->stats.bytes += skb->len;
+        sch->stats.bytes += (*skb)->len;
         result = NF_ACCEPT;
         break;
     };
@@ -192,7 +192,7 @@
     D2PRINTK("Overriding result to ACCEPT\n");
     result = NF_ACCEPT;
     sch->stats.packets++;
-    sch->stats.bytes += skb->len;
+    sch->stats.bytes += (*skb)->len;
 #endif
 #endif
 
@@ -210,21 +210,24 @@
 }
 
 
-static int ingress_requeue(struct sk_buff *skb,struct Qdisc *sch)
+static int ingress_requeue(struct sk_buff ** const skb,struct Qdisc *sch)
 {
 /*
     struct ingress_qdisc_data *p = PRIV(sch);
-    D2PRINTK("ingress_requeue(skb %p,sch %p,[qdisc 
%p])\n",skb,sch,PRIV(p));
+    D2PRINTK("ingress_requeue(skb %p,sch %p,[qdisc 
%p])\n",*skb,sch,PRIV(p));
 */
     return 0;
 }
 
-static unsigned int ingress_drop(struct Qdisc *sch)
+static unsigned int ingress_drop(struct Qdisc *sch, struct sk_buff ** 
const skb)
 {
 #ifdef DEBUG_INGRESS
     struct ingress_qdisc_data *p = PRIV(sch);
 #endif
     DPRINTK("ingress_drop(sch %p,[qdisc %p])\n", sch, p);
+   
+    *skb=NULL; // if somebody tries to free me... :>
+   
     return 0;
 }
 
@@ -254,8 +257,12 @@
 
     if (dev->qdisc_ingress) {
         spin_lock(&dev->queue_lock);
-        if ((q = dev->qdisc_ingress) != NULL)
-            fwres = q->enqueue(skb, q);
+        if ((q = dev->qdisc_ingress) != NULL){
+            fwres = q->enqueue(pskb, q);
+            if(any_dropped(fwres)){
+                before_explicit_drop(*pskb);
+            }
+        }
         spin_unlock(&dev->queue_lock);
         }
            
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_netem.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_netem.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_netem.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_netem.c    2004-08-12 
16:39:11.000000000 +0200
@@ -601,14 +601,14 @@
 /* Enqueue packets with underlying discipline (fifo)
  * but mark them with current time first.
  */
-static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int netem_enqueue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct netem_sched_data *q = qdisc_priv(sch);
-    struct netem_skb_cb *cb = (struct netem_skb_cb *)skb->cb;
+    struct netem_skb_cb *cb = (struct netem_skb_cb *)(*skb)->cb;
     psched_time_t now;
     long delay;
 
-    pr_debug("netem_enqueue skb=%p @%lu\n", skb, jiffies);
+    pr_debug("netem_enqueue skb=%p @%lu\n", *skb, jiffies);
 
     /* Random packet drop 0 => none, ~0 => all */
     if (q->loss && q->loss >= net_random()) {
@@ -644,20 +644,20 @@
    
     /* Always queue at tail to keep packets in order */
     if (likely(q->delayed.qlen < q->limit)) {
-        __skb_queue_tail(&q->delayed, skb);
+        __skb_queue_tail(&q->delayed, *skb);
         sch->q.qlen++;
-        sch->stats.bytes += skb->len;
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         return 0;
     }
 
     sch->stats.drops++;
-    kfree_skb(skb);
+    IMPLICIT_DROP();
     return NET_XMIT_DROP;
 }
 
 /* Requeue packets but don't change time stamp */
-static int netem_requeue(struct sk_buff *skb, struct Qdisc *sch)
+static int netem_requeue(struct sk_buff ** const skb, struct Qdisc *sch)
 {
     struct netem_sched_data *q = qdisc_priv(sch);
     int ret;
@@ -668,12 +668,12 @@
     return ret;
 }
 
-static unsigned int netem_drop(struct Qdisc* sch)
+static unsigned int netem_drop(struct Qdisc* sch, struct sk_buff ** 
const skb)
 {
     struct netem_sched_data *q = qdisc_priv(sch);
     unsigned int len;
 
-    if ((len = q->qdisc->ops->drop(q->qdisc)) != 0) {
+    if ((len = q->qdisc->ops->drop(q->qdisc, skb)) != 0) {
         sch->q.qlen--;
         sch->stats.drops++;
     }
@@ -706,7 +706,7 @@
         }
         __skb_unlink(skb, &q->delayed);
 
-        if (q->qdisc->enqueue(skb, q->qdisc))
+        if (q->qdisc->enqueue(&skb, q->qdisc))
             sch->stats.drops++;
     }
 
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_prio.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_prio.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_prio.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_prio.c    2004-08-12 
16:41:53.936487224 +0200
@@ -47,16 +47,16 @@
 };
 
 
-struct Qdisc *prio_classify(struct sk_buff *skb, struct Qdisc *sch,int *r)
+struct Qdisc *prio_classify(struct sk_buff ** const skb, struct Qdisc 
*sch,int *r)
 {
     struct prio_sched_data *q = qdisc_priv(sch);
-    u32 band = skb->priority;
+    u32 band = (*skb)->priority;
     struct tcf_result res;
 
-    if (TC_H_MAJ(skb->priority) != sch->handle) {
+    if (TC_H_MAJ((*skb)->priority) != sch->handle) {
 #ifdef CONFIG_NET_CLS_ACT
         int result = 0, terminal = 0;
-        result = tc_classify(skb, q->filter_list, &res);
+        result = tc_classify(*skb, q->filter_list, &res);
 
         switch (result) {
             case TC_ACT_SHOT:
@@ -74,13 +74,17 @@
             break;
         };
         if (terminal) {
-            kfree_skb(skb);
+            if( any_dropped(*r) ){
+                before_explicit_drop(*skb);
+                IMPLICIT_DROP();
+            } else
+                kfree_skb(*skb);
             return NULL;
-        }
+        }
 
         if (!q->filter_list ) {
 #else
-        if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
+        if (!q->filter_list || tc_classify(*skb, q->filter_list, &res)) {
 #endif
             if (TC_H_MAJ(band))
                 band = 0;
@@ -96,7 +100,7 @@
 }
 
 static int
-prio_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+prio_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct Qdisc *qdisc;
     int ret = NET_XMIT_SUCCESS;
@@ -107,7 +111,7 @@
         goto dropped;
 
     if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) {
-        sch->stats.bytes += skb->len;
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         sch->q.qlen++;
         return NET_XMIT_SUCCESS;
@@ -128,7 +132,7 @@
 
 
 static int
-prio_requeue(struct sk_buff *skb, struct Qdisc* sch)
+prio_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct Qdisc *qdisc;
     int ret = NET_XMIT_DROP;
@@ -167,7 +171,7 @@
 
 }
 
-static unsigned int prio_drop(struct Qdisc* sch)
+static unsigned int prio_drop(struct Qdisc* sch, struct sk_buff ** 
const  skb)
 {
     struct prio_sched_data *q = qdisc_priv(sch);
     int prio;
@@ -176,7 +180,7 @@
 
     for (prio = q->bands-1; prio >= 0; prio--) {
         qdisc = q->queues[prio];
-        if ((len = qdisc->ops->drop(qdisc)) != 0) {
+        if ((len = qdisc->ops->drop(qdisc, skb)) != 0) {
             sch->q.qlen--;
             return len;
         }
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_red.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_red.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_red.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_red.c    2004-08-12 
16:39:11.000000000 +0200
@@ -178,7 +178,7 @@
 }
 
 static int
-red_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+red_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct red_sched_data *q = qdisc_priv(sch);
 
@@ -242,16 +242,16 @@
     if (q->qave < q->qth_min) {
         q->qcount = -1;
 enqueue:
-        if (sch->stats.backlog + skb->len <= q->limit) {
-            __skb_queue_tail(&sch->q, skb);
-            sch->stats.backlog += skb->len;
-            sch->stats.bytes += skb->len;
+        if (sch->stats.backlog + (*skb)->len <= q->limit) {
+            __skb_queue_tail(&sch->q, *skb);
+            sch->stats.backlog += (*skb)->len;
+            sch->stats.bytes += (*skb)->len;
             sch->stats.packets++;
             return NET_XMIT_SUCCESS;
         } else {
             q->st.pdrop++;
         }
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         sch->stats.drops++;
         return NET_XMIT_DROP;
     }
@@ -259,7 +259,7 @@
         q->qcount = -1;
         sch->stats.overlimits++;
 mark:
-        if  (!(q->flags&TC_RED_ECN) || !red_ecn_mark(skb)) {
+        if  (!(q->flags&TC_RED_ECN) || !red_ecn_mark(*skb)) {
             q->st.early++;
             goto drop;
         }
@@ -295,20 +295,20 @@
     goto enqueue;
 
 drop:
-    kfree_skb(skb);
+    IMPLICIT_DROP();
     sch->stats.drops++;
     return NET_XMIT_CN;
 }
 
 static int
-red_requeue(struct sk_buff *skb, struct Qdisc* sch)
+red_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct red_sched_data *q = qdisc_priv(sch);
 
     PSCHED_SET_PASTPERFECT(q->qidlestart);
 
-    __skb_queue_head(&sch->q, skb);
-    sch->stats.backlog += skb->len;
+    __skb_queue_head(&sch->q, *skb);
+    sch->stats.backlog += (*skb)->len;
     return 0;
 }
 
@@ -327,18 +327,17 @@
     return NULL;
 }
 
-static unsigned int red_drop(struct Qdisc* sch)
+static unsigned int red_drop(struct Qdisc* sch, struct sk_buff ** 
const  skb)
 {
-    struct sk_buff *skb;
     struct red_sched_data *q = qdisc_priv(sch);
 
-    skb = __skb_dequeue_tail(&sch->q);
-    if (skb) {
-        unsigned int len = skb->len;
+    *skb = __skb_dequeue_tail(&sch->q);
+    if (*skb) {
+        unsigned int len = (*skb)->len;
         sch->stats.backlog -= len;
         sch->stats.drops++;
         q->st.other++;
-        kfree_skb(skb);
+        IMPLICIT_DROP();
         return len;
     }
     PSCHED_GET_TIME(q->qidlestart);
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_sfq.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_sfq.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_sfq.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_sfq.c    2004-08-12 
16:39:11.000000000 +0200
@@ -209,11 +209,10 @@
     sfq_link(q, x);
 }
 
-static unsigned int sfq_drop(struct Qdisc *sch)
+static unsigned int sfq_drop(struct Qdisc *sch, struct sk_buff ** const 
skb)
 {
     struct sfq_sched_data *q = qdisc_priv(sch);
     sfq_index d = q->max_depth;
-    struct sk_buff *skb;
     unsigned int len;
 
     /* Queue is full! Find the longest slot and
@@ -221,10 +220,10 @@
 
     if (d > 1) {
         sfq_index x = q->dep[d+SFQ_DEPTH].next;
-        skb = q->qs[x].prev;
-        len = skb->len;
-        __skb_unlink(skb, &q->qs[x]);
-        kfree_skb(skb);
+        *skb = q->qs[x].prev;
+        len = (*skb)->len;
+        __skb_unlink(*skb, &q->qs[x]);
+        IMPLICIT_DROP();
         sfq_dec(q, x);
         sch->q.qlen--;
         sch->stats.drops++;
@@ -236,10 +235,10 @@
         d = q->next[q->tail];
         q->next[q->tail] = q->next[d];
         q->allot[q->next[d]] += q->quantum;
-        skb = q->qs[d].prev;
-        len = skb->len;
-        __skb_unlink(skb, &q->qs[d]);
-        kfree_skb(skb);
+        *skb = q->qs[d].prev;
+        len = (*skb)->len;
+        __skb_unlink(*skb, &q->qs[d]);
+        IMPLICIT_DROP();
         sfq_dec(q, d);
         sch->q.qlen--;
         q->ht[q->hash[d]] = SFQ_DEPTH;
@@ -251,10 +250,10 @@
 }
 
 static int
-sfq_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+sfq_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct sfq_sched_data *q = qdisc_priv(sch);
-    unsigned hash = sfq_hash(q, skb);
+    unsigned hash = sfq_hash(q, *skb);
     sfq_index x;
 
     x = q->ht[hash];
@@ -262,7 +261,7 @@
         q->ht[hash] = x = q->dep[SFQ_DEPTH].next;
         q->hash[x] = hash;
     }
-    __skb_queue_tail(&q->qs[x], skb);
+    __skb_queue_tail(&q->qs[x], *skb);
     sfq_inc(q, x);
     if (q->qs[x].qlen == 1) {        /* The flow is new */
         if (q->tail == SFQ_DEPTH) {    /* It is the first flow */
@@ -276,20 +275,20 @@
         }
     }
     if (++sch->q.qlen < q->limit-1) {
-        sch->stats.bytes += skb->len;
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         return 0;
     }
 
-    sfq_drop(sch);
+    sfq_drop(sch, skb);
     return NET_XMIT_CN;
 }
 
 static int
-sfq_requeue(struct sk_buff *skb, struct Qdisc* sch)
+sfq_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct sfq_sched_data *q = qdisc_priv(sch);
-    unsigned hash = sfq_hash(q, skb);
+    unsigned hash = sfq_hash(q, *skb);
     sfq_index x;
 
     x = q->ht[hash];
@@ -297,7 +296,7 @@
         q->ht[hash] = x = q->dep[SFQ_DEPTH].next;
         q->hash[x] = hash;
     }
-    __skb_queue_head(&q->qs[x], skb);
+    __skb_queue_head(&q->qs[x], *skb);
     sfq_inc(q, x);
     if (q->qs[x].qlen == 1) {        /* The flow is new */
         if (q->tail == SFQ_DEPTH) {    /* It is the first flow */
@@ -314,7 +313,7 @@
         return 0;
 
     sch->stats.drops++;
-    sfq_drop(sch);
+    sfq_drop(sch, skb);
     return NET_XMIT_CN;
 }
 
@@ -362,8 +361,10 @@
 {
     struct sk_buff *skb;
 
-    while ((skb = sfq_dequeue(sch)) != NULL)
+    while ((skb = sfq_dequeue(sch)) != NULL){
+        before_explicit_drop(skb);
         kfree_skb(skb);
+    }
 }
 
 static void sfq_perturbation(unsigned long arg)
@@ -394,8 +395,11 @@
     if (ctl->limit)
         q->limit = min_t(u32, ctl->limit, SFQ_DEPTH);
 
-    while (sch->q.qlen >= q->limit-1)
-        sfq_drop(sch);
+    struct sk_buff * skb;
+    while (sch->q.qlen >= q->limit-1){
+        sfq_drop(sch, &skb);
+        before_explicit_drop(skb);
+    }
 
     del_timer(&q->perturb_timer);
     if (q->perturb_period) {
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_tbf.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_tbf.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_tbf.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_tbf.c    2004-08-12 
16:39:11.000000000 +0200
@@ -135,19 +135,23 @@
 #define L2T(q,L)   ((q)->R_tab->data[(L)>>(q)->R_tab->rate.cell_log])
 #define L2T_P(q,L) ((q)->P_tab->data[(L)>>(q)->P_tab->rate.cell_log])
 
-static int tbf_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+static int tbf_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct tbf_sched_data *q = qdisc_priv(sch);
     int ret;
 
-    if (skb->len > q->max_size) {
+    if ((*skb)->len > q->max_size) {
         sch->stats.drops++;
+
 #ifdef CONFIG_NET_CLS_POLICE
-        if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
+        if (sch->reshape_fail==NULL || sch->reshape_fail(skb, sch)){
+#endif
+            IMPLICIT_DROP();
+            return NET_XMIT_DROP;
+#ifdef CONFIG_NET_CLS_POLICE
+        }
+        return NET_XMIT_RESHAPED;
 #endif
-            kfree_skb(skb);
-
-        return NET_XMIT_DROP;
     }
 
     if ((ret = q->qdisc->enqueue(skb, q->qdisc)) != 0) {
@@ -156,12 +160,12 @@
     }
 
     sch->q.qlen++;
-    sch->stats.bytes += skb->len;
+    sch->stats.bytes += (*skb)->len;
     sch->stats.packets++;
     return 0;
 }
 
-static int tbf_requeue(struct sk_buff *skb, struct Qdisc* sch)
+static int tbf_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct tbf_sched_data *q = qdisc_priv(sch);
     int ret;
@@ -172,12 +176,12 @@
     return ret;
 }
 
-static unsigned int tbf_drop(struct Qdisc* sch)
+static unsigned int tbf_drop(struct Qdisc* sch, struct sk_buff ** const 
skb)
 {
     struct tbf_sched_data *q = qdisc_priv(sch);
     unsigned int len;
 
-    if ((len = q->qdisc->ops->drop(q->qdisc)) != 0) {
+    if ((len = q->qdisc->ops->drop(q->qdisc, skb)) != 0) {
         sch->q.qlen--;
         sch->stats.drops++;
     }
@@ -247,8 +251,9 @@
            (cf. CSZ, HPFQ, HFSC)
          */
 
-        if (q->qdisc->ops->requeue(skb, q->qdisc) != NET_XMIT_SUCCESS) {
+        if (q->qdisc->ops->requeue(&skb, q->qdisc) != NET_XMIT_SUCCESS) {
             /* When requeue fails skb is dropped */
+            before_explicit_drop(skb);
             sch->q.qlen--;
             sch->stats.drops++;
         }
diff -NaurX dontdiff linux-2.6.8-rc4-netxmitcodes/net/sched/sch_teql.c 
linux-2.6.8-rc4-apichanged/net/sched/sch_teql.c
--- linux-2.6.8-rc4-netxmitcodes/net/sched/sch_teql.c    2004-08-10 
12:27:36.000000000 +0200
+++ linux-2.6.8-rc4-apichanged/net/sched/sch_teql.c    2004-08-12 
16:39:43.808269720 +0200
@@ -88,30 +88,30 @@
 /* "teql*" qdisc routines */
 
 static int
-teql_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+teql_enqueue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct net_device *dev = sch->dev;
     struct teql_sched_data *q = qdisc_priv(sch);
 
-    __skb_queue_tail(&q->q, skb);
+    __skb_queue_tail(&q->q, *skb);
     if (q->q.qlen <= dev->tx_queue_len) {
-        sch->stats.bytes += skb->len;
+        sch->stats.bytes += (*skb)->len;
         sch->stats.packets++;
         return 0;
     }
 
-    __skb_unlink(skb, &q->q);
-    kfree_skb(skb);
+    __skb_unlink(*skb, &q->q);
+    IMPLICIT_DROP();
     sch->stats.drops++;
     return NET_XMIT_DROP;
 }
 
 static int
-teql_requeue(struct sk_buff *skb, struct Qdisc* sch)
+teql_requeue(struct sk_buff ** const skb, struct Qdisc* sch)
 {
     struct teql_sched_data *q = qdisc_priv(sch);
 
-    __skb_queue_head(&q->q, skb);
+    __skb_queue_head(&q->q, *skb);
     return 0;
 }
 
@@ -340,6 +340,7 @@
 
 drop:
     master->stats.tx_dropped++;
+    before_explicit_drop(skb);
     dev_kfree_skb(skb);
     return 0;
 }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-13  0:48 [PATCH 2/4] deferred drop, __parent workaround, reshape_fail sandr8
@ 2004-08-13 12:51 ` jamal
  2004-08-13 14:09   ` sandr8
  2004-08-16  7:20   ` Harald Welte
  0 siblings, 2 replies; 30+ messages in thread
From: jamal @ 2004-08-13 12:51 UTC (permalink / raw)
  To: sandr8_NOSPAM_
  Cc: Alexey, David S. Miller, devik, shemminger, kaber, rusty,
	Harald Welte, netdev, netfilter-devel

Alessandro,

This summary applies to all your patches: Too many changes that seem
unnecessary. Take a deep breath.

1) You cant change the enqueue/dequeue API just because one qdisc
doesnt seem to use it right. Whats all that contrack stuff doing in 
dev.c and pkt sched areas?
2) incr/decrementing ref cnt is the right way to go. skbs are shared.
Try running tcpdump with your changes (and what i suspect your qdisc is
doing) to see the effect. If you want to make changes to an skb, get
your own copy (or copy of parts of the skb look at things like clone and
skb_expand and relatives)
3) I have a feeling what you are trying to do is best achieved by
using a tc action and not a qdisc. As an example (based on a
conversation with Harald at OLS) i plan on doing selective contracking
at the ingress/egress - but as a clean separate action and not touching
any other files. If this is what you are trying to do, i can help sit on
the sideline and cheer you own with advice. It is not complicated to do,
time is the issue.
Maybe what you are trying to do is use contracking for accounting?
In which case there will be two independent actions: one to track and
another to account.

cheers,
jamal

On Thu, 2004-08-12 at 20:48, sandr8 wrote:
> 2) The second patch eliminates the need for the deprecated __parent 
> field and eliminates some tricks such as the rx_class field...
> to do so it introduces a slight change in the interface of the enqueue() 
> and requeue() operations... the "struct sk_buff * skb" becomes now a 
> "struct sk_buff ** const skb". All the in-2.6.8-kernel packet schedulers 
> and the core/dev.c are updated consequently... changes are quite trivial 
> so there's no issue for out-of-the-kernel-tree schedulers.
> 
> in order to make it clearer i should give a broader insight... i hate to 
> describe the gory details but it's a bit delicate.
> 
> a) the socket buffer for a dropped packet is freed as late as possible 
> (lazy lazy lazy!). in other words, locally, a packet is _candidated_ to 
> be dropped, but the latest word is the most global one.
> 
> who's got the most global viewpoint? Elementary! My Dear Watson! the 
> most external enqueuing operation. Ok... now you get it... the stack is 
> your friend and will discriminate for free :) the pointer to the socket 
> buffer structure always stays at the same location, but is updated by 
> the enqueueing operations.
> 
> This means that whatever packet "Barabba" is choosen to be dropped 
> internally, it can be further saved by the caller, who can exchange it 
> with an other victim... (i think there's no need to say who)
> 
> b) the changes in (a) made it possible to remove the deprecated
> __parent field due to cbq. this is done by handling the reshape 
> operation from the outside (in a very stupid loop cycle), whoever
> drops whatever.
> 
> this way we also avoid saving the class in the rx_class field, calling a 
> function that retrieves it and doing complex tricks to cope with the old 
> byval api
> 
> c) now it's possible to have (almost) a single point (core/dev.c) where 
> packets are dropped. Though there are some other few places, the inline 
> function before_explicit_drop() easies the centralization of the 
> operations. this comes into great help for patch 4.
> 
> Also, the empty macro IMPLICIT_DROP() behaves as a marker for the 
> reader   and the coder, helping maintainance and readability.
> 
> d) yet the reshape_fail field is kept, and the code that handled it in 
> the leaf disciplines is kept as well (slightly reworked but kept). this 
> just in case some out-of-the-kernel classful schedulers use it: we don't 
> want to break them (for what concerns the in-kernel-schedulers, you can 
> wipe out all the code related to reshape_fail without any harm, it is 
> not any more needed... with regard to it, what about wrapping it in some 
> #ifdef CONFIG_TC_DONT_BREAK_OUT_OF_THE_KERNEL_CLASSFUL_SCHEDULERS ?)
> 
> NOTA BENE
> e) it is darn important that if a packet is dropped all the outmost 
> queueing discipline's enqueue()'s return value tells it!
> 
> NOTA BENE
> f) kfree_skb() for locally allocated socket buffers are still there and 
> must stay there
> 
> NOTA BENE
> g) kfree_skb() whose semantic is to just decrement the user count are 
> not a packet drop. furthermore, the same kfree_skb() could have a 
> different run-time semantic, and is hence splitted in two, as in the 
> following excerpt of code:
> 
>          }
>  
>          if (terminal) {
> -            kfree_skb(skb);
> +            if( any_dropped(*qres) ){
> +                before_explicit_drop(*skb);
> +                IMPLICIT_DROP();
> +            } else
> +                kfree_skb(*skb);
>              return NULL;
>          }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-13 12:51 ` jamal
@ 2004-08-13 14:09   ` sandr8
  2004-08-14 21:21     ` jamal
  2004-08-16  7:20   ` Harald Welte
  1 sibling, 1 reply; 30+ messages in thread
From: sandr8 @ 2004-08-13 14:09 UTC (permalink / raw)
  To: hadi
  Cc: Alexey, David S. Miller, devik, shemminger, kaber, rusty,
	Harald Welte, netdev, netfilter-devel

jamal wrote:

>Alessandro,
>
>This summary applies to all your patches: Too many changes that seem
>unnecessary. Take a deep breath.
>
>1) You cant change the enqueue/dequeue API just because one qdisc
>doesnt seem to use it right. Whats all that contrack stuff doing in 
>dev.c and pkt sched areas?
>
the conntrack stuff (actually that little part of code that adds unbilling to ACCT in case of a drop) is something that comes later (patch 4) and is not in the pkt sched part. please, don't let it divert your mind from the real point. just forget it for the moment... [in any case it is _not_ in the pkt sched area and as i said in mail 4/4 i don't like to put the variables into dev.c, that's why i am there asking for alternatives]

the api change is there to tell the outside which packet was dropped from the qdisc queue as a consequence of the enqueueing operation. the very first and most important consequence is that instead of having to call the callback function reshape_fail() from the inside of the inner qdisc and having to save information about the last visited class into rx_class and the parent qdisc into __parent you just have to *use one only line* inside a loop in the outer qdisc, isn't it neat?

the main aim of all those changes in the API is not this accounting
stuff but having less and simpler code, which imho is more
straightforward, but this is just my viewpoint.

have a look at how many lines disappeared from sch_cbq.c.... as i said,
in all the other sources all code that has to do with the reshape_fail
can immediatly disappear. just immagine how many lines (and ifdef
blocks!) can disappear from all the inner queueing policies that use
the reshape_fail(). asking the mantainer of an inner qdisc to add some
(furthermore conditionally compiled) code to help the outer qdisc makes
it harder to write and mantain, isn't it?

yes, the patch looks big but there are many changes just because i wanted
to update every single scheduler so that they complies with the slightly
changed API. but now everything is ready to go and much code and overehead
has become unnecessary. i mean, if what i have just written above is not a
rave, then i'd say

"it is better not to overcomplicate the code of any non-root policy just
because one classful qdisc doesnt seem to use the API right"

rather than

"You cant change the enqueue/dequeue API just because one qdisc
doesnt seem to use it right."

and similarly i'd say

"Too many things have now become unnecessary."

rather than

"Too many changes that seem unnecessary"

...isn't this simply amazing?

>2) incr/decrementing ref cnt is the right way to go. skbs are shared.
>Try running tcpdump with your changes (and what i suspect your qdisc is
>doing) to see the effect. If you want to make changes to an skb, get
>your own copy (or copy of parts of the skb look at things like clone and
>skb_expand and relatives)
>  
>
i'm not making changes to any socket buffer... sorry i can't get the 
point about "If you want to make changes to an skb"

i could interpret your suggestion as to increment the user count when 
passing a skb to the enqueue() of a qdisc so that the kfree_skb() inside 
will just decrement the counter, but still there's the problem of 
always: how can you know from the outside which packet did the 
kfree_skb() apply to?

>3) I have a feeling what you are trying to do is best achieved by
>using a tc action and not a qdisc. As an example (based on a
>conversation with Harald at OLS) i plan on doing selective contracking
>at the ingress/egress - but as a clean separate action and not touching
>any other files. If this is what you are trying to do, i can help sit on
>the sideline and cheer you own with advice. It is not complicated to do,
>time is the issue.
>Maybe what you are trying to do is use contracking for accounting?
>In which case there will be two independent actions: one to track and
>another to account.
>  
>
this is what that conntrack stuff in patch 4 was there for... better to 
say: patch 4 is there to unbill in case of a drop due to traffic 
control. the billing is already done by the patch 3 by Harald Welte, 
that as far as i understood was already on the way for inclusion.
packet action was not there in 2.6.7 when i started coding this patch. 
now i might as well try to dig into it and see if it can help. but, i 
repeat, in any case the main aim of  all those changes in the API is not 
this accounting stuff but having less and simpler code.

>cheers,
>jamal
>
>  
>
ciao and thank you for the attention!
sandr8)

>On Thu, 2004-08-12 at 20:48, sandr8 wrote:
>  
>
>>2) The second patch eliminates the need for the deprecated __parent 
>>field and eliminates some tricks such as the rx_class field...
>>to do so it introduces a slight change in the interface of the enqueue() 
>>and requeue() operations... the "struct sk_buff * skb" becomes now a 
>>"struct sk_buff ** const skb". All the in-2.6.8-kernel packet schedulers 
>>and the core/dev.c are updated consequently... changes are quite trivial 
>>so there's no issue for out-of-the-kernel-tree schedulers.
>>
>>in order to make it clearer i should give a broader insight... i hate to 
>>describe the gory details but it's a bit delicate.
>>
>>a) the socket buffer for a dropped packet is freed as late as possible 
>>(lazy lazy lazy!). in other words, locally, a packet is _candidated_ to 
>>be dropped, but the latest word is the most global one.
>>
>>who's got the most global viewpoint? Elementary! My Dear Watson! the 
>>most external enqueuing operation. Ok... now you get it... the stack is 
>>your friend and will discriminate for free :) the pointer to the socket 
>>buffer structure always stays at the same location, but is updated by 
>>the enqueueing operations.
>>
>>This means that whatever packet "Barabba" is choosen to be dropped 
>>internally, it can be further saved by the caller, who can exchange it 
>>with an other victim... (i think there's no need to say who)
>>
>>b) the changes in (a) made it possible to remove the deprecated
>>__parent field due to cbq. this is done by handling the reshape 
>>operation from the outside (in a very stupid loop cycle), whoever
>>drops whatever.
>>
>>this way we also avoid saving the class in the rx_class field, calling a 
>>function that retrieves it and doing complex tricks to cope with the old 
>>byval api
>>
>>c) now it's possible to have (almost) a single point (core/dev.c) where 
>>packets are dropped. Though there are some other few places, the inline 
>>function before_explicit_drop() easies the centralization of the 
>>operations. this comes into great help for patch 4.
>>
>>Also, the empty macro IMPLICIT_DROP() behaves as a marker for the 
>>reader   and the coder, helping maintainance and readability.
>>
>>d) yet the reshape_fail field is kept, and the code that handled it in 
>>the leaf disciplines is kept as well (slightly reworked but kept). this 
>>just in case some out-of-the-kernel classful schedulers use it: we don't 
>>want to break them (for what concerns the in-kernel-schedulers, you can 
>>wipe out all the code related to reshape_fail without any harm, it is 
>>not any more needed... with regard to it, what about wrapping it in some 
>>#ifdef CONFIG_TC_DONT_BREAK_OUT_OF_THE_KERNEL_CLASSFUL_SCHEDULERS ?)
>>
>>NOTA BENE
>>e) it is darn important that if a packet is dropped all the outmost 
>>queueing discipline's enqueue()'s return value tells it!
>>
>>NOTA BENE
>>f) kfree_skb() for locally allocated socket buffers are still there and 
>>must stay there
>>
>>NOTA BENE
>>g) kfree_skb() whose semantic is to just decrement the user count are 
>>not a packet drop. furthermore, the same kfree_skb() could have a 
>>different run-time semantic, and is hence splitted in two, as in the 
>>following excerpt of code:
>>
>>         }
>> 
>>         if (terminal) {
>>-            kfree_skb(skb);
>>+            if( any_dropped(*qres) ){
>>+                before_explicit_drop(*skb);
>>+                IMPLICIT_DROP();
>>+            } else
>>+                kfree_skb(*skb);
>>             return NULL;
>>         }
>>
>>    
>>
>
>
>
>
>
>  
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-13 14:09   ` sandr8
@ 2004-08-14 21:21     ` jamal
  2004-08-16  7:35       ` Harald Welte
  0 siblings, 1 reply; 30+ messages in thread
From: jamal @ 2004-08-14 21:21 UTC (permalink / raw)
  To: sandr8
  Cc: Alexey, David S. Miller, devik, shemminger, kaber, rusty,
	Harald Welte, netdev, netfilter-devel

[please wrap your email at about 76 characters; makes it easier to
respond to]

On Fri, 2004-08-13 at 10:09, sandr8 wrote:
> jamal wrote:

> >1) You cant change the enqueue/dequeue API just because one qdisc
> >doesnt seem to use it right. Whats all that contrack stuff doing in 
> >dev.c and pkt sched areas?
> >
> the conntrack stuff (actually that little part of code that adds 
> unbilling to ACCT in case of a drop) is something that comes later 
> (patch 4) and is not in the pkt sched part. please, don't let it 
> divert your mind from the real point. just forget it for the moment...
>  [in any case it is _not_ in the pkt sched area and as i said in mail 
> 4/4 i don't like to put the variables into dev.c, that's why i am 
> there asking for alternatives]

This actually seems to be the core issue. 
Correct me if i misunderstood what you are trying to achieve: 
Somewhere above, the netfilter code bills some packet. Packet gets 
all the way to the scheduler on egress.
Scheduler drops packet although it has been billed already.
You being a man looking for justice ;-> decides that was unfair and
you are trying to undo it. Is this accurate?

Also is their a corrective factor that happens once the _accounting_ 
data has been shipped? Example: 
- account for packet
- ship accounting data to some billing server
- oops, unbill
- what now?

BTW, what happens if you clone the packet below netfilter and send
several copies of it possibly over several different interfaces? This
may happen with tc extensions.
I think accounting is important - especially if it is almost free with
contracking.

Lets talk about this issue first instead of confusing it with everything
else you have in other patches.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-13 12:51 ` jamal
  2004-08-13 14:09   ` sandr8
@ 2004-08-16  7:20   ` Harald Welte
  2004-08-16 13:00     ` jamal
  1 sibling, 1 reply; 30+ messages in thread
From: Harald Welte @ 2004-08-16  7:20 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

[removing lots of people from the Cc, since they are definitely on
 netdev and don't need to be Cc'ed at this state]

Hi Jamal!

On Fri, Aug 13, 2004 at 08:51:24AM -0400, jamal wrote:
> Alessandro,
> 
> This summary applies to all your patches: Too many changes that seem
> unnecessary. Take a deep breath.

I'm actually not as pessimistic about all his changes.

Allesandro's ultimate goal seems to be connection-based accounting that
accounts precisely which packets have actually hit the outgoing wire.
While I'm quite happy with the now in-kernel conntrack accounting
(basedo on Rx rather than Tx packets/bytes), this is a different
definition of accounting.

Let's discuss the individual patches seperately.

1) Is certainly not a huge issue, no debate here

2) I am not as familiar with the tc/scheduler code as you are, but I
also think that what he is trying to achieve is a valid goal.  He tries
to make all tc-related packet drops go to a single code path for packet
dropping.  Independent of Allesandro's implementation, I would really
like to see something like this.   We once had an experimental patch
called the 'dropped hook' that would be traversed for all packets
dropped somewhere in the stack (for auditing in userspace, whatever).
Having a single packet drop point makes such a change less intrusive.

3) Is already in davem's tree, no need for discusion ;)

4) This is the part you are complaining about, right?  I agree, I don't
like conntrack specific stuff in dev.c and packet scheduler areas.

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-14 21:21     ` jamal
@ 2004-08-16  7:35       ` Harald Welte
  2004-08-16 13:29         ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: Harald Welte @ 2004-08-16  7:35 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 3725 bytes --]

On Sat, Aug 14, 2004 at 05:21:31PM -0400, jamal wrote:

> > the conntrack stuff (actually that little part of code that adds 
> > unbilling to ACCT in case of a drop) is something that comes later 
> > (patch 4) and is not in the pkt sched part. please, don't let it 
> > divert your mind from the real point. just forget it for the moment...
> >  [in any case it is _not_ in the pkt sched area and as i said in mail 
> > 4/4 i don't like to put the variables into dev.c, that's why i am 
> > there asking for alternatives]
> 
> This actually seems to be the core issue. 
> Correct me if i misunderstood what you are trying to achieve: 
> Somewhere above, the netfilter code bills some packet. Packet gets 
> all the way to the scheduler on egress.
> Scheduler drops packet although it has been billed already.
> You being a man looking for justice ;-> decides that was unfair and
> you are trying to undo it. Is this accurate?

Yes, that's how I understoo Allesandro's efforts.  While I don't think
it's worth being that precise (and rather change the definition of what
is being accounted to 'number of packets/bytes recevived in this flow').

> Also is their a corrective factor that happens once the _accounting_ 
> data has been shipped? Example: 
> - account for packet
> - ship accounting data to some billing server
> - oops, unbill
> - what now? 

Yes, this is a race condition.  However, I don't this is not very likely
to occurr, since the accounting data is by default only sent to
userspace via ctnetlink once the connection tracking entry is deleted.

Yes, you can read it while the connection is still alive, but this will
not reset/update the counters, but rather give you a current snapshot.
If you send this to your accounting server, the accounting server has to
cope with the fact that this intermediate snapshot can be updated by
some later data.  It SHOULD not care whether this later data for the
same flow has bigger or smaller byte/packet counters. [and
it is very unlikely that the total will be lower, since then in the
timeframe [snapshot, terminations] more packets have to be dropped than
accepted.  Still, if this is documented with ctnetlink I'm perfectly
fine which such behaviour.

> BTW, what happens if you clone the packet below netfilter and send
> several copies of it possibly over several different interfaces? This
> may happen with tc extensions.

oh yes, I think somebody has written a similar iptables target, too. I'm
not sure whether there is a good solution for the 'unbill' feature.  Do
you have any thoughts/recommendations for this?

> I think accounting is important - especially if it is almost free with
> contracking.

Totally agreed.  

The reason for not delaying accounting update until qdisc has happened
is locking.  Then we would have to re-grab the conntrack write lock to
make the counter update... whrereas in my patch counter updates happen
while we are already under write lock for the timer/timeout update.

> Lets talk about this issue first instead of confusing it with everything
> else you have in other patches.

Also, if this 'unbill' feature gets into the kernel in some form, I
would definitely make it a CONFIG_ or sysctl... after all people could
be interested in the Rx side only...

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16  7:20   ` Harald Welte
@ 2004-08-16 13:00     ` jamal
  2004-08-16 13:08       ` Harald Welte
  2004-08-16 15:19       ` sandr8
  0 siblings, 2 replies; 30+ messages in thread
From: jamal @ 2004-08-16 13:00 UTC (permalink / raw)
  To: Harald Welte; +Cc: sandr8, devik, netdev, netfilter-devel

On Mon, 2004-08-16 at 03:20, Harald Welte wrote:

> On Fri, Aug 13, 2004 at 08:51:24AM -0400, jamal wrote:
> > Alessandro,
> > 
> > This summary applies to all your patches: Too many changes that seem
> > unnecessary. Take a deep breath.
> 
> I'm actually not as pessimistic about all his changes.
> 
> Allesandro's ultimate goal seems to be connection-based accounting that
> accounts precisely which packets have actually hit the outgoing wire.
> While I'm quite happy with the now in-kernel conntrack accounting
> (basedo on Rx rather than Tx packets/bytes), this is a different
> definition of accounting.

When did that code go in?

> Let's discuss the individual patches seperately.
> 
> 1) Is certainly not a huge issue, no debate here

Although the patch does look innocent,
actually it may break a lot of things since that code is particulary
tricky and the patch changes the environmental rules. It needs study.
since it doesnt add anything other than cosmetics, holding on it
might be the wisest thing.

> 2) I am not as familiar with the tc/scheduler code as you are, but I
> also think that what he is trying to achieve is a valid goal.  He tries
> to make all tc-related packet drops go to a single code path for packet
> dropping.  Independent of Allesandro's implementation, I would really
> like to see something like this.   We once had an experimental patch
> called the 'dropped hook' that would be traversed for all packets
> dropped somewhere in the stack (for auditing in userspace, whatever).
> Having a single packet drop point makes such a change less intrusive.

Seems to me the interest is more having single point for stats
collection.
Let me think about it for a bit and get back to you guys.

> 3) Is already in davem's tree, no need for discusion ;)

Didnt know that ;-> Is it in released kernels already?
Actually that patch should be fine.

> 4) This is the part you are complaining about, right?  I agree, I don't
> like conntrack specific stuff in dev.c and packet scheduler areas.

Let me think about it and get back to you. Solution to #2 should apply
here.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16 13:00     ` jamal
@ 2004-08-16 13:08       ` Harald Welte
  2004-08-16 15:19       ` sandr8
  1 sibling, 0 replies; 30+ messages in thread
From: Harald Welte @ 2004-08-16 13:08 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

On Mon, Aug 16, 2004 at 09:00:35AM -0400, jamal wrote:
> When did that code go in?

sorry that should have been 'now submitted and accepted by davem'

> Let me think about it for a bit and get back to you guys.

sure.

> > 3) Is already in davem's tree, no need for discusion ;)
> 
> Didnt know that ;-> Is it in released kernels already?

no, sorry. 2.6.9 stuff

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16  7:35       ` Harald Welte
@ 2004-08-16 13:29         ` jamal
  2004-08-24 18:57           ` Harald Welte
  0 siblings, 1 reply; 30+ messages in thread
From: jamal @ 2004-08-16 13:29 UTC (permalink / raw)
  To: Harald Welte; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 3999 bytes --]

On Mon, 2004-08-16 at 03:35, Harald Welte wrote:
> On Sat, Aug 14, 2004 at 05:21:31PM -0400, jamal wrote:
> 

> > Also is their a corrective factor that happens once the _accounting_ 
> > data has been shipped? Example: 
> > - account for packet
> > - ship accounting data to some billing server
> > - oops, unbill
> > - what now? 
> 
> Yes, this is a race condition.  However, I don't this is not very likely
> to occurr, since the accounting data is by default only sent to
> userspace via ctnetlink once the connection tracking entry is deleted.

ah, ok. so problem solved then. 

> Yes, you can read it while the connection is still alive, but this will
> not reset/update the counters, but rather give you a current snapshot.
> If you send this to your accounting server, the accounting server has to
> cope with the fact that this intermediate snapshot can be updated by
> some later data.  It SHOULD not care whether this later data for the
> same flow has bigger or smaller byte/packet counters. [and
> it is very unlikely that the total will be lower, since then in the
> timeframe [snapshot, terminations] more packets have to be dropped than
> accepted.  Still, if this is documented with ctnetlink I'm perfectly
> fine which such behaviour.

I am too. Good stuff.
I think 99.9% of accounting would be happy with getting data after the
call is done; the other 0.01% may have to live with extra packets later
which undo things. 
Are you working on something along the IPFIX protocol for transport?

> > BTW, what happens if you clone the packet below netfilter and send
> > several copies of it possibly over several different interfaces? This
> > may happen with tc extensions.
> 
> oh yes, I think somebody has written a similar iptables target, too. I'm
> not sure whether there is a good solution for the 'unbill' feature.  Do
> you have any thoughts/recommendations for this?

Let me think about it.
Clearly the best place to account for things is on the wire once the
packet has left the box ;-> So the closest you are to the wire, the
better. How open are you to move accounting further down? My thoughts
are along the lines of incrementing the contrack counters at the qdisc
level. Since you transport after the structure has been deleted, it
should work out fine and fair billing will be taken care of.

> The reason for not delaying accounting update until qdisc has happened
> is locking.  Then we would have to re-grab the conntrack write lock to
> make the counter update... whrereas in my patch counter updates happen
> while we are already under write lock for the timer/timeout update.

Yikes. That sort of kills my thought above ;->
Has someone done experimented and figured how expensive it would be to
do it at the qdisc level? Note, you can probably have a low level
grained
lock just for stats.

> > Lets talk about this issue first instead of confusing it with everything
> > else you have in other patches.
> 
> Also, if this 'unbill' feature gets into the kernel in some form, I
> would definitely make it a CONFIG_ or sysctl... after all people could
> be interested in the Rx side only...

Agreed.

Heres thinking developed while responding to you.
Contracking to use generic stats counters that we plan to use for MPLS
(amongst other things). I have attached a patch i was going to shoot to
Dave - needs testing against latest kernels. The code is reproduced from
the net/sched.

My thinking is that at the qdisc level instead of saying things like:
sch->stats.packets++;
you do:
INC_STATS(skb,sch->stats,reason_code)

INC_STATS is generic (goes in the generic stats code in attached patch)
and will have an ifdef for contrack billing (which includes unbilling
depending on reason code). Reason code could be results that are now
returned.
As an example NET_XMIT_DROP is definetely unbilling while
NET_XMIT_SUCCESS implies bill. 

I think the current stats structure may not cover all cases but we can
discuss that before i push patch to Dave.

cheers,
jamal




[-- Attachment #2: stats1.patch --]
[-- Type: text/plain, Size: 9342 bytes --]

--- 268rc3/net/core/Makefile	2004/08/09 02:44:08	1.1
+++ 268rc3/net/core/Makefile	2004/08/09 02:46:01
@@ -2,7 +2,7 @@
 # Makefile for the Linux networking core.
 #
 
-obj-y := sock.o skbuff.o iovec.o datagram.o stream.o scm.o
+obj-y := sock.o skbuff.o iovec.o datagram.o stream.o scm.o gen_stats.o gen_estimator.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
--- /dev/null	1998-05-05 16:32:27.000000000 -0400
+++ 268rc3/net/core/gen_stats.c	2004-08-09 09:22:21.000000000 -0400
@@ -0,0 +1,105 @@
+/*
+ * net/core/gen_stats.c
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:  Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ *
+ * Changes:
+ * Jamal Hadi Salim adapted from net_sched_api for gen purpose use
+ *
+ */
+
+#include <asm/uaccess.h>
+#include <asm/system.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/init.h>
+#include <net/sock.h>
+#include <linux/gen_stats.h>
+
+
+/* 
+ * USAGE:
+ *
+ * declare in mystruct:
+ * struct gen_stats    mystats;
+ *
+ * increment as appropriate,eg :
+ *
+ * mystruct->mystats.packets++;
+ *
+ * update is lockless
+ *
+ * passing to user space:
+ *
+ * in routine my_dump():
+ *
+ *  if (gen_copy_stats(skb, &mystruct->mystats,MYSTAT_V), my_lock)
+ *                         goto rtattr_failure;
+ *
+ *
+ * locks:
+ *
+ * You are responsible for making sure that stats lock is
+ * initialized to something valid 
+ * (typically the table lock -- i.e updates happen only when
+ * you are dumping like here) 
+ * */
+int gen_copy_stats(struct sk_buff *skb, struct gnet_stats *st,int type, spinlock_t *lock)
+{
+	spin_lock_bh(lock);
+	RTA_PUT(skb, type, sizeof(struct gnet_stats), st);
+	spin_unlock_bh(lock);
+	return 0;
+
+rtattr_failure:
+	spin_unlock_bh(lock);
+	return -1;
+}
+
+/* 
+ * USAGE:
+ *
+ * declare your own private formated in mystruct:
+ * struct mypriv_stats    mystats;
+ *
+ * passing to user space:
+ *
+ * in routine my_dump():
+ *
+ *  if (gen_copy_xstats(skb, (void *)&mystruct->mystats,sizeof(struct mypriv_stats), MYPSTAT_V),my_lock)
+ *                         goto rtattr_failure;
+ *
+ *   Lock rules apply the same as in general stats
+ */
+int gen_copy_xstats(struct sk_buff *skb, void *st, int size, int type, spinlock_t *lock)
+{
+	spin_lock_bh(lock);
+	RTA_PUT(skb, type, size, st);
+	spin_unlock_bh(lock);
+	return 0;
+
+rtattr_failure:
+	spin_unlock_bh(lock);
+	return -1;
+}
+
+EXPORT_SYMBOL(gen_copy_stats);
+EXPORT_SYMBOL(gen_copy_xstats);
--- /dev/null	1998-05-05 16:32:27.000000000 -0400
+++ 268rc3/net/core/gen_estimator.c	2004-08-09 09:00:56.000000000 -0400
@@ -0,0 +1,207 @@
+/*
+ * net/sched/estimator.c	Simple rate estimator.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ *
+ * Changes:
+ *              Jamal Hadi Salim - moved it to net/core and reshulfed
+ *              names to make it usable in general net subsystem.
+ *    
+ *
+ */
+
+#include <asm/uaccess.h>
+#include <asm/system.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/init.h>
+#include <net/sock.h>
+#include <linux/gen_stats.h>
+
+/*
+   This code is NOT intended to be used for statistics collection,
+   its purpose is to provide a base for statistical multiplexing
+   for controlled load service.
+   If you need only statistics, run a user level daemon which
+   periodically reads byte counters.
+
+   Unfortunately, rate estimation is not a very easy task.
+   F.e. I did not find a simple way to estimate the current peak rate
+   and even failed to formulate the problem 8)8)
+
+   So I preferred not to built an estimator into the scheduler,
+   but run this task separately.
+   Ideally, it should be kernel thread(s), but for now it runs
+   from timers, which puts apparent top bounds on the number of rated
+   flows, has minimal overhead on small, but is enough
+   to handle controlled load service, sets of aggregates.
+
+   We measure rate over A=(1<<interval) seconds and evaluate EWMA:
+
+   avrate = avrate*(1-W) + rate*W
+
+   where W is chosen as negative power of 2: W = 2^(-ewma_log)
+
+   The resulting time constant is:
+
+   T = A/(-ln(1-W))
+
+
+   NOTES.
+
+   * The stored value for avbps is scaled by 2^5, so that maximal
+     rate is ~1Gbit, avpps is scaled by 2^10.
+
+   * Minimal interval is HZ/4=250msec (it is the greatest common divisor
+     for HZ=100 and HZ=1024 8)), maximal interval
+     is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals
+     are too expensive, longer ones can be implemented
+     at user level painlessly.
+ */
+
+#if (HZ%4) != 0
+#error Bad HZ value.
+#endif
+
+#define EST_MAX_INTERVAL	5
+
+struct gen_estimator
+{
+	struct gen_estimator	*next;
+	struct gnet_stats	*stats;
+	spinlock_t		*stats_lock;
+	unsigned		interval;
+	int			ewma_log;
+	u64			last_bytes;
+	u32			last_packets;
+	u32			avpps;
+	u32			avbps;
+};
+
+struct gen_estimator_head
+{
+	struct timer_list	timer;
+	struct gen_estimator	*list;
+};
+
+static struct gen_estimator_head elist[EST_MAX_INTERVAL+1];
+
+/* Estimator array lock */
+static rwlock_t est_lock = RW_LOCK_UNLOCKED;
+
+static void est_timer(unsigned long arg)
+{
+	int idx = (int)arg;
+	struct gen_estimator *e;
+
+	read_lock(&est_lock);
+	for (e = elist[idx].list; e; e = e->next) {
+		struct gnet_stats *st = e->stats;
+		u64 nbytes;
+		u32 npackets;
+		u32 rate;
+
+		spin_lock(e->stats_lock);
+		nbytes = st->bytes;
+		npackets = st->packets;
+		rate = (nbytes - e->last_bytes)<<(7 - idx);
+		e->last_bytes = nbytes;
+		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+		st->bps = (e->avbps+0xF)>>5;
+
+		rate = (npackets - e->last_packets)<<(12 - idx);
+		e->last_packets = npackets;
+		e->avpps += ((long)rate - (long)e->avpps) >> e->ewma_log;
+		e->stats->pps = (e->avpps+0x1FF)>>10;
+		spin_unlock(e->stats_lock);
+	}
+
+	mod_timer(&elist[idx].timer, jiffies + ((HZ/4)<<idx));
+	read_unlock(&est_lock);
+}
+
+int gen_new_estimator(struct gnet_stats *stats, spinlock_t *stats_lock, struct rtattr *opt)
+{
+	struct gen_estimator *est;
+	struct gnet_estimator *parm = RTA_DATA(opt);
+
+	if (RTA_PAYLOAD(opt) < sizeof(*parm))
+		return -EINVAL;
+
+	if (parm->interval < -2 || parm->interval > 3)
+		return -EINVAL;
+
+	est = kmalloc(sizeof(*est), GFP_KERNEL);
+	if (est == NULL)
+		return -ENOBUFS;
+
+	memset(est, 0, sizeof(*est));
+	est->interval = parm->interval + 2;
+	est->stats = stats;
+	est->stats_lock = stats_lock;
+	est->ewma_log = parm->ewma_log;
+	est->last_bytes = stats->bytes;
+	est->avbps = stats->bps<<5;
+	est->last_packets = stats->packets;
+	est->avpps = stats->pps<<10;
+
+	est->next = elist[est->interval].list;
+	if (est->next == NULL) {
+		init_timer(&elist[est->interval].timer);
+		elist[est->interval].timer.data = est->interval;
+		elist[est->interval].timer.expires = jiffies + ((HZ/4)<<est->interval);
+		elist[est->interval].timer.function = est_timer;
+		add_timer(&elist[est->interval].timer);
+	}
+	write_lock_bh(&est_lock);
+	elist[est->interval].list = est;
+	write_unlock_bh(&est_lock);
+	return 0;
+}
+
+void gen_kill_estimator(struct gnet_stats *stats)
+{
+	int idx;
+	struct gen_estimator *est, **pest;
+
+	for (idx=0; idx <= EST_MAX_INTERVAL; idx++) {
+		int killed = 0;
+		pest = &elist[idx].list;
+		while ((est=*pest) != NULL) {
+			if (est->stats != stats) {
+				pest = &est->next;
+				continue;
+			}
+
+			write_lock_bh(&est_lock);
+			*pest = est->next;
+			write_unlock_bh(&est_lock);
+
+			kfree(est);
+			killed++;
+		}
+		if (killed && elist[idx].list == NULL)
+			del_timer(&elist[idx].timer);
+	}
+}
+
+EXPORT_SYMBOL(gen_kill_estimator);
+EXPORT_SYMBOL(gen_new_estimator);
--- /dev/null	1998-05-05 16:32:27.000000000 -0400
+++ 268rc3/include/linux/gen_stats.h	2004-08-09 09:06:29.000000000 -0400
@@ -0,0 +1,21 @@
+#ifndef __LINUX_GEN_STATS_H
+#define __LINUX_GEN_STATS_H
+
+struct gnet_stats
+{
+	__u64	bytes;			/* Number of seen bytes */
+	__u32	packets;		/* Number of seen packets	*/
+	__u32	drops;			/* Packets dropped */
+	__u32	bps;			/* Current flow byte rate */
+	__u32	pps;			/* Current flow packet rate */
+	__u32   qlen;
+	__u32   backlog;
+};
+
+struct gnet_estimator
+{
+	signed char	interval;
+	unsigned char	ewma_log;
+};
+
+#endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16 13:00     ` jamal
  2004-08-16 13:08       ` Harald Welte
@ 2004-08-16 15:19       ` sandr8
  2004-08-17 11:52         ` jamal
  1 sibling, 1 reply; 30+ messages in thread
From: sandr8 @ 2004-08-16 15:19 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, sandr8, devik, netdev, netfilter-devel

jamal wrote:

>On Mon, 2004-08-16 at 03:20, Harald Welte wrote:
>  
>
>>Let's discuss the individual patches seperately.
>>
>>1) Is certainly not a huge issue, no debate here
>>    
>>
>
>Although the patch does look innocent,
>actually it may break a lot of things since that code is particulary
>tricky and the patch changes the environmental rules. It needs study.
>since it doesnt add anything other than cosmetics, holding on it
>might be the wisest thing.
>  
>
the danger should be where it is used incorrectly.
you are right, but for the moment it is not used
spreadly.

if instead you are talking about breaking some
code that relied on the old numerical values
themselves (but the trivial case of 0 that many times
seems to be confused with the only success return
value, which is not true) then i must have definitely
missed that code.

btw, wouldn't it be nice to have something similar for
the tc_calssify results (eg. setting a "terminal" bit in the
defines and adding an inline terminal(result) function
that does mask with that bit so that we avoid switching
and hence some lines of code and branches)?

>>2)... 
>>
>>Having a single packet drop point makes such a change less intrusive.
>>    
>>
>
>Seems to me the interest is more having single point for stats
>collection.
>Let me think about it for a bit and get back to you guys.
>
>  
>
well i didn't dare doing that too because
all those * or & insertion themselves made
up a 50k patch... but in fact that would be
nice and would lighten the code inside the
many enqueue()...

needless to say, the more (common) code
is pulled outside of the per-discipline
code, the smallest the kernel is and the
easiest is to mantain/debug/introduce
further changes.

>>4) This is the part you are complaining about, right?  I agree, I don't
>>like conntrack specific stuff in dev.c and packet scheduler areas.
>>    
>>
>
>Let me think about it and get back to you. Solution to #2 should apply
>here.
>  
>
as i said from the beginning, i don't like that too.
yes, it is working fine for me, but that is not the
logical place where to do that...

please correct me if i am wrong: the two big questions are:

1) moving that code away from net/core/dev.c

and

2) what to do in case of packets sent over multiple interfaces?

for what concerns (1), i was wondering if it would be
less nasty to have a device enqueue operation that will
interact (wraps it and does something around that) with
the outmost qdisc enqueue... this could give a good
abstraction to answer to question 2 as well...

for what concerns (2), i see that jamal takes literally
the meaning of 'billing' a connection... well i was
thinking just at traffic policing and not at money
accounting and billing servers...

in any case... regarding what to do if a packet sent over
multiple interfaces is dropped only in some of them and
not on all of them... this could also happen for a broadcast
or multicast packet... and thinking about it, the most coerent
thing to do from my viewpoint (traffic policing, not getting
money for the service given) would be to have a separate view
of the same packet flow at every different interface... this
would get more complex, but the separate device-level enqueue
could be the place to do that. there would also have place the
single point for drops.

in an other message, jamal wrote:

>Let me think about it.
>Clearly the best place to account for things is on the wire once the
>packet has left the box ;-> So the closest you are to the wire, the
>better. How open are you to move accounting further down? My thoughts
>are along the lines of incrementing the contrack counters at the qdisc
>level. Since you transport after the structure has been deleted, it
>should work out fine and fair billing will be taken care of.
>  
>
accounting when the packet goes to the wire would mean at the
dequeue level?

besides, as harald sais, grabbing a lock every time a packet is sent and
not only when a packet is dropped... this would also imply from my
particular viewpoint, that when enqueing a packet we will not yet know how
much its flow/connection has been billed till know. we'll know that, only
once the previous packet will have been dequeued. and for me it would
be _much_ more cpu-intensive to behave consequently to that information if
i get it that late :''''(

this because it would force me to have more complexity in the enqueue
operation, that in the scheduler i'm trying to write does need to have that
information to put packets correctly into the queue.

i think that in that case, i'd better duplicate the work and account that
information on my own... the speedup i'd get would be definitely worth
having twice the same info... even though that would not be elegant at
all... :(

i should stop, take an other long breath, and think a little bit about 
this ;)

cheers
alessandro

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16 15:19       ` sandr8
@ 2004-08-17 11:52         ` jamal
  2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
  2004-08-17 13:49           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail sandr8
  0 siblings, 2 replies; 30+ messages in thread
From: jamal @ 2004-08-17 11:52 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, sandr8, devik, netdev, netfilter-devel

On Mon, 2004-08-16 at 11:19, sandr8 wrote:
> jamal wrote:
> 
[..]
> the danger should be where it is used incorrectly.
> you are right, but for the moment it is not used
> spreadly.

It is used all over the stack. Lets defer this part of the 
discussion - even if we never "fix" this it doesnt matter.

> please correct me if i am wrong: the two big questions are:
> 
> 1) moving that code away from net/core/dev.c
> 
> and
> 
> 2) what to do in case of packets sent over multiple interfaces?
>
> for what concerns (1), i was wondering if it would be
> less nasty to have a device enqueue operation that will
> interact (wraps it and does something around that) with
> the outmost qdisc enqueue... this could give a good
> abstraction to answer to question 2 as well...

I am not sure i followed.

> for what concerns (2), i see that jamal takes literally
> the meaning of 'billing' a connection... well i was
> thinking just at traffic policing and not at money
> accounting and billing servers...

The policer (or any other action) accounts for what passes through it. 
Dropping packets at the policer is policy definition. Dropping packets
at the qdisc due to full queue is an accident. An accident that in
a good system shouldnt happen. For the accident part i agree with
the unbilling/recompensation feature.
I do agree with you - I think we have some bad accounting practises at
the qdisc level. Look at the requeu accounting for another example.
I also have issues with some of the other stats, example
stats.packets should really be incremented on enqueue to account for 
all the packets enqueue has seen (instead it is an accounting for
how many success have happened)

> in any case... regarding what to do if a packet sent over
> multiple interfaces is dropped only in some of them and
> not on all of them... this could also happen for a broadcast
> or multicast packet... and thinking about it, the most coerent
> thing to do from my viewpoint (traffic policing, not getting
> money for the service given) would be to have a separate view
> of the same packet flow at every different interface... this
> would get more complex, but the separate device-level enqueue
> could be the place to do that. there would also have place the
> single point for drops.

Yes, this is a hard question. Did you see the suggestion i proposed
to Harald?

> in an other message, jamal wrote:
> 
> >Let me think about it.
> >Clearly the best place to account for things is on the wire once the
> >packet has left the box ;-> So the closest you are to the wire, the
> >better. How open are you to move accounting further down? My thoughts
> >are along the lines of incrementing the contrack counters at the qdisc
> >level. Since you transport after the structure has been deleted, it
> >should work out fine and fair billing will be taken care of.
> >  
> >
> accounting when the packet goes to the wire would mean at the
> dequeue level?

I mean it is grabbed from the qdisc and a DMA of the packet is
attempted.

> besides, as harald sais, grabbing a lock every time a packet is sent and
> not only when a packet is dropped... this would also imply from my
> particular viewpoint, that when enqueing a packet we will not yet know how
> much its flow/connection has been billed till know. we'll know that, only
> once the previous packet will have been dequeued. and for me it would
> be _much_ more cpu-intensive to behave consequently to that information if
> i get it that late :''''(

I have been thinking about the lock cost. I believe the cost of using
stats lock at qdisc is the same as what you have currently with
unbilling.

> this because it would force me to have more complexity in the enqueue
> operation, that in the scheduler i'm trying to write does need to have that
> information to put packets correctly into the queue.

Ok, now you mention the other piece. What are you trying to do on said
qdisc?

> i think that in that case, i'd better duplicate the work and account that
> information on my own... the speedup i'd get would be definitely worth
> having twice the same info... even though that would not be elegant at
> all... :(

Explain what your qdisc is doing.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-17 11:52         ` jamal
@ 2004-08-17 13:40           ` sandr8
  2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
                               ` (2 more replies)
  2004-08-17 13:49           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail sandr8
  1 sibling, 3 replies; 30+ messages in thread
From: sandr8 @ 2004-08-17 13:40 UTC (permalink / raw)
  To: hadi; +Cc: laforge, devik, netdev, netfilter-devel

jamal wrote:

>It is used all over the stack. Lets defer this part of the 
>discussion - even if we never "fix" this it doesnt matter.
>  
>
sorry, i meant the two new inline functions

>>i was wondering if it would be
>>less nasty to have a device enqueue operation that will
>>interact (wraps it and does something around that) with
>>the outmost qdisc enqueue... this could give a good
>>abstraction to answer to question 2 as well...
>>    
>>
>
>I am not sure i followed.
>  
>
something like enqueue(dev) that will indirectly call dev->qdisc->enqueue
and handle in that single place that stuff that does not fit well in 
net/core/dev.c

>Dropping packets at the policer is policy definition. Dropping packets
>at the qdisc due to full queue is an accident. An accident that in
>a good system shouldnt happen.
>
why it should not happen in a good system?
it is an accident that is a sympthom of something. when we
encounter that accident we detect that "sympthom" at the
scheduler. the way the scheduler reacts to that sympthom
is imho part of the policy. i'm somehow advocating that
the policer is something more than the mere filter, but the
filter + that part of the scheduler that decides what to drop...
from that viewpoint there is no big difference between
the filter drop and the "accidental drop" performed
nevertheless in compliance with a given policy.

>For the accident part i agree with
>the unbilling/recompensation feature.
>  
>
why not in the other case? :'''(
well, since later on you ask me what i have
in mind, it would be more clear there why i
personally would need it in any case.

>Yes, this is a hard question. Did you see the suggestion i proposed
>to Harald?
>  
>
if it is the centralization of the stats with the reason code that,
for what concerns the ACCT, says wheter to bill or unbill i
think it is _really_ great :)
still, for what concerns the multiple interface delivery of the
same packet i don't see how it would be solved...

would there be any magic to have some conntrack data per device
without having to execute the actual tracking twice but without locking
the whole conntrack either? what could be the "magic" to let the
conntrack do the hard work just once and handle the additional traffic
policing information separately, in an other data structure that is 
mantained
on a device basis? that could also be the place where to count how much
a given flow is backlogged on a given interface... which could help in
choosing the dropping action... sorry, am i going too much further?

>I mean it is grabbed from the qdisc and a DMA of the packet is
>attempted.
>  
>
so, after (maybe better to say while :) qdisc is run and dequeues the 
packet.

well, your approach seems to be the most coherent one...

>I believe the cost of using
>stats lock at qdisc is the same as what you have currently with
>unbilling.
>  
>
you mean having a fine grained lock just for the stats?

>>this because it would force me to have more complexity in the enqueue
>>operation, that in the scheduler i'm trying to write does need to have that
>>information to put packets correctly into the queue.
>>    
>>
>
>Ok, now you mention the other piece. What are you trying to do on said
>qdisc?
>  
>
it is not ready, but to say it shortly, i'm trying to serve first who 
has been
_served_ the less.

from the first experiments i have made this behaves pretty well and smootly,
but i've noticed that _not_ unbilling can be pretty unfair towards udp 
flows,
since they always keep sending.

>>i think that in that case, i'd better duplicate the work and account that
>>information on my own... the speedup i'd get would be definitely worth
>>having twice the same info... even though that would not be elegant at
>>all... :(
>>    
>>
>
>Explain what your qdisc is doing.
>  
>
it simply has a priority dequeue that is manained ordered on the 
attained service.
if no drop occours, then accounting before enqueueing simply forecasts 
the service
that will have been attained up to the packet currenlty being enqueued 
when it will
be dequeued.  [ much easier to code than to say... ]

>cheers,
>jamal
>
ciao ;)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-17 11:52         ` jamal
  2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
@ 2004-08-17 13:49           ` sandr8
  1 sibling, 0 replies; 30+ messages in thread
From: sandr8 @ 2004-08-17 13:49 UTC (permalink / raw)
  Cc: netdev

jamal wrote:

>It is used all over the stack. Lets defer this part of the 
>discussion - even if we never "fix" this it doesnt matter.
>  
>
sorry, i meant the two new inline functions

>>i was wondering if it would be
>>less nasty to have a device enqueue operation that will
>>interact (wraps it and does something around that) with
>>the outmost qdisc enqueue... this could give a good
>>abstraction to answer to question 2 as well...
>>    
>>
>
>I am not sure i followed.
>  
>
something like enqueue(dev) that will indirectly call dev->qdisc->enqueue
and handle in that single place that stuff that does not fit well in
net/core/dev.c

>Dropping packets at the policer is policy definition. Dropping packets
>at the qdisc due to full queue is an accident. An accident that in
>a good system shouldnt happen.
>
why it should not happen in a good system?
it is an accident that is a sympthom of something. when we
encounter that accident we detect that "sympthom" at the
scheduler. the way the scheduler reacts to that sympthom
is imho part of the policy. i'm somehow advocating that
the policer is something more than the mere filter, but the
filter + that part of the scheduler that decides what to drop...
from that viewpoint there is no big difference between
the filter drop and the "accidental drop" performed
nevertheless in compliance with a given policy.

>For the accident part i agree with
>the unbilling/recompensation feature.
>  
>
why not in the other case? :'''(
well, since later on you ask me what i have
in mind, it would be more clear there why i
personally would need it in any case.

>Yes, this is a hard question. Did you see the suggestion i proposed
>to Harald?
>  
>
if it is the centralization of the stats with the reason code that,
for what concerns the ACCT, says wheter to bill or unbill i
think it is _really_ great :)
still, for what concerns the multiple interface delivery of the
same packet i don't see how it would be solved...

would there be any magic to have some conntrack data per device
without having to execute the actual tracking twice but without locking
the whole conntrack either? what could be the "magic" to let the
conntrack do the hard work just once and handle the additional traffic
policing information separately, in an other data structure that is
mantained
on a device basis? that could also be the place where to count how much
a given flow is backlogged on a given interface... which could help in
choosing the dropping action... sorry, am i going too much further?

>I mean it is grabbed from the qdisc and a DMA of the packet is
>attempted.
>  
>
so, after (maybe better to say while :) qdisc is run and dequeues the
packet.

well, your approach seems to be the most coherent one...

>I believe the cost of using
>stats lock at qdisc is the same as what you have currently with
>unbilling.
>  
>
you mean having a fine grained lock just for the stats?

>>this because it would force me to have more complexity in the enqueue
>>operation, that in the scheduler i'm trying to write does need to have that
>>information to put packets correctly into the queue.
>>    
>>
>
>Ok, now you mention the other piece. What are you trying to do on said
>qdisc?
>  
>
it is not ready, but to say it shortly, i'm trying to serve first who
has been
_served_ the less.

from the first experiments i have made this behaves pretty well and smootly,
but i've noticed that _not_ unbilling can be pretty unfair towards udp
flows,
since they always keep sending.

>>i think that in that case, i'd better duplicate the work and account that
>>information on my own... the speedup i'd get would be definitely worth
>>having twice the same info... even though that would not be elegant at
>>all... :(
>>    
>>
>
>Explain what your qdisc is doing.
>  
>
it simply has a priority dequeue that is manained ordered on the
attained service.
if no drop occours, then accounting before enqueueing simply forecasts
the service
that will have been attained up to the packet currenlty being enqueued
when it will
be dequeued.  [ much easier to code than to say... ]

>cheers,
>jamal
>
ciao ;)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Billing 1: WAS (Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
@ 2004-08-22 15:17             ` jamal
  2004-08-23  9:33               ` sandr8
  2004-08-24 18:38               ` Harald Welte
  2004-08-22 15:38             ` Billing 2: WAS(Re: " jamal
  2004-08-22 16:12             ` Billing 3: " jamal
  2 siblings, 2 replies; 30+ messages in thread
From: jamal @ 2004-08-22 15:17 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

Apologies for latency - was busy at (my real) work.

Let me break this email into several ones since it is getting long.

On Tue, 2004-08-17 at 09:40, sandr8 wrote:

> something like enqueue(dev) that will indirectly call dev->qdisc->enqueue
> and handle in that single place that stuff that does not fit well in 
> net/core/dev.c

Enqueue of _root_ qdisc is the place to do it.
Maybe even dev.c calls to it. Lets defer this to the next email.

Let me say this:
I am happy with Haralds billing patch which is already in as is.
In other words, although there is an accounting discrepancy it is not
that big.
What does that mean? unbilling is not something to rush in and patch in
if its going to have an impact on other pieces. It doesnt matter whether
it goes in in 2.6.20 or doesnt even go in as far as i am concerned. 
However this shouldnt dicourage you because you have actually opened an
issue we need to resolve. So please keep up the discussions.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Billing 2: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
  2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
@ 2004-08-22 15:38             ` jamal
  2004-08-22 16:12             ` Billing 3: " jamal
  2 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-22 15:38 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Tue, 2004-08-17 at 09:40, sandr8 wrote:
> jamal wrote:

> >Dropping packets at the policer is policy definition. Dropping packets
> >at the qdisc due to full queue is an accident. An accident that in
> >a good system shouldnt happen.
> >
> why it should not happen in a good system?
> it is an accident that is a sympthom of something. when we
> encounter that accident we detect that "sympthom" at the
> scheduler. the way the scheduler reacts to that sympthom
> is imho part of the policy. 

By "good system" i mean one with no resource bottlenecks or one that
deals elegantly deals with bottlenecks.

1) A packet that is dropped because a queue is found to be full on
enqueue() is an exception handling. Typically such a drop is a result of
a bottleneck. In this specific case more than likely a bus overload (at
least where i have seen it 100% of the time). My talk at SUCON will
touch on some experiences in this area.
In such a case you want the top layer to treat such behavior differently
by perhaps backing off for a short while - i.e this is an internal
resource congestion. Thats why TCP needs to be lied to with _BYPASS.

2) I see that as different from policing or firewalling where i have
actually  managed the resource myself and told the system to drop the
packet. 

Essentially in #1 i am providing more intelligence to TCP so it can
do smarter congestion control than it would have done otherwise
(suing whatever std congestion algorithms ask it to). This is extremely
valuable (and i think Linux is the only OS which  does this).
In #2 i dont want TCP to bypass its standard congestion
control algorithm;at some point it discovers it and thinks theres
congestion on the network - we just dont help.

Not sure if that makes sense.

> i'm somehow advocating that
> the policer is something more than the mere filter, but the
> filter + that part of the scheduler that decides what to drop...
> from that viewpoint there is no big difference between
> the filter drop and the "accidental drop" performed
> nevertheless in compliance with a given policy.

Refer to what i said above.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
  2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
  2004-08-22 15:38             ` Billing 2: WAS(Re: " jamal
@ 2004-08-22 16:12             ` jamal
  2004-08-23  9:39               ` sandr8
  2004-08-24 18:46               ` Billing 3: " Harald Welte
  2 siblings, 2 replies; 30+ messages in thread
From: jamal @ 2004-08-22 16:12 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Tue, 2004-08-17 at 09:40, sandr8 wrote:
> jamal wrote:

> 
> >Yes, this is a hard question. Did you see the suggestion i proposed
> >to Harald?
> >  
> >
> if it is the centralization of the stats with the reason code that,
> for what concerns the ACCT, says wheter to bill or unbill i
> think it is _really_ great :)
> still, for what concerns the multiple interface delivery of the
> same packet i don't see how it would be solved...

Such packets are cloned or copied. I am going to assume the contrack
data remains intact in both cases. LaForge?
BTW, although i mentioned the multiple interfaces as an issue - thinking
a little more i see retransmissions from TCP as well (when enqueue drops
because of full queue) being a problem. 
Haralds patch bills in that case as well for each retransmitted packet
that gets dropped because of full Q.
So best place to really unbill is at the qdisc level.
The only place for now i see that happening is in the case of drops
i.e sch->stats.drops++ 
The dev.c area after enqueue() attempt is a more dangerous place to do
it at (incase the skb doesnt exist because something freed it when
enqueue was called. Also because thats one area open for returning more
intelligent congestion level indicating codes)

> would there be any magic to have some conntrack data per device
> without having to execute the actual tracking twice but without locking
> the whole conntrack either? 

That is the challenge at the moment.
For starters i dont see it as an issue right now to do locking.
Its a cost for the feature since Haralds patch is in. 

In the future we should make accounting a feature that could be turned
on despite contracking and skbs should carry an accounting metadata with
them. 

> what could be the "magic" to let the
> conntrack do the hard work just once and handle the additional traffic
> policing information separately, in an other data structure that is 
> mantained
> on a device basis? that could also be the place where to count how much
> a given flow is backlogged on a given interface... which could help in
> choosing the dropping action... sorry, am i going too much further?

No i think your idea is valuable.
The challenge is say you have a million connections, then do you
have a million locks (one per structure)? I think we could reduce it
by having a pool of stats sharing a lock (maybe by placing them in a 
shared hash table with each bucket having a lock).
You cant have too many locks and you cant have too few ;->

On your qdisc you say:

> >
> it is not ready, but to say it shortly, i'm trying to serve first who 
> has been
> _served_ the less.
> 
> from the first experiments i have made this behaves pretty well and smootly,
> but i've noticed that _not_ unbilling can be pretty unfair towards udp 
> flows,
> since they always keep sending.

If qdisc drops on full Q and unbills i think it should work, no?
If it drops because they abused a bandwidth level, shouldnt you punish
them still? I think you should, but your mileage may vary.
Note you also dont want to unbill more than once. If not maybe you can
introduce something on the skb to indicate unbilling-happened (if done
by policer) so root qdisc doesnt unbill again.

I think the issue starts with defining what resource is being accounted
for. In my view, you are accounting for both CPU and bandwidth.
Lets start by asking  What is the resource being accounted for?

> it simply has a priority dequeue that is manained ordered on the 
> attained service.
> if no drop occours, then accounting before enqueueing simply forecasts 
> the service
> that will have been attained up to the packet currenlty being enqueued 
> when it will
> be dequeued.  [ much easier to code than to say... ]

I think i understand.
A packet that gets enqueued is _guaranteed_ to be transmitted unless
overulled by admin policy. 
Ok, how about the idea of adding skb->unbilled which gets set when
unbilling happens (in the aggregated stats_incr()). skb->unbilled gets
zeroed at the root qdisc after return from enqueueing.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 1: WAS (Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
@ 2004-08-23  9:33               ` sandr8
  2004-08-24 18:38               ` Harald Welte
  1 sibling, 0 replies; 30+ messages in thread
From: sandr8 @ 2004-08-23  9:33 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, devik, netdev, netfilter-devel

jamal wrote:

>Apologies for latency - was busy at (my real) work.
>  
>
you don't have to apologize for that :) it's something
everybody can easily understand :)

>Let me break this email into several ones since it is getting long.
>
>On Tue, 2004-08-17 at 09:40, sandr8 wrote:
>  
>
>>something like enqueue(dev) that will indirectly call dev->qdisc->enqueue
>>and handle in that single place that stuff that does not fit well in 
>>net/core/dev.c
>>    
>>
>
>Enqueue of _root_ qdisc is the place to do it.
>Maybe even dev.c calls to it. Lets defer this to the next email.
>  
>
then i should add some code in every qdisc that can
be root and execute it only if it is actually root?

isn't it more complex?

>Let me say this:
>I am happy with Haralds billing patch which is already in as is.
>In other words, although there is an accounting discrepancy it is not
>that big.
>What does that mean? unbilling is not something to rush in and patch in
>if its going to have an impact on other pieces. It doesnt matter whether
>it goes in in 2.6.20 or doesnt even go in as far as i am concerned. 
>However this shouldnt dicourage you because you have actually opened an
>issue we need to resolve. So please keep up the discussions.
>  
>
sure, for the time being i thought to simply concentrate
on the rest and maintain it as indipendent as possible
from the choice that will be taken.

cheers
alessandro

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-22 16:12             ` Billing 3: " jamal
@ 2004-08-23  9:39               ` sandr8
  2004-08-23 11:38                 ` Billing 3-1: " jamal
                                   ` (2 more replies)
  2004-08-24 18:46               ` Billing 3: " Harald Welte
  1 sibling, 3 replies; 30+ messages in thread
From: sandr8 @ 2004-08-23  9:39 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, devik, netdev, netfilter-devel

jamal wrote:

>On Tue, 2004-08-17 at 09:40, sandr8 wrote:
>  
>
>>jamal wrote:
>>    
>>
>
>  
>
>>>Yes, this is a hard question. Did you see the suggestion i proposed
>>>to Harald?
>>>
>>if it is the centralization of the stats with the reason code that,
>>for what concerns the ACCT, says wheter to bill or unbill i
>>think it is _really_ great :)
>>still, for what concerns the multiple interface delivery of the
>>same packet i don't see how it would be solved...
>>    
>>
>
>Such packets are cloned or copied. I am going to assume the contrack
>data remains intact in both cases. LaForge?
>BTW, although i mentioned the multiple interfaces as an issue - thinking
>a little more i see retransmissions from TCP as well (when enqueue drops
>because of full queue) being a problem.
>  
>
imho, the point maybe is that a scheduler should work at layer 3,
am i wrong?
i mean: i made the same question to myself and answered that
the right level should be the third... this would account for tcp
retransmissions as well as forward error corrections packets
added by some application on top of udp, or retrasmissions by
applications on udp... or... whatever...

maybe my reasoning is less foggy if i first answer to an other
question:

>I think the issue starts with defining what resource is being accounted
>for. In my view, you are accounting for both CPU and bandwidth.
>Lets start by asking  What is the resource being accounted for?
>  
>
i would like to account for the number of bytes sent to the wire
on behalf of each flow :)

>Haralds patch bills in that case as well for each retransmitted packet
>that gets dropped because of full Q.
>So best place to really unbill is at the qdisc level.
>The only place for now i see that happening is in the case of drops
>i.e sch->stats.drops++ 
>The dev.c area after enqueue() attempt is a more dangerous place to do
>it at (incase the skb doesnt exist because something freed it when
>enqueue was called. Also because thats one area open for returning more
>intelligent congestion level indicating codes)
>  
>
this seems not to be possible, afaik, with that patch i wrote. the only
skbs that are freed are those that are internally allocated. and the
only kfree_skb() that can happen on skbs that are enqueued in dev.c
should be those in case od a TC_ACT_QUEUED or TC_ACT_STOLEN,
where they should just decrement the user counter. i say "should" since
this is the most reasonable assumption i managed to make, but
this is your field and you definitely know it much better than me :)
in that case, btw, dev.c doesn't get any drop return code...

if a drop return code is given, the packet is not freed internally, but
only "externally".  (for the "where"... the question is open in "billing 1")

where could a skb be freed then?

[ i'm not insisting with that patch, i'm just trying to say that, if i don't
rave, it should not be dangerous to do that after the enqueue()...
then, it's just that for the moment i can't immagine a different
way to do things in that place :) yes, there could be a slight
variation with a skb_get() right before the enqueue and all the
kfree_skb() at their place inside the code, but then somewhere
we should always add a kfree_skb... ouch... and we would need
a by ref skb anyway to get the packet that has been dropped
and if it's not the same as the enqueued one also the enqueued
one should pass through one more kfree_skb()... horrible, more
complex and slower i'd say... ]

>>would there be any magic to have some conntrack data per device
>>without having to execute the actual tracking twice but without locking
>>the whole conntrack either? 
>>    
>>
>
>That is the challenge at the moment.
>For starters i dont see it as an issue right now to do locking.
>Its a cost for the feature since Haralds patch is in. 
>
>In the future we should make accounting a feature that could be turned
>on despite contracking and skbs should carry an accounting metadata with
>them. 
>  
>
i need to think thoroughly on it... depending on where that information is
kept, the complexity of some operations can change a lot... and i should
not only egoistically think to the operations i need but look at it from the
outside to have a less partisan viewpoint on the problem and find the
most generally good solution possible.

>>what could be the "magic" to let the
>>conntrack do the hard work just once and handle the additional traffic
>>policing information separately, in an other data structure that is 
>>mantained
>>on a device basis? that could also be the place where to count how much
>>a given flow is backlogged on a given interface... which could help in
>>choosing the dropping action... sorry, am i going too much further?
>>    
>>
>No i think your idea is valuable.
>The challenge is say you have a million connections, then do you
>have a million locks (one per structure)? I think we could reduce it
>by having a pool of stats sharing a lock (maybe by placing them in a 
>shared hash table with each bucket having a lock).
>  
>
yeah, that could be the right compromise :)

>You cant have too many locks and you cant have too few ;->
>
>
>On your qdisc you say:
>
>>it is not ready, but to say it shortly, i'm trying to serve first who 
>>has been _served_ the less.
>>
>>from the first experiments i have made this behaves pretty well and smootly,
>>but i've noticed that _not_ unbilling can be pretty unfair towards udp 
>>flows,
>>since they always keep sending.
>>    
>>
>
>If qdisc drops on full Q and unbills i think it should work, no?
>  
>
this is the case. i could do it on my own from inside my code, but then 
i would
"pollute" the information seen from other parts of the kernel code and i 
would
introduce a _new_ unfairness between those flows that pass though my qdisc
and those that don't... to sum it up... it would be pretty unclean

>If it drops because they abused a bandwidth level, shouldnt you punish
>them still? I think you should, but your mileage may vary.
>Note you also dont want to unbill more than once. If not maybe you can
>introduce something on the skb to indicate unbilling-happened (if done
>by policer) so root qdisc doesnt unbill again.
>  
>
you are thinking in that perspective because of tcp? as i said above, i 
would stop at layer 3...
btw, if i don't misunderstand what you mean, i guess it's when tcp is 
retransmitting that that
field should somehow be set... is it as feasible when we are not on an 
end-point as when
we are an endpoint? btw, we should then do the same with the other 
protocols and, for
example with udp, it would become application dependent... a suicide?

>>it simply has a priority dequeue that is manained ordered on the 
>>attained service.
>>if no drop occours, then accounting before enqueueing simply forecasts 
>>the service
>>that will have been attained up to the packet currenlty being enqueued 
>>when it will
>>be dequeued.  [ much easier to code than to say... ]
>>    
>>
>
>I think i understand.
>A packet that gets enqueued is _guaranteed_ to be transmitted unless
>overulled by admin policy. 
>Ok, how about the idea of adding skb->unbilled which gets set when
>unbilling happens (in the aggregated stats_incr()). skb->unbilled gets
>zeroed at the root qdisc after return from enqueueing.
>  
>
sorry?? i'm lost... maybe there's something implied i can't get...
do you agree it's not the same skb that will be re-billed
afterwards?

>cheers,
>jamal
>
ciao
Alessandro :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3-1: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23  9:39               ` sandr8
@ 2004-08-23 11:38                 ` jamal
  2004-08-23 12:04                   ` sandr8
  2004-08-23 11:58                 ` Billing 3: " jamal
  2004-08-23 12:15                 ` Billing 3-3: " jamal
  2 siblings, 1 reply; 30+ messages in thread
From: jamal @ 2004-08-23 11:38 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Mon, 2004-08-23 at 05:39, sandr8 wrote:
> jamal wrote:
> 
> >On Tue, 2004-08-17 at 09:40, sandr8 wrote:
> >  
> >
> >Such packets are cloned or copied. I am going to assume the contrack
> >data remains intact in both cases. LaForge?
> >BTW, although i mentioned the multiple interfaces as an issue - thinking
> >a little more i see retransmissions from TCP as well (when enqueue drops
> >because of full queue) being a problem.
> >  
> >
> imho, the point maybe is that a scheduler should work at layer 3,
> am i wrong?
> i mean: i made the same question to myself and answered that
> the right level should be the third... this would account for tcp
> retransmissions as well as forward error corrections packets
> added by some application on top of udp, or retrasmissions by
> applications on udp... or... whatever...

Maybe state oughta be carried somehow from where the packets "starts"
(eg application/L4 level) or ingress of system. This may complicate
things - will need a lot more thinking (but is worth thinking about).

> maybe my reasoning is less foggy if i first answer to an other
> question:
> 
> >I think the issue starts with defining what resource is being accounted
> >for. In my view, you are accounting for both CPU and bandwidth.
> >Lets start by asking  What is the resource being accounted for?
> >  
> >
> i would like to account for the number of bytes sent to the wire
> on behalf of each flow :)

Ok, in this case, retransmissions have to be unbilled.
To rewind to what i said a few emails ago:
The best place to bill is by looking at what comes out of the box;->
Ok, we dont have that luxury in this case. So the next best place
is to do it at the qdisc level. Because only at that level do you
know for sure if packets made it out or not. 
Since contracking already does the job of marking the flow, then
thats the second part of your requirement "on behalf of each flow".
What we are doing now is hacking around to try and reduce the injustice.

Conclusion: The current way of billing is _wrong_. The better way is to
have contracking just mark and the qdisc decide on billing or unbilling.
Have a billing table somewhere indexed by flow that increments these
stats.

For now i think that focussing on just sch.drops++ in case of full 
queue will help.

Let me cut email here for readability.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23  9:39               ` sandr8
  2004-08-23 11:38                 ` Billing 3-1: " jamal
@ 2004-08-23 11:58                 ` jamal
  2004-08-23 12:27                   ` sandr8
  2004-08-23 12:15                 ` Billing 3-3: " jamal
  2 siblings, 1 reply; 30+ messages in thread
From: jamal @ 2004-08-23 11:58 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Mon, 2004-08-23 at 05:39, sandr8 wrote:
> jamal wrote:

> >Haralds patch bills in that case as well for each retransmitted packet
> >that gets dropped because of full Q.
> >So best place to really unbill is at the qdisc level.
> >The only place for now i see that happening is in the case of drops
> >i.e sch->stats.drops++ 
> >The dev.c area after enqueue() attempt is a more dangerous place to do
> >it at (incase the skb doesnt exist because something freed it when
> >enqueue was called. Also because thats one area open for returning more
> >intelligent congestion level indicating codes)
> >  
> >
> this seems not to be possible, afaik, with that patch i wrote. the only
> skbs that are freed are those that are internally allocated. and the
> only kfree_skb() that can happen on skbs that are enqueued in dev.c
> should be those in case od a TC_ACT_QUEUED or TC_ACT_STOLEN,
> where they should just decrement the user counter. i say "should" since
> this is the most reasonable assumption i managed to make, but
> this is your field and you definitely know it much better than me :)
> in that case, btw, dev.c doesn't get any drop return code...
> 
> if a drop return code is given, the packet is not freed internally, but
> only "externally".  (for the "where"... the question is open in "billing 1")
> 
> where could a skb be freed then?

Let me layout a few things:

    ..-->classification --> tcaction
return code1 --> enqueue
return code2 ...(packet may be freed here)--> dev.c

The rules are:
1) A qdisc receiving a STOLEN/QUEUED/SHOT signal from the classification
result (return code 1) MUST free the packet and immediately stop
processing that packet.
The infrastructure code will clone packets if they want to steal or
queue it. Infact the skbs flow may even change during this path (eg
packet rewritten)
Billing issues: In such a case multiple packets may be later reinjected
(after replicating the one that was stolen) which totaly bypass
contracking code. Multiple packets may make it out to the wire.
You need to find a way to bill them ;-> (opposite of what you are trying
to do).

2)Return code (return code 2) of qdisc from enqueue function need to be
dealt with care if the packet is localy generated so it doesnt confuse
TCP (SCTP and DCCP sound like two other protocols that could be added)

Here we have two paths:
i) policy dropped packets.
Billing issues: You want to unbill dropped but not stolen packets.

ii) packets that are dropped because of a full Q should continue to
return a XMIT_DROP.
Billing: MUST unbill.

Again as  above shows, billing would work better at the qdisc level.

> [ i'm not insisting with that patch, i'm just trying to say that, if i don't
> rave, it should not be dangerous to do that after the enqueue()...
> then, it's just that for the moment i can't immagine a different
> way to do things in that place :) yes, there could be a slight
> variation with a skb_get() right before the enqueue and all the
> kfree_skb() at their place inside the code, but then somewhere
> we should always add a kfree_skb... ouch... and we would need
> a by ref skb anyway to get the packet that has been dropped
> and if it's not the same as the enqueued one also the enqueued
> one should pass through one more kfree_skb()... horrible, more
> complex and slower i'd say... ]

Linux is being efficient by sharing skbs. One of the most expensive
things in a stack is copying packets around (which is avoided).
If this was a simple system where allocating and freeing can be mapped
to exactly a flow then what you suggest can be done. In Linux you cant.

> >>
> >That is the challenge at the moment.
> >For starters i dont see it as an issue right now to do locking.
> >Its a cost for the feature since Haralds patch is in. 
> >
> >In the future we should make accounting a feature that could be turned
> >on despite contracking and skbs should carry an accounting metadata with
> >them. 
> >  
> >
> i need to think thoroughly on it... depending on where that information is
> kept, the complexity of some operations can change a lot... and i should
> not only egoistically think to the operations i need but look at it from the
> outside to have a less partisan viewpoint on the problem and find the
> most generally good solution possible.

I am begining to think that all contrack should do is mark the packets
then let the qdisc take care of things.

Cutting email here, since another topic below and this email is too
long.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3-1: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23 11:38                 ` Billing 3-1: " jamal
@ 2004-08-23 12:04                   ` sandr8
  2004-08-23 12:31                     ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: sandr8 @ 2004-08-23 12:04 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, devik, netdev, netfilter-devel

jamal wrote:

>On Mon, 2004-08-23 at 05:39, sandr8 wrote:
>  
>
>>jamal wrote:
>>    
>>
>Ok, in this case, retransmissions have to be unbilled.
>To rewind to what i said a few emails ago:
>The best place to bill is by looking at what comes out of the box;->
>Ok, we dont have that luxury in this case. So the next best place
>is to do it at the qdisc level. Because only at that level do you
>know for sure if packets made it out or not. 
>Since contracking already does the job of marking the flow, then
>thats the second part of your requirement "on behalf of each flow".
>What we are doing now is hacking around to try and reduce the injustice.
>
>Conclusion: The current way of billing is _wrong_. The better way is to
>have contracking just mark and the qdisc decide on billing or unbilling.
>Have a billing table somewhere indexed by flow that increments these
>stats.
>
>For now i think that focussing on just sch.drops++ in case of full 
>queue will help.
>
>Let me cut email here for readability.
>  
>
so, maybe we are saying the same thing but in different words :)

if we blindly look at layer 3 and unbill when a packet is dropped,
then the retransmission is already unbilled :)
it will be billed when it takes place, but the first transmission that
underwent a drop has been unbilled and hence we are square.
this without looking at layer 4.

what i was thinking about was mimicking the conntracking at
a device level, having per each device a singleton object that
has the same buckets as the connection tracking. it could
store a lot of interesting information that would augment queuing
disciplines to better share the pain of drops and also to perform
per-connection head drops instead of connection-unaware
tail-drop.

this would improve fairness and shorten the time tcp sources
need to get the feedback, in a better way than random early
drop does.

having this structure at a device level would be an answer
for the issue of packets cloned to multiple interfaces, as we
would be augmented to perform a separate accounting for
each interface (which seems, afaik, reasonable... in most
cases we would account on a single interface, and we also
should likely get less hash collisions... no more than in the
centralized conntrack).

furthermore, the per-bucket lock you suggested, that should
be a good compromise, would also not "interfere" from one
interface to the other one. well... maybe as soon as enqueues
and dequeues on the same device stay serialized (thanks to
dev->queue_lock) we should not need that further lock
either.

does it make sense?

>cheers,
>jamal
>
ciao ciao!
alessandro

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3-3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23  9:39               ` sandr8
  2004-08-23 11:38                 ` Billing 3-1: " jamal
  2004-08-23 11:58                 ` Billing 3: " jamal
@ 2004-08-23 12:15                 ` jamal
  2 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-23 12:15 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Mon, 2004-08-23 at 05:39, sandr8 wrote:
>jamal wrote:

Let me just get to the bottom of your email:

[..]
> >A packet that gets enqueued is _guaranteed_ to be transmitted unless
> >overulled by admin policy. 
> >Ok, how about the idea of adding skb->unbilled which gets set when
> >unbilling happens (in the aggregated stats_incr()). skb->unbilled gets
> >zeroed at the root qdisc after return from enqueueing.
> >  
> >
> sorry?? i'm lost... maybe there's something implied i can't get...
> do you agree it's not the same skb that will be re-billed
> afterwards?

Sure, its a cloned packet - from a billing perspective we can look at it
as a different flow. Contrack bills, qdisc unbills.
What i meant is packet still flowing within the qdisc infrastructure.
what i drew as:
   ..-->classification --> tcaction
return code1 --> enqueue
return code2 ...(packet may be freed here)--> dev.c

Thats the only place i see skb->unbilled as useful.

So:
all components need to increment their won proivate stats and should
call stats_incr() but only the first one gets to unbill;
eg if action calls stats_incr to indicate drop; you unbill. When
qdisc enqueue calls stats_incr with a drop reason, you dont want to
unbill again.

We could talk in more details about code after - just discussing
concepts right now.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23 11:58                 ` Billing 3: " jamal
@ 2004-08-23 12:27                   ` sandr8
  2004-08-25 11:34                     ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: sandr8 @ 2004-08-23 12:27 UTC (permalink / raw)
  To: hadi; +Cc: Harald Welte, devik, netdev, netfilter-devel

jamal wrote:

>Let me layout a few things:
>
>    ..-->classification --> tcaction
>return code1 --> enqueue
>return code2 ...(packet may be freed here)--> dev.c
>
>The rules are:
>1) A qdisc receiving a STOLEN/QUEUED/SHOT signal from the classification
>result (return code 1) MUST free the packet and immediately stop
>processing that packet.
>The infrastructure code will clone packets if they want to steal or
>queue it. Infact the skbs flow may even change during this path (eg
>packet rewritten)
>Billing issues: In such a case multiple packets may be later reinjected
>(after replicating the one that was stolen) which totaly bypass
>contracking code. Multiple packets may make it out to the wire.
>You need to find a way to bill them ;-> (opposite of what you are trying
>to do).
>  
>
i don't know the code that does that, since afaik it
was not yet there in 2.6.8... if it is, please tell me
because i'm eager to have a full viewpoint of the
forthcoming packet action framework :)

>2)Return code (return code 2) of qdisc from enqueue function need to be
>dealt with care if the packet is localy generated so it doesnt confuse
>TCP (SCTP and DCCP sound like two other protocols that could be added)
>
>Here we have two paths:
>i) policy dropped packets.
>Billing issues: You want to unbill dropped but not stolen packets.
>
>ii) packets that are dropped because of a full Q should continue to
>return a XMIT_DROP.
>Billing: MUST unbill.
>
>Again as  above shows, billing would work better at the qdisc level.
>  
>
yes, sure...
but for the moment i personally don't know which path
STOLEN and QUEUED packets will follow... it's not
a reproach :) it's just that i simply can't know for the
moment, but i don't want to make you hurry, i perfectly
understand that you've got plenty of things to do :)

>>[ i'm not insisting with that patch, i'm just trying to say that, if i don't
>>rave, it should not be dangerous to do that after the enqueue()...
>>then, it's just that for the moment i can't immagine a different
>>way to do things in that place :) yes, there could be a slight
>>variation with a skb_get() right before the enqueue and all the
>>kfree_skb() at their place inside the code, but then somewhere
>>we should always add a kfree_skb... ouch... and we would need
>>a by ref skb anyway to get the packet that has been dropped
>>and if it's not the same as the enqueued one also the enqueued
>>one should pass through one more kfree_skb()... horrible, more
>>complex and slower i'd say... ]
>>    
>>
>
>Linux is being efficient by sharing skbs. One of the most expensive
>things in a stack is copying packets around (which is avoided).
>If this was a simple system where allocating and freeing can be mapped
>to exactly a flow then what you suggest can be done. In Linux you cant.
>
>  
>
in that patch i was not copying them... just passing them by reference 
through a
"struct sk_buff ** const skb" parameter.

what is the difference from before...?

calling:
before) when calling we passed a pointer on the stack.
now) when calling we pass a pointer on the stack

returning:
before) when returning we didn't tell the caller which
packet was dropped
now) when returning we tell the sender which packet
is not queued or is thrown away from our queue to
make place for the packet enqueued.

what happens on the stack now?

for the calling: more or less the same as before :)
for the returning...

1) the external pointer is const, cannot be changed. this
is good to avoid stupid bugs.
2) being the external pointer const, the internal one always
lays in the stack frame of the outmost caller of an enqueue
operation. hust the external one is passed to callees. when
a requeue operation is called, that pointer nevertheless stays
in the stack frame of the outmost caller of an enqueue
operation. that's to say: it never moves from the stack of the
caller of dev->qdisc->enqueue()... [in that case it was dev.c,
but maybe it would be nicer to have it in a sort of
dev->enqueue() that would have to do with device level
conntracking].

additional cost for the function calling? z-e-r-o !!! :)
performance issues? maybe some improvement due
to the elimination of many internal jumps and conditions.

furthermore, telling the caller the packet that we chose to
drop allows it to reshape it without the need for every
qdisc to recur too any callback function.

>>>That is the challenge at the moment.
>>>For starters i dont see it as an issue right now to do locking.
>>>Its a cost for the feature since Haralds patch is in. 
>>>
>>>In the future we should make accounting a feature that could be turned
>>>on despite contracking and skbs should carry an accounting metadata with
>>>them. 
>>> 
>>>
>>>      
>>>
>>i need to think thoroughly on it... depending on where that information is
>>kept, the complexity of some operations can change a lot... and i should
>>not only egoistically think to the operations i need but look at it from the
>>outside to have a less partisan viewpoint on the problem and find the
>>most generally good solution possible.
>>    
>>
>
>I am begining to think that all contrack should do is mark the packets
>then let the qdisc take care of things.
>
>Cutting email here, since another topic below and this email is too
>long.
>
>cheers,
>jamal
>  
>
ciao :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3-1: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23 12:04                   ` sandr8
@ 2004-08-23 12:31                     ` jamal
  0 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-23 12:31 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netdev, netfilter-devel

On Mon, 2004-08-23 at 08:04, sandr8 wrote:
> jamal wrote:
> 

> so, maybe we are saying the same thing but in different words :)

Probably ;->

> if we blindly look at layer 3 and unbill when a packet is dropped,
> then the retransmission is already unbilled :)
> it will be billed when it takes place, but the first transmission that
> underwent a drop has been unbilled and hence we are square.
> this without looking at layer 4.

Agreed.

> what i was thinking about was mimicking the conntracking at
> a device level, having per each device a singleton object that
> has the same buckets as the connection tracking. it could
> store a lot of interesting information that would augment queuing
> disciplines to better share the pain of drops and also to perform
> per-connection head drops instead of connection-unaware
> tail-drop.

This connection exists today in the form of marking. Let conmtracking
mark then use fw classifier at the qdisc. If i am not mistaken theres
something more powerful these days called conmarking.
Now if you did it like that then you dont need Haralds code and it will
be much easier to maintain (refer to my earlier email).

> this would improve fairness and shorten the time tcp sources
> need to get the feedback, in a better way than random early
> drop does.

Contracking is a horrible performance pig. I dont even turn it on. Dont
make it a requirement to turn it on.

> having this structure at a device level would be an answer
> for the issue of packets cloned to multiple interfaces, as we
> would be augmented to perform a separate accounting for
> each interface (which seems, afaik, reasonable... in most
> cases we would account on a single interface, and we also
> should likely get less hash collisions... no more than in the
> centralized conntrack).

Heres a thought:
Make it a tc action.
--> netfilter: contrack --> conmark
--> qdisc: classify via fw -> billing action bill
-------->                      attach table index + lock to skb 
-------->   deal with any unfairness on enqueue by using index
-------->   and fine grained lock found in skb

on skb freeing make sure you delete the index and lock.
On cloning, what to do?

Note i do plan to have a contracking action at the qdisc level.
Let me know if that is useful to you then i can prioritize.

> furthermore, the per-bucket lock you suggested, that should
> be a good compromise, would also not "interfere" from one
> interface to the other one. well... maybe as soon as enqueues
> and dequeues on the same device stay serialized (thanks to
> dev->queue_lock) we should not need that further lock
> either.

Possibly yes.

> does it make sense?

I think it does. Make sure you double check the rules i posted earlier.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 1: WAS (Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
  2004-08-23  9:33               ` sandr8
@ 2004-08-24 18:38               ` Harald Welte
  1 sibling, 0 replies; 30+ messages in thread
From: Harald Welte @ 2004-08-24 18:38 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 996 bytes --]

On Sun, Aug 22, 2004 at 11:17:15AM -0400, jamal wrote:
> Apologies for latency - was busy at (my real) work.

Same for me, sorry.

> Let me say this:
> I am happy with Haralds billing patch which is already in as is.
> In other words, although there is an accounting discrepancy it is not
> that big.
> What does that mean? unbilling is not something to rush in and patch in
> if its going to have an impact on other pieces. It doesnt matter whether
> it goes in in 2.6.20 or doesnt even go in as far as i am concerned. 

Same to me, although I wold have used a less explicit wording ;)

> cheers,
> jamal

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-22 16:12             ` Billing 3: " jamal
  2004-08-23  9:39               ` sandr8
@ 2004-08-24 18:46               ` Harald Welte
  2004-08-25 11:50                 ` jamal
  1 sibling, 1 reply; 30+ messages in thread
From: Harald Welte @ 2004-08-24 18:46 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Sun, Aug 22, 2004 at 12:12:04PM -0400, jamal wrote:
> On Tue, 2004-08-17 at 09:40, sandr8 wrote:
> > jamal wrote:
> 
> > 
> > >Yes, this is a hard question. Did you see the suggestion i proposed
> > >to Harald?
> > >  
> > >
> > if it is the centralization of the stats with the reason code that,
> > for what concerns the ACCT, says wheter to bill or unbill i
> > think it is _really_ great :)
> > still, for what concerns the multiple interface delivery of the
> > same packet i don't see how it would be solved...
> 
> Such packets are cloned or copied. I am going to assume the contrack
> data remains intact in both cases. LaForge?

Yes. But still it is a question of viewpoint what kind of behaviour is
correct.  Let's say a single packet is accounted in ct_acct, and then
sent to multiple interfaces, where on more than one of them it gets
unbilled.  So we add once, but 'unbill' (i prefer the term subtract)
more than once.   In the end the ct_acct counter will be less than when
it first encountered the packet.

> In the future we should make accounting a feature that could be turned
> on despite contracking and skbs should carry an accounting metadata with
> them. 

I don't really understand what you want to say.  You want accounting
that is not conntrack-based?  well, then you should maybe look at
one of the many methods, ranging from iptables rule counters to the ipt_acct match/target, nacctd, pcap-mmap, PF_RING, ULOG-based, ...

btw, they all account the amount of RX packet on inbound interface and
do not 'unbill' ;)

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-16 13:29         ` jamal
@ 2004-08-24 18:57           ` Harald Welte
  2004-08-25 12:12             ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: Harald Welte @ 2004-08-24 18:57 UTC (permalink / raw)
  To: jamal; +Cc: sandr8, devik, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 3452 bytes --]

On Mon, Aug 16, 2004 at 09:29:59AM -0400, jamal wrote:
> > some later data.  It SHOULD not care whether this later data for the
> > same flow has bigger or smaller byte/packet counters. [and
> > it is very unlikely that the total will be lower, since then in the
> > timeframe [snapshot, terminations] more packets have to be dropped than
> > accepted.  Still, if this is documented with ctnetlink I'm perfectly
> > fine which such behaviour.
> 
> I am too. Good stuff.
> I think 99.9% of accounting would be happy with getting data after the
> call is done; the other 0.01% may have to live with extra packets later
> which undo things. 
> Are you working on something along the IPFIX protocol for transport?

Yes, I am working on the next generation of ulogd (ulogd-2).  It will
allow you to stack multiple plugins on top of each other, such as:

ctnetlink -> ipfix

or 

ulog -> packet parser -> flow_aggrgation -> ipfix 

or even exporting per-packet data in ipfix 

ulog -> packet parser -> ipfix

It's far from complete, but I'm almost finished with the core and can
start to write the plugins:
http://svn.gnumonks.org/branches/ulog/ulogd2/

> Let me think about it.
> Clearly the best place to account for things is on the wire once the
> packet has left the box ;-> 

This is arguable.  Why not once the packet was receieved on the incoming
wire?  If you are Ingress router of your network, your ISP bills you for
the amout of traffic it sends to your ingress router.

OTOTH, for egress traffic I agree.

> How open are you to move accounting further down? My thoughts
> are along the lines of incrementing the contrack counters at the qdisc
> level. Since you transport after the structure has been deleted, it
> should work out fine and fair billing will be taken care of.

Sure, it can be done... but you need to grab ip_conntrack_lock in order
to do so.

> Has someone done experimented and figured how expensive it would be to
> do it at the qdisc level? Note, you can probably have a low level
> grained lock just for stats.

It doesn't help.  Currently we still have one global rwlock for all
conntrack's.  There are patches for per-bucket locking.  

But well, as long as the usage count of ip_conntrack doesn't drop to
zero (which we're guaranteed since our skb still references it), and we
don't need to walk the hash table, we don't actually need to grab
ip_conntrack_lock.   A seperate accounting lock would be possible.

Somebody needs to implement this and run profiles...

> INC_STATS is generic (goes in the generic stats code in attached patch)
> and will have an ifdef for contrack billing (which includes unbilling
> depending on reason code). Reason code could be results that are now
> returned.
> As an example NET_XMIT_DROP is definetely unbilling while
> NET_XMIT_SUCCESS implies bill. 

sounds fine to me.  But this basically means that we would hide
conntrack counters behind generic API - the counters themselves are
stored in our own data structure.

> cheers,
> jamal
-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-23 12:27                   ` sandr8
@ 2004-08-25 11:34                     ` jamal
  0 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-25 11:34 UTC (permalink / raw)
  To: sandr8; +Cc: Harald Welte, devik, netfilter-devel, netdev


On Mon, 2004-08-23 at 08:27, sandr8 wrote:
> jamal wrote:
> 
> >Let me layout a few things:
> >
> >    ..-->classification --> tcaction
> >return code1 --> enqueue
> >return code2 ...(packet may be freed here)--> dev.c
> >
> >The rules are:

[..]
> >
> i don't know the code that does that, since afaik it
> was not yet there in 2.6.8... if it is, please tell me
> because i'm eager to have a full viewpoint of the
> forthcoming packet action framework :)

Look at the policer as an example - already in the kernel. It may decide
to shoot a packet.
The framework is already in since pre-2.6.8. My goal was to wait and
have it stabilize i.e no complaints and then start sending
patches for new actions. I can send you a few if you want to scrutinize
(although to understand the rules posted you dont need to see the
patches)

> >Again as  above shows, billing would work better at the qdisc level.
> >  
> >
> yes, sure...
> but for the moment i personally don't know which path
> STOLEN and QUEUED packets will follow... it's not
> a reproach :) it's just that i simply can't know for the
> moment, but i don't want to make you hurry, i perfectly
> understand that you've got plenty of things to do :)
> 

Let me know if you want me to send you an action or two ;->


> >
> >Linux is being efficient by sharing skbs. One of the most expensive
> >things in a stack is copying packets around (which is avoided).
> >If this was a simple system where allocating and freeing can be mapped
> >to exactly a flow then what you suggest can be done. In Linux you cant.
> >
> >  
> >
> in that patch i was not copying them... just passing them by reference 
> through a
> "struct sk_buff ** const skb" parameter.
> 
> what is the difference from before...?
> 
> calling:
> before) when calling we passed a pointer on the stack.
> now) when calling we pass a pointer on the stack
> 
> returning:
> before) when returning we didn't tell the caller which
> packet was dropped
> now) when returning we tell the sender which packet
> is not queued or is thrown away from our queue to
> make place for the packet enqueued.

The issue i had was two fold:
1) If you plan is to change things in the skb you have to
remember that you must copy or clone them (because you dont know who
else is holding references to the skb and for what purpose)
in which case the skb you manipulate is different from the one you
passed.
i.e no need to pass the reference to achieve that
2) You are making a lot of changes to pass **skb
in order to achieve your goal which doesnt seem strong to begin with.

> what happens on the stack now?
> 
> for the calling: more or less the same as before :)
> for the returning...
> 
> 1) the external pointer is const, cannot be changed. this
> is good to avoid stupid bugs.
> 2) being the external pointer const, the internal one always
> lays in the stack frame of the outmost caller of an enqueue
> operation. hust the external one is passed to callees. when
> a requeue operation is called, that pointer nevertheless stays
> in the stack frame of the outmost caller of an enqueue
> operation. that's to say: it never moves from the stack of the
> caller of dev->qdisc->enqueue()... [in that case it was dev.c,
> but maybe it would be nicer to have it in a sort of
> dev->enqueue() that would have to do with device level
> conntracking].
> 
> additional cost for the function calling? z-e-r-o !!! :)
> performance issues? maybe some improvement due
> to the elimination of many internal jumps and conditions.
> 
> furthermore, telling the caller the packet that we chose to
> drop allows it to reshape it without the need for every
> qdisc to recur too any callback function.

I didnt quiet parse what you said above - we may be saying the same
thing again. Did my last email (where we seem to sync) cover this?

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Billing 3: WAS(Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com ,
  2004-08-24 18:46               ` Billing 3: " Harald Welte
@ 2004-08-25 11:50                 ` jamal
  0 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-25 11:50 UTC (permalink / raw)
  To: Harald Welte; +Cc: sandr8, devik, netfilter-devel, netdev

On Tue, 2004-08-24 at 14:46, Harald Welte wrote:
> On Sun, Aug 22, 2004 at 12:12:04PM -0400, jamal wrote:

> > Such packets are cloned or copied. I am going to assume the contrack
> > data remains intact in both cases. LaForge?
> 
> Yes. But still it is a question of viewpoint what kind of behaviour is
> correct.  Let's say a single packet is accounted in ct_acct, and then
> sent to multiple interfaces, where on more than one of them it gets
> unbilled.  So we add once, but 'unbill' (i prefer the term subtract)
> more than once.   In the end the ct_acct counter will be less than when
> it first encountered the packet.
> 

These are some of the challenges i posed to Allesandro as well.

> > In the future we should make accounting a feature that could be turned
> > on despite contracking and skbs should carry an accounting metadata with
> > them. 
> 
> I don't really understand what you want to say.  You want accounting
> that is not conntrack-based?  well, then you should maybe look at
> one of the many methods, ranging from iptables rule counters to the 
> ipt_acct match/target, nacctd, pcap-mmap, PF_RING, ULOG-based, ...
> 

The discussion with fairness lead to my assertion above.
None of the above take care of fairness factor that Alessandro is trying
to achieve. So my thinking is if we wanna fix the unbill problem, why
not address the whole issue and get accounting to be an embedded piece.

> btw, they all account the amount of RX packet on inbound interface and
> do not 'unbill' ;)

I suspect in such a case you have a passive box which just gets a copy
of the packet from the wire ;-> Someone else in the real box where the
packet gets forwarded will drop the packet and it will still be paid 
for ;->
In the case of what you have introduced though, you are billing for
localy generated packets as well (if i am not mistaken this is where the
issues are).

In general i dont think we need to be 100% accurate. PSAMP for example
operates by merely sampling i.e not taking a precise measurement. People
have written tons of papers to prove it was good enuf.

cheers,
jamal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] deferred drop, __parent workaround, reshape_fail
  2004-08-24 18:57           ` Harald Welte
@ 2004-08-25 12:12             ` jamal
  0 siblings, 0 replies; 30+ messages in thread
From: jamal @ 2004-08-25 12:12 UTC (permalink / raw)
  To: Harald Welte; +Cc: sandr8, devik, netfilter-devel, netdev

On Tue, 2004-08-24 at 14:57, Harald Welte wrote:
> On Mon, Aug 16, 2004 at 09:29:59AM -0400, jamal wrote:

> Yes, I am working on the next generation of ulogd (ulogd-2).  It will
> allow you to stack multiple plugins on top of each other, such as:
> 
> ctnetlink -> ipfix
> 
> or 
> 
> ulog -> packet parser -> flow_aggrgation -> ipfix 
> 
> or even exporting per-packet data in ipfix 
> 
> ulog -> packet parser -> ipfix
> 
> It's far from complete, but I'm almost finished with the core and can
> start to write the plugins:
> http://svn.gnumonks.org/branches/ulog/ulogd2/

cool. Looks nice.

> > Let me think about it.
> > Clearly the best place to account for things is on the wire once the
> > packet has left the box ;-> 
> 
> This is arguable.  Why not once the packet was receieved on the incoming
> wire?  If you are Ingress router of your network, your ISP bills you for
> the amout of traffic it sends to your ingress router.
> 
> OTOTH, for egress traffic I agree.

I think maybe the issue is to define what the end goal is.
For a passive box getting copies of the packets on the wire, ingress 
would be fine. Unfairness would still be there incase the forwarding
box (not the accounting box) actually immediately drops the packet
that has already been accounted for.
OTOH, in what you have introduced you are also accounting for
localy generated packets in a box that is not dedicatedfor just
accounting.
In such a case, we can (if the cost of making the change is acceptable)
make more intelligent decisions.

> > How open are you to move accounting further down? My thoughts
> > are along the lines of incrementing the contrack counters at the qdisc
> > level. Since you transport after the structure has been deleted, it
> > should work out fine and fair billing will be taken care of.
> 
> Sure, it can be done... but you need to grab ip_conntrack_lock in order
> to do so.

Suggestion is you just do the marking, let the qdisc do the accounting
on possibly independent accounting table with much more fine grained
locks.

> > Has someone done experimented and figured how expensive it would be to
> > do it at the qdisc level? Note, you can probably have a low level
> > grained lock just for stats.
> 
> It doesn't help.  Currently we still have one global rwlock for all
> conntrack's.  There are patches for per-bucket locking.  
> 
> But well, as long as the usage count of ip_conntrack doesn't drop to
> zero (which we're guaranteed since our skb still references it), and we
> don't need to walk the hash table, we don't actually need to grab
> ip_conntrack_lock.   A seperate accounting lock would be possible.
> 
> Somebody needs to implement this and run profiles...
> 

We should probably avoid this approach totaly if we are going to do all
that work.

> > INC_STATS is generic (goes in the generic stats code in attached patch)
> > and will have an ifdef for contrack billing (which includes unbilling
> > depending on reason code). Reason code could be results that are now
> > returned.
> > As an example NET_XMIT_DROP is definetely unbilling while
> > NET_XMIT_SUCCESS implies bill. 
> 
> sounds fine to me.  But this basically means that we would hide
> conntrack counters behind generic API - the counters themselves are
> stored in our own data structure.
> 

yes, that would also be a good starting compromise. 
I am begining to question the wisdom of "fixing" this. The clear
solution is to have the contracking code do nothing other
than mark the packets. This is the most valuable thing we can
get out of contracking. Let the tc bits do the accounting. You can still
transport this over ctnetlink. There will somehow need to be signaling
as well from contrack to indicate addition or deletion of marks (other
alternative is to have admin select marks).

Of course all this is a big bait so we can have the energetic Alessandro
do all the work ;-> <wink, wink>

cheers,
jamal

 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2004-08-25 12:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-13  0:48 [PATCH 2/4] deferred drop, __parent workaround, reshape_fail sandr8
2004-08-13 12:51 ` jamal
2004-08-13 14:09   ` sandr8
2004-08-14 21:21     ` jamal
2004-08-16  7:35       ` Harald Welte
2004-08-16 13:29         ` jamal
2004-08-24 18:57           ` Harald Welte
2004-08-25 12:12             ` jamal
2004-08-16  7:20   ` Harald Welte
2004-08-16 13:00     ` jamal
2004-08-16 13:08       ` Harald Welte
2004-08-16 15:19       ` sandr8
2004-08-17 11:52         ` jamal
2004-08-17 13:40           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail , netdev@oss.sgi.com , sandr8
2004-08-22 15:17             ` Billing 1: WAS (Re: " jamal
2004-08-23  9:33               ` sandr8
2004-08-24 18:38               ` Harald Welte
2004-08-22 15:38             ` Billing 2: WAS(Re: " jamal
2004-08-22 16:12             ` Billing 3: " jamal
2004-08-23  9:39               ` sandr8
2004-08-23 11:38                 ` Billing 3-1: " jamal
2004-08-23 12:04                   ` sandr8
2004-08-23 12:31                     ` jamal
2004-08-23 11:58                 ` Billing 3: " jamal
2004-08-23 12:27                   ` sandr8
2004-08-25 11:34                     ` jamal
2004-08-23 12:15                 ` Billing 3-3: " jamal
2004-08-24 18:46               ` Billing 3: " Harald Welte
2004-08-25 11:50                 ` jamal
2004-08-17 13:49           ` [PATCH 2/4] deferred drop, __parent workaround, reshape_fail sandr8

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).