* Re: omap4-panda-es boot issues with v3.15-rc4
From: Santosh Shilimkar @ 2014-05-12 21:40 UTC (permalink / raw)
To: Tony Lindgren, Kevin Hilman
Cc: Roger Quadros, Menon, Nishanth, Grygorii Strashko, Paul Walmsley,
Taras Kondratiuk, linux-omap@vger.kernel.org,
Linux ARM Kernel Mailing List, Kristo, Tero, Paul Burton,
Daniel Lezcano, Rafael J. Wysocki
In-Reply-To: <20140511155542.GD28266@atomide.com>
On Sunday 11 May 2014 11:55 AM, Tony Lindgren wrote:
> * Kevin Hilman <khilman@linaro.org> [140509 16:46]:
>> Roger Quadros <rogerq@ti.com> writes:
>>
>>> Kevin,
>>>
>>> On 05/09/2014 01:15 AM, Kevin Hilman wrote:
>>>> Tony Lindgren <tony@atomide.com> writes:
>>>>
>>>> [...]
>>>>
>>>>> ..but I think I found the cause for recent hangs on panda, just a wild
>>>>> guess based on looking at the recent cpuidle patches after v3.14.
>>>>>
>>>>> Looks like reverting 0b89e9aa2856 (cpuidle: delay enabling interrupts
>>>>> until all coupled CPUs leave idle) makes booting work reliably again
>>>>> on panda.
>>>>>
>>>>> Can you guys confirm, so far no issues here after few boot tests,
>>>>> but it might be too early to tell.
>>>>
>>>> Reverting that makes things a bit more stable, but it still eventually
>>>> fails in the same way. For me it took 8 boots for it to eventually
>>>> fail.
>>>>
>>>> However, if I build with CONFIG_CPU_IDLE=n, it becomes much more stable
>>>> (20+ boots in a row and still going.)
>>>>
>>>
>>> Can you please test with CPU_IDLE enabled but C3 disabled as in below patch?
>>> It worked for me 10/10 boots.
>>
>> Yup, it worked for me too for 10/10 boots in a row.
>
> But what has caused this regression, does it work reliably with let's
> say v3.13 or v3.12?
>
IIRC things were stable till some CPUIDLE code consolidation happened.
I don't recall exactly but some one did discuss about it a while back.
Can you re-run your test-cases with patch at end of the email. This
is just a hunch so don't blame me if I waste your time testing the
patch.
regards,
Santosh
>From bdd30d68f8fa659aa0e3ce436f94029a7719036b Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar <santosh.shilimkar@ti.com>
Date: Mon, 12 May 2014 17:37:59 -0400
Subject: [PATCH] Revert "cpuidle / omap4 : use CPUIDLE_FLAG_TIMER_STOP flag"
This reverts commit cb7094e848f7bcaa0a4cda3db4b232f08dbf5b78.
Conflicts:
arch/arm/mach-omap2/cpuidle44xx.c
---
arch/arm/mach-omap2/cpuidle44xx.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/arm/mach-omap2/cpuidle44xx.c b/arch/arm/mach-omap2/cpuidle44xx.c
index 01fc710..aae3606 100644
--- a/arch/arm/mach-omap2/cpuidle44xx.c
+++ b/arch/arm/mach-omap2/cpuidle44xx.c
@@ -83,6 +83,7 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
{
struct idle_statedata *cx = state_ptr + index;
u32 mpuss_can_lose_context = 0;
+ int cpu_id = smp_processor_id();
/*
* CPU0 has to wait and stay ON until CPU1 is OFF state.
@@ -110,6 +111,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
mpuss_can_lose_context = (cx->mpu_state == PWRDM_POWER_RET) &&
(cx->mpu_logic_state == PWRDM_POWER_OFF);
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu_id);
+
/*
* Call idle CPU PM enter notifier chain so that
* VFP and per CPU interrupt context is saved.
@@ -165,6 +168,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
if (dev->cpu == 0 && mpuss_can_lose_context)
cpu_cluster_pm_exit();
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu_id);
+
fail:
cpuidle_coupled_parallel_barrier(dev, &abort_barrier);
cpu_done[dev->cpu] = false;
@@ -189,8 +194,7 @@ static struct cpuidle_driver omap4_idle_driver = {
/* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */
.exit_latency = 328 + 440,
.target_residency = 960,
- .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
- CPUIDLE_FLAG_TIMER_STOP,
+ .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
.enter = omap_enter_idle_coupled,
.name = "C2",
.desc = "CPUx OFF, MPUSS CSWR",
@@ -199,8 +203,7 @@ static struct cpuidle_driver omap4_idle_driver = {
/* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */
.exit_latency = 460 + 518,
.target_residency = 1100,
- .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
- CPUIDLE_FLAG_TIMER_STOP,
+ .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
.enter = omap_enter_idle_coupled,
.name = "C3",
.desc = "CPUx OFF, MPUSS OSWR",
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH 06/10 V2] workqueue: convert worker_idr to worker_ida
From: Tejun Heo @ 2014-05-12 21:40 UTC (permalink / raw)
To: Lai Jiangshan; +Cc: linux-kernel
In-Reply-To: <1399877792-13046-7-git-send-email-laijs@cn.fujitsu.com>
On Mon, May 12, 2014 at 02:56:18PM +0800, Lai Jiangshan wrote:
> @@ -1681,7 +1682,6 @@ static void worker_detach_from_pool(struct worker *worker,
> struct completion *detach_completion = NULL;
>
> mutex_lock(&pool->manager_mutex);
> - idr_remove(&pool->worker_idr, worker->id);
> list_del(&worker->node);
> if (list_empty(&pool->workers))
> detach_completion = pool->detach_completion;
Why are we moving ida removal to the caller here? Does
worker_detach_from_pool() get used for something else later? Up until
this point, the distinction seems pretty arbitrary.
> @@ -1757,8 +1754,6 @@ static struct worker *create_worker(struct worker_pool *pool)
> if (pool->flags & POOL_DISASSOCIATED)
> worker->flags |= WORKER_UNBOUND;
>
> - /* successful, commit the pointer to idr */
> - idr_replace(&pool->worker_idr, worker, worker->id);
Ah, the comment is removed altogether. Please disregard my previous
review on the comment.
Thanks.
--
tejun
^ permalink raw reply
* omap4-panda-es boot issues with v3.15-rc4
From: Santosh Shilimkar @ 2014-05-12 21:40 UTC (permalink / raw)
To: linux-arm-kernel
In-Reply-To: <20140511155542.GD28266@atomide.com>
On Sunday 11 May 2014 11:55 AM, Tony Lindgren wrote:
> * Kevin Hilman <khilman@linaro.org> [140509 16:46]:
>> Roger Quadros <rogerq@ti.com> writes:
>>
>>> Kevin,
>>>
>>> On 05/09/2014 01:15 AM, Kevin Hilman wrote:
>>>> Tony Lindgren <tony@atomide.com> writes:
>>>>
>>>> [...]
>>>>
>>>>> ..but I think I found the cause for recent hangs on panda, just a wild
>>>>> guess based on looking at the recent cpuidle patches after v3.14.
>>>>>
>>>>> Looks like reverting 0b89e9aa2856 (cpuidle: delay enabling interrupts
>>>>> until all coupled CPUs leave idle) makes booting work reliably again
>>>>> on panda.
>>>>>
>>>>> Can you guys confirm, so far no issues here after few boot tests,
>>>>> but it might be too early to tell.
>>>>
>>>> Reverting that makes things a bit more stable, but it still eventually
>>>> fails in the same way. For me it took 8 boots for it to eventually
>>>> fail.
>>>>
>>>> However, if I build with CONFIG_CPU_IDLE=n, it becomes much more stable
>>>> (20+ boots in a row and still going.)
>>>>
>>>
>>> Can you please test with CPU_IDLE enabled but C3 disabled as in below patch?
>>> It worked for me 10/10 boots.
>>
>> Yup, it worked for me too for 10/10 boots in a row.
>
> But what has caused this regression, does it work reliably with let's
> say v3.13 or v3.12?
>
IIRC things were stable till some CPUIDLE code consolidation happened.
I don't recall exactly but some one did discuss about it a while back.
Can you re-run your test-cases with patch at end of the email. This
is just a hunch so don't blame me if I waste your time testing the
patch.
regards,
Santosh
>From bdd30d68f8fa659aa0e3ce436f94029a7719036b Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar <santosh.shilimkar@ti.com>
Date: Mon, 12 May 2014 17:37:59 -0400
Subject: [PATCH] Revert "cpuidle / omap4 : use CPUIDLE_FLAG_TIMER_STOP flag"
This reverts commit cb7094e848f7bcaa0a4cda3db4b232f08dbf5b78.
Conflicts:
arch/arm/mach-omap2/cpuidle44xx.c
---
arch/arm/mach-omap2/cpuidle44xx.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/arm/mach-omap2/cpuidle44xx.c b/arch/arm/mach-omap2/cpuidle44xx.c
index 01fc710..aae3606 100644
--- a/arch/arm/mach-omap2/cpuidle44xx.c
+++ b/arch/arm/mach-omap2/cpuidle44xx.c
@@ -83,6 +83,7 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
{
struct idle_statedata *cx = state_ptr + index;
u32 mpuss_can_lose_context = 0;
+ int cpu_id = smp_processor_id();
/*
* CPU0 has to wait and stay ON until CPU1 is OFF state.
@@ -110,6 +111,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
mpuss_can_lose_context = (cx->mpu_state == PWRDM_POWER_RET) &&
(cx->mpu_logic_state == PWRDM_POWER_OFF);
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu_id);
+
/*
* Call idle CPU PM enter notifier chain so that
* VFP and per CPU interrupt context is saved.
@@ -165,6 +168,8 @@ static int omap_enter_idle_coupled(struct cpuidle_device *dev,
if (dev->cpu == 0 && mpuss_can_lose_context)
cpu_cluster_pm_exit();
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu_id);
+
fail:
cpuidle_coupled_parallel_barrier(dev, &abort_barrier);
cpu_done[dev->cpu] = false;
@@ -189,8 +194,7 @@ static struct cpuidle_driver omap4_idle_driver = {
/* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */
.exit_latency = 328 + 440,
.target_residency = 960,
- .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
- CPUIDLE_FLAG_TIMER_STOP,
+ .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
.enter = omap_enter_idle_coupled,
.name = "C2",
.desc = "CPUx OFF, MPUSS CSWR",
@@ -199,8 +203,7 @@ static struct cpuidle_driver omap4_idle_driver = {
/* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */
.exit_latency = 460 + 518,
.target_residency = 1100,
- .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED |
- CPUIDLE_FLAG_TIMER_STOP,
+ .flags = CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_COUPLED,
.enter = omap_enter_idle_coupled,
.name = "C3",
.desc = "CPUx OFF, MPUSS OSWR",
--
1.7.9.5
^ permalink raw reply related
* + mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch added to -mm tree
From: akpm @ 2014-05-12 21:40 UTC (permalink / raw)
To: mm-commits, mina86, vbabka
Subject: + mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch added to -mm tree
To: vbabka@suse.cz,mina86@mina86.com
From: akpm@linux-foundation.org
Date: Mon, 12 May 2014 14:40:27 -0700
The patch titled
Subject: mm-compaction-do-not-count-migratepages-when-unnecessary-fix
has been added to the -mm tree. Its filename is
mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm-compaction-do-not-count-migratepages-when-unnecessary-fix
list_for_each is enough for counting the list length. We also avoid
including struct page definition this way.
Suggested-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/trace/events/compaction.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff -puN include/trace/events/compaction.h~mm-compaction-do-not-count-migratepages-when-unnecessary-fix include/trace/events/compaction.h
--- a/include/trace/events/compaction.h~mm-compaction-do-not-count-migratepages-when-unnecessary-fix
+++ a/include/trace/events/compaction.h
@@ -7,7 +7,6 @@
#include <linux/types.h>
#include <linux/list.h>
#include <linux/tracepoint.h>
-#include <linux/mm_types.h>
#include <trace/events/gfpflags.h>
DECLARE_EVENT_CLASS(mm_compaction_isolate_template,
@@ -62,7 +61,7 @@ TRACE_EVENT(mm_compaction_migratepages,
TP_fast_assign(
unsigned long nr_failed = 0;
- struct page *page;
+ struct list_head *page_lru;
/*
* migrate_pages() returns either a non-negative number
@@ -73,7 +72,7 @@ TRACE_EVENT(mm_compaction_migratepages,
if (migrate_rc >= 0)
nr_failed = migrate_rc;
else
- list_for_each_entry(page, migratepages, lru)
+ list_for_each(page_lru, migratepages)
nr_failed++;
__entry->nr_migrated = nr_all - nr_failed;
_
Patches currently in -mm which might be from vbabka@suse.cz are
mm-compactionc-isolate_freepages_block-small-tuneup.patch
mm-page_alloc-prevent-migrate_reserve-pages-from-being-misplaced.patch
mm-compaction-clean-up-unused-code-lines.patch
mm-compaction-cleanup-isolate_freepages.patch
mm-compaction-cleanup-isolate_freepages-fix.patch
mm-compaction-cleanup-isolate_freepages-fix-2.patch
mm-migration-add-destination-page-freeing-callback.patch
mm-compaction-return-failed-migration-target-pages-back-to-freelist.patch
mm-compaction-add-per-zone-migration-pfn-cache-for-async-compaction.patch
mm-compaction-embed-migration-mode-in-compact_control.patch
mm-compaction-embed-migration-mode-in-compact_control-fix.patch
mm-thp-avoid-excessive-compaction-latency-during-fault.patch
mm-thp-avoid-excessive-compaction-latency-during-fault-fix.patch
mm-compaction-terminate-async-compaction-when-rescheduling.patch
mm-compaction-do-not-count-migratepages-when-unnecessary.patch
mm-compaction-do-not-count-migratepages-when-unnecessary-fix.patch
mm-compaction-avoid-rescanning-pageblocks-in-isolate_freepages.patch
mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch
mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages-fix.patch
^ permalink raw reply
* Re: [PATCH] powerpc/powernv: Correctly set hypervisor interrupt little endian bit on POWER8
From: Benjamin Herrenschmidt @ 2014-05-12 21:40 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linuxppc-dev, paulus
In-Reply-To: <20140508223138.4d600b60@kryten>
On Thu, 2014-05-08 at 22:31 +1000, Anton Blanchard wrote:
> HID0 IBM bit 19 is the HILE bit on POWER8. Set it to 0 to take
> exceptions in big endian and to 1 to take them in little endian.
>
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
Let's stick to the variant involving a FW call instead.
Cheers,
Ben.
>
> Index: b/arch/powerpc/include/asm/reg.h
> ===================================================================
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -397,6 +397,7 @@
> #define SPRN_HASH1 0x3D2 /* Primary Hash Address Register */
> #define SPRN_HASH2 0x3D3 /* Secondary Hash Address Resgister */
> #define SPRN_HID0 0x3F0 /* Hardware Implementation Register 0 */
> +#define HID0_HILE_SH (63 - 19) /* Hypervisor interrupt little endian */
> #define HID0_HDICE_SH (63 - 23) /* 970 HDEC interrupt enable */
> #define HID0_EMCP (1<<31) /* Enable Machine Check pin */
> #define HID0_EBA (1<<29) /* Enable Bus Address Parity */
> Index: b/arch/powerpc/kernel/cpu_setup_power.S
> ===================================================================
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -60,6 +60,7 @@ _GLOBAL(__setup_cpu_power8)
> bl __init_HFSCR
> bl __init_tlb_power8
> bl __init_PMU_HV
> + bl __init_HILE
> mtlr r11
> blr
>
> @@ -78,6 +79,7 @@ _GLOBAL(__restore_cpu_power8)
> bl __init_HFSCR
> bl __init_tlb_power8
> bl __init_PMU_HV
> + bl __init_HILE
> mtlr r11
> blr
>
> @@ -132,6 +134,26 @@ __init_HFSCR:
> mtspr SPRN_HFSCR,r3
> blr
>
> +__init_HILE:
> + mfspr r3,SPRN_HID0
> + li r4,1
> + sldi r4,r4,HID0_HILE_SH
> +#ifdef __LITTLE_ENDIAN__
> + or r3,r3,r4
> +#else
> + andc r3,r3,r4
> +#endif
> + sync
> + mtspr SPRN_HID0,r3
> + mfspr r3,SPRN_HID0
> + mfspr r3,SPRN_HID0
> + mfspr r3,SPRN_HID0
> + mfspr r3,SPRN_HID0
> + mfspr r3,SPRN_HID0
> + mfspr r3,SPRN_HID0
> + isync
> + blr
> +
> /*
> * Clear the TLB using the specified IS form of tlbiel instruction
> * (invalidate by congruence class). P7 has 128 CCs., P8 has 512.
^ permalink raw reply
* Re: [PATCH i-g-t 1/2] lib: add igt_set_timeout
From: Daniel Vetter @ 2014-05-12 21:39 UTC (permalink / raw)
To: Thomas Wood; +Cc: intel-gfx
In-Reply-To: <1399890921-13068-1-git-send-email-thomas.wood@intel.com>
On Mon, May 12, 2014 at 11:35:20AM +0100, Thomas Wood wrote:
> Add a function to stop and fail a test after the specified number of
> seconds have elapsed.
>
> Signed-off-by: Thomas Wood <thomas.wood@intel.com>
> ---
> lib/igt_core.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> lib/igt_core.h | 2 ++
> 2 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/lib/igt_core.c b/lib/igt_core.c
> index 6f137ab..238068c 100644
> --- a/lib/igt_core.c
> +++ b/lib/igt_core.c
> @@ -618,9 +618,12 @@ void igt_fail(int exitcode)
Right above this is the igt_exitcode = exitcode assignment. I wonder
whether we should upgrade a TIMEOUT to a FAIL if we run more than one
subtest.
> if (test_child)
> exit(exitcode);
>
> - if (in_subtest)
> - exit_subtest("FAIL");
> - else {
> + if (in_subtest) {
> + if (exitcode == 78)
> + exit_subtest("TIMEOUT");
> + else
> + exit_subtest("FAIL");
> + } else {
> assert(!test_with_subtests || in_fixture);
>
> if (in_fixture) {
> @@ -1201,3 +1204,38 @@ void igt_vlog(enum igt_log_level level, const char *format, va_list args)
> } else
> vprintf(format, args);
> }
> +
> +static void igt_alarm_handler(int signal)
> +{
> + /* subsequent tests are skipped */
> + skip_subtests_henceforth = SKIP;
> +
> + /* exit with status 78 to indicate timeout */
> + igt_fail(78);
Hm, I didn't notice that your piglit testcase encodes 78 as "timeout". Or
I've forgotten about it again ;-) Iirc my first attempt didn't
special-case exit codes beyond the 77 "skip", but this works. One useful
thing though would be to add a bunch of #defines for these special values
somewhere in a header, e.g.
#define IGT_EXIT_TIMEOUT 78
#define IGT_EXIT_SKIP 77
#define IGT_EXIT_SUCCESS 0
But this is a follow-up patch.
-Daniel
> +}
> +
> +/**
> + * igt_set_timeout:
> + * @seconds: seconds before timeout
> + *
> + * Stop the current test and skip any subsequent tests after the specified
> + * number of seconds have elapsed. The test will exit with "timeout" status
> + * (78). Any previous timer is cancelled and no timeout is scheduled if @seconds
> + * is zero.
> + *
> + */
> +void igt_set_timeout(unsigned int seconds)
> +{
> + struct sigaction sa;
> +
> + sa.sa_handler = igt_alarm_handler;
> + sigemptyset(&sa.sa_mask);
> + sa.sa_flags = 0;
> +
> + if (seconds == 0)
> + sigaction(SIGALRM, NULL, NULL);
> + else
> + sigaction(SIGALRM, &sa, NULL);
> +
> + alarm(seconds);
> +}
> diff --git a/lib/igt_core.h b/lib/igt_core.h
> index 7ede0d3..63ed9a5 100644
> --- a/lib/igt_core.h
> +++ b/lib/igt_core.h
> @@ -465,4 +465,6 @@ extern enum igt_log_level igt_log_level;
> } while (0)
>
>
> +void igt_set_timeout(unsigned int timeout);
> +
> #endif /* IGT_CORE_H */
> --
> 1.9.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply
* Re: [PATCH 05/10 V2] workqueue: separate iteration role from worker_idr
From: Tejun Heo @ 2014-05-12 21:37 UTC (permalink / raw)
To: Lai Jiangshan; +Cc: linux-kernel, Peter Zijlstra, Viresh Kumar, Ingo Molnar
In-Reply-To: <1399877792-13046-6-git-send-email-laijs@cn.fujitsu.com>
Oops, one more thing.
On Mon, May 12, 2014 at 02:56:17PM +0800, Lai Jiangshan wrote:
> - struct idr worker_idr; /* M: worker IDs and iteration */
> + struct idr worker_idr; /* M: worker IDs */
> + struct list_head workers; /* M: attached workers */
Something like the following is probably more useful.
/* M: runs through worker->node */
--
tejun
^ permalink raw reply
* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
From: Steven Rostedt @ 2014-05-12 21:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
Michael Kerrisk, Sebastian Andrzej Siewior
In-Reply-To: <20140512190438.314125476@linutronix.de>
On Mon, 12 May 2014 20:45:32 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:
> Steven, I told you more than once, that your damned performance and
> feature mania is completely unacceptable.
Yes you have, and now you are yelling at me for something I did 3 years
ago. But this isn't one of my "performance and feature mania" things
you bitched at me about. This is something *you* asked me to look at!
http://marc.info/?l=linux-kernel&m=129242481625261
The "stress test" you recommend back then was just to port it to -rt
and test it there. Which I did, and it found various issues where I
helped Lai fix.
>
> When the patch was posted, you were reviewing it. When I wanted to
> test it, it did not apply and you told me you'll fix it and pick it
> up. Sure you made it apply and you sticked it into a git tree and
> let Ingo pull it.
>
> I didn't pay much attention as I was happy with the idea in
> general. I trusted your review and judgement. My problem.
>
> Why on earth did you merge that patch even if the changelog mentions
> that Documentation was left stale and the tester broken?
Sure, I should have pushed Lai for that at the time. Lessons learned. I
was expecting he would do so after we got to a final version of the
patch but I was also working on other things at the time and forgot to
remind him.
I'm surprised you didn't mention to me to make sure the documentation
and test case got fixed before this went further. In fact, that was
never mentioned. These patches went through 4 revisions over a month,
and I even posted another RFC for them to get into the -rt tree.
>
> This is beyond sloppy and outright stupid.
Thomas, listen to yourself. You are calling something that got in
without updating the documentation and test case "beyond sloppy and
outright stupid". Sure, I'll agree it was a little sloppy, and today I'm
more on the ball to get things like this right (as we push much harder
today on documentation coming in along with code changes). But the
change log had:
need updated after this patch is accepted
1) Document/*
2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst
I agree this should be fixed, but you temper tantrum over this seems a
bit over the top.
>
> Get this fixed now!
>
> And those fixes are not going to be merged without my Reviewed-by.
>
> I'm not going to let your multi-reader patches anywhere near RT,
> unless this Documentation and tester issue is resolved.
>
> After that I'm going through them with a fine comb, as I really lost
> the last modicom of hope, that you can do something without leaving
> trainwrecks left and right.
And it boils down to this. In the past, I have screwed up where you
were the one (and usually the only one) that got hit by that mess. But
now you seem to take any little mistake done by me as a sign of
complete incompetence.
I'm sorry you feel this way.
I'll try to fix this up as I get a chance (I still have a day job to
do).
-- Steve
^ permalink raw reply
* + fix-_ioc_typecheck-sparse-error.patch added to -mm tree
From: akpm @ 2014-05-12 21:37 UTC (permalink / raw)
To: mm-commits, josh, hans.verkuil, hverkuil
Subject: + fix-_ioc_typecheck-sparse-error.patch added to -mm tree
To: hverkuil@xs4all.nl,hans.verkuil@cisco.com,josh@joshtriplett.org
From: akpm@linux-foundation.org
Date: Mon, 12 May 2014 14:37:05 -0700
The patch titled
Subject: include/asm-generic/ioctl.h: fix _IOC_TYPECHECK sparse error
has been added to the -mm tree. Its filename is
fix-_ioc_typecheck-sparse-error.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/fix-_ioc_typecheck-sparse-error.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/fix-_ioc_typecheck-sparse-error.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hans Verkuil <hverkuil@xs4all.nl>
Subject: include/asm-generic/ioctl.h: fix _IOC_TYPECHECK sparse error
When running sparse over drivers/media/v4l2-core/v4l2-ioctl.c I get these
errors:
drivers/media/v4l2-core/v4l2-ioctl.c:2043:9: error: bad integer constant expression
drivers/media/v4l2-core/v4l2-ioctl.c:2044:9: error: bad integer constant expression
drivers/media/v4l2-core/v4l2-ioctl.c:2045:9: error: bad integer constant expression
drivers/media/v4l2-core/v4l2-ioctl.c:2046:9: error: bad integer constant expression
etc.
The root cause of that turns out to be in include/asm-generic/ioctl.h:
#include <uapi/asm-generic/ioctl.h>
/* provoke compile error for invalid uses of size argument */
extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
((sizeof(t) == sizeof(t[1]) && \
sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
sizeof(t) : __invalid_size_argument_for_IOC)
If it is defined as this (as is already done if __KERNEL__ is not defined):
#define _IOC_TYPECHECK(t) (sizeof(t))
then all is well with the world.
This patch allows sparse to work correctly.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/asm-generic/ioctl.h | 5 +++++
1 file changed, 5 insertions(+)
diff -puN include/asm-generic/ioctl.h~fix-_ioc_typecheck-sparse-error include/asm-generic/ioctl.h
--- a/include/asm-generic/ioctl.h~fix-_ioc_typecheck-sparse-error
+++ a/include/asm-generic/ioctl.h
@@ -3,10 +3,15 @@
#include <uapi/asm-generic/ioctl.h>
+#ifdef __CHECKER__
+#define _IOC_TYPECHECK(t) (sizeof(t))
+#else
/* provoke compile error for invalid uses of size argument */
extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
((sizeof(t) == sizeof(t[1]) && \
sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
sizeof(t) : __invalid_size_argument_for_IOC)
+#endif
+
#endif /* _ASM_GENERIC_IOCTL_H */
_
Patches currently in -mm which might be from hverkuil@xs4all.nl are
fix-_ioc_typecheck-sparse-error.patch
linux-next.patch
^ permalink raw reply
* [PATCH] Push the file layout driver into a subdirectory
From: Thomas Haynes @ 2014-05-12 21:35 UTC (permalink / raw)
To: trond.myklebust; +Cc: linux-nfs, Tom Haynes
From: Tom Haynes <Thomas.Haynes@primarydata.com>
The object and block layouts already exist in their own
subdirectories. This patch completes the set!
Note that as a layout denotes nfs4 already, I stripped
that prefix out of the file names.
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
---
fs/nfs/Makefile | 4 +---
fs/nfs/filelayout/Makefile | 5 +++++
fs/nfs/{nfs4filelayout.c => filelayout/filelayout.c} | 10 +++++-----
fs/nfs/{nfs4filelayout.h => filelayout/filelayout.h} | 2 +-
fs/nfs/{nfs4filelayoutdev.c => filelayout/filelayoutdev.c} | 6 +++---
5 files changed, 15 insertions(+), 12 deletions(-)
create mode 100644 fs/nfs/filelayout/Makefile
rename fs/nfs/{nfs4filelayout.c => filelayout/filelayout.c} (99%)
rename fs/nfs/{nfs4filelayout.h => filelayout/filelayout.h} (99%)
rename fs/nfs/{nfs4filelayoutdev.c => filelayout/filelayoutdev.c} (99%)
diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 551d82a..43058f7 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -29,8 +29,6 @@ nfsv4-$(CONFIG_NFS_USE_LEGACY_DNS) += cache_lib.o
nfsv4-$(CONFIG_SYSCTL) += nfs4sysctl.o
nfsv4-$(CONFIG_NFS_V4_1) += pnfs.o pnfs_dev.o
-obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
-nfs_layout_nfsv41_files-y := nfs4filelayout.o nfs4filelayoutdev.o
-
+obj-$(CONFIG_PNFS_FILE_LAYOUT) += filelayout/
obj-$(CONFIG_PNFS_OBJLAYOUT) += objlayout/
obj-$(CONFIG_PNFS_BLOCK) += blocklayout/
diff --git a/fs/nfs/filelayout/Makefile b/fs/nfs/filelayout/Makefile
new file mode 100644
index 0000000..8516cdf
--- /dev/null
+++ b/fs/nfs/filelayout/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the pNFS Files Layout Driver kernel module
+#
+obj-$(CONFIG_PNFS_FILE_LAYOUT) += nfs_layout_nfsv41_files.o
+nfs_layout_nfsv41_files-y := filelayout.o filelayoutdev.o
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/filelayout/filelayout.c
similarity index 99%
rename from fs/nfs/nfs4filelayout.c
rename to fs/nfs/filelayout/filelayout.c
index 7954e16..ffc2ed8 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -35,11 +35,11 @@
#include <linux/sunrpc/metrics.h>
-#include "nfs4session.h"
-#include "internal.h"
-#include "delegation.h"
-#include "nfs4filelayout.h"
-#include "nfs4trace.h"
+#include "../nfs4session.h"
+#include "../internal.h"
+#include "../delegation.h"
+#include "filelayout.h"
+#include "../nfs4trace.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD
diff --git a/fs/nfs/nfs4filelayout.h b/fs/nfs/filelayout/filelayout.h
similarity index 99%
rename from fs/nfs/nfs4filelayout.h
rename to fs/nfs/filelayout/filelayout.h
index cebd20e..ffbddf2 100644
--- a/fs/nfs/nfs4filelayout.h
+++ b/fs/nfs/filelayout/filelayout.h
@@ -30,7 +30,7 @@
#ifndef FS_NFS_NFS4FILELAYOUT_H
#define FS_NFS_NFS4FILELAYOUT_H
-#include "pnfs.h"
+#include "../pnfs.h"
/*
* Default data server connection timeout and retrans vaules.
diff --git a/fs/nfs/nfs4filelayoutdev.c b/fs/nfs/filelayout/filelayoutdev.c
similarity index 99%
rename from fs/nfs/nfs4filelayoutdev.c
rename to fs/nfs/filelayout/filelayoutdev.c
index efac602..7c85390 100644
--- a/fs/nfs/nfs4filelayoutdev.c
+++ b/fs/nfs/filelayout/filelayoutdev.c
@@ -33,9 +33,9 @@
#include <linux/module.h>
#include <linux/sunrpc/addr.h>
-#include "internal.h"
-#include "nfs4session.h"
-#include "nfs4filelayout.h"
+#include "../internal.h"
+#include "../nfs4session.h"
+#include "filelayout.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD
--
1.9.0
^ permalink raw reply related
* Re: [PATCH 05/10 V2] workqueue: separate iteration role from worker_idr
From: Tejun Heo @ 2014-05-12 21:35 UTC (permalink / raw)
To: Lai Jiangshan; +Cc: linux-kernel, Peter Zijlstra, Viresh Kumar, Ingo Molnar
In-Reply-To: <1399877792-13046-6-git-send-email-laijs@cn.fujitsu.com>
On Mon, May 12, 2014 at 02:56:17PM +0800, Lai Jiangshan wrote:
> worker_idr has the iteration(iterating for attached workers) and worker ID
^
Please put a space before an opening parenthses. It becomes a lot
easier on the eyes.
> duties. These two duties are not necessary tied together. We can separate
necessarily
> them and use a list for tracking attached workers and iteration
>
> After separation, we can add the rescuer workers to the list for iteration
> in future. worker_idr can't add rescuer workers due to rescuer workers
> can't allocate id from worker_idr.
Explaining how that'd be beneficial and justifies this change would be
nice.
...
> /**
> - * for_each_pool_worker - iterate through all workers of a worker_pool
> - * @worker: iteration cursor
> - * @wi: integer used for iteration
> - * @pool: worker_pool to iterate workers of
> - *
> - * This must be called with @pool->manager_mutex.
> - *
> - * The if/else clause exists only for the lockdep assertion and can be
> - * ignored.
> - */
> -#define for_each_pool_worker(worker, wi, pool) \
> - idr_for_each_entry(&(pool)->worker_idr, (worker), (wi)) \
> - if (({ lockdep_assert_held(&pool->manager_mutex); false; })) { } \
> - else
Hmmm.... don't we still want the lockdep protection?
> @@ -1772,6 +1759,8 @@ static struct worker *create_worker(struct worker_pool *pool)
>
> /* successful, commit the pointer to idr */
/* successful, commit it to idr and link on the workers list */
> idr_replace(&pool->worker_idr, worker, worker->id);
> + /* successful, attach the worker to the pool */
and lose the above line.
> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
> index 7e2204d..c44abc3 100644
> --- a/kernel/workqueue_internal.h
> +++ b/kernel/workqueue_internal.h
> @@ -37,6 +37,7 @@ struct worker {
> struct task_struct *task; /* I: worker task */
> struct worker_pool *pool; /* I: the associated pool */
> /* L: for rescuers */
> + struct list_head node; /* M: attached to the pool */
/* M: anchored at pool->workers */
Thanks.
--
tejun
^ permalink raw reply
* [PATCH 5/5] net: tcp: add DCTCP congestion control algorithm
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev; +Cc: Daniel Borkmann, Glenn Judd, Florian Westphal
In-Reply-To: <1399928384-24143-1-git-send-email-fw@strlen.de>
From: Daniel Borkmann <dborkman@redhat.com>
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3]. Also, as an
informational IETF draft available at [5].
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are e.g.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on
the switch side, packets are being marked when its internal queue
length > K; ii) the sender side maintains a moving average of the
fraction of marked packets, so each RTT, F is being updated as
follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha is then being used in order to adaptively
decrease the congestion window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN. RFC3168 describes a mechanism for
using Explicit Congestion Notification from the switch for early
detection of congestion, rather than waiting for segment loss to
occur. However, this method only detects the presence of congestion,
not the extent. In the presence of mild congestion, it reduces the
TCP congestion window too aggressively and unnecessarily affects
the throughput of long flows [5]. DCTCP, as mentioned, enhances
Explicit Congestion Notification (ECN) processing to estimate the
fraction of bytes that encounter congestion, rather than simply
detecting that some congestion has occurred. DCTCP then scales the
TCP congestion window based on this estimate [5], thus it can derive
multibit feedback from the information present in the single-bit
sequence of marks in its control law and act in *proportion* to
the extent of congestion, not its *presence*. Switches therefore set
the Congestion Experienced (CE) codepoint in packets when internal
queue lengths exceed threshold K. Resulting, DCTCP delivers the same
or better throughput than normal TCP, while using 90% less buffer
space. From the Stanford paper, it says that in handling workloads
derived from operational measurements [2], it was found that DCTCP
enables the applications to handle 10x the current background traffic,
without impacting foreground traffic. Moreover, a 10x increase in
foreground traffic did not cause any timeouts, and thus largely
eliminates incast problems [2].
The algorithm itself has already seen deployments in large production
data centers. We have carefully implemented this patch set and did a
long-term stress-test and analysis in a data center and have found
similar results that we have noted down into a documentation section.
Details can be found there, in short, summary of our TCP incast tests
with iperf compared to cubic:
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.8421053 1.315789474
Median 183 1
Max 207 5
Min 123 0
Stddev 28.99092417 1.600438536
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.6842105 521.8947368
Median 464 523
Max 776 527
Min 403 519
Stddev 105.8909568 2.601169328
Fairness 0.962434227 0.999976467
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.166692604 0.010640778
4) Convergence and stability test:
CUBIC DCTCP
Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
0 9.93 0 0 9.92 0
0.5 9.87 0 0.5 9.86 0
1 8.73 2.25 1 6.46 4.88
1.5 7.29 2.8 1.5 4.9 4.99
2 6.96 3.1 2 4.92 4.94
2.5 6.67 3.34 2.5 4.93 5
3 6.39 3.57 3 4.92 4.99
3.5 6.24 3.75 3.5 4.94 4.74
4 6 3.94 4 5.34 4.71
4.5 5.88 4.09 4.5 4.99 4.97
5 5.27 4.98 5 4.83 5.01
5.5 4.93 5.04 5.5 4.89 4.99
6 4.9 4.99 6 4.92 5.04
6.5 4.93 5.1 6.5 4.91 4.97
7 4.28 5.8 7 4.97 4.97
7.5 4.62 4.91 7.5 4.99 4.82
8 5.05 4.45 8 5.16 4.76
8.5 5.93 4.09 8.5 4.94 4.98
9 5.73 4.2 9 4.92 5.02
9.5 5.62 4.32 9.5 4.87 5.03
10 6.12 3.2 10 4.91 5.01
10.5 6.91 3.11 10.5 4.87 5.04
11 8.48 0 11 8.49 4.94
11.5 9.87 0 11.5 9.9 0
Enabling DCTCP with this patch requires the following steps: DCTCP
must be running both on the sender and receiver side in your data
center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled at all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at 1Gbps,
and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in Silicon. The
gain (dctcp_shift_g) is currently a fixed constant (1/16) from the
paper, but we leave the option that it can be chosen carefully to
a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
The implementation itself is only around 300 loc of changes, the
rest is mainly documentation and our results in detail.
The implementation is heavily modified from an initial DCTCP patch
from [1] written by Abdul Kabbani, Masato Yasuda and Mohammad Alizadeh
from Stanford University. More information about DCTCP can be found
in [1-5].
[1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
[2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
[3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
[4] http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf
[5] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
Documentation/networking/dctcp.txt | 232 +++++++++++++++++++++++++++
net/ipv4/Kconfig | 28 +++-
net/ipv4/Makefile | 1 +
net/ipv4/tcp_dctcp.c | 311 +++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_output.c | 1 +
5 files changed, 572 insertions(+), 1 deletion(-)
create mode 100644 Documentation/networking/dctcp.txt
create mode 100644 net/ipv4/tcp_dctcp.c
diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt
new file mode 100644
index 0000000..b64d27f
--- /dev/null
+++ b/Documentation/networking/dctcp.txt
@@ -0,0 +1,232 @@
+DCTCP (DataCenter TCP)
+----------------------
+
+The below description provides an deployment example for people
+interested in running DCTCP in their data center network.
+
+1) Deployment scenario/example:
+
+The configuration for your data center is two-fold, it consists
+of a configuration of all switches and configuration on host
+side.
+
+1.1) Switch configuration:
+
+For each switch port, traffic was segregated into two queues.
+For any packet with a DSCP of 0x01 - or equivalently a TOS of
+0x04 - the packet was placed into the DCTCP queue. All other
+packets were placed into the default drop-tail queue. For the
+DCTCP queue, RED/ECN marking was enabled here with a marking
+threshold of 75 KB.
+
+1.2) Server configuration:
+
+The following configuration examples were used on the servers:
+
+1.2.1) DCTCP:
+
+# Set congestion control algorithm to DCTCP
+sysctl net.ipv4.tcp_congestion_control=dctcp
+# Set DSCP bits so that the switch can apply RED/ECN AQM to DCTCP traffic
+iptables -A OUTPUT -t mangle -p tcp -j TOS --or-tos 0x04
+
+1.2.2) CUBIC:
+
+# Set congestion control algorithm to CUBIC
+sysctl net.ipv4.tcp_congestion_control=cubic
+# Clear DSCP rule so that the switch applies drop-tail to CUBIC traffic
+iptables -D OUTPUT -t mangle -p tcp -j TOS --or-tos 0x04
+
+2) Example results:
+
+2.1) Incast test:
+
+This test measured DCTCP throughput and latency and compared
+it with CUBIC throughput and latency for an incast scenario.
+In this test, 19 senders sent at maximum rate to a single
+receiver. The receiver simply ran iperf -s.
+
+The senders ran iperf -c <receiver> -t 30. All senders started
+simultaneously (using local clocks synchronized by ntp).
+
+This test was repeated multiple times. Below shows the results
+from a single test. Other tests are similar. (DCTCP results were
+extremely consistent. CUBIC results show some variance induced
+by the TCP timeouts that CUBIC encountered.)
+
+For this test, we report statistics on the number of TCP timeouts,
+flow throughput, and traffic latency.
+
+2.1.1) Timeouts (total over all flows, and per flow summaries):
+
+ CUBIC DCTCP
+Total 3227 25
+Mean 169.8421053 1.315789474
+Median 183 1
+Max 207 5
+Min 123 0
+Stddev 28.99092417 1.600438536
+
+Timeout data is taken by measuring the net change in netstat -s
+"other TCP timeouts" reported. As a result, the timeout
+measurements above are not restricted to the test traffic, and
+we believe that it is likely that all of the "DCTCP timeouts" are
+actually timeouts for non-test traffic. We report them
+nevertheless. CUBIC will also include some non-test timeouts, but
+they are drawfed by bona fide test traffic timeouts for CUBIC.
+Clearly DCTCP does an excellent job of preventing TCP timeouts.
+DCTCP reduces timeouts by at least two orders of magnitude and
+may well have eliminated them in this scenario.
+
+2.1.2) Throughput (per flow in Mbps):
+
+ CUBIC DCTCP
+Mean 521.6842105 521.8947368
+Median 464 523
+Max 776 527
+Min 403 519
+Stddev 105.8909568 2.601169328
+Fairness 0.962434227 0.999976467
+
+Throughput data was simply the average throughput for each flow
+reported by iperf. By avoiding TCP timeouts, DCTCP is able to
+achieve much better per-flow results. In CUBIC, many flows
+experience TCP timeouts which makes flow throughput
+unpredictable and unfair. DCTCP, on the other hand, provides
+very clean predictable throughput without incurring TCP timeouts.
+Thus, the standard deviation of CUBIC throughput is dramatically
+higher than the standard deviation of DCTCP throughput.
+
+Mean throughput is nearly identical because even though cubic
+flows suffer TCP timeouts, other flows will step in and fill
+the unused bandwidth. Note that this test is something of a
+best case scenario for incast under CUBIC: it allows other flows
+to fill in for flows experiencing a timeout. Under situations
+where the receiver is issuing requests and then waiting for all
+flows to complete, flows cannot fill in for timed out flows and
+throughput will drop dramatically.
+
+2.1.3) Latency (in ms):
+
+ CUBIC DCTCP
+Mean 4.0088 0.04219
+Median 4.055 0.0395
+Max 4.2 0.085
+Min 3.32 0.028
+Stddev 0.166692604 0.010640778
+
+Latency for each protocol was computed by running "ping -i 0.2
+<receiver>" from a single sender to the receiver during the
+incast test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used
+to ensure that traffic traversed the DCTCP queue and was not
+dropped when the queue size was greater than the marking
+threshold. The summary statistics above are over all ping
+metrics measured between the single sender, receiver pair.
+
+The latency results for this test show a dramatic difference
+between CUBIC and DCTCP. CUBIC intentionally overflows the
+switch buffer which incurs the maximum queue latency (more
+buffer memory will lead to high latency.) DCTCP, on the other
+hand, deliberately attempts to keep queue occupancy low. The
+result is a two orders of magnitude reduction of latency with
+DCTCP - even with a switch with relatively little RAM. Switches
+with larger amounts of RAM will incur increasing amounts of
+latency for CUBIC, but not for DCTCP.
+
+2.2) Convergence and stability test:
+
+This test measured the time that DCTCP took to fairly
+redistribute bandwidth when a new flow commences. It also
+measured DCTCP's ability to remain stable at a fair
+bandwidth distribution. DCTCP is compared with CUBIC for
+this test.
+
+At the commencement of this test, a single flow is sending at
+maximum rate (near 10 Gbps) to a single receiver. One second
+after that first flow commences, a new flow from a distinct
+server begins sending to the same receiver as the first flow.
+After the second flow has sent data for 10 seconds, the second
+flow is terminated. The first flow sends for an additional
+second. Ideally, the bandwidth would be evenly shared as soon
+as the second flow starts, and recover as soon as it stops.
+
+The results of this test are shown below. Note that the flow
+bandwidth for the two flows was measured near the same time,
+but not simultaneously.
+
+DCTCP performs nearly perfectly within the measurement
+limitations of this test: bandwidth is quickly distributed
+fairly between the two flows, remains stable throughout the
+duration of the test, and recovers quickly. CUBIC, in contrast,
+is slow to divide the bandwidth fairly, and has trouble
+remaining stable.
+
+CUBIC DCTCP
+
+Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
+ 0 9.93 0 0 9.92 0
+ 0.5 9.87 0 0.5 9.86 0
+ 1 8.73 2.25 1 6.46 4.88
+ 1.5 7.29 2.8 1.5 4.9 4.99
+ 2 6.96 3.1 2 4.92 4.94
+ 2.5 6.67 3.34 2.5 4.93 5
+ 3 6.39 3.57 3 4.92 4.99
+ 3.5 6.24 3.75 3.5 4.94 4.74
+ 4 6 3.94 4 5.34 4.71
+ 4.5 5.88 4.09 4.5 4.99 4.97
+ 5 5.27 4.98 5 4.83 5.01
+ 5.5 4.93 5.04 5.5 4.89 4.99
+ 6 4.9 4.99 6 4.92 5.04
+ 6.5 4.93 5.1 6.5 4.91 4.97
+ 7 4.28 5.8 7 4.97 4.97
+ 7.5 4.62 4.91 7.5 4.99 4.82
+ 8 5.05 4.45 8 5.16 4.76
+ 8.5 5.93 4.09 8.5 4.94 4.98
+ 9 5.73 4.2 9 4.92 5.02
+ 9.5 5.62 4.32 9.5 4.87 5.03
+10 6.12 3.2 10 4.91 5.01
+10.5 6.91 3.11 10.5 4.87 5.04
+11 8.48 0 11 8.49 4.94
+11.5 9.87 0 11.5 9.9 0
+
+2.3) SYN/ACK ECT test:
+
+This test demonstrates the importance of ECT on SYN and
+SYN-ACK packets by measuring the connection probability in
+the presence of competing flows for a DCTCP connection
+attempt *without* ECT in the SYN packet. The test was
+repeated five times for each number of competing flows.
+
+ Competing Flows 1 | 2 | 4 | 8 | 16
+ ------------------------------
+Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
+Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
+
+As the number of competing flows moves beyond 1, the
+connection probability drops rapidly.
+
+3) Further reading:
+
+3.1) DCTCP paper:
+
+The algorithm is further described in detail in the following two
+SIGCOMM/SIGMETRICS papers:
+
+ i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
+ Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
+ "Data Center TCP (DCTCP)", Data Center Networks session
+ Proc. ACM SIGCOMM, New Delhi, 2010.
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+
+ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
+ "Analysis of DCTCP: Stability, Convergence, and Fairness"
+ Proc. ACM SIGMETRICS, San Jose, 2011.
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
+
+3.2) IETF informational draft:
+
+ http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
+
+3.3) DCTCP site:
+
+ http://simula.stanford.edu/~alizade/Site/DCTCP.html
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 05c57f0..02827bd 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -556,6 +556,29 @@ config TCP_CONG_ILLINOIS
For further details see:
http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+config TCP_CONG_DCTCP
+ tristate "DataCenter TCP (DCTCP)"
+ default n
+ ---help---
+ DCTCP leverages Explicit Congestion Notification (ECN) in the network to
+ provide multi-bit feedback to the end hosts. It is designed to provide:
+
+ - High burst tolerance (incast due to partition/aggregate),
+ - Low latency (short flows, queries),
+ - High throughput (continuous data updates, large file transfers) with
+ commodity, shallow-buffered switches.
+
+ All switches in the data center network running DCTCP must support
+ ECN marking and be configured for marking when reaching defined switch
+ buffer thresholds. The default ECN marking threshold heuristic for
+ DCTCP on switches is 20 packets (30KB) at 1Gbps, and 65 packets
+ (~100KB) at 10Gbps.
+
+ If unsure what all that means, say N.
+
+ For further details see:
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+
choice
prompt "Default TCP congestion control"
default DEFAULT_CUBIC
@@ -584,9 +607,11 @@ choice
config DEFAULT_WESTWOOD
bool "Westwood" if TCP_CONG_WESTWOOD=y
+ config DEFAULT_DCTCP
+ bool "DCTCP" if TCP_CONG_DCTCP=y
+
config DEFAULT_RENO
bool "Reno"
-
endchoice
endif
@@ -606,6 +631,7 @@ config DEFAULT_TCP_CONG
default "westwood" if DEFAULT_WESTWOOD
default "veno" if DEFAULT_VENO
default "reno" if DEFAULT_RENO
+ default "dctcp" if DEFAULT_DCTCP
default "cubic"
config TCP_MD5SIG
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index f032688..4497d40 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
+obj-$(CONFIG_TCP_CONG_DCTCP) += tcp_dctcp.o
obj-$(CONFIG_TCP_CONG_WESTWOOD) += tcp_westwood.o
obj-$(CONFIG_TCP_CONG_HSTCP) += tcp_highspeed.o
obj-$(CONFIG_TCP_CONG_HYBLA) += tcp_hybla.o
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
new file mode 100644
index 0000000..bb9d9ef
--- /dev/null
+++ b/net/ipv4/tcp_dctcp.c
@@ -0,0 +1,311 @@
+/* DataCenter TCP (DCTCP) congestion control.
+ *
+ * http://simula.stanford.edu/~alizade/Site/DCTCP.html
+ *
+ * This is an implementation of DCTCP over Reno, an enhancement to the
+ * TCP congestion control algorithm designed for data centers. DCTCP
+ * leverages Explicit Congestion Notification (ECN) in the network to
+ * provide multi-bit feedback to the end hosts. DCTCP's goal is to meet
+ * the following three data center transport requirements:
+ *
+ * - High burst tolerance (incast due to partition/aggregate)
+ * - Low latency (short flows, queries)
+ * - High throughput (continuous data updates, large file transfers)
+ * with commodity shallow buffered switches
+ *
+ * The algorithm is described in detail in the following two papers:
+ *
+ * 1) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
+ * Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
+ * "Data Center TCP (DCTCP)", Data Center Networks session
+ * Proc. ACM SIGCOMM, New Delhi, 2010.
+ * http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+ *
+ * 2) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
+ * "Analysis of DCTCP: Stability, Convergence, and Fairness"
+ * Proc. ACM SIGMETRICS, San Jose, 2011.
+ * http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
+ *
+ * Implemented from an initial implementation of DCTCP from Abdul Kabbani,
+ * Masato Yasuda, and Mohammad Alizadeh.
+ *
+ * Authors:
+ *
+ * Daniel Borkmann <dborkman@redhat.com>
+ * Florian Westphal <fw@strlen.de>
+ * Glenn Judd <glenn.judd@morganstanley.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or (at
+ * your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <net/tcp.h>
+
+#define DCTCP_MAX_ALPHA 1024U
+
+struct dctcp {
+ u32 acked_bytes_ecn;
+ u32 acked_bytes_total;
+ u32 prior_snd_una;
+ u32 prior_rcv_nxt;
+ u32 dctcp_alpha;
+ u32 next_seq;
+ /* false: last pkt was non-ce
+ * true: last pkt was ce
+ */
+ bool ce_state;
+ bool delayed_ack_reserved;
+};
+
+static unsigned int dctcp_shift_g __read_mostly = 4; /* g = 1/2^4 */
+module_param(dctcp_shift_g, uint, 0644);
+MODULE_PARM_DESC(dctcp_shift_g, "parameter g for updating dctcp_alpha");
+
+static unsigned int dctcp_alpha_on_init __read_mostly = DCTCP_MAX_ALPHA;
+module_param(dctcp_alpha_on_init, uint, 0644);
+MODULE_PARM_DESC(dctcp_alpha_on_init, "parameter for initial alpha value");
+
+static unsigned int dctcp_clamp_alpha_on_loss __read_mostly = 0;
+module_param(dctcp_clamp_alpha_on_loss, uint, 0644);
+MODULE_PARM_DESC(dctcp_clamp_alpha_on_loss,
+ "parameter for clamping alpha on loss");
+
+static struct tcp_congestion_ops dctcp_reno;
+
+static void dctcp_reset(const struct tcp_sock *tp, struct dctcp *ca)
+{
+ ca->next_seq = tp->snd_nxt;
+
+ ca->acked_bytes_ecn = 0;
+ ca->acked_bytes_total = 0;
+}
+
+static void dctcp_init(struct sock *sk)
+{
+ const struct tcp_sock *tp = tcp_sk(sk);
+
+ if (tp->ecn_flags & TCP_ECN_OK) {
+ struct dctcp *ca = inet_csk_ca(sk);
+
+ ca->prior_snd_una = tp->snd_una;
+ ca->prior_rcv_nxt = tp->rcv_nxt;
+
+ ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
+
+ ca->delayed_ack_reserved = false;
+ ca->ce_state = false;
+
+ dctcp_reset(tp, ca);
+ return;
+ }
+
+ /* No ECN support? Fall back to Reno. */
+ inet_csk(sk)->icsk_ca_ops = &dctcp_reno;
+ pr_debug("sk:%p fallback to Reno due to missing ECN support\n", sk);
+}
+
+static u32 dctcp_ssthresh(struct sock *sk)
+{
+ const struct dctcp *ca = inet_csk_ca(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U);
+}
+
+static void dctcp_ce_state_0_to_1(struct sock *sk)
+{
+ struct dctcp *ca = inet_csk_ca(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ /* State has changed from CE=0 to CE=1 and delayed
+ * ACK has not sent yet.
+ */
+ if (!ca->ce_state && ca->delayed_ack_reserved) {
+ u32 tmp_rcv_nxt;
+
+ /* Save current rcv_nxt. */
+ tmp_rcv_nxt = tp->rcv_nxt;
+
+ /* Generate previous ack with CE=0. */
+ tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+ tp->rcv_nxt = ca->prior_rcv_nxt;
+
+ tcp_send_ack(sk);
+
+ /* Recover current rcv_nxt. */
+ tp->rcv_nxt = tmp_rcv_nxt;
+ }
+
+ ca->prior_rcv_nxt = tp->rcv_nxt;
+ ca->ce_state = true;
+
+ tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+}
+
+static void dctcp_ce_state_1_to_0(struct sock *sk)
+{
+ struct dctcp *ca = inet_csk_ca(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ /* State has changed from CE=1 to CE=0 and delayed
+ * ACK has not sent yet.
+ */
+ if (ca->ce_state && ca->delayed_ack_reserved) {
+ u32 tmp_rcv_nxt;
+
+ /* Save current rcv_nxt. */
+ tmp_rcv_nxt = tp->rcv_nxt;
+
+ /* Generate previous ack with CE=1. */
+ tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+ tp->rcv_nxt = ca->prior_rcv_nxt;
+
+ tcp_send_ack(sk);
+
+ /* Recover current rcv_nxt. */
+ tp->rcv_nxt = tmp_rcv_nxt;
+ }
+
+ ca->prior_rcv_nxt = tp->rcv_nxt;
+ ca->ce_state = false;
+
+ tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+}
+
+static void dctcp_update_alpha(struct sock *sk, u32 flags)
+{
+ const struct tcp_sock *tp = tcp_sk(sk);
+ struct dctcp *ca = inet_csk_ca(sk);
+ u32 acked_bytes = tp->snd_una - ca->prior_snd_una;
+
+ /* If ack did not advance snd_una, count dupack as MSS size.
+ * If ack did update window, do not count it at all.
+ */
+ if (acked_bytes == 0 && !(flags & CA_ACK_WIN_UPDATE))
+ acked_bytes = inet_csk(sk)->icsk_ack.rcv_mss;
+ if (acked_bytes) {
+ ca->acked_bytes_total += acked_bytes;
+ ca->prior_snd_una = tp->snd_una;
+
+ if (flags & CA_ACK_ECE)
+ ca->acked_bytes_ecn += acked_bytes;
+ }
+
+ /* Expired RTT */
+ if (!before(tp->snd_una, ca->next_seq)) {
+ /* For avoiding denominator == 1. */
+ if (ca->acked_bytes_total == 0)
+ ca->acked_bytes_total = 1;
+
+ /* alpha = (1 - g) * alpha + g * F */
+ ca->dctcp_alpha = ca->dctcp_alpha -
+ (ca->dctcp_alpha >> dctcp_shift_g) +
+ (ca->acked_bytes_ecn << (10U - dctcp_shift_g)) /
+ ca->acked_bytes_total;
+
+ if (ca->dctcp_alpha > DCTCP_MAX_ALPHA)
+ /* Clamp dctcp_alpha to max. */
+ ca->dctcp_alpha = DCTCP_MAX_ALPHA;
+
+ dctcp_reset(tp, ca);
+ }
+}
+
+static void dctcp_state(struct sock *sk, u8 new_state)
+{
+ if (dctcp_clamp_alpha_on_loss && new_state == TCP_CA_Loss) {
+ struct dctcp *ca = inet_csk_ca(sk);
+
+ /* If this extension is enabled, we clamp dctcp_alpha to
+ * max on packet loss; the motivation is that dctcp_alpha
+ * is an indicator to the extend of congestion and packet
+ * loss is an indicator of extreme congestion; setting
+ * this in practice turned out to be beneficial, and
+ * effectively assumes total congestion which reduces the
+ * window by half.
+ */
+ ca->dctcp_alpha = DCTCP_MAX_ALPHA;
+ }
+}
+
+static void dctcp_update_ack_reserved(struct sock *sk, enum tcp_ca_event ev)
+{
+ struct dctcp *ca = inet_csk_ca(sk);
+
+ switch (ev) {
+ case CA_EVENT_DELAYED_ACK:
+ if (!ca->delayed_ack_reserved)
+ ca->delayed_ack_reserved = true;
+ break;
+ case CA_EVENT_NON_DELAYED_ACK:
+ if (ca->delayed_ack_reserved)
+ ca->delayed_ack_reserved = false;
+ break;
+ default:
+ /* Don't care for the rest. */
+ break;
+ }
+}
+
+static void dctcp_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
+{
+ switch (ev) {
+ case CA_EVENT_ECN_IS_CE:
+ dctcp_ce_state_0_to_1(sk);
+ break;
+ case CA_EVENT_ECN_NO_CE:
+ dctcp_ce_state_1_to_0(sk);
+ break;
+ case CA_EVENT_DELAYED_ACK:
+ case CA_EVENT_NON_DELAYED_ACK:
+ dctcp_update_ack_reserved(sk, ev);
+ break;
+ default:
+ /* Don't care for the rest. */
+ break;
+ }
+}
+
+static struct tcp_congestion_ops dctcp __read_mostly = {
+ .init = dctcp_init,
+ .in_ack_event = dctcp_update_alpha,
+ .cwnd_event = dctcp_cwnd_event,
+ .ssthresh = dctcp_ssthresh,
+ .cong_avoid = tcp_reno_cong_avoid,
+ .set_state = dctcp_state,
+ .flags = TCP_CONG_NEEDS_ECN,
+ .owner = THIS_MODULE,
+ .name = "dctcp",
+};
+
+static struct tcp_congestion_ops dctcp_reno __read_mostly = {
+ .ssthresh = tcp_reno_ssthresh,
+ .cong_avoid = tcp_reno_cong_avoid,
+ .owner = THIS_MODULE,
+ .name = "dctcp-reno",
+};
+
+static int __init dctcp_register(void)
+{
+ BUILD_BUG_ON(sizeof(struct dctcp) > ICSK_CA_PRIV_SIZE);
+ return tcp_register_congestion_control(&dctcp);
+}
+
+static void __exit dctcp_unregister(void)
+{
+ tcp_unregister_congestion_control(&dctcp);
+}
+
+module_init(dctcp_register);
+module_exit(dctcp_unregister);
+
+MODULE_AUTHOR("Daniel Borkmann <dborkman@redhat.com>");
+MODULE_AUTHOR("Florian Westphal <fw@strlen.de>");
+MODULE_AUTHOR("Glenn Judd <glenn.judd@morganstanley.com>");
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("DataCenter TCP (DCTCP)");
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1f5e04a..d398b88 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3200,6 +3200,7 @@ void tcp_send_ack(struct sock *sk)
TCP_SKB_CB(buff)->when = tcp_time_stamp;
tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
}
+EXPORT_SYMBOL_GPL(tcp_send_ack);
/* This routine sends a packet with an out of date sequence
* number. It assumes the other end will try to ack it.
--
1.8.1.5
^ permalink raw reply related
* [PATCH 4/5] net: tcp: more detailed ACK events, and events for CE marked packets
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann, Glenn Judd
In-Reply-To: <1399928384-24143-1-git-send-email-fw@strlen.de>
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
and ACK properties, e.g. ACK that updates window is treated differently
than DUPACK.
Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
DCTCP also implements a CE state machine that keeps track of CE markings
of incoming packets.
Therefore, extend the congestion control framework to provide these
event types, so that DCTCP can be properly implemented as a normal
congestion algorithm module outside the core stack.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 9 ++++++++-
net/ipv4/tcp_input.c | 22 ++++++++++++++++++----
net/ipv4/tcp_output.c | 4 ++++
3 files changed, 30 insertions(+), 5 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0d767d2..56bf383 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -754,10 +754,17 @@ enum tcp_ca_event {
CA_EVENT_CWND_RESTART, /* congestion window restart */
CA_EVENT_COMPLETE_CWR, /* end of congestion recovery */
CA_EVENT_LOSS, /* loss timeout */
+ CA_EVENT_ECN_NO_CE, /* ECT set, but not CE marked */
+ CA_EVENT_ECN_IS_CE, /* received CE marked IP packet */
+ CA_EVENT_DELAYED_ACK, /* Delayed ack is sent */
+ CA_EVENT_NON_DELAYED_ACK,
};
+/* information about inbound ACK, passed to cong_ops->in_ack_event() */
enum tcp_ca_ack_event_flags {
- CA_ACK_SLOWPATH = (1 << 0),
+ CA_ACK_SLOWPATH = (1 << 0), /* in slow path processing */
+ CA_ACK_WIN_UPDATE = (1 << 1), /* ACK updated window */
+ CA_ACK_ECE = (1 << 2), /* ECE bit is set on ack */
};
/*
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7fab1da..bf0f734 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -232,14 +232,21 @@ static inline void TCP_ECN_check_ce(struct tcp_sock *tp, const struct sk_buff *s
tcp_enter_quickack_mode((struct sock *)tp);
break;
case INET_ECN_CE:
+ if (tcp_ca_needs_ecn((struct sock *)tp))
+ tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_IS_CE);
+
if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode((struct sock *)tp);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
}
- /* fallinto */
+ tp->ecn_flags |= TCP_ECN_SEEN;
+ break;
default:
+ if (tcp_ca_needs_ecn((struct sock *)tp))
+ tcp_ca_event((struct sock *)tp, CA_EVENT_ECN_NO_CE);
tp->ecn_flags |= TCP_ECN_SEEN;
+ break;
}
}
@@ -3421,10 +3428,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tp->snd_una = ack;
flag |= FLAG_WIN_UPDATE;
- tcp_in_ack_event(sk, 0);
+ tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPACKS);
} else {
+ u32 ack_ev_flags = CA_ACK_SLOWPATH;
+
if (ack_seq != TCP_SKB_CB(skb)->end_seq)
flag |= FLAG_DATA;
else
@@ -3436,10 +3445,15 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
&sack_rtt_us);
- if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb)))
+ if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb))) {
flag |= FLAG_ECE;
+ ack_ev_flags |= CA_ACK_ECE;
+ }
+
+ if (flag & FLAG_WIN_UPDATE)
+ ack_ev_flags |= CA_ACK_WIN_UPDATE;
- tcp_in_ack_event(sk, CA_ACK_SLOWPATH);
+ tcp_in_ack_event(sk, ack_ev_flags);
}
/* We passed data and got it acked, remove any soft error
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1f983dd..1f5e04a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3119,6 +3119,8 @@ void tcp_send_delayed_ack(struct sock *sk)
int ato = icsk->icsk_ack.ato;
unsigned long timeout;
+ tcp_ca_event(sk, CA_EVENT_DELAYED_ACK);
+
if (ato > TCP_DELACK_MIN) {
const struct tcp_sock *tp = tcp_sk(sk);
int max_ato = HZ / 2;
@@ -3175,6 +3177,8 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
+ tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
+
/* We are not putting this on the write queue, so
* tcp_transmit_skb() will set the ownership to this
* sock.
--
1.8.1.5
^ permalink raw reply related
* [PATCH 3/5] net: tcp: split ack slow/fast events from cwnd_event
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann, Glenn Judd
In-Reply-To: <1399928384-24143-1-git-send-email-fw@strlen.de>
The congestion control ops "cwnd_event" currently supports
CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
Both FAST and SLOW_ACK are only used by Westwood CC algorithm.
This removes both flags from cwnd_event and adds a new in_ack_event
callback for this.
The goal is to be able to provide more detailed information
about ACKs, such as whether ECE flag was set, or wheter the ACK
resulted in a window update.
It is required for DataCenter TCP (DCTCP) congestion control
algorithm as it makes a different choice depending on ECE being
set or not.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 8 ++++++--
net/ipv4/tcp_input.c | 12 ++++++++++--
net/ipv4/tcp_westwood.c | 30 ++++++++++++++++--------------
3 files changed, 32 insertions(+), 18 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 92d1600..0d767d2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -754,8 +754,10 @@ enum tcp_ca_event {
CA_EVENT_CWND_RESTART, /* congestion window restart */
CA_EVENT_COMPLETE_CWR, /* end of congestion recovery */
CA_EVENT_LOSS, /* loss timeout */
- CA_EVENT_FAST_ACK, /* in sequence ack */
- CA_EVENT_SLOW_ACK, /* other ack */
+};
+
+enum tcp_ca_ack_event_flags {
+ CA_ACK_SLOWPATH = (1 << 0),
};
/*
@@ -785,6 +787,8 @@ struct tcp_congestion_ops {
void (*set_state)(struct sock *sk, u8 new_state);
/* call when cwnd event occurs (optional) */
void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
+ /* call when ack arrives (optional) */
+ void (*in_ack_event)(struct sock *sk, u32 flags);
/* new value of cwnd after loss (optional) */
u32 (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 350b207..7fab1da 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3356,6 +3356,14 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
}
}
+static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+{
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+
+ if (icsk->icsk_ca_ops->in_ack_event)
+ icsk->icsk_ca_ops->in_ack_event(sk, flags);
+}
+
/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
@@ -3413,7 +3421,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tp->snd_una = ack;
flag |= FLAG_WIN_UPDATE;
- tcp_ca_event(sk, CA_EVENT_FAST_ACK);
+ tcp_in_ack_event(sk, 0);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPACKS);
} else {
@@ -3431,7 +3439,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb)))
flag |= FLAG_ECE;
- tcp_ca_event(sk, CA_EVENT_SLOW_ACK);
+ tcp_in_ack_event(sk, CA_ACK_SLOWPATH);
}
/* We passed data and got it acked, remove any soft error
diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
index b94a04a..1c5b0df 100644
--- a/net/ipv4/tcp_westwood.c
+++ b/net/ipv4/tcp_westwood.c
@@ -222,39 +222,41 @@ static u32 tcp_westwood_bw_rttmin(const struct sock *sk)
return max_t(u32, (w->bw_est * w->rtt_min) / tp->mss_cache, 2);
}
+static void tcp_westwood_ack(struct sock *sk, u32 ack_flags)
+{
+ if (ack_flags & CA_ACK_SLOWPATH) {
+ struct westwood *w = inet_csk_ca(sk);
+
+ westwood_update_window(sk);
+ w->bk += westwood_acked_count(sk);
+
+ update_rtt_min(w);
+ return;
+ }
+
+ westwood_fast_bw(sk);
+}
+
static void tcp_westwood_event(struct sock *sk, enum tcp_ca_event event)
{
struct tcp_sock *tp = tcp_sk(sk);
struct westwood *w = inet_csk_ca(sk);
switch (event) {
- case CA_EVENT_FAST_ACK:
- westwood_fast_bw(sk);
- break;
-
case CA_EVENT_COMPLETE_CWR:
tp->snd_cwnd = tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
break;
-
case CA_EVENT_LOSS:
tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
/* Update RTT_min when next ack arrives */
w->reset_rtt_min = 1;
break;
-
- case CA_EVENT_SLOW_ACK:
- westwood_update_window(sk);
- w->bk += westwood_acked_count(sk);
- update_rtt_min(w);
- break;
-
default:
/* don't care */
break;
}
}
-
/* Extract info for Tcp socket info provided via netlink. */
static void tcp_westwood_info(struct sock *sk, u32 ext,
struct sk_buff *skb)
@@ -271,12 +273,12 @@ static void tcp_westwood_info(struct sock *sk, u32 ext,
}
}
-
static struct tcp_congestion_ops tcp_westwood __read_mostly = {
.init = tcp_westwood_init,
.ssthresh = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
.cwnd_event = tcp_westwood_event,
+ .in_ack_event = tcp_westwood_ack,
.get_info = tcp_westwood_info,
.pkts_acked = tcp_westwood_pkts_acked,
--
1.8.1.5
^ permalink raw reply related
* [PATCH 2/5] net: tcp: add flag for ca to indicate that ECN is required
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev; +Cc: Daniel Borkmann, Glenn Judd, Florian Westphal
In-Reply-To: <1399928384-24143-1-git-send-email-fw@strlen.de>
From: Daniel Borkmann <dborkman@redhat.com>
This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.
It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered) from switches
inside the data center to derive feedback to the end hosts.
Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to provided congestion control ops.
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 45 ++++++++++++++++++++++++++++-----------------
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_output.c | 9 ++++++++-
net/ipv6/tcp_ipv6.c | 2 +-
4 files changed, 38 insertions(+), 20 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8ae165f..92d1600 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -734,23 +734,6 @@ struct tcp_skb_cb {
#define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0]))
-/* RFC3168 : 6.1.1 SYN packets must not have ECT/ECN bits set
- *
- * If we receive a SYN packet with these bits set, it means a network is
- * playing bad games with TOS bits. In order to avoid possible false congestion
- * notifications, we disable TCP ECN negociation.
- */
-static inline void
-TCP_ECN_create_request(struct request_sock *req, const struct sk_buff *skb,
- struct net *net)
-{
- const struct tcphdr *th = tcp_hdr(skb);
-
- if (net->ipv4.sysctl_tcp_ecn && th->ece && th->cwr &&
- INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield))
- inet_rsk(req)->ecn_ok = 1;
-}
-
/* Due to TSO, an SKB can be composed of multiple actual
* packets. To keep these tracked properly, we use this.
*/
@@ -783,6 +766,7 @@ enum tcp_ca_event {
#define TCP_CA_BUF_MAX (TCP_CA_NAME_MAX*TCP_CA_MAX)
#define TCP_CONG_NON_RESTRICTED 0x1
+#define TCP_CONG_NEEDS_ECN 0x2
struct tcp_congestion_ops {
struct list_head list;
@@ -848,6 +832,13 @@ static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
icsk->icsk_ca_state = ca_state;
}
+static inline bool tcp_ca_needs_ecn(const struct sock *sk)
+{
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+
+ return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ECN;
+}
+
static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -856,6 +847,26 @@ static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
icsk->icsk_ca_ops->cwnd_event(sk, event);
}
+/* RFC3168 : 6.1.1 SYN packets must not have ECT/ECN bits set
+ *
+ * If we receive a SYN packet with these bits set, it means a network is
+ * playing bad games with TOS bits. In order to avoid possible false congestion
+ * notifications, we disable TCP ECN negociation.
+ */
+static inline void
+TCP_ECN_create_request(struct request_sock *req, const struct sk_buff *skb,
+ const struct sock *listen_sk)
+{
+ const struct tcphdr *th = tcp_hdr(skb);
+ const struct net *net = sock_net(listen_sk);
+
+ if (net->ipv4.sysctl_tcp_ecn && th->ece && th->cwr &&
+ (INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield) ||
+ tcp_ca_needs_ecn(listen_sk))) {
+ inet_rsk(req)->ecn_ok = 1;
+ }
+}
+
/* These functions determine how the current flow behaves in respect of SACK
* handling. SACK is negotiated with the peer, and therefore it can vary
* between different flows.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad166dc..d7d9c54 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1512,7 +1512,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
goto drop_and_free;
if (!want_cookie || tmp_opt.tstamp_ok)
- TCP_ECN_create_request(req, skb, sock_net(sk));
+ TCP_ECN_create_request(req, skb, sk);
if (want_cookie) {
isn = cookie_v4_init_sequence(sk, skb, &req->mss);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 694711a..1f983dd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -331,7 +331,8 @@ static inline void TCP_ECN_send_syn(struct sock *sk, struct sk_buff *skb)
struct tcp_sock *tp = tcp_sk(sk);
tp->ecn_flags = 0;
- if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1) {
+ if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+ tcp_ca_needs_ecn(sk)) {
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
tp->ecn_flags = TCP_ECN_OK;
}
@@ -953,6 +954,9 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
TCP_ECN_send(sk, skb, tcp_header_size);
+ if (tcp_ca_needs_ecn(sk))
+ INET_ECN_xmit(sk);
+
#ifdef CONFIG_TCP_MD5SIG
/* Calculate the MD5 hash, as we have all we need now */
if (md5) {
@@ -2860,6 +2864,9 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
th->doff = (tcp_header_size >> 2);
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+ if (tcp_ca_needs_ecn(sk))
+ INET_ECN_xmit(sk);
+
#ifdef CONFIG_TCP_MD5SIG
/* Okay, we have all we need - do the md5 hash if needed */
if (md5) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7fa6743..b1c8765 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1013,7 +1013,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
if (!want_cookie || tmp_opt.tstamp_ok)
- TCP_ECN_create_request(req, skb, sock_net(sk));
+ TCP_ECN_create_request(req, skb, sk);
ireq->ir_iif = sk->sk_bound_dev_if;
--
1.8.1.5
^ permalink raw reply related
* [PATCH 1/5] net: tcp: assign tcp cong_ops when tcp sk is created
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann, Glenn Judd
In-Reply-To: <1399928384-24143-1-git-send-email-fw@strlen.de>
Split assignment and initialization from one into two functions.
This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 10 +++++++++-
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_cong.c | 24 ++++++++++--------------
net/ipv4/tcp_minisocks.c | 2 +-
4 files changed, 21 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3c94184..8ae165f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -815,7 +815,7 @@ struct tcp_congestion_ops {
int tcp_register_congestion_control(struct tcp_congestion_ops *type);
void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
-void tcp_init_congestion_control(struct sock *sk);
+void tcp_assign_congestion_control(struct sock *sk);
void tcp_cleanup_congestion_control(struct sock *sk);
int tcp_set_default_congestion_control(const char *name);
void tcp_get_default_congestion_control(char *name);
@@ -831,6 +831,14 @@ u32 tcp_reno_ssthresh(struct sock *sk);
void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
extern struct tcp_congestion_ops tcp_reno;
+static inline void tcp_init_congestion_control(struct sock *sk)
+{
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+
+ if (icsk->icsk_ca_ops->init)
+ icsk->icsk_ca_ops->init(sk);
+}
+
static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
{
struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eb1dde3..f3ff83d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -405,7 +405,7 @@ void tcp_init_sock(struct sock *sk)
tp->reordering = sysctl_tcp_reordering;
tcp_enable_early_retrans(tp);
- icsk->icsk_ca_ops = &tcp_init_congestion_ops;
+ tcp_assign_congestion_control(sk);
tp->tsoffset = 0;
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 7b09d8b..65609ac 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -74,27 +74,23 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
/* Assign choice of congestion control. */
-void tcp_init_congestion_control(struct sock *sk)
+void tcp_assign_congestion_control(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_congestion_ops *ca;
- /* if no choice made yet assign the current value set as default */
- if (icsk->icsk_ca_ops == &tcp_init_congestion_ops) {
- rcu_read_lock();
- list_for_each_entry_rcu(ca, &tcp_cong_list, list) {
- if (try_module_get(ca->owner)) {
- icsk->icsk_ca_ops = ca;
- break;
- }
-
- /* fallback to next available */
+ rcu_read_lock();
+ list_for_each_entry_rcu(ca, &tcp_cong_list, list) {
+ if (try_module_get(ca->owner)) {
+ icsk->icsk_ca_ops = ca;
+ goto out;
}
- rcu_read_unlock();
}
- if (icsk->icsk_ca_ops->init)
- icsk->icsk_ca_ops->init(sk);
+ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
+
+ out:
+ rcu_read_unlock();
}
/* Manage refcounts on socket close. */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 05c1b15..b810071 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -422,7 +422,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
if (newicsk->icsk_ca_ops != &tcp_init_congestion_ops &&
!try_module_get(newicsk->icsk_ca_ops->owner))
- newicsk->icsk_ca_ops = &tcp_init_congestion_ops;
+ tcp_assign_congestion_control(newsk);
tcp_set_ca_state(newsk, TCP_CA_Open);
tcp_init_xmit_timers(newsk);
--
1.8.1.5
^ permalink raw reply related
* [next PATCH 0/5] net: tcp: DCTCP congestion control algorithm
From: Florian Westphal @ 2014-05-12 20:59 UTC (permalink / raw)
To: netdev
This patch series adds support for the Datacenter TCP (DCTCP) congestion
control algorithm.
Please see individual patches' changelog for the details. A summary
of DCTCP and test results can be found in patch 5.
Joint work with Daniel Borkmann and Glenn Judd.
Daniel Borkmann (2):
net: tcp: add flag for ca to indicate that ECN is required
net: tcp: add DCTCP congestion algorithm
Florian Westphal (3):
net: tcp: assign tcp cong_ops when tcp sk is created
net: tcp: split ack slow/fast events from cwnd_event
net: tcp: more detailed ACK events, and events for CE marked packets
Documentation/networking/dctcp.txt | 232 +++++++++++++++++++++++++++
include/net/tcp.h | 70 ++++++---
net/ipv4/Kconfig | 28 +++-
net/ipv4/Makefile | 1 +
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_cong.c | 24 ++-
net/ipv4/tcp_dctcp.c | 311 +++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_input.c | 30 +++-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv4/tcp_output.c | 14 +-
net/ipv4/tcp_westwood.c | 30 ++--
net/ipv6/tcp_ipv6.c | 2 +-
13 files changed, 690 insertions(+), 58 deletions(-)
create mode 100644 Documentation/networking/dctcp.txt
create mode 100644 net/ipv4/tcp_dctcp.c
^ permalink raw reply
* [PATCH 2/2] device/nvf0: enable video decoding engines on gk110/gk208
From: John Rowley @ 2014-05-12 21:34 UTC (permalink / raw)
To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
Cc: John Rowley, bskeggs-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1399930497-6636-1-git-send-email-john.rowley08-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Only tested on nvf1, was advised to enable on all.
Signed-off-by: John Rowley <john.rowley08-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
nvkm/engine/device/nve0.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/nvkm/engine/device/nve0.c b/nvkm/engine/device/nve0.c
index 6e72f9c..459099e 100644
--- a/nvkm/engine/device/nve0.c
+++ b/nvkm/engine/device/nve0.c
@@ -201,11 +201,9 @@ nve0_identify(struct nouveau_device *device)
device->oclass[NVDEV_ENGINE_COPY0 ] = &nve0_copy0_oclass;
device->oclass[NVDEV_ENGINE_COPY1 ] = &nve0_copy1_oclass;
device->oclass[NVDEV_ENGINE_COPY2 ] = &nve0_copy2_oclass;
-#if 0
device->oclass[NVDEV_ENGINE_BSP ] = &nve0_bsp_oclass;
device->oclass[NVDEV_ENGINE_VP ] = &nve0_vp_oclass;
device->oclass[NVDEV_ENGINE_PPP ] = &nvc0_ppp_oclass;
-#endif
device->oclass[NVDEV_ENGINE_PERFMON] = &nvf0_perfmon_oclass;
break;
case 0xf1:
@@ -236,11 +234,9 @@ nve0_identify(struct nouveau_device *device)
device->oclass[NVDEV_ENGINE_COPY0 ] = &nve0_copy0_oclass;
device->oclass[NVDEV_ENGINE_COPY1 ] = &nve0_copy1_oclass;
device->oclass[NVDEV_ENGINE_COPY2 ] = &nve0_copy2_oclass;
-#if 0
device->oclass[NVDEV_ENGINE_BSP ] = &nve0_bsp_oclass;
device->oclass[NVDEV_ENGINE_VP ] = &nve0_vp_oclass;
device->oclass[NVDEV_ENGINE_PPP ] = &nvc0_ppp_oclass;
-#endif
device->oclass[NVDEV_ENGINE_PERFMON] = &nvf0_perfmon_oclass;
break;
case 0x108:
@@ -271,11 +267,9 @@ nve0_identify(struct nouveau_device *device)
device->oclass[NVDEV_ENGINE_COPY0 ] = &nve0_copy0_oclass;
device->oclass[NVDEV_ENGINE_COPY1 ] = &nve0_copy1_oclass;
device->oclass[NVDEV_ENGINE_COPY2 ] = &nve0_copy2_oclass;
-#if 0
device->oclass[NVDEV_ENGINE_BSP ] = &nve0_bsp_oclass;
device->oclass[NVDEV_ENGINE_VP ] = &nve0_vp_oclass;
device->oclass[NVDEV_ENGINE_PPP ] = &nvc0_ppp_oclass;
-#endif
break;
default:
nv_fatal(device, "unknown Kepler chipset\n");
--
1.9.2
^ permalink raw reply related
* [PATCH 1/2] device/nvf1: add support for 0xf1 (gk110b)
From: John Rowley @ 2014-05-12 21:34 UTC (permalink / raw)
To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
Cc: John Rowley, bskeggs-H+wXaHxf7aLQT0dZR+AlfA
Signed-off-by: John Rowley <john.rowley08-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
nvkm/engine/device/nve0.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/nvkm/engine/device/nve0.c b/nvkm/engine/device/nve0.c
index 964c183..6e72f9c 100644
--- a/nvkm/engine/device/nve0.c
+++ b/nvkm/engine/device/nve0.c
@@ -208,6 +208,41 @@ nve0_identify(struct nouveau_device *device)
#endif
device->oclass[NVDEV_ENGINE_PERFMON] = &nvf0_perfmon_oclass;
break;
+ case 0xf1:
+ device->cname = "GK110B";
+ device->oclass[NVDEV_SUBDEV_VBIOS ] = &nouveau_bios_oclass;
+ device->oclass[NVDEV_SUBDEV_GPIO ] = &nve0_gpio_oclass;
+ device->oclass[NVDEV_SUBDEV_I2C ] = &nvd0_i2c_oclass;
+ device->oclass[NVDEV_SUBDEV_CLOCK ] = &nve0_clock_oclass;
+ device->oclass[NVDEV_SUBDEV_THERM ] = &nvd0_therm_oclass;
+ device->oclass[NVDEV_SUBDEV_MXM ] = &nv50_mxm_oclass;
+ device->oclass[NVDEV_SUBDEV_DEVINIT] = nvc0_devinit_oclass;
+ device->oclass[NVDEV_SUBDEV_MC ] = nvc3_mc_oclass;
+ device->oclass[NVDEV_SUBDEV_BUS ] = nvc0_bus_oclass;
+ device->oclass[NVDEV_SUBDEV_TIMER ] = &nv04_timer_oclass;
+ device->oclass[NVDEV_SUBDEV_FB ] = nve0_fb_oclass;
+ device->oclass[NVDEV_SUBDEV_LTCG ] = gf100_ltcg_oclass;
+ device->oclass[NVDEV_SUBDEV_IBUS ] = &nve0_ibus_oclass;
+ device->oclass[NVDEV_SUBDEV_INSTMEM] = nv50_instmem_oclass;
+ device->oclass[NVDEV_SUBDEV_VM ] = &nvc0_vmmgr_oclass;
+ device->oclass[NVDEV_SUBDEV_BAR ] = &nvc0_bar_oclass;
+ device->oclass[NVDEV_SUBDEV_PWR ] = &nvd0_pwr_oclass;
+ device->oclass[NVDEV_SUBDEV_VOLT ] = &nv40_volt_oclass;
+ device->oclass[NVDEV_ENGINE_DMAOBJ ] = &nvd0_dmaeng_oclass;
+ device->oclass[NVDEV_ENGINE_FIFO ] = nve0_fifo_oclass;
+ device->oclass[NVDEV_ENGINE_SW ] = nvc0_software_oclass;
+ device->oclass[NVDEV_ENGINE_GR ] = nvf0_graph_oclass;
+ device->oclass[NVDEV_ENGINE_DISP ] = nvf0_disp_oclass;
+ device->oclass[NVDEV_ENGINE_COPY0 ] = &nve0_copy0_oclass;
+ device->oclass[NVDEV_ENGINE_COPY1 ] = &nve0_copy1_oclass;
+ device->oclass[NVDEV_ENGINE_COPY2 ] = &nve0_copy2_oclass;
+#if 0
+ device->oclass[NVDEV_ENGINE_BSP ] = &nve0_bsp_oclass;
+ device->oclass[NVDEV_ENGINE_VP ] = &nve0_vp_oclass;
+ device->oclass[NVDEV_ENGINE_PPP ] = &nvc0_ppp_oclass;
+#endif
+ device->oclass[NVDEV_ENGINE_PERFMON] = &nvf0_perfmon_oclass;
+ break;
case 0x108:
device->cname = "GK208";
device->oclass[NVDEV_SUBDEV_VBIOS ] = &nouveau_bios_oclass;
--
1.9.2
^ permalink raw reply related
* [nightly] Core TISDK 2014.05 build 2014-05-12_15-32-09
From: Denys Dmytriyenko @ 2014-05-12 21:16 UTC (permalink / raw)
To: meta-arago
[-- Attachment #1: Type: text/html, Size: 6358 bytes --]
^ permalink raw reply
* [Buildroot] [git commit] mesa3d: Fix gbm related compile error
From: Peter Korsgaard @ 2014-05-12 21:31 UTC (permalink / raw)
To: buildroot
commit: http://git.buildroot.net/buildroot/commit/?id=c95a30471da61bd8b1396d85fd4401c0cf7901c8
branch: http://git.buildroot.net/buildroot/commit/?id=refs/heads/master
https://bugs.freedesktop.org/show_bug.cgi?id=78225#c1
"AFAICS both gbm backends require DRI"
Signed-off-by: Bernd Kuhls <bernd.kuhls@t-online.de>
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
---
package/mesa3d/mesa3d.mk | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/package/mesa3d/mesa3d.mk b/package/mesa3d/mesa3d.mk
index 8c0678c..63ba574 100644
--- a/package/mesa3d/mesa3d.mk
+++ b/package/mesa3d/mesa3d.mk
@@ -69,7 +69,6 @@ endif
ifeq ($(MESA3D_DRI_DRIVERS-y),)
MESA3D_CONF_OPT += \
- --disable-dri \
--without-dri-drivers
else
MESA3D_CONF_OPT += \
@@ -102,7 +101,6 @@ MESA3D_CONF_OPT += \
--with-egl-platforms=$(subst $(space),$(comma),$(MESA3D_EGL_PLATFORMS))
else
MESA3D_CONF_OPT += \
- --disable-gbm \
--disable-egl
endif
^ permalink raw reply related
* v0.80.1 Firefly released
From: Sage Weil @ 2014-05-12 21:31 UTC (permalink / raw)
To: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ
This first Firefly point release fixes a few bugs, the most visible
being a problem that prevents scrub from completing in some cases.
Notable Changes
---------------
* osd: revert incomplete scrub fix (Samuel Just)
* rgw: fix stripe calculation for manifest objects (Yehuda Sadeh)
* rgw: improve handling, memory usage for abort reads (Yehuda Sadeh)
* rgw: send Swift user manifest HTTP header (Yehuda Sadeh)
* libcephfs, ceph-fuse: expose MDS session state via admin socket (Yan,
Zheng)
* osd: add simple throttle for snap trimming (Sage Weil)
* monclient: fix possible hang from ill-timed monitor connection failure
(Sage Weil)
* osd: fix trimming of past HitSets (Sage Weil)
* osd: fix whiteouts for non-writeback cache modes (Sage Weil)
* osd: prevent divide by zero in tiering agent (David Zafman)
* osd: prevent busy loop when tiering agent can do no work (David Zafman)
For more detailed information, see the complete changelog:
* http://ceph.com/docs/master/_downloads/v0.80.1.txt
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.80.1.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
^ permalink raw reply
* [Buildroot] [PATCH v6 03/32] mesa3d: use --enable-shared-glapi also for Gallium drivers
From: Peter Korsgaard @ 2014-05-12 21:30 UTC (permalink / raw)
To: buildroot
In-Reply-To: <1399716164-6452-4-git-send-email-bernd.kuhls@t-online.de>
>>>>> "Bernd" == Bernd Kuhls <bernd.kuhls@t-online.de> writes:
> Needed since this upstream commit:
> http://cgit.freedesktop.org/mesa/mesa/commit/configure.ac?h=10.2&id=0432aa064bf5d4d0ad8fc3c4d648b8feb238ddfa
> Remove --disable-shared-glapi from the non-DRI-block, this
> would break with enabled Gallium drivers.
> Signed-off-by: Bernd Kuhls <bernd.kuhls@t-online.de>
Committed, thanks.
--
Bye, Peter Korsgaard
^ permalink raw reply
* [Qemu-devel] [PATCH] target-i386: Preserve the Z bit for bt/bts/btr/btc
From: Richard Henderson @ 2014-05-12 21:28 UTC (permalink / raw)
To: qemu-devel; +Cc: peter.maydell, qemu-stable
In-Reply-To: <1399930102-16128-1-git-send-email-rth@twiddle.net>
Older Intel manuals (pre-2010) and current AMD manuals describe Z as
undefined, but newer Intel manuals describe Z as unchanged.
Cc: qemu-stable@nongnu.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
target-i386/translate.c | 40 +++++++++++++++++++++++++++++++---------
1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 02625e3..032b0fd 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -6708,41 +6708,63 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
}
bt_op:
tcg_gen_andi_tl(cpu_T[1], cpu_T[1], (1 << (3 + ot)) - 1);
+ tcg_gen_shr_tl(cpu_tmp4, cpu_T[0], cpu_T[1]);
switch(op) {
case 0:
- tcg_gen_shr_tl(cpu_cc_src, cpu_T[0], cpu_T[1]);
- tcg_gen_movi_tl(cpu_cc_dst, 0);
break;
case 1:
- tcg_gen_shr_tl(cpu_tmp4, cpu_T[0], cpu_T[1]);
tcg_gen_movi_tl(cpu_tmp0, 1);
tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T[1]);
tcg_gen_or_tl(cpu_T[0], cpu_T[0], cpu_tmp0);
break;
case 2:
- tcg_gen_shr_tl(cpu_tmp4, cpu_T[0], cpu_T[1]);
tcg_gen_movi_tl(cpu_tmp0, 1);
tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T[1]);
- tcg_gen_not_tl(cpu_tmp0, cpu_tmp0);
- tcg_gen_and_tl(cpu_T[0], cpu_T[0], cpu_tmp0);
+ tcg_gen_andc_tl(cpu_T[0], cpu_T[0], cpu_tmp0);
break;
default:
case 3:
- tcg_gen_shr_tl(cpu_tmp4, cpu_T[0], cpu_T[1]);
tcg_gen_movi_tl(cpu_tmp0, 1);
tcg_gen_shl_tl(cpu_tmp0, cpu_tmp0, cpu_T[1]);
tcg_gen_xor_tl(cpu_T[0], cpu_T[0], cpu_tmp0);
break;
}
- set_cc_op(s, CC_OP_SARB + ot);
if (op != 0) {
if (mod != 3) {
gen_op_st_v(s, ot, cpu_T[0], cpu_A0);
} else {
gen_op_mov_reg_v(ot, rm, cpu_T[0]);
}
+ }
+
+ /* Delay all CC updates until after the store above. Note that
+ C is the result of the test, Z is unchanged, and the others
+ are all undefined. */
+ switch (s->cc_op) {
+ case CC_OP_MULB ... CC_OP_MULQ:
+ case CC_OP_ADDB ... CC_OP_ADDQ:
+ case CC_OP_ADCB ... CC_OP_ADCQ:
+ case CC_OP_SUBB ... CC_OP_SUBQ:
+ case CC_OP_SBBB ... CC_OP_SBBQ:
+ case CC_OP_LOGICB ... CC_OP_LOGICQ:
+ case CC_OP_INCB ... CC_OP_INCQ:
+ case CC_OP_DECB ... CC_OP_DECQ:
+ case CC_OP_SHLB ... CC_OP_SHLQ:
+ case CC_OP_SARB ... CC_OP_SARQ:
+ case CC_OP_BMILGB ... CC_OP_BMILGQ:
+ /* Z was going to be computed from the non-zero status of CC_DST.
+ We can get that same Z value (and the new C value) by leaving
+ CC_DST alone, setting CC_SRC, and using a CC_OP_SAR of the
+ same width. */
tcg_gen_mov_tl(cpu_cc_src, cpu_tmp4);
- tcg_gen_movi_tl(cpu_cc_dst, 0);
+ set_cc_op(s, ((s->cc_op - CC_OP_MULB) & 3) + CC_OP_SARB);
+ break;
+ default:
+ /* Otherwise, generate EFLAGS and replace the C bit. */
+ gen_compute_eflags(s);
+ tcg_gen_deposit_tl(cpu_cc_src, cpu_cc_src, cpu_tmp4,
+ ctz32(CC_C), 1);
+ break;
}
break;
case 0x1bc: /* bsf / tzcnt */
--
1.9.0
^ permalink raw reply related
* [Buildroot] [git commit] mesa3d: use --enable-shared-glapi also for Gallium drivers
From: Peter Korsgaard @ 2014-05-12 21:29 UTC (permalink / raw)
To: buildroot
commit: http://git.buildroot.net/buildroot/commit/?id=b11289752e5f862afeced9f34bfab2e6a521b776
branch: http://git.buildroot.net/buildroot/commit/?id=refs/heads/master
Needed since this upstream commit:
http://cgit.freedesktop.org/mesa/mesa/commit/configure.ac?h=10.2&id=0432aa064bf5d4d0ad8fc3c4d648b8feb238ddfa
Remove --disable-shared-glapi from the non-DRI-block, this
would break with enabled Gallium drivers.
Signed-off-by: Bernd Kuhls <bernd.kuhls@t-online.de>
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
---
package/mesa3d/mesa3d.mk | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/package/mesa3d/mesa3d.mk b/package/mesa3d/mesa3d.mk
index a09e4c6..8c0678c 100644
--- a/package/mesa3d/mesa3d.mk
+++ b/package/mesa3d/mesa3d.mk
@@ -63,13 +63,13 @@ MESA3D_CONF_OPT += \
--without-gallium-drivers
else
MESA3D_CONF_OPT += \
+ --enable-shared-glapi \
--with-gallium-drivers=$(subst $(space),$(comma),$(MESA3D_GALLIUM_DRIVERS-y))
endif
ifeq ($(MESA3D_DRI_DRIVERS-y),)
MESA3D_CONF_OPT += \
--disable-dri \
- --disable-shared-glapi \
--without-dri-drivers
else
MESA3D_CONF_OPT += \
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.