* [PATCH 3.10 01/13] udf: Avoid infinite loop when processing indirect ICBs
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 02/13] perf: fix perf bug in fork() Greg Kroah-Hartman
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel; +Cc: Greg Kroah-Hartman, stable, Jan Kara, Chuck Ebbert
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Jan Kara <jack@suse.cz>
commit c03aa9f6e1f938618e6db2e23afef0574efeeb65 upstream.
We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.
Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/udf/inode.c | 35 +++++++++++++++++++++--------------
1 file changed, 21 insertions(+), 14 deletions(-)
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1270,13 +1270,22 @@ update_time:
return 0;
}
+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
static void __udf_read_inode(struct inode *inode)
{
struct buffer_head *bh = NULL;
struct fileEntry *fe;
uint16_t ident;
struct udf_inode_info *iinfo = UDF_I(inode);
+ unsigned int indirections = 0;
+reread:
/*
* Set defaults, but the inode is still incomplete!
* Note: get_new_inode() sets the following on a new inode:
@@ -1313,28 +1322,26 @@ static void __udf_read_inode(struct inod
ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
&ident);
if (ident == TAG_IDENT_IE && ibh) {
- struct buffer_head *nbh = NULL;
struct kernel_lb_addr loc;
struct indirectEntry *ie;
ie = (struct indirectEntry *)ibh->b_data;
loc = lelb_to_cpu(ie->indirectICB.extLocation);
- if (ie->indirectICB.extLength &&
- (nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
- &ident))) {
- if (ident == TAG_IDENT_FE ||
- ident == TAG_IDENT_EFE) {
- memcpy(&iinfo->i_location,
- &loc,
- sizeof(struct kernel_lb_addr));
- brelse(bh);
- brelse(ibh);
- brelse(nbh);
- __udf_read_inode(inode);
+ if (ie->indirectICB.extLength) {
+ brelse(bh);
+ brelse(ibh);
+ memcpy(&iinfo->i_location, &loc,
+ sizeof(struct kernel_lb_addr));
+ if (++indirections > UDF_MAX_ICB_NESTING) {
+ udf_err(inode->i_sb,
+ "too many ICBs in ICB hierarchy"
+ " (max %d supported)\n",
+ UDF_MAX_ICB_NESTING);
+ make_bad_inode(inode);
return;
}
- brelse(nbh);
+ goto reread;
}
}
brelse(ibh);
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 02/13] perf: fix perf bug in fork()
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 01/13] udf: Avoid infinite loop when processing indirect ICBs Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 03/13] init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu Greg Kroah-Hartman
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Oleg Nesterov, Sylvain ythier Hitier,
Peter Zijlstra (Intel), Ingo Molnar, Andrew Morton,
Linus Torvalds
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Peter Zijlstra <peterz@infradead.org>
commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.
Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process. This is bad..
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/events/core.c | 4 +++-
kernel/fork.c | 5 +++--
2 files changed, 6 insertions(+), 3 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7482,8 +7482,10 @@ int perf_event_init_task(struct task_str
for_each_task_context_nr(ctxn) {
ret = perf_event_init_context(child, ctxn);
- if (ret)
+ if (ret) {
+ perf_event_free_task(child);
return ret;
+ }
}
return 0;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1324,7 +1324,7 @@ static struct task_struct *copy_process(
goto bad_fork_cleanup_policy;
retval = audit_alloc(p);
if (retval)
- goto bad_fork_cleanup_policy;
+ goto bad_fork_cleanup_perf;
/* copy all the process information */
retval = copy_semundo(clone_flags, p);
if (retval)
@@ -1522,8 +1522,9 @@ bad_fork_cleanup_semundo:
exit_sem(p);
bad_fork_cleanup_audit:
audit_free(p);
-bad_fork_cleanup_policy:
+bad_fork_cleanup_perf:
perf_event_free_task(p);
+bad_fork_cleanup_policy:
#ifdef CONFIG_NUMA
mpol_put(p->mempolicy);
bad_fork_cleanup_cgroup:
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 03/13] init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 01/13] udf: Avoid infinite loop when processing indirect ICBs Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 02/13] perf: fix perf bug in fork() Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 04/13] ring-buffer: Fix infinite spin in reading buffer Greg Kroah-Hartman
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel; +Cc: Greg Kroah-Hartman, stable, Josh Triplett, Randy Dunlap
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Josh Triplett <josh@joshtriplett.org>
commit 62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f upstream.
commit 03b8c7b623c80af264c4c8d6111e5c6289933666 ("futex: Allow
architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
HAVE_FUTEX_CMPXCHG symbol right below FUTEX. This placed it right in
the middle of the options for the EXPERT menu. However,
HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
placing items in the EXPERT menu, and displays the remaining several
EXPERT items (starting with EPOLL) directly in the General Setup menu.
Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
HAVE_FUTEX_CMPXCHG itself depend on FUTEX. With this change, the
subsequent items display as part of the EXPERT menu again; the EMBEDDED
menu now appears as the next top-level item in the General Setup menu,
which makes General Setup much shorter and more usable.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
init/Kconfig | 1 +
1 file changed, 1 insertion(+)
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1367,6 +1367,7 @@ config FUTEX
config HAVE_FUTEX_CMPXCHG
bool
+ depends on FUTEX
help
Architectures should select this if futex_atomic_cmpxchg_inatomic()
is implemented and always working. This removes a couple of runtime
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 04/13] ring-buffer: Fix infinite spin in reading buffer
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (2 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 03/13] init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 05/13] mm, thp: move invariant bug check out of loop in __split_huge_page_map Greg Kroah-Hartman
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel; +Cc: Greg Kroah-Hartman, stable, Steven Rostedt
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.
Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.
A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.
The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.
Commit 651e22f2701b changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.
Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3371,7 +3371,7 @@ static void rb_iter_reset(struct ring_bu
iter->head = cpu_buffer->reader_page->read;
iter->cache_reader_page = iter->head_page;
- iter->cache_read = iter->head;
+ iter->cache_read = cpu_buffer->read;
if (iter->head)
iter->read_stamp = cpu_buffer->read_stamp;
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 05/13] mm, thp: move invariant bug check out of loop in __split_huge_page_map
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (3 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 04/13] ring-buffer: Fix infinite spin in reading buffer Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 06/13] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Greg Kroah-Hartman
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Waiman Long, Kirill A. Shutemov,
Andrea Arcangeli, Mel Gorman, Rik van Riel, Scott J Norton,
Andrew Morton, Linus Torvalds
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Waiman Long <Waiman.Long@hp.com>
commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.
In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop. Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/huge_memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1733,6 +1733,8 @@ static int __split_huge_page_map(struct
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm);
pmd_populate(mm, &_pmd, pgtable);
+ if (pmd_write(*pmd))
+ BUG_ON(page_mapcount(page) != 1);
haddr = address;
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1742,8 +1744,6 @@ static int __split_huge_page_map(struct
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
if (pmd_numa(*pmd))
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 06/13] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (4 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 05/13] mm, thp: move invariant bug check out of loop in __split_huge_page_map Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 07/13] media: vb2: fix VBI/poll regression Greg Kroah-Hartman
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Mel Gorman, Rik van Riel,
Kirill A. Shutemov, Linus Torvalds
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.
This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.
VM_RW|VM_PROTNONE
|-----------------|
^
split here
In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.
Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/huge_memory.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1740,14 +1740,17 @@ static int __split_huge_page_map(struct
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t *pte, entry;
BUG_ON(PageCompound(page+i));
+ /*
+ * Note that pmd_numa is not transferred deliberately
+ * to avoid any possibility that pte_numa leaks to
+ * a PROT_NONE VMA by accident.
+ */
entry = mk_pte(page + i, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
- if (pmd_numa(*pmd))
- entry = pte_mknuma(entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 07/13] media: vb2: fix VBI/poll regression
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (5 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 06/13] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 08/13] md/raid5: disable DISCARD by default due to safety concerns Greg Kroah-Hartman
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Hans Verkuil, Laurent Pinchart,
Mauro Carvalho Chehab
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Hans Verkuil <hans.verkuil@cisco.com>
commit 58d75f4b1ce26324b4d809b18f94819843a98731 upstream.
The recent conversion of saa7134 to vb2 unconvered a poll() bug that
broke the teletext applications alevt and mtt. These applications
expect that calling poll() without having called VIDIOC_STREAMON will
cause poll() to return POLLERR. That did not happen in vb2.
This patch fixes that behavior. It also fixes what should happen when
poll() is called when STREAMON is called but no buffers have been
queued. In that case poll() will also return POLLERR, but only for
capture queues since output queues will always return POLLOUT
anyway in that situation.
This brings the vb2 behavior in line with the old videobuf behavior.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/media/v4l2-core/videobuf2-core.c | 15 +++++++++++++--
include/media/videobuf2-core.h | 4 ++++
2 files changed, 17 insertions(+), 2 deletions(-)
--- a/drivers/media/v4l2-core/videobuf2-core.c
+++ b/drivers/media/v4l2-core/videobuf2-core.c
@@ -666,6 +666,7 @@ static int __reqbufs(struct vb2_queue *q
* to the userspace.
*/
req->count = allocated_buffers;
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
return 0;
}
@@ -714,6 +715,7 @@ static int __create_bufs(struct vb2_queu
memset(q->plane_sizes, 0, sizeof(q->plane_sizes));
memset(q->alloc_ctx, 0, sizeof(q->alloc_ctx));
q->memory = create->memory;
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
}
num_buffers = min(create->count, VIDEO_MAX_FRAME - q->num_buffers);
@@ -1355,6 +1357,7 @@ int vb2_qbuf(struct vb2_queue *q, struct
* dequeued in dqbuf.
*/
list_add_tail(&vb->queued_entry, &q->queued_list);
+ q->waiting_for_buffers = false;
vb->state = VB2_BUF_STATE_QUEUED;
/*
@@ -1724,6 +1727,7 @@ int vb2_streamoff(struct vb2_queue *q, e
* and videobuf, effectively returning control over them to userspace.
*/
__vb2_queue_cancel(q);
+ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
dprintk(3, "Streamoff successful\n");
return 0;
@@ -2009,9 +2013,16 @@ unsigned int vb2_poll(struct vb2_queue *
}
/*
- * There is nothing to wait for if no buffers have already been queued.
+ * There is nothing to wait for if the queue isn't streaming.
*/
- if (list_empty(&q->queued_list))
+ if (!vb2_is_streaming(q))
+ return res | POLLERR;
+ /*
+ * For compatibility with vb1: if QBUF hasn't been called yet, then
+ * return POLLERR as well. This only affects capture queues, output
+ * queues will always initialize waiting_for_buffers to false.
+ */
+ if (q->waiting_for_buffers)
return res | POLLERR;
if (list_empty(&q->done_list))
--- a/include/media/videobuf2-core.h
+++ b/include/media/videobuf2-core.h
@@ -318,6 +318,9 @@ struct v4l2_fh;
* @done_wq: waitqueue for processes waiting for buffers ready to be dequeued
* @alloc_ctx: memory type/allocator-specific contexts for each plane
* @streaming: current streaming state
+ * @waiting_for_buffers: used in poll() to check if vb2 is still waiting for
+ * buffers. Only set for capture queues if qbuf has not yet been
+ * called since poll() needs to return POLLERR in that situation.
* @fileio: file io emulator internal data, used only if emulator is active
*/
struct vb2_queue {
@@ -350,6 +353,7 @@ struct vb2_queue {
unsigned int plane_sizes[VIDEO_MAX_PLANES];
unsigned int streaming:1;
+ unsigned int waiting_for_buffers:1;
struct vb2_fileio_data *fileio;
};
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 08/13] md/raid5: disable DISCARD by default due to safety concerns.
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (6 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 07/13] media: vb2: fix VBI/poll regression Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 09/13] jiffies: Fix timeval conversion to jiffies Greg Kroah-Hartman
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Shaohua Li, Martin K. Petersen,
Mike Snitzer, Heinz Mauelshagen, NeilBrown, Ben Hutchings
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: NeilBrown <neilb@suse.de>
commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.
The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.
As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.
As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.
If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y
As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7
Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.10: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/md/raid5.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -60,6 +60,10 @@
#include "raid0.h"
#include "bitmap.h"
+static bool devices_handle_discard_safely = false;
+module_param(devices_handle_discard_safely, bool, 0644);
+MODULE_PARM_DESC(devices_handle_discard_safely,
+ "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
/*
* Stripe cache
*/
@@ -5611,7 +5615,7 @@ static int run(struct mddev *mddev)
mddev->queue->limits.discard_granularity = stripe;
/*
* unaligned part of discard request will be ignored, so can't
- * guarantee discard_zerors_data
+ * guarantee discard_zeroes_data
*/
mddev->queue->limits.discard_zeroes_data = 0;
@@ -5636,6 +5640,18 @@ static int run(struct mddev *mddev)
!bdev_get_queue(rdev->bdev)->
limits.discard_zeroes_data)
discard_supported = false;
+ /* Unfortunately, discard_zeroes_data is not currently
+ * a guarantee - just a hint. So we only allow DISCARD
+ * if the sysadmin has confirmed that only safe devices
+ * are in use by setting a module parameter.
+ */
+ if (!devices_handle_discard_safely) {
+ if (discard_supported) {
+ pr_info("md/raid456: discard support disabled due to uncertainty.\n");
+ pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
+ }
+ discard_supported = false;
+ }
}
if (discard_supported &&
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 09/13] jiffies: Fix timeval conversion to jiffies
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (7 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 08/13] md/raid5: disable DISCARD by default due to safety concerns Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 10/13] drbd: fix regression out of mem, failed to invoke fence-peer helper Greg Kroah-Hartman
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Thomas Gleixner, Ingo Molnar,
Paul Turner, Richard Cochran, Prarit Bhargava, Aaron Jacobs,
Andrew Hunter, John Stultz, Ben Hutchings
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andrew Hunter <ahh@google.com>
commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:
setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);
would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.
Here's what was wrong with the conversion: we essentially computed
(eliding seconds)
jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)
by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:
jiffies = (usec * x) >> USEC_JIFFIE_SC
and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)
In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.
We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.
Tested: the following program:
int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/jiffies.h | 12 ----------
kernel/time.c | 54 ++++++++++++++++++++++++++----------------------
2 files changed, 30 insertions(+), 36 deletions(-)
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -254,23 +254,11 @@ extern unsigned long preset_lpj;
#define SEC_JIFFIE_SC (32 - SHIFT_HZ)
#endif
#define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
#define SEC_CONVERSION ((unsigned long)((((u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\
TICK_NSEC -1) / (u64)TICK_NSEC))
#define NSEC_CONVERSION ((unsigned long)((((u64)1 << NSEC_JIFFIE_SC) +\
TICK_NSEC -1) / (u64)TICK_NSEC))
-#define USEC_CONVERSION \
- ((unsigned long)((((u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
- TICK_NSEC -1) / (u64)TICK_NSEC))
-/*
- * USEC_ROUND is used in the timeval to jiffie conversion. See there
- * for more details. It is the scaled resolution rounding value. Note
- * that it is a 64-bit value. Since, when it is applied, we are already
- * in jiffies (albit scaled), it is nothing but the bits we will shift
- * off.
- */
-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
/*
* The maximum jiffie value is (MAX_INT >> 1). Here we translate that
* into seconds. The 64-bit case will overflow if we are not careful,
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
* that a remainder subtract here would not do the right thing as the
* resolution values don't fall on second boundries. I.e. the line:
* nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ * Note that due to the small error in the multiplier here, this
+ * rounding is incorrect for sufficiently large values of tv_nsec, but
+ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
+ * OK.
*
* Rather, we just shift the bits off the right.
*
* The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
* value to a scaled second value.
*/
-unsigned long
-timespec_to_jiffies(const struct timespec *value)
+static unsigned long
+__timespec_to_jiffies(unsigned long sec, long nsec)
{
- unsigned long sec = value->tv_sec;
- long nsec = value->tv_nsec + TICK_NSEC - 1;
+ nsec = nsec + TICK_NSEC - 1;
if (sec >= MAX_SEC_IN_JIFFIES){
sec = MAX_SEC_IN_JIFFIES;
@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespe
(NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
}
+
+unsigned long
+timespec_to_jiffies(const struct timespec *value)
+{
+ return __timespec_to_jiffies(value->tv_sec, value->tv_nsec);
+}
+
EXPORT_SYMBOL(timespec_to_jiffies);
void
@@ -533,31 +543,27 @@ jiffies_to_timespec(const unsigned long
}
EXPORT_SYMBOL(jiffies_to_timespec);
-/* Same for "timeval"
+/*
+ * We could use a similar algorithm to timespec_to_jiffies (with a
+ * different multiplier for usec instead of nsec). But this has a
+ * problem with rounding: we can't exactly add TICK_NSEC - 1 to the
+ * usec value, since it's not necessarily integral.
+ *
+ * We could instead round in the intermediate scaled representation
+ * (i.e. in units of 1/2^(large scale) jiffies) but that's also
+ * perilous: the scaling introduces a small positive error, which
+ * combined with a division-rounding-upward (i.e. adding 2^(scale) - 1
+ * units to the intermediate before shifting) leads to accidental
+ * overflow and overestimates.
*
- * Well, almost. The problem here is that the real system resolution is
- * in nanoseconds and the value being converted is in micro seconds.
- * Also for some machines (those that use HZ = 1024, in-particular),
- * there is a LARGE error in the tick size in microseconds.
-
- * The solution we use is to do the rounding AFTER we convert the
- * microsecond part. Thus the USEC_ROUND, the bits to be shifted off.
- * Instruction wise, this should cost only an additional add with carry
- * instruction above the way it was done above.
+ * At the cost of one additional multiplication by a constant, just
+ * use the timespec implementation.
*/
unsigned long
timeval_to_jiffies(const struct timeval *value)
{
- unsigned long sec = value->tv_sec;
- long usec = value->tv_usec;
-
- if (sec >= MAX_SEC_IN_JIFFIES){
- sec = MAX_SEC_IN_JIFFIES;
- usec = 0;
- }
- return (((u64)sec * SEC_CONVERSION) +
- (((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
- (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+ return __timespec_to_jiffies(value->tv_sec,
+ value->tv_usec * NSEC_PER_USEC);
}
EXPORT_SYMBOL(timeval_to_jiffies);
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 10/13] drbd: fix regression out of mem, failed to invoke fence-peer helper
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (8 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 09/13] jiffies: Fix timeval conversion to jiffies Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 11/13] nl80211: clear skb cb before passing to netlink Greg Kroah-Hartman
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Philipp Reisner, Lars Ellenberg,
Jens Axboe
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Lars Ellenberg <lars.ellenberg@linbit.com>
commit bbc1c5e8ad6dfebf9d13b8a4ccdf66c92913eac9 upstream.
Since linux kernel 3.13, kthread_run() internally uses
wait_for_completion_killable(). We sometimes may use kthread_run()
while we still have a signal pending, which we used to kick our threads
out of potentially blocking network functions, causing kthread_run() to
mistake that as a new fatal signal and fail.
Fix: flush_signals() before kthread_run().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/block/drbd/drbd_nl.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -514,6 +514,12 @@ void conn_try_outdate_peer_async(struct
struct task_struct *opa;
kref_get(&tconn->kref);
+ /* We may just have force_sig()'ed this thread
+ * to get it out of some blocking network function.
+ * Clear signals; otherwise kthread_run(), which internally uses
+ * wait_on_completion_killable(), will mistake our pending signal
+ * for a new fatal signal and fail. */
+ flush_signals(current);
opa = kthread_run(_try_outdate_peer_async, tconn, "drbd_async_h");
if (IS_ERR(opa)) {
conn_err(tconn, "out of mem, failed to invoke fence-peer helper\n");
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 11/13] nl80211: clear skb cb before passing to netlink
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (9 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 10/13] drbd: fix regression out of mem, failed to invoke fence-peer helper Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 12/13] cpufreq: Fix wrong time unit conversion Greg Kroah-Hartman
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Assaf Azulay, David Spinadel,
Johannes Berg
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johannes Berg <johannes.berg@intel.com>
commit bd8c78e78d5011d8111bc2533ee73b13a3bd6c42 upstream.
In testmode and vendor command reply/event SKBs we use the
skb cb data to store nl80211 parameters between allocation
and sending. This causes the code for CONFIG_NETLINK_MMAP
to get confused, because it takes ownership of the skb cb
data when the SKB is handed off to netlink, and it doesn't
explicitly clear it.
Clear the skb cb explicitly when we're done and before it
gets passed to netlink to avoid this issue.
Reported-by: Assaf Azulay <assaf.azulay@intel.com>
Reported-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/wireless/nl80211.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -6568,6 +6568,9 @@ int cfg80211_testmode_reply(struct sk_bu
void *hdr = ((void **)skb->cb)[1];
struct nlattr *data = ((void **)skb->cb)[2];
+ /* clear CB data for netlink core to own from now on */
+ memset(skb->cb, 0, sizeof(skb->cb));
+
if (WARN_ON(!rdev->testmode_info)) {
kfree_skb(skb);
return -EINVAL;
@@ -6594,6 +6597,9 @@ void cfg80211_testmode_event(struct sk_b
void *hdr = ((void **)skb->cb)[1];
struct nlattr *data = ((void **)skb->cb)[2];
+ /* clear CB data for netlink core to own from now on */
+ memset(skb->cb, 0, sizeof(skb->cb));
+
nla_nest_end(skb, data);
genlmsg_end(skb, hdr);
genlmsg_multicast_netns(wiphy_net(&rdev->wiphy), skb, 0,
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 12/13] cpufreq: Fix wrong time unit conversion
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (10 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 11/13] nl80211: clear skb cb before passing to netlink Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-07 23:20 ` [PATCH 3.10 13/13] cpufreq: ondemand: Change the calculation of target frequency Greg Kroah-Hartman
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Andreas Schwab, Frederic Weisbecker,
Paul E. McKenney, Rafael J. Wysocki, Mark Brown
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andreas Schwab <schwab@linux-m68k.org>
commit a857c0b9e24e39fe5be82451b65377795f9538d8 upstream.
The time spent by a CPU under a given frequency is stored in jiffies unit
in the cpu var cpufreq_stats_table->time_in_state[i], i being the index of
the frequency.
This is what is displayed in the following file on the right column:
cat /sys/devices/system/cpu/cpuX/cpufreq/stats/time_in_state
2301000 19835820
2300000 3172
[...]
Now cpufreq converts this jiffies unit delta to clock_t before returning it
to the user as in the above file. And that conversion is achieved using the API
cputime64_to_clock_t().
Although it accidentally works on traditional tick based cputime accounting, where
cputime_t maps directly to jiffies, it doesn't work with other types of cputime
accounting such as CONFIG_VIRT_CPU_ACCOUNTING_* where cputime_t can map to nsecs
or any granularity preffered by the architecture.
For example we get a buggy zero delta on full dyntick configurations:
cat /sys/devices/system/cpu/cpuX/cpufreq/stats/time_in_state
2301000 0
2300000 0
[...]
Fix this with using the proper jiffies_64_t to clock_t conversion.
Reported-and-tested-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/cpufreq/cpufreq_stats.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/cpufreq/cpufreq_stats.c
+++ b/drivers/cpufreq/cpufreq_stats.c
@@ -81,7 +81,7 @@ static ssize_t show_time_in_state(struct
for (i = 0; i < stat->state_num; i++) {
len += sprintf(buf + len, "%u %llu\n", stat->freq_table[i],
(unsigned long long)
- cputime64_to_clock_t(stat->time_in_state[i]));
+ jiffies_64_to_clock_t(stat->time_in_state[i]));
}
return len;
}
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 3.10 13/13] cpufreq: ondemand: Change the calculation of target frequency
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (11 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 12/13] cpufreq: Fix wrong time unit conversion Greg Kroah-Hartman
@ 2014-10-07 23:20 ` Greg Kroah-Hartman
2014-10-08 2:49 ` [PATCH 3.10 00/13] 3.10.57-stable review Guenter Roeck
2014-10-08 20:06 ` Shuah Khan
14 siblings, 0 replies; 16+ messages in thread
From: Greg Kroah-Hartman @ 2014-10-07 23:20 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Stratos Karafotis, Viresh Kumar,
Rafael J. Wysocki, Mark Brown
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Stratos Karafotis <stratosk@semaphore.gr>
commit dfa5bb622555d9da0df21b50f46ebdeef390041b upstream.
The ondemand governor calculates load in terms of frequency and
increases it only if load_freq is greater than up_threshold
multiplied by the current or average frequency. This appears to
produce oscillations of frequency between min and max because,
for example, a relatively small load can easily saturate minimum
frequency and lead the CPU to the max. Then, it will decrease
back to the min due to small load_freq.
Change the calculation method of load and target frequency on the
basis of the following two observations:
- Load computation should not depend on the current or average
measured frequency. For example, absolute load of 80% at 100MHz
is not necessarily equivalent to 8% at 1000MHz in the next
sampling interval.
- It should be possible to increase the target frequency to any
value present in the frequency table proportional to the absolute
load, rather than to the max only, so that:
Target frequency = C * load
where we take C = policy->cpuinfo.max_freq / 100.
Tested on Intel i7-3770 CPU @ 3.40GHz and on Quad core 1500MHz Krait.
Phoronix benchmark of Linux Kernel Compilation 3.1 test shows an
increase ~1.5% in performance. cpufreq_stats (time_in_state) shows
that middle frequencies are used more, with this patch. Highest
and lowest frequencies were used less by ~9%.
[rjw: We have run multiple other tests on kernels with this
change applied and in the vast majority of cases it turns out
that the resulting performance improvement also leads to reduced
consumption of energy. The change is additionally justified by
the overall simplification of the code in question.]
Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/cpufreq/cpufreq_governor.c | 10 ---------
drivers/cpufreq/cpufreq_governor.h | 1
drivers/cpufreq/cpufreq_ondemand.c | 39 ++++++-------------------------------
3 files changed, 8 insertions(+), 42 deletions(-)
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -97,7 +97,7 @@ void dbs_check_cpu(struct dbs_data *dbs_
policy = cdbs->cur_policy;
- /* Get Absolute Load (in terms of freq for ondemand gov) */
+ /* Get Absolute Load */
for_each_cpu(j, policy->cpus) {
struct cpu_dbs_common_info *j_cdbs;
u64 cur_wall_time, cur_idle_time;
@@ -148,14 +148,6 @@ void dbs_check_cpu(struct dbs_data *dbs_
load = 100 * (wall_time - idle_time) / wall_time;
- if (dbs_data->cdata->governor == GOV_ONDEMAND) {
- int freq_avg = __cpufreq_driver_getavg(policy, j);
- if (freq_avg <= 0)
- freq_avg = policy->cur;
-
- load *= freq_avg;
- }
-
if (load > max_load)
max_load = load;
}
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -169,7 +169,6 @@ struct od_dbs_tuners {
unsigned int sampling_rate;
unsigned int sampling_down_factor;
unsigned int up_threshold;
- unsigned int adj_up_threshold;
unsigned int powersave_bias;
unsigned int io_is_busy;
};
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -29,11 +29,9 @@
#include "cpufreq_governor.h"
/* On-demand governor macros */
-#define DEF_FREQUENCY_DOWN_DIFFERENTIAL (10)
#define DEF_FREQUENCY_UP_THRESHOLD (80)
#define DEF_SAMPLING_DOWN_FACTOR (1)
#define MAX_SAMPLING_DOWN_FACTOR (100000)
-#define MICRO_FREQUENCY_DOWN_DIFFERENTIAL (3)
#define MICRO_FREQUENCY_UP_THRESHOLD (95)
#define MICRO_FREQUENCY_MIN_SAMPLE_RATE (10000)
#define MIN_FREQUENCY_UP_THRESHOLD (11)
@@ -161,14 +159,10 @@ static void dbs_freq_increase(struct cpu
/*
* Every sampling_rate, we check, if current idle time is less than 20%
- * (default), then we try to increase frequency. Every sampling_rate, we look
- * for the lowest frequency which can sustain the load while keeping idle time
- * over 30%. If such a frequency exist, we try to decrease to this frequency.
- *
- * Any frequency increase takes it to the maximum frequency. Frequency reduction
- * happens at minimum steps of 5% (default) of current frequency
+ * (default), then we try to increase frequency. Else, we adjust the frequency
+ * proportional to load.
*/
-static void od_check_cpu(int cpu, unsigned int load_freq)
+static void od_check_cpu(int cpu, unsigned int load)
{
struct od_cpu_dbs_info_s *dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
struct cpufreq_policy *policy = dbs_info->cdbs.cur_policy;
@@ -178,29 +172,17 @@ static void od_check_cpu(int cpu, unsign
dbs_info->freq_lo = 0;
/* Check for frequency increase */
- if (load_freq > od_tuners->up_threshold * policy->cur) {
+ if (load > od_tuners->up_threshold) {
/* If switching to max speed, apply sampling_down_factor */
if (policy->cur < policy->max)
dbs_info->rate_mult =
od_tuners->sampling_down_factor;
dbs_freq_increase(policy, policy->max);
return;
- }
-
- /* Check for frequency decrease */
- /* if we cannot reduce the frequency anymore, break out early */
- if (policy->cur == policy->min)
- return;
-
- /*
- * The optimal frequency is the frequency that is the lowest that can
- * support the current CPU usage without triggering the up policy. To be
- * safe, we focus 10 points under the threshold.
- */
- if (load_freq < od_tuners->adj_up_threshold
- * policy->cur) {
+ } else {
+ /* Calculate the next frequency proportional to load */
unsigned int freq_next;
- freq_next = load_freq / od_tuners->adj_up_threshold;
+ freq_next = load * policy->cpuinfo.max_freq / 100;
/* No longer fully busy, reset rate_mult */
dbs_info->rate_mult = 1;
@@ -374,9 +356,6 @@ static ssize_t store_up_threshold(struct
input < MIN_FREQUENCY_UP_THRESHOLD) {
return -EINVAL;
}
- /* Calculate the new adj_up_threshold */
- od_tuners->adj_up_threshold += input;
- od_tuners->adj_up_threshold -= od_tuners->up_threshold;
od_tuners->up_threshold = input;
return count;
@@ -525,8 +504,6 @@ static int od_init(struct dbs_data *dbs_
if (idle_time != -1ULL) {
/* Idle micro accounting is supported. Use finer thresholds */
tuners->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
- tuners->adj_up_threshold = MICRO_FREQUENCY_UP_THRESHOLD -
- MICRO_FREQUENCY_DOWN_DIFFERENTIAL;
/*
* In nohz/micro accounting case we set the minimum frequency
* not depending on HZ, but fixed (very low). The deferred
@@ -535,8 +512,6 @@ static int od_init(struct dbs_data *dbs_
dbs_data->min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE;
} else {
tuners->up_threshold = DEF_FREQUENCY_UP_THRESHOLD;
- tuners->adj_up_threshold = DEF_FREQUENCY_UP_THRESHOLD -
- DEF_FREQUENCY_DOWN_DIFFERENTIAL;
/* For correct statistics, we need 10 ticks for each measure */
dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 3.10 00/13] 3.10.57-stable review
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (12 preceding siblings ...)
2014-10-07 23:20 ` [PATCH 3.10 13/13] cpufreq: ondemand: Change the calculation of target frequency Greg Kroah-Hartman
@ 2014-10-08 2:49 ` Guenter Roeck
2014-10-08 20:06 ` Shuah Khan
14 siblings, 0 replies; 16+ messages in thread
From: Guenter Roeck @ 2014-10-08 2:49 UTC (permalink / raw)
To: Greg Kroah-Hartman, linux-kernel
Cc: torvalds, akpm, satoru.takeuchi, shuah.kh, stable
On 10/07/2014 04:20 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.10.57 release.
> There are 13 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Oct 9 23:19:13 UTC 2014.
> Anything received after that time might be too late.
>
Build results:
total: 137 pass: 137 fail: 0
Qemu tests:
total: 27 pass: 27 fail: 0
Details are available at http://server.roeck-us.net:8010/builders.
Guenter
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 3.10 00/13] 3.10.57-stable review
2014-10-07 23:20 [PATCH 3.10 00/13] 3.10.57-stable review Greg Kroah-Hartman
` (13 preceding siblings ...)
2014-10-08 2:49 ` [PATCH 3.10 00/13] 3.10.57-stable review Guenter Roeck
@ 2014-10-08 20:06 ` Shuah Khan
14 siblings, 0 replies; 16+ messages in thread
From: Shuah Khan @ 2014-10-08 20:06 UTC (permalink / raw)
To: Greg Kroah-Hartman, linux-kernel
Cc: torvalds, akpm, linux, satoru.takeuchi, shuah.kh, stable
On 10/07/2014 05:20 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.10.57 release.
> There are 13 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Oct 9 23:19:13 UTC 2014.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.10.57-rc1.gz
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>
Compiled and booted on my test system. No dmesg regressions.
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978
^ permalink raw reply [flat|nested] 16+ messages in thread