All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 12/30] mm/zsmalloc: Add missing annotation for pin_tag()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Minchan Kim, Nitin Gupta,
	Sergey Senozhatsky, Andrew Morton,
	open list:ZSMALLOC COMPRESSED SLAB MEMORY ALLOCATOR
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at pin_tag()()

warning: context imbalance in pin_tag() - wrong count at exit

The root cause is the missing annotation at pin_tag()
Add the missing __acquires(bitlock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/zsmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 2eab424c8c67..7bac76ae11b3 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -891,7 +891,7 @@ static inline int trypin_tag(unsigned long handle)
 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
-static void pin_tag(unsigned long handle)
+static void pin_tag(unsigned long handle) __acquires(bitlock)
 {
 	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
-- 
2.24.1



^ permalink raw reply related

* [PATCH 11/30] mm/zsmalloc: Add missing annotation for migrate_read_unlock()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Minchan Kim, Nitin Gupta,
	Sergey Senozhatsky, Andrew Morton,
	open list:ZSMALLOC COMPRESSED SLAB MEMORY ALLOCATOR
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at migrate_read_unlock()()

 warning: context imbalance in migrate_read_unlock() - unexpected unlock

The root cause is the missing annotation at migrate_read_unlock()
Add the missing __releases(&zspage->lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/zsmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index da70817b4ed8..2eab424c8c67 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1838,7 +1838,7 @@ static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
 	read_lock(&zspage->lock);
 }
 
-static void migrate_read_unlock(struct zspage *zspage)
+static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
 {
 	read_unlock(&zspage->lock);
 }
-- 
2.24.1



^ permalink raw reply related

* [PATCH 10/30] mm/zsmalloc: Add missing annotation for migrate_read_lock()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Minchan Kim, Nitin Gupta,
	Sergey Senozhatsky, Andrew Morton,
	open list:ZSMALLOC COMPRESSED SLAB MEMORY ALLOCATOR
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at migrate_read_lock()()

 warning: context imbalance in migrate_read_lock() - wrong count at exit

The root cause is the missing annotation at migrate_read_lock()
Add the missing __acquires(&zspage->lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/zsmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 22d17ecfe7df..da70817b4ed8 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1833,7 +1833,7 @@ static void migrate_lock_init(struct zspage *zspage)
 	rwlock_init(&zspage->lock);
 }
 
-static void migrate_read_lock(struct zspage *zspage)
+static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
 {
 	read_lock(&zspage->lock);
 }
-- 
2.24.1



^ permalink raw reply related

* [PATCH 09/30] mm/slub: Add missing annotation for put_map()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton,
	open list:SLAB ALLOCATOR
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at put_map()()

 warning: context imbalance in put_map() - unexpected unlock

The root cause is the missing annotation at put_map()
Add the missing __releases(&object_map_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3c6d348afcf9..c273d0f32b8f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -466,7 +466,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
 	return object_map;
 }
 
-static void put_map(unsigned long *map)
+static void put_map(unsigned long *map) __releases(&object_map_lock)
 {
 	VM_BUG_ON(map != object_map);
 	lockdep_assert_held(&object_map_lock);
-- 
2.24.1



^ permalink raw reply related

* [PATCH 08/30] mm/slub: Add missing annotation for get_map()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton,
	open list:SLAB ALLOCATOR
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at get_map()()

 warning: context imbalance in get_map() - wrong count at exit

The root cause is the missing annotation at get_map()
Add the missing __acquires(&object_map_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/slub.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..3c6d348afcf9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -449,6 +449,7 @@ static DEFINE_SPINLOCK(object_map_lock);
  * not vanish from under us.
  */
 static unsigned long *get_map(struct kmem_cache *s, struct page *page)
+	__acquires(&object_map_lock)
 {
 	void *p;
 	void *addr = page_address(page);
-- 
2.24.1



^ permalink raw reply related

* [PATCH 07/30] mm/mempolicy: Add missing annotation for queue_pages_pmd()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Andrew Morton,
	open list:MEMORY MANAGEMENT
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at queue_pages_pmd()

context imbalance in queue_pages_pmd() - unexpected unlock

The root cause is the missing annotation at queue_pages_pmd()
Add the missing __releases(ptl)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/mempolicy.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 977c641f78cf..bb0755228150 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -442,6 +442,7 @@ static inline bool queue_pages_required(struct page *page,
  */
 static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
+	__releases(ptl)
 {
 	int ret = 0;
 	struct page *page;
-- 
2.24.1



^ permalink raw reply related

* Re: [PATCH v1 3/3] virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM
From: Tyler Sanderson via Virtualization @ 2020-02-14 20:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S . Tsirkin, linux-kernel, virtualization, linux-mm,
	Nadav Amit, David Rientjes, Alexander Duyck, Michal Hocko
In-Reply-To: <802f93b1-1588-bd2c-8238-c12ec7f7ae9e@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 2706 bytes --]

Regarding Wei's patch that modifies the shrinker implementation, versus
this patch which reverts to OOM notifier:
I am in favor of both patches. But I do want to make sure a fix gets back
ported to 4.19 where the performance regression was first introduced.
My concern with reverting to the OOM notifier is, as mst@ put it (in the
other thread):
"when linux hits OOM all kind of error paths are being hit, latent bugs
start triggering, latency goes up drastically."
The guest could be in a lot of pain before the OOM notifier is invoked, and
it seems like the shrinker API might allow more fine grained control of
when we deflate.

On the other hand, I'm not totally convinced that Wei's patch is an
expected use of the shrinker/page-cache APIs, and maybe it is fragile.
Needs more testing and scrutiny.

It seems to me like the shrinker API is the right API in the long run,
perhaps with some fixes and modifications. But maybe reverting to OOM
notifier is the best patch to back port?

On Fri, Feb 14, 2020 at 6:19 AM David Hildenbrand <david@redhat.com> wrote:

> >> There was a report that this results in undesired side effects when
> >> inflating the balloon to shrink the page cache. [1]
> >>      "When inflating the balloon against page cache (i.e. no free memory
> >>       remains) vmscan.c will both shrink page cache, but also invoke the
> >>       shrinkers -- including the balloon's shrinker. So the balloon
> >>       driver allocates memory which requires reclaim, vmscan gets this
> >>       memory by shrinking the balloon, and then the driver adds the
> >>       memory back to the balloon. Basically a busy no-op."
> >>
> >> The name "deflate on OOM" makes it pretty clear when deflation should
> >> happen - after other approaches to reclaim memory failed, not while
> >> reclaiming. This allows to minimize the footprint of a guest - memory
> >> will only be taken out of the balloon when really needed.
> >>
> >> Especially, a drop_slab() will result in the whole balloon getting
> >> deflated - undesired.
> >
> > Could you explain why some more? drop_caches shouldn't be really used in
> > any production workloads and if somebody really wants all the cache to
> > be dropped then why is balloon any different?
> >
>
> Deflation should happen when the guest is out of memory, not when
> somebody thinks it's time to reclaim some memory. That's what the
> feature promised from the beginning: Only give the guest more memory in
> case it *really* needs more memory.
>
> Deflate on oom, not deflate on reclaim/memory pressure. (that's what the
> report was all about)
>
> A priority for shrinkers might be a step into the right direction.
>
> --
> Thanks,
>
> David / dhildenb
>
>

[-- Attachment #1.2: Type: text/html, Size: 3531 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH 06/30] mm/hugetlb: Add missing annotation for gather_surplus_pages()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Mike Kravetz, Andrew Morton,
	open list:HUGETLB FILESYSTEM
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at gather_surplus_pages()

warning: context imbalance in hugetlb_cow() - unexpected unlock

The root cause is the missing annotation at gather_surplus_pages()
Add the missing __must_hold(&hugetlb_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd8737a94bec..ff7dda27b33f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1695,6 +1695,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
  * of size 'delta'.
  */
 static int gather_surplus_pages(struct hstate *h, int delta)
+	__must_hold(&hugetlb_lock)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
-- 
2.24.1



^ permalink raw reply related

* [PATCH 05/30] mm/compaction: Add missing annotation for compact_lock_irqsave
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Andrew Morton,
	open list:MEMORY MANAGEMENT
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at compact_lock_irqsave()

warning: context imbalance in compact_lock_irqsave() - wrong count at exit

The root cause is the missing annotation at compact_lock_irqsave()
Add the missing __acquires(lock) annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/compaction.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..81190fe22200 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -481,6 +481,7 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page,
  */
 static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 						struct compact_control *cc)
+	__acquires(lock)
 {
 	/* Track if the lock is contended in async mode */
 	if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
-- 
2.24.1



^ permalink raw reply related

* [PATCH 04/30] mm/memcontrol: Add missing annotation for lock_page_lru()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports warning at lock_page_lry()

warning: context imbalance in lock_page_lru() - wrong count at exit

The root cause is the missing annotation at lock_page_lru()
Add the missing __acquires(&pgdat->lru_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 22ddd557a69b..67dc9f1af0bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2571,6 +2571,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
+	__acquires(&pgdat->lru_lock)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-- 
2.24.1



^ permalink raw reply related

* [PATCH 03/30] mm/memcontrol: Add missing annotation for unlock_page_lru()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports warning at unlock_page_lry()

warning: context imbalance in unlock_page_lru() - unexpected unlock

The root cause is the missing annotation at unlock_page_lru()
Add the missing __releases(&pgdat->lru_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f6dc8712e39..22ddd557a69b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2587,6 +2587,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 }
 
 static void unlock_page_lru(struct page *page, int isolated)
+	__releases(&pgdat->lru_lock)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-- 
2.24.1



^ permalink raw reply related

* [PATCH 02/30] x86/apic/vector: Add missing annotation to lock_vector_lock(void)
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT), Enrico Weigelt,
	Allison Randal, Greg Kroah-Hartman, Neil Horman
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at unlock_vector_lock()

warning: context imbalance in unlock_vector_lock() - unexpected unlock

The root cause is the missing annotation at unlock_vector_lock()
Add the missing  __releases(&vector_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 arch/x86/kernel/apic/vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index d7556939c6cf..8ee7848a355b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -55,7 +55,7 @@ void lock_vector_lock(void) __acquires(&vector_lock)
 	raw_spin_lock(&vector_lock);
 }
 
-void unlock_vector_lock(void)
+void unlock_vector_lock(void) __releases(&vector_lock)
 {
 	raw_spin_unlock(&vector_lock);
 }
-- 
2.24.1


^ permalink raw reply related

* [PATCH 01/30] x86/apic/vector: Add missing annotation to lock_vector_lock(void)
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT), Neil Horman,
	Allison Randal, Kate Stewart
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports a warning at lock_vector_lock(void)

warning: context imbalance in lock_vector_lock() - wrong count at exit

The root cause is the missing annotation at lock_vector_lock(void)
Add the missing  __acquires(&vector_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 arch/x86/kernel/apic/vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 2c5676b0a6e7..d7556939c6cf 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -47,7 +47,7 @@ static struct irq_matrix *vector_matrix;
 static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
 #endif
 
-void lock_vector_lock(void)
+void lock_vector_lock(void) __acquires(&vector_lock)
 {
 	/* Used to the online set of cpus does not change
 	 * during assign_irq_vector.
-- 
2.24.1


^ permalink raw reply related

* [PATCH 00/30] Lock warning cleanup
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: boqun.feng, Jules Irenge
In-Reply-To: <0/30>

This patch series adds missing annotations to various functions  that register warnings of context imbalance when built with Sparse tool. The adds fix the warnings and give better insight or directive on what the function are actually doing. 

Jules Irenge (30):
  x86/apic/vector: Add missing annotation to lock_vector_lock(void)
  x86/apic/vector: Add missing annotation to lock_vector_lock(void)
  mm/memcontrol: Add missing annotation for unlock_page_lru()
  mm/memcontrol: Add missing annotation for lock_page_lru()
  mm/compaction: Add missing annotation for compact_lock_irqsave
  mm/hugetlb: Add missing annotation for gather_surplus_pages()
  mm/mempolicy: Add missing annotation for queue_pages_pmd()
  mm/slub: Add missing annotation for get_map()
  mm/slub: Add missing annotation for put_map()
  mm/zsmalloc: Add missing annotation for migrate_read_lock()
  mm/zsmalloc: Add missing annotation for migrate_read_unlock()
  mm/zsmalloc: Add missing annotation for pin_tag()
  mm/zsmalloc: Add missing annotation for unpin_tag()
  x86/xen: Add missing annotation for xen_pte_lock()
  x86/xen: Add missing annotation for xen_pte_unlock()
  drm/vkms: Add missing annotation for vkms_crtc_atomic_begin()
  drm/vkms: Add missing annotation for vkms_crtc_atomic_flush()
  driver core: Add missing annotation for device_links_write_lock()
  driver core: Add missing annotation for device_links_read_lock()
  pcnet32: Add missing annotation for pcnet32_suspend()
  sfc: Add missing annotation for efx_ef10_try_update_nic_stats_vf()
  xhci: Add missing annotation for xhci_set_port_power()
  xhci: Add missing annotation for xhci_enter_test_mode
  tipc: Add missing annotation for tipc_node_read_lock()
  tipc: Add missing annotation for tipc_node_read_unlock()
  tipc: Add missing annotation for tipc_node_write_lock()
  tipc: Add missing annotation for tipc_node_write_unlock_fast()
  tipc: Add missing annotation for tipc_node_write_unlock()
  net: Add missing annotation for netlink_walk_start()
  net: Add missing annotation for netlink_walk_stop()

 arch/x86/kernel/apic/vector.c      | 4 ++--
 arch/x86/xen/mmu_pv.c              | 3 ++-
 drivers/base/core.c                | 4 ++--
 drivers/gpu/drm/vkms/vkms_crtc.c   | 2 ++
 drivers/net/ethernet/amd/pcnet32.c | 2 +-
 drivers/net/ethernet/sfc/ef10.c    | 1 +
 drivers/usb/host/xhci-hub.c        | 2 ++
 mm/compaction.c                    | 1 +
 mm/hugetlb.c                       | 1 +
 mm/memcontrol.c                    | 2 ++
 mm/mempolicy.c                     | 1 +
 mm/slub.c                          | 3 ++-
 mm/zsmalloc.c                      | 8 ++++----
 net/netlink/af_netlink.c           | 4 ++--
 net/tipc/node.c                    | 9 +++++----
 15 files changed, 30 insertions(+), 17 deletions(-)

-- 
2.24.1


^ permalink raw reply

* [PATCH 04/30] mm/memcontrol: Add missing annotation for lock_page_lru()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: boqun.feng-Re5JQEeQqe8AvxtiuMwx3w, Jules Irenge, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER MEMCG,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER MEMCG
In-Reply-To: <20200214204741.94112-1-jbi.octave-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Sparse reports warning at lock_page_lry()

warning: context imbalance in lock_page_lru() - wrong count at exit

The root cause is the missing annotation at lock_page_lru()
Add the missing __acquires(&pgdat->lru_lock)

Signed-off-by: Jules Irenge <jbi.octave-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 22ddd557a69b..67dc9f1af0bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2571,6 +2571,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
+	__acquires(&pgdat->lru_lock)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-- 
2.24.1


^ permalink raw reply related

* [PATCH 03/30] mm/memcontrol: Add missing annotation for unlock_page_lru()
From: Jules Irenge @ 2020-02-14 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: boqun.feng, Jules Irenge, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER MEMCG,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER MEMCG
In-Reply-To: <20200214204741.94112-1-jbi.octave@gmail.com>

Sparse reports warning at unlock_page_lry()

warning: context imbalance in unlock_page_lru() - unexpected unlock

The root cause is the missing annotation at unlock_page_lru()
Add the missing __releases(&pgdat->lru_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f6dc8712e39..22ddd557a69b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2587,6 +2587,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 }
 
 static void unlock_page_lru(struct page *page, int isolated)
+	__releases(&pgdat->lru_lock)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-- 
2.24.1


^ permalink raw reply related

* Re: [PATCH] remoteproc: qcom: wcnss: Add iris completion barrier
From: Mathieu Poirier @ 2020-02-14 20:46 UTC (permalink / raw)
  To: Loic Poulain
  Cc: bjorn.andersson, linux-arm-msm, linux-remoteproc, anibal.limon
In-Reply-To: <1581530043-12112-1-git-send-email-loic.poulain@linaro.org>

On Wed, Feb 12, 2020 at 06:54:03PM +0100, Loic Poulain wrote:
> There is no guarantee that the iris pointer will be assigned before
> remoteproc subsystem starts the wcnss rproc, actually it depends how
> fast rproc subsystem is able to get the firmware to trigger the start.
> 
> This leads to sporadic wifi/bluetooth initialization issue on db410c
> with the following output:
>  remoteproc remoteproc1: powering up a204000.wcnss
>  remoteproc remoteproc1: Booting fw image qcom/msm8916/wcnss.mdt...
>  qcom-wcnss-pil a204000.wcnss: no iris registered
>  remoteproc remoteproc1: can't start rproc a204000.wcnss: -22
> 
> This patch introduces a 'iris_assigned' completion barrier to fix
> this issue. Maybe not the most elegant way, but it does the trick.
> 
> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> ---
>  drivers/remoteproc/qcom_wcnss.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/remoteproc/qcom_wcnss.c b/drivers/remoteproc/qcom_wcnss.c
> index a0468b3..c888282 100644
> --- a/drivers/remoteproc/qcom_wcnss.c
> +++ b/drivers/remoteproc/qcom_wcnss.c
> @@ -84,6 +84,7 @@ struct qcom_wcnss {
>  
>  	struct completion start_done;
>  	struct completion stop_done;
> +	struct completion iris_assigned;
>  
>  	phys_addr_t mem_phys;
>  	phys_addr_t mem_reloc;
> @@ -138,6 +139,7 @@ void qcom_wcnss_assign_iris(struct qcom_wcnss *wcnss,
>  
>  	wcnss->iris = iris;
>  	wcnss->use_48mhz_xo = use_48mhz_xo;
> +	complete(&wcnss->iris_assigned);
>  
>  	mutex_unlock(&wcnss->iris_lock);
>  }
> @@ -213,6 +215,10 @@ static int wcnss_start(struct rproc *rproc)
>  	struct qcom_wcnss *wcnss = (struct qcom_wcnss *)rproc->priv;
>  	int ret;
>  
> +	/* Grant some time for iris registration */
> +	wait_for_completion_timeout(&wcnss->iris_assigned,
> +				    msecs_to_jiffies(5000));
> +
>  	mutex_lock(&wcnss->iris_lock);
>  	if (!wcnss->iris) {
>  		dev_err(wcnss->dev, "no iris registered\n");
> @@ -494,6 +500,7 @@ static int wcnss_probe(struct platform_device *pdev)
>  
>  	init_completion(&wcnss->start_done);
>  	init_completion(&wcnss->stop_done);
> +	init_completion(&wcnss->iris_assigned);
>  
>  	mutex_init(&wcnss->iris_lock);

If I understand the problem correctly, if loading the fw image takes long enough,
qcom_iris_probe() that is triggered by of_platform_populate() has time to
complete and call qcom_wcnss_assign_iris().  Otherwise the remoteproc core calls
wcnss_start() before qcom_wcnss_assign_iris() had the opportunity to run.

If I am correct, would it be possible to call of_platform_populate() before
calling rproc_add()?  There might be some refactoring to do but that's probably
better than introducing a delay...

Thanks,
Mathieu

>  
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH] Btrfs: avoid unnecessary splits when setting bits on an extent io tree
From: Josef Bacik @ 2020-02-14 20:45 UTC (permalink / raw)
  To: fdmanana, linux-btrfs
In-Reply-To: <20200213102002.6176-1-fdmanana@kernel.org>

On 2/13/20 5:20 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When attempting to set bits on a range of an exent io tree that already
> has those bits set we can end up splitting an extent state record, use
> the preallocated extent state record, insert it into the red black tree,
> do another search on the red black tree, merge the preallocated extent
> state record with the previous extent state record, remove that previous
> record from the red black tree and then free it. This is all unnecessary
> work that consumes time.
> 
> This happens specifically at the following case at __set_extent_bit():
> 
>    $ cat -n fs/btrfs/extent_io.c
>     957  static int __must_check
>     958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>    (...)
>    1044          /*
>    1045           *     | ---- desired range ---- |
>    1046           * | state |
>    1047           *   or
>    1048           * | ------------- state -------------- |
>    1049           *
>    (...)
>    1060          if (state->start < start) {
>    1061                  if (state->state & exclusive_bits) {
>    1062                          *failed_start = start;
>    1063                          err = -EEXIST;
>    1064                          goto out;
>    1065                  }
>    1066
>    1067                  prealloc = alloc_extent_state_atomic(prealloc);
>    1068                  BUG_ON(!prealloc);
>    1069                  err = split_state(tree, state, prealloc, start);
>    1070                  if (err)
>    1071                          extent_io_tree_panic(tree, err);
>    1072
>    1073                  prealloc = NULL;
> 
> So if our extent state represents a range from 0 to 1Mb for example, and
> we want to set bits in the range 128Kb to 256Kb for example, and that
> extent state record already has all those bits set, we end up splitting
> that record, so we end up with extent state records in the tree which
> represent the ranges from 0 to 128Kb and from 128Kb to 1Mb. This is
> temporary because a subsequent iteration in that function will end up
> merging the records.
> 
> The splitting requires using the preallocated extent state record, so
> a future iteration that needs to do another split will need to allocate
> another extent state record in an atomic context, something not ideal
> that we try to avoid as much as possible. The splitting also requires
> an insertion in the red black tree, and a subsequent merge will require
> a deletion from the red black tree and freeing an extent state record.
> 
> This change just skips the splitting of an extent state record when it
> already has all the bits the we need to set.
> 
> Setting a bit that is already set for a range is very common in the
> inode's 'file_extent_tree' extent io tree for example, where we keep
> setting the EXTENT_DIRTY bit every time we replace an extent.
> 
> This change also fixes a bug that happens after the recent patchset from
> Josef that avoids having implicit holes after a power failure when not
> using the NO_HOLES feature, more specifically the patch with the subject:
> 
>    "btrfs: introduce the inode->file_extent_tree"
> 
> This patch introduced an extent io tree per inode to keep track of
> completed ordered extents and figure out at any time what is the safe
> value for the inode's disk_i_size. This assumes that for contiguous
> ranges in a file we always end up with a single extent state record in
> the io tree, but that is not the case, as there is a short time window
> where we can have two extent state records representing contiguous
> ranges. When this happens we end setting up an incorrect value for the
> inode's disk_i_size, resulting in data loss after a clean unmount
> of the filesystem. The following example explains how this can happen.
> 
> Suppose we have an inode with an i_size and a disk_i_size of 1Mb, so in
> the inode's file_extent_tree we have a single extent state record that
> represents the range [0, 1Mb[ with the EXTENT_DIRTY bit set. Then the
> following steps happen:
> 
> 1) A buffered write against file range [512Kb, 768Kb[ is made. At this
>     point delalloc was not flushed yet;
> 
> 2) Deduplication from some other inode into this inode's range
>     [128Kb, 256Kb[ is made. This causes btrfs_inode_set_file_extent_range()
>     to be called, from btrfs_insert_clone_extent(), to mark the range
>     [128Kb, 256Kb[ with EXTENT_DIRTY in the inode's file_extent_tree;
> 
> 3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
>     end up at __set_extent_bit(). In the first iteration of that function's
>     loop we end up in the following branch:
> 
>     $ cat -n fs/btrfs/extent_io.c
>      957  static int __must_check
>      958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>     (...)
>     1044          /*
>     1045           *     | ---- desired range ---- |
>     1046           * | state |
>     1047           *   or
>     1048           * | ------------- state -------------- |
>     1049           *
>     (...)
>     1060          if (state->start < start) {
>     1061                  if (state->state & exclusive_bits) {
>     1062                          *failed_start = start;
>     1063                          err = -EEXIST;
>     1064                          goto out;
>     1065                  }
>     1066
>     1067                  prealloc = alloc_extent_state_atomic(prealloc);
>     1068                  BUG_ON(!prealloc);
>     1069                  err = split_state(tree, state, prealloc, start);
>     1070                  if (err)
>     1071                          extent_io_tree_panic(tree, err);
>     1072
>     1073                  prealloc = NULL;
>     (...)
>     1089                  goto search_again;
> 
>     This splits the state record into two, one for range [0, 128Kb[ and
>     another for the range [128Kb, 1Mb[. Both already have the EXTENT_DIRTY
>     bit set. Then we jump to the 'search_again' label, where we unlock the
>     the spinlock protecting the extent io tree before jumping to the
>     'again' label to perform the next iteration;
> 
> 4) In the meanwhile, delalloc is flushed, the ordered extent for the range
>     [512Kb, 768Kb[ is created and when it completes, at
>     btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
>     with a value of 0 for its 'new_size' argument;
> 
> 5) Before the deduplication task currently at __set_extent_bit() moves to
>     the next iteration, the task finishing the ordered extent calls
>     find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
>     and gets 'start' set to 0 and 'end' set to 128Kb - because at this
>     moment the io tree has two extent state records, one representing the
>     range [0, 128Kb[ and another representing the range [128Kb, 1Mb[,
>     both with EXTENT_DIRTY set. Then we set 'isize' to:
> 
>     isize = min(isize, end + 1)
>           = min(1Mb, 128Kb - 1 + 1)
>           = 128Kb
> 
>     Then we set the inode's disk_i_size to 128Kb (isize).
> 
>     After a clean unmount of the filesystem and mounting it again, we have
>     the file with a size of 128Kb, and effectively lost all the data it
>     had before in the range from 128Kb to 1Mb.
> 
> This change fixes that issue too, as we never end up splitting extent
> state records when they already have all the bits we want set.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Sorry, forgot to say

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply

* pynfs python 3 flag day
From: J. Bruce Fields @ 2020-02-14 20:45 UTC (permalink / raw)
  To: linux-nfs

I'm hearing more noise about deprecating Python 2, so decided I can't
keep ignoring Python 3.

Getting pynfs working on Python 3 is a bigger project than I expected.
Keeping it working under Python 2 looks like another project.  So, I'm
planning a flag day after which pynfs will require Python 3.

That isn't the way I'd prefer to do it, but there's only so much time I
want to spend on this.

I've mostly got the 4.0 server tests working under python 3.  I hope a
few more days will be enough to get the 4.1 tests working as well.

When I switch over, I'm afraid a few things will be left broken: any
tests that I don't personally run may still have minor python 3 bugs,
and I haven't touched the python server code that's used for client
testing.

If you stumble across something broken, and you can give me a simple
reproducer, feel free to share it with me and I'll take a look.

But for anything complicated, I'll probably need patches.

Again, I apologize for any extra work that creates for anyone, but for
now this seems like the best compromise to keep things mostly working
without it becoming a bigger time sink for me.

Work so far is in the "python3" branch at

	git://linux-nfs.org/~bfields/pynfs.git

The history will probably be cleaned up an rewritten before it's done.
I'm hoping that'll be in the next week.

It's mostly just a matter of separating out unicode strings and byte
arrays.  Protocol data is all the latter (even if the protocol prefers
some field to be UTF8, pynfs still needs to be able to handle non-UTF8).
But some things have to be unicode strings.

--b.

^ permalink raw reply

* Re: [PATCH v2] ecryptfs: replace BUG_ON with error handling code
From: Tyler Hicks @ 2020-02-14 20:44 UTC (permalink / raw)
  To: Aditya Pakki
  Cc: kjlu, Tyler Hicks, Andrew Morton, Adrian Bunk, Randy Dunlap,
	Theodore Ts'o, ecryptfs, linux-kernel
In-Reply-To: <20200214182101.17165-1-pakki001@umn.edu>

On 2020-02-14 12:21:01, Aditya Pakki wrote:
> In crypt_scatterlist, if the crypt_stat argument is not set up
> correctly, the kernel crashes. Instead, by returning an error code
> upstream, the error is handled safely.
> 
> The issue is detected via a static analysis tool written by us.
> 
> Fixes: 237fead619984 (ecryptfs: fs/Makefile and fs/Kconfig)
> Signed-off-by: Aditya Pakki <pakki001@umn.edu>

Thanks! This looks good to me and passes the eCryptfs regression tests.
I've queued it up in my tree.

Tyler

> ---
> v1: Add missing fixes tag suggested by Markus and Tyler.
> ---
>  fs/ecryptfs/crypto.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
> index db1ef144c63a..2c449aed1b92 100644
> --- a/fs/ecryptfs/crypto.c
> +++ b/fs/ecryptfs/crypto.c
> @@ -311,8 +311,10 @@ static int crypt_scatterlist(struct ecryptfs_crypt_stat *crypt_stat,
>  	struct extent_crypt_result ecr;
>  	int rc = 0;
>  
> -	BUG_ON(!crypt_stat || !crypt_stat->tfm
> -	       || !(crypt_stat->flags & ECRYPTFS_STRUCT_INITIALIZED));
> +	if (!crypt_stat || !crypt_stat->tfm
> +	       || !(crypt_stat->flags & ECRYPTFS_STRUCT_INITIALIZED))
> +		return -EINVAL;
> +
>  	if (unlikely(ecryptfs_verbosity > 0)) {
>  		ecryptfs_printk(KERN_DEBUG, "Key size [%zd]; key:\n",
>  				crypt_stat->key_size);
> -- 
> 2.20.1
> 

^ permalink raw reply

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
From: Jens Axboe @ 2020-02-14 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring
In-Reply-To: <addcd44e-ed9b-5f82-517d-c1ed3ee2d85c@kernel.dk>

On 2/14/20 10:52 AM, Jens Axboe wrote:
> On 2/14/20 9:18 AM, Jens Axboe wrote:
>> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>>> I suspect you meant to put that in finish_task_switch() which is the
>>>> tail end of every schedule(), schedule_tail() is the tail end of
>>>> clone().
>>>>
>>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>>> is after every schedule() but in a preemptible context -- much saner
>>>> since you don't want to go add an unbounded amount of work in a
>>>> non-preemptible context.
>>>>
>>>> At which point you already have your callback: io_wq_worker_running(),
>>>> or is this for any random task?
>>>
>>> Let me try and clarify - this isn't for the worker tasks, this is for
>>> any task that is using io_uring. In fact, it's particularly not for the
>>> worker threads, just the task itself.
>>>
>>> I basically want the handler to be called when:
>>>
>>> 1) The task is scheduled in. The poll will complete and stuff some items
>>>    on that task list, and I want to task to process them as it wakes up.
>>>
>>> 2) The task is going to sleep, don't want to leave entries around while
>>>    the task is sleeping.
>>>
>>> 3) I need it to be called from "normal" context, with ints enabled,
>>>    preempt enabled, etc.
>>>
>>> sched_update_worker() (with a rename) looks ideal for #1, and the
>>> context is sane for me. Just need a good spot to put the hook call for
>>> schedule out. I think this:
>>>
>>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>>> 		preempt_disable();
>>> 		if (tsk->flags & PF_WQ_WORKER)
>>> 			wq_worker_sleeping(tsk);
>>> 		else
>>> 			io_wq_worker_sleeping(tsk);
>>> 		preempt_enable_no_resched();
>>> 	}
>>>
>>> just needs to go into another helper, and then I can call it there
>>> outside of the preempt.
>>>
>>> I'm sure there are daemons lurking here, but I'll test and see how it
>>> goes...
>>
>> Here's a stab at cleaning it up:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
>>
>> top two patches. First one simply cleans up the sched_update_worker(),
>> so we now have sched_in_update() and sched_out_update(). No changes in
>> this patch, just moves the worker sched-out handling into a helper.
>>
>> 2nd patch then utilizes this to flush the per-task requests that may
>> have been queued up.
> 
> In fact, we can go even further. If we have this task handler, then we:
> 
> 1) Never need to go async for poll completion, and we can remove a bunch
>    of code that handles that
> 2) Don't need to worry about nested eventfd notification, that code goes
>    away too
> 3) Don't need the poll llist for batching flushes, that goes away
> 
> In terms of performance, for the single client case we did about 48K
> requests per second on my kvm on the laptop, now we're doing 148K.
> So it's definitely worthwhile... On top of that, diffstat:
> 
>  fs/io_uring.c | 166 +++++++-------------------------------------------
>  1 file changed, 22 insertions(+), 144 deletions(-)

It's now up to 3.5x the original performance for the single client case.
Here's the updated patch, folded with the original that only went half
the way there.


commit b9c04ea10d10cf80f2d2f3b96e1668e523602072
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 14 09:15:29 2020 -0700

    io_uring: add per-task callback handler
    
    For poll requests, it's not uncommon to link a read (or write) after
    the poll to execute immediately after the file is marked as ready.
    Since the poll completion is called inside the waitqueue wake up handler,
    we have to punt that linked request to async context. This slows down
    the processing, and actually means it's faster to not use a link for this
    use case.
    
    We also run into problems if the completion_lock is contended, as we're
    doing a different lock ordering than the issue side is. Hence we have
    to do trylock for completion, and if that fails, go async. Poll removal
    needs to go async as well, for the same reason.
    
    eventfd notification needs special case as well, to avoid stack blowing
    recursion or deadlocks.
    
    These are all deficiencies that were inherited from the aio poll
    implementation, but I think we can do better. When a poll completes,
    simply queue it up in the task poll list. When the task completes the
    list, we can run dependent links inline as well. This means we never
    have to go async, and we can remove a bunch of code associated with
    that, and optimizations to try and make that run faster. The diffstat
    speaks for itself.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a826017ebb8..2756654e2955 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -574,6 +567,7 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
 	struct io_wq_work	work;
 };
 
@@ -834,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -1056,24 +1049,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
 		return false;
 	if (!ctx->eventfd_async)
 		return true;
-	return io_wq_current_is_worker() || in_interrupt();
+	return io_wq_current_is_worker();
 }
 
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev)
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (trigger_ev)
+	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
-{
-	__io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx));
-}
-
 /* Returns true if there are no backlogged entries after the flush */
 static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 {
@@ -1238,6 +1226,8 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	/* one is dropped after submission, the other at completion */
 	refcount_set(&req->refs, 2);
 	req->result = 0;
+	/* task will wait for requests on exit, don't need a ref */
+	req->task = current;
 	INIT_IO_WORK(&req->work, io_wq_submit_work);
 	return req;
 fallback:
@@ -3448,15 +3438,22 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
 static void io_poll_remove_one(struct io_kiocb *req)
 {
 	struct io_poll_iocb *poll = &req->poll;
+	bool do_complete = false;
 
 	spin_lock(&poll->head->lock);
 	WRITE_ONCE(poll->canceled, true);
 	if (!list_empty(&poll->wait.entry)) {
 		list_del_init(&poll->wait.entry);
-		io_queue_async_work(req);
+		do_complete = true;
 	}
 	spin_unlock(&poll->head->lock);
 	hash_del(&req->hash_node);
+	if (do_complete) {
+		io_cqring_fill_event(req, -ECANCELED);
+		io_commit_cqring(req->ctx);
+		req->flags |= REQ_F_COMP_LOCKED;
+		io_put_req(req);
+	}
 }
 
 static void io_poll_remove_all(struct io_ring_ctx *ctx)
@@ -3474,6 +3471,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 			io_poll_remove_one(req);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
 }
 
 static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
@@ -3539,92 +3538,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
 
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3632,8 +3557,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3641,46 +3567,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req(req);
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3739,7 +3631,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -5243,6 +5134,28 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return autoremove_wake_function(curr, mode, wake_flags, key);
 }
 
+void io_uring_task_handler(struct task_struct *tsk)
+{
+	LIST_HEAD(local_list);
+	struct io_kiocb *req;
+
+	raw_spin_lock_irq(&tsk->uring_lock);
+	if (!list_empty(&tsk->uring_work))
+		list_splice_init(&tsk->uring_work, &local_list);
+	raw_spin_unlock_irq(&tsk->uring_lock);
+
+	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
+		req = list_first_entry(&local_list, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(nxt, NULL);
+	}
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..447b06c6bed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_IO_URING
+	struct list_head		uring_work;
+	raw_spinlock_t			uring_lock;
+#endif
+
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51ca491d99ed..170fefa1caf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
+#ifdef CONFIG_IO_URING
+	INIT_LIST_HEAD(&p->uring_work);
+	raw_spin_lock_init(&p->uring_lock);
+#endif
+
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
@@ -4104,6 +4109,20 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
+#ifdef CONFIG_IO_URING
+extern void io_uring_task_handler(struct task_struct *tsk);
+
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+	if (!list_empty(&tsk->uring_work))
+		io_uring_task_handler(tsk);
+}
+#else /* !CONFIG_IO_URING */
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+}
+#endif
+
 static void sched_out_update(struct task_struct *tsk)
 {
 	/*
@@ -4121,6 +4140,7 @@ static void sched_out_update(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
+	io_uring_handler(tsk);
 }
 
 static void sched_in_update(struct task_struct *tsk)
@@ -4131,6 +4151,7 @@ static void sched_in_update(struct task_struct *tsk)
 		else
 			io_wq_worker_running(tsk);
 	}
+	io_uring_handler(tsk);
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)

-- 
Jens Axboe


^ permalink raw reply related

* Re: [PATCH v1] builtin/rebase: remove a call to get_oid() on `options.switch_to'
From: Alban Gruin @ 2020-02-14 20:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, phillip.wood, Johannes Schindelin
In-Reply-To: <xmqq1rrr6ww2.fsf@gitster-ct.c.googlers.com>

Hi Junio,

Le 22/01/2020 à 21:47, Junio C Hamano a écrit :
> Alban Gruin <alban.gruin@gmail.com> writes:
> 
>> When `options.switch_to' is set, `options.orig_head' is populated right
>> after.  Therefore, there is no need to parse `switch_to' again.
>>
>> Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
>> ---
>>  builtin/rebase.c | 10 +---------
>>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> Sounds good.
> 

Did this patch fell through the cracks?

Alban



^ permalink raw reply

* Re: [PATCH v2 21/21] target/arm: Correctly implement ACTLR2, HACTLR2
From: Richard Henderson @ 2020-02-14 20:42 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel
  Cc: Eric Auger, Aaron Lindsay, Philippe Mathieu-Daudé
In-Reply-To: <20200214175116.9164-22-peter.maydell@linaro.org>

On 2/14/20 9:51 AM, Peter Maydell wrote:
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index c46bb5a5c09..9f618e120aa 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -2718,6 +2718,7 @@ static void arm_max_initfn(Object *obj)
>  
>              t = cpu->isar.id_mmfr4;
>              t = FIELD_DP32(t, ID_MMFR4, HPDS, 1); /* AA32HPD */
> +            t = FIELD_DP32(t, ID_MMFR4, AC2, 1); /* ACTLR2, HACTLR2 */
>              cpu->isar.id_mmfr4 = t;
>          }
>  #endif
> diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
> index 8430d432943..32cf8ee98b0 100644
> --- a/target/arm/cpu64.c
> +++ b/target/arm/cpu64.c
> @@ -703,6 +703,10 @@ static void aarch64_max_initfn(Object *obj)
>          u = FIELD_DP32(u, ID_MMFR3, PAN, 2); /* ATS1E1 */
>          cpu->isar.id_mmfr3 = u;
>  
> +        u = cpu->isar.id_mmfr4;
> +        u = FIELD_DP32(u, ID_MMFR4, AC2, 1); /* ACTLR2, HACTLR2 implemented */
> +        cpu->isar.id_mmfr4 = u;

This highlights a missing set of HPDS for cpu64 max.

Saying "implemented" is somewhat redundant.  Otherwise,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply

* Re: Heads up: phylink changes for next merge window
From: Russell King - ARM Linux admin @ 2020-02-14 20:42 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Florian Fainelli, Sean Wang, John Crispin, Mark Lee,
	Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu, Vladimir Oltean,
	Radhey Shyam Pandey, Nicolas Ferre, Thomas Petazzoni,
	Ioana Ciornei
In-Reply-To: <20200214150826.GF15299@lunn.ch>

On Fri, Feb 14, 2020 at 04:08:26PM +0100, Andrew Lunn wrote:
> > The reasoning is, if the PHY detect bit is set, the PPU should be
> > polling the attached PHY and automatically updating the MAC to reflect
> > the PHY status.  This seems great...
> > 
> > On the ZII dev rev C, we have the following port status values:
> > - port 0 = 0xe04
> > - port 1 through 8 = 0x100f
> > - port 9 = 0x49
> > - port 10 = 0xcf4c
> > 
> > On the ZII dev rev B, port 4 (which is one of the optical ports) has a
> > value of 0xde09, despite being linked to the on-board serdes.  It seems
> > that the PPU on the 88e6352 automatically propagates the status from the
> > serdes there.
> > 
> > So, it looks to me like using the PHY detect bit is the right solution,
> > we just need access to it at this mid-layer...
> 
> Hi Russell
> 
> We need to be careful of the PPU. I'm assuming it uses MDIO to access
> the PHY registers. We have code which allows the PHY page to the
> changed, e.g. hwmon to get the temperature sensors, and i will soon
> have code for cable diagnostics. We don't want the PPU reading some
> temperature sensor registers and configuring the MAC based on that.
> 
> For cases not using a PHY, e.g. the SERDES on the 6352, it might be
> safe to use the PPU.

Bear in mind that the PPU has been used for some time.

However, what I read in some of the functional specs is the built-in
PHYs use a more "direct" method to communicate their negotiated results
to the MAC.  Whether that also applies to the 6352 Serdes or not, I
don't know, but as port 4 will automatically switch between the
built-in copper PHY and Serdes depending on which link comes up first.

It's interesting, however, reading some of the functional specs, where
some say that the PPU must be globally disabled before access is
allowed to the MDIO bus, others the bit for globally disabling the PPU
is "reserved".

That all said, using the PPU PHY_DETECT bit seems to me to be the right
solution - if the chip is polling the PHYs itself, we want to unforce
the speed, duplex and pause, otherwise we need to force them to the
link parameters.  If we need to disable the PPU for a port, then
disabling PHY_DETECT seems like the right answer.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply

* Re: [PATCH] objtool: ignore .L prefixed local symbols
From: Arvind Sankar @ 2020-02-14 20:42 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Arvind Sankar, Nick Desaulniers, jpoimboe, peterz,
	clang-built-linux, Nathan Chancellor, linux-kernel
In-Reply-To: <20200214180527.z44b4bmzn336mff2@google.com>

On Fri, Feb 14, 2020 at 10:05:27AM -0800, Fangrui Song wrote:
> I know little about objtool, but if it may be used by other
> architectures, hope the following explanations don't appear to be too
> off-topic:)
> 
> On 2020-02-14, Arvind Sankar wrote:
> >Can you describe what case the clang change is supposed to optimize?
> >AFAICT, it kicks in when the symbol is known by the compiler to be local
> >to the DSO and defined in the same translation unit.
> >
> >But then there are two cases:
> >(a) we have call foo, where foo is defined in the same section as the
> >call instruction. In this case the assembler should be able to fully
> >resolve foo and not generate any relocation, regardless of whether foo
> >is global or local.
> 
> If foo is STB_GLOBAL or STB_WEAK, the assembler cannot fully resolve a
> reference to foo in the same section, unless the assembler can assume
> (the codegen tells it) the call to foo cannot be interposed by another
> foo definition at runtime.

I was testing with hidden/protected visibility, I see you want this for
the no-semantic-interposition case. Actually a bit more testing shows
some peculiarities even with hidden visibility. With the below, the call
and lea create relocations in the object file, but the jmp doesn't. ld
does avoid creating a plt for this though.

	.text
	.globl foo, bar
	.hidden foo
	bar:
		call	foo
		leaq	foo(%rip), %rax
		jmp	foo

	foo:	ret

> 
> >(b) we have call foo, where foo is defined in a different section from
> >the call instruction. In this case the assembler must generate a
> >relocation regardless of whether foo is global or local, and the linker
> >should eliminate it.
> >In what case does does replacing call foo with call .Lfoo$local help?
> 
> For -fPIC -fno-semantic-interposition, the assembly emitter can perform
> the following optimization:
> 
>    void foo() {}
>    void bar() { foo(); }
> 
>    .globl foo, bar
>    foo:
>    .Lfoo$local:
>      ret
>    bar:
>      call foo  --> call .Lfoo$local
>      ret
> 
> call foo generates an R_X86_64_PLT32. In a -shared link, it creates an
> unneeded PLT entry for foo.
> 
> call .Lfoo$local generates an R_X86_64_PLT32. In a -shared link, .Lfoo$local is
> non-preemptible => no PLT entry is created.
> 
> For -fno-PIC and -fPIE, the final link is expected to be -no-pie or
> -pie. This optimization does not save anything, because PLT entries will
> not be generated. With clang's integrated assembler, it may increase the
> number of STT_SECTION symbols (because .Lfoo$local will be turned to a
> STT_SECTION relative relocation), but the size increase is very small.
> 
> 
> I want to teach clang -fPIC to use -fno-semantic-interposition by
> default. (It is currently an LLVM optimization, not realized in clang.)
> clang traditionally makes various -fno-semantic-interposition
> assumptions and can perform interprocedural optimizations even if the
> strict ELF rule disallows them.

FWIW, gcc with no-semantic-interposition also uses local aliases, but
rather than using .L labels, it creates a local alias by
	.set foo.localalias, foo
This makes the type of foo.localalias the same as foo, which I gather
should placate objtool as it'll still see an STT_FUNC no matter whether
it picks up foo.localalias or foo.

^ permalink raw reply


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.