netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Martin KaFai Lau <kafai@fb.com>
Cc: ast@kernel.org, daniel@iogearbox.net, netdev@vger.kernel.org
Subject: Re: [bpf PATCH v2 3/6] bpf: sockhash fix omitted bucket lock in sock_close
Date: Fri, 15 Jun 2018 08:23:14 -0700	[thread overview]
Message-ID: <fe2127cd-7943-c68d-8129-49cf8d7e77e8@gmail.com> (raw)
In-Reply-To: <20180615054137.qfw3ad4jzvs74esy@kafai-mbp.dhcp.thefacebook.com>

On 06/14/2018 10:41 PM, Martin KaFai Lau wrote:
> On Thu, Jun 14, 2018 at 09:44:57AM -0700, John Fastabend wrote:
>> First in tcp_close, reduce scope of sk_callback_lock() the lock is
>> only needed for protecting smap_release_sock() the ingress and cork
>> lists are protected by sock lock. Having the lock in wider scope is
>> harmless but may confuse the reader who may infer it is in fact
>> needed.
>>
>> Next, in sock_hash_delete_elem() the pattern is as follows,
>>
>>   sock_hash_delete_elem()
>>      [...]
>>      spin_lock(bucket_lock)
>>      l = lookup_elem_raw()
>>      if (l)
>>         hlist_del_rcu()
>>         write_lock(sk_callback_lock)
>>          .... destroy psock ...
>>         write_unlock(sk_callback_lock)
>>      spin_unlock(bucket_lock)
>>
>> The ordering is necessary because we only know the {p}sock after
>> dereferencing the hash table which we can't do unless we have the
>> bucket lock held. Once we have the bucket lock and the psock element
>> it is deleted from the hashmap to ensure any other path doing a lookup
>> will fail. Finally, the refcnt is decremented and if zero the psock
>> is destroyed.
>>
>> In parallel with the above (or free'ing the map) a tcp close event
>> may trigger tcp_close(). Which at the moment omits the bucket lock
>> altogether (oops!) where the flow looks like this,
>>
>>   bpf_tcp_close()
>>      [...]
>>      write_lock(sk_callback_lock)
>>      for each psock->maps // list of maps this sock is part of
>>          hlist_del_rcu(ref_hash_node);
>>          .... destroy psock ...
>>      write_unlock(sk_callback_lock)
>>
>> Obviously, and demonstrated by syzbot, this is broken because
>> we can have multiple threads deleting entries via hlist_del_rcu().
>>
>> To fix this we might be tempted to wrap the hlist operation in a
>> bucket lock but that would create a lock inversion problem. In
>> summary to follow locking rules maps needs the sk_callback_lock but we
>> need the bucket lock to do the hlist_del_rcu. To resolve the lock
>> inversion problem note that when bpf_tcp_close is called no updates
>> can happen in parallel, due to ESTABLISH state check in update logic,
>> so pop the head of the list repeatedly and remove the reference until
>> no more are left. If a delete happens in parallel from the BPF API
>> that is OK as well because it will do a similar action, lookup the
>> sock in the map/hash, delete it from the map/hash, and dec the refcnt.
>> We check for this case before doing a destroy on the psock to ensure
>> we don't have two threads tearing down a psock. The new logic is
>> as follows,
>>
>>   bpf_tcp_close()
>>   e = psock_map_pop(psock->maps) // done with sk_callback_lock
>>   bucket_lock() // lock hash list bucket
>>   l = lookup_elem_raw(head, hash, key, key_size);
>>   if (l) {
>>      //only get here if elmnt was not already removed
>>      hlist_del_rcu()
>>      ... destroy psock...
>>   }
>>   bucket_unlock()
>>
>> And finally for all the above to work add missing sk_callback_lock
>> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
>> delete and update may corrupt maps list.
>>
>> (As an aside the sk_callback_lock serves two purposes. The
>>  first, is to update the sock callbacks sk_data_ready, sk_write_space,
>>  etc. The second is to protect the psock 'maps' list. The 'maps' list
>>  is used to (as shown above) to delete all map/hash references to a
>>  sock when the sock is closed)
>>
>> (If we did not have the ESTABLISHED state guarantee from tcp_close
>>  then we could not ensure completion because updates could happen
>>  forever and pin thread in delete loop.)
>>
>> Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
>> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> ---
>>  0 files changed
>>

^^^^ Will fix this 0 files changes as well.

>>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>> @@ -2114,10 +2169,13 @@ static void sock_hash_free(struct bpf_map *map)
>>  	 */
>>  	rcu_read_lock();
>>  	for (i = 0; i < htab->n_buckets; i++) {
>> -		struct hlist_head *head = select_bucket(htab, i);
>> +		struct bucket *b = __select_bucket(htab, i);
>> +		struct hlist_head *head;
>>  		struct hlist_node *n;
>>  		struct htab_elem *l;
>>  
>> +		raw_spin_lock_bh(&b->lock);
> There is a synchronize_rcu() at the beginning of sock_hash_free().
> Taking the bucket's lock of the free-ing map at this point is a bit
> suspicious.  What may still access the map after synchronize_rcu()?
> 

tcp_close() may be called while the map is being free. The sync_rcu will
only sync the BPF side.

.John

  reply	other threads:[~2018-06-15 15:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-14 16:44 [bpf PATCH v2 0/6] BPF fixes for sockhash John Fastabend
2018-06-14 16:44 ` [bpf PATCH v2 1/6] bpf: sockmap, fix crash when ipv6 sock is added John Fastabend
2018-06-14 23:53   ` Martin KaFai Lau
2018-06-15  4:46     ` John Fastabend
2018-06-14 16:44 ` [bpf PATCH v2 2/6] bpf: sockmap only allow ESTABLISHED sock state John Fastabend
2018-06-15  0:18   ` Martin KaFai Lau
2018-06-18 14:50     ` John Fastabend
2018-06-18 21:17       ` Martin KaFai Lau
2018-06-20 22:15         ` John Fastabend
2018-06-14 16:44 ` [bpf PATCH v2 3/6] bpf: sockhash fix omitted bucket lock in sock_close John Fastabend
2018-06-15  5:41   ` Martin KaFai Lau
2018-06-15 15:23     ` John Fastabend [this message]
2018-06-15 15:45       ` Martin KaFai Lau
2018-06-14 16:45 ` [bpf PATCH v2 4/6] bpf: sockmap, tcp_disconnect to listen transition John Fastabend
2018-06-15  6:04   ` Martin KaFai Lau
2018-06-14 16:45 ` [bpf PATCH v2 5/6] bpf: sockhash, add release routine John Fastabend
2018-06-15  6:05   ` Martin KaFai Lau
2018-06-14 16:45 ` [bpf PATCH v2 6/6] bpf: selftest remove attempts to add LISTEN sockets to sockmap John Fastabend
2018-06-15  6:07   ` Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fe2127cd-7943-c68d-8129-49cf8d7e77e8@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).