Skip to content

Commit 22eab21

Browse files
committed
bpf: Call free_htab_elem() after htab_unlock_bucket()
JIRA: https://issues.redhat.com/browse/RHEL-85485 JIRA: https://issues.redhat.com/browse/RHEL-89790 commit b9e9ed9 Author: Hou Tao <[email protected]> Date: Wed Nov 6 14:35:40 2024 +0800 bpf: Call free_htab_elem() after htab_unlock_bucket() For htab of maps, when the map is removed from the htab, it may hold the last reference of the map. bpf_map_fd_put_ptr() will invoke bpf_map_free_id() to free the id of the removed map element. However, bpf_map_fd_put_ptr() is invoked while holding a bucket lock (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock (spinlock_t), triggering the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.11.0-rc4+ #49 Not tainted ----------------------------- test_maps/4881 is trying to lock: ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70 other info that might help us debug this: context-{5:5} 2 locks held by test_maps/4881: #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270 #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80 stack backtrace: CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... Call Trace: <TASK> dump_stack_lvl+0x6e/0xb0 dump_stack+0x10/0x20 __lock_acquire+0x73e/0x36c0 lock_acquire+0x182/0x450 _raw_spin_lock_irqsave+0x43/0x70 bpf_map_free_id.part.0+0x21/0x70 bpf_map_put+0xcf/0x110 bpf_map_fd_put_ptr+0x9a/0xb0 free_htab_elem+0x69/0xe0 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 bpf_map_update_value+0x266/0x380 __sys_bpf+0x21bb/0x36b0 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e One way to fix the lockdep warning is using raw_spinlock_t for map_idr_lock as well. However, bpf_map_alloc_id() invokes idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a similar lockdep warning because the slab's lock (s->cpu_slab->lock) is still a spinlock. Instead of changing map_idr_lock's type, fix the issue by invoking htab_put_fd_value() after htab_unlock_bucket(). However, only deferring the invocation of htab_put_fd_value() is not enough, because the old map pointers in htab of maps can not be saved during batched deletion. Therefore, also defer the invocation of free_htab_elem(), so these to-be-freed elements could be linked together similar to lru map. There are four callers for ->map_fd_put_ptr: (1) alloc_htab_elem() (through htab_put_fd_value()) It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of htab_put_fd_value() can not simply move after htab_unlock_bucket(), because the old element has already been stashed in htab->extra_elems. It may be reused immediately after htab_unlock_bucket() and the invocation of htab_put_fd_value() after htab_unlock_bucket() may release the newly-added element incorrectly. Therefore, saving the map pointer of the old element for htab of maps before unlocking the bucket and releasing the map_ptr after unlock. Beside the map pointer in the old element, should do the same thing for the special fields in the old element as well. (2) free_htab_elem() (through htab_put_fd_value()) Its caller includes __htab_map_lookup_and_delete_elem(), htab_map_delete_elem() and __htab_map_lookup_and_delete_batch(). For htab_map_delete_elem(), simply invoke free_htab_elem() after htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just like lru map, linking the to-be-freed element into node_to_free list and invoking free_htab_elem() for these element after unlock. It is safe to reuse batch_flink as the link for node_to_free, because these elements have been removed from the hash llist. Because htab of maps doesn't support lookup_and_delete operation, __htab_map_lookup_and_delete_elem() doesn't have the problem, so kept it as is. (3) fd_htab_map_free() It invokes ->map_fd_put_ptr without raw_spinlock_t. (4) bpf_fd_htab_map_update_elem() It invokes ->map_fd_put_ptr without raw_spinlock_t. After moving free_htab_elem() outside htab bucket lock scope, using pcpu_freelist_push() instead of __pcpu_freelist_push() to disable the irq before freeing elements, and protecting the invocations of bpf_mem_cache_free() with migrate_{disable|enable} pair. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Viktor Malik <[email protected]>
1 parent c32f17b commit 22eab21

File tree

1 file changed

+39
-17
lines changed

1 file changed

+39
-17
lines changed

kernel/bpf/hashtab.c

Lines changed: 39 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -895,9 +895,12 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
895895
static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
896896
{
897897
check_and_free_fields(htab, l);
898+
899+
migrate_disable();
898900
if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
899901
bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
900902
bpf_mem_cache_free(&htab->ma, l);
903+
migrate_enable();
901904
}
902905

903906
static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
@@ -947,7 +950,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
947950
if (htab_is_prealloc(htab)) {
948951
bpf_map_dec_elem_count(&htab->map);
949952
check_and_free_fields(htab, l);
950-
__pcpu_freelist_push(&htab->freelist, &l->fnode);
953+
pcpu_freelist_push(&htab->freelist, &l->fnode);
951954
} else {
952955
dec_elem_count(htab);
953956
htab_elem_free(htab, l);
@@ -1017,7 +1020,6 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
10171020
*/
10181021
pl_new = this_cpu_ptr(htab->extra_elems);
10191022
l_new = *pl_new;
1020-
htab_put_fd_value(htab, old_elem);
10211023
*pl_new = old_elem;
10221024
} else {
10231025
struct pcpu_freelist_node *l;
@@ -1104,6 +1106,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
11041106
struct htab_elem *l_new = NULL, *l_old;
11051107
struct hlist_nulls_head *head;
11061108
unsigned long flags;
1109+
void *old_map_ptr;
11071110
struct bucket *b;
11081111
u32 key_size, hash;
11091112
int ret;
@@ -1182,12 +1185,27 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
11821185
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
11831186
if (l_old) {
11841187
hlist_nulls_del_rcu(&l_old->hash_node);
1188+
1189+
/* l_old has already been stashed in htab->extra_elems, free
1190+
* its special fields before it is available for reuse. Also
1191+
* save the old map pointer in htab of maps before unlock
1192+
* and release it after unlock.
1193+
*/
1194+
old_map_ptr = NULL;
1195+
if (htab_is_prealloc(htab)) {
1196+
if (map->ops->map_fd_put_ptr)
1197+
old_map_ptr = fd_htab_map_get_ptr(map, l_old);
1198+
check_and_free_fields(htab, l_old);
1199+
}
1200+
}
1201+
htab_unlock_bucket(htab, b, hash, flags);
1202+
if (l_old) {
1203+
if (old_map_ptr)
1204+
map->ops->map_fd_put_ptr(map, old_map_ptr, true);
11851205
if (!htab_is_prealloc(htab))
11861206
free_htab_elem(htab, l_old);
1187-
else
1188-
check_and_free_fields(htab, l_old);
11891207
}
1190-
ret = 0;
1208+
return 0;
11911209
err:
11921210
htab_unlock_bucket(htab, b, hash, flags);
11931211
return ret;
@@ -1431,15 +1449,15 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
14311449
return ret;
14321450

14331451
l = lookup_elem_raw(head, hash, key, key_size);
1434-
1435-
if (l) {
1452+
if (l)
14361453
hlist_nulls_del_rcu(&l->hash_node);
1437-
free_htab_elem(htab, l);
1438-
} else {
1454+
else
14391455
ret = -ENOENT;
1440-
}
14411456

14421457
htab_unlock_bucket(htab, b, hash, flags);
1458+
1459+
if (l)
1460+
free_htab_elem(htab, l);
14431461
return ret;
14441462
}
14451463

@@ -1852,13 +1870,14 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
18521870
* may cause deadlock. See comments in function
18531871
* prealloc_lru_pop(). Let us do bpf_lru_push_free()
18541872
* after releasing the bucket lock.
1873+
*
1874+
* For htab of maps, htab_put_fd_value() in
1875+
* free_htab_elem() may acquire a spinlock with bucket
1876+
* lock being held and it violates the lock rule, so
1877+
* invoke free_htab_elem() after unlock as well.
18551878
*/
1856-
if (is_lru_map) {
1857-
l->batch_flink = node_to_free;
1858-
node_to_free = l;
1859-
} else {
1860-
free_htab_elem(htab, l);
1861-
}
1879+
l->batch_flink = node_to_free;
1880+
node_to_free = l;
18621881
}
18631882
dst_key += key_size;
18641883
dst_val += value_size;
@@ -1870,7 +1889,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
18701889
while (node_to_free) {
18711890
l = node_to_free;
18721891
node_to_free = node_to_free->batch_flink;
1873-
htab_lru_push_free(htab, l);
1892+
if (is_lru_map)
1893+
htab_lru_push_free(htab, l);
1894+
else
1895+
free_htab_elem(htab, l);
18741896
}
18751897

18761898
next_batch:

0 commit comments

Comments
 (0)