twx-linux/kernel/bpf
Chen Ridong 0d86cd70fc cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
[ Upstream commit 117932eea99b729ee5d12783601a4f7f5fd58a23 ]

A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 <TASK>
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08 16:28:24 +01:00
..
preload bpf: make preloaded map iterators to display map elements count 2023-07-06 12:42:25 -07:00
arraymap.c bpf: Check percpu map value size first 2024-10-17 15:24:15 +02:00
bloom_filter.c bpf: Check bloom filter map value size 2024-05-17 12:02:11 +02:00
bpf_cgrp_storage.c
bpf_inode_storage.c Networking changes for 6.4. 2023-04-26 16:07:23 -07:00
bpf_iter.c
bpf_local_storage.c bpf: fix order of args in call to bpf_map_kvcalloc 2024-07-18 13:21:13 +02:00
bpf_lru_list.c bpf: Address KCSAN report on bpf_lru_list 2023-05-12 12:01:03 -07:00
bpf_lru_list.h bpf: lru: Remove unused declaration bpf_lru_promote() 2023-08-08 17:21:42 -07:00
bpf_lsm.c
bpf_struct_ops_types.h
bpf_struct_ops.c bpf: Support default .validate() and .update() behavior for struct_ops links 2023-08-14 22:23:39 -07:00
bpf_task_storage.c
btf.c bpf: Fix memory leak in bpf_core_apply 2024-11-01 01:58:18 +01:00
cgroup_iter.c
cgroup.c cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction 2024-11-08 16:28:24 +01:00
core.c bpf: Prevent tail call between progs attached to different hooks 2024-10-17 15:24:16 +02:00
cpumap.c bpf: report RCU QS in cpumap kthread 2024-03-26 18:20:12 -04:00
cpumask.c bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. 2023-07-12 23:45:23 +02:00
devmap.c bpf: devmap: provide rxq after redirect 2024-11-01 01:58:17 +01:00
disasm.c bpf: change bpf_alu_sign_string and bpf_movsx_string to static 2023-08-04 16:15:50 -07:00
disasm.h
dispatcher.c
hashtab.c bpf: Check percpu map value size first 2024-10-17 15:24:15 +02:00
helpers.c bpf: Add MEM_WRITE attribute 2024-11-01 01:58:30 +01:00
inode.c bpf: convert to ctime accessor functions 2023-07-24 10:30:07 +02:00
Kconfig bpf: Add fd-based tcx multi-prog infra with link support 2023-07-19 10:07:27 -07:00
link_iter.c
local_storage.c cgroup changes for v6.4-rc1 2023-04-29 10:05:22 -07:00
log.c bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log 2023-05-16 22:34:50 -07:00
lpm_trie.c bpf: Fix out-of-bounds write in trie_get_next_key() 2024-11-08 16:28:18 +01:00
Makefile bpf: Add fd-based tcx multi-prog infra with link support 2023-07-19 10:07:27 -07:00
map_in_map.c bpf: Optimize the free of inner map 2024-06-21 14:38:15 +02:00
map_in_map.h bpf: Add map and need_defer parameters to .map_fd_put_ptr() 2024-01-25 15:35:22 -08:00
map_iter.c bpf: allow any program to use the bpf_map_sum_elem_count kfunc 2023-07-19 09:48:53 -07:00
memalloc.c bpf: Use c->unit_size to select target cache during free 2024-01-25 15:35:28 -08:00
mmap_unlock_work.h
mprog.c bpf: Handle bpf_mprog_query with NULL entry 2023-10-06 17:11:20 -07:00
net_namespace.c
offload.c bpf: Avoid dummy bpf_offload_netdev in __bpf_prog_dev_bound_init 2023-09-11 22:06:06 -07:00
percpu_freelist.c
percpu_freelist.h
prog_iter.c
queue_stack_maps.c bpf: Avoid deadlock when using queue and stack maps from NMI 2023-09-11 19:04:49 -07:00
reuseport_array.c bpf: Centralize permissions checks for all BPF map types 2023-06-19 14:04:04 +02:00
ringbuf.c bpf: Add MEM_WRITE attribute 2024-11-01 01:58:30 +01:00
stackmap.c bpf: Fix stackmap overflow check on 32-bit arches 2024-03-26 18:19:39 -04:00
syscall.c bpf: Add MEM_WRITE attribute 2024-11-01 01:58:30 +01:00
sysfs_btf.c
task_iter.c bpf: Fix iter/task tid filtering 2024-11-01 01:58:25 +01:00
tcx.c bpf: Handle bpf_mprog_query with NULL entry 2023-10-06 17:11:20 -07:00
tnum.c
trampoline.c bpf, x64: Fix tailcall infinite loop 2023-11-20 11:58:55 +01:00
verifier.c bpf: Force checkpoint when jmp history is too long 2024-11-08 16:28:18 +01:00