dataloader的死锁问题

死锁问题

最近运行代码的时候老是遇到代码卡死的问题,报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Traceback (most recent call last):
File "train.py", line 265, in <module>
main()
File "train.py", line 262, in main
k_fold_train(batch_size, num_workers, learning_rate, weight_decay, num_epoch, model_path)
File "train.py", line 250, in k_fold_train
train(train_loader, valid_loader, learning_rate, weight_decay, num_epoch, model_path, fold)
File "train.py", line 167, in train
for batch in tqdm(train_loader):
File "/home/anaconda3/envs/pytorch/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/home/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1182, in _next_data
idx, data = self._get_data() File "/home/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1138, in _get_data
success, data = self._try_get_data()
File "/home/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/anaconda3/envs/pytorch/lib/python3.8/queue.py", line 179, in get self.not_empty.wait(remaining) File "/home/anaconda3/envs/pytorch/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)

看这个报错可以发现是发生了死锁问题

解决方法

  1. 查看是否使用了cv2,cv2的多线程会和pytorch的冲突使用,使用如下语句来关闭cv2的多线程

    1
    2
    cv2.setNumThreads(0)
    cv2.ocl.setUseOpenCL(False)
  2. pytorch的dataloader锁死了,将num_workers设置为0/1,即可解决问题。(但是这样会导致运行速度变慢)

  3. 如果在使用了num_worker的情况下,还使用了多线程的库,可以使用来限制每个dataloader的进程只能使用一个线程

    1
    OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py

    也可以写成这样,建议将这段代码放在代码的最开头

    1
    2
    3
    import os
    os.environ["OMP_NUM_THREADS"] = "1"
    os.environ["MKL_NUM_THREADS"] = "1"

参考链接

https://blog.csdn.net/qq_32526087/article/details/106350530

https://github.com/pytorch/pytorch/issues/1355


dataloader的死锁问题
http://example.com/2021/07/05/dataloader的死锁问题/
Author
Neko kiku
Posted on
July 5, 2021
Licensed under