成功实现Pytorch并行运算！

介绍

在模型中进程会出现并行运算，比如在下列的\(InceptionV3\)的算子中：

三条支路的计算可以同步运算，如果实现，理论上可以达到3倍加速。

方案说明

torch.multiprocessing as mp

这是Pytorch实现的多进程库，采用的是Fork的方式实现多进程，具体而言就是在Start进程的位置继承前面的变量开启新的进程，很遗憾的是，一般情况下网络是不能共享的，两个进程各训练各的，最终都不返回训练的网络。需要使用另外一个库才能使其返回训练的梯度，和目前需求是不满足的。
Python官方的多进程库无法同时调用Cuda.
最终选用的方案是多线程，由于多个线程无需切换上下文，因此官方的Thread就可以实现并行运算，但是我们不能无限的把进程开下去，因此需要一个线程池进行管理，最终使用一个等待来等待线程的关闭，理论上协程也可以实现。

代码示例

联邦学习

from concurrent.futures import ThreadPoolExecutor

thread_pool_executor = ThreadPoolExecutor(max_workers=8, thread_name_prefix="test_")  

def update(image,label,model,criterion,optimizer):
    outputs = model(image)
    loss = criterion(outputs, label)
    loss.backward()# 求梯度，存在模型中。
    optimizer.step()
    optimizer.zero_grad()

for epoch in range(num_epochs):
    index=0
    for i, data in enumerate(train_loader):
        images, labels = data
        images = images.to(device)
        if index in adverse_index:
            labels=RandomLabel(labels)
        labels = labels.to(device)
        thread_pool_executor.submit(update,images,labels,net,criterion,optimizers)

thread_pool_executor.shutdown(wait=True)

建立线程池。
submit加入任务。
shutdown等待线程执行结束。

Inception:

def run(path_code,path_index,out):  
    for i in range(path_code[1]):
        layer_index=i#-pool_num
        op=path_code[2+i]
        index=(path_index*3+layer_index)*6+op
        feature_list.append(self.Nets[index](out))
thread_pool_executor.submit(run,path_code,path_index,out)

你需要定义线程需要使用的变量以防止相互干扰，这可以并行计算的运行，相信我，伙计，这非常重要。
你可以直接在函数中定义Run函数，虽然每次执行都会重复定义花上一点时间，但实际上这相当方便。
相同的代码在Linux操作系统上缺不能运行，蚌埠住了。
反思和观察是AMP的问题，答案也确实。
但是实际上这并没有取得显著的加速效果，实际上只加速了几个Step，可能在10个Batch。