CPU, GPU の FP32, FP64 性能の検証 – ソフトウェア技術者(ときどき科学)のつぶやき

PyTorch の fp32, fp64 性能の謎 – ソフトウェア技術者(ときどき科学)のつぶやき (minosys.com)の続き。

Cavity 問題を解くときの格子数が少なすぎるゆえに CPU 性能と GPU 性能が逆転したのではないかという仮説を裏付ける実験を行いました。

検証に使ったのは、以下のような演算です。a,b は当初 [0.0, 1.0) のランダム値を入れておくこととします。

$a_{2 i} = \frac{b_{2 i} + a_{2 i + 1}}{2}$

$a_{2 i + 1} = \frac{b_{2 i + 1} + a_{2 i}}{2}$

$b_{2 i} = \frac{a_{2 i} + b_{2 i + 1}}{2}$

$b_{2 i + 1} = \frac{a_{2 i + 1} + b_{2 i}}{2}$

ただし、一旦右辺をすべて計算してから左辺への代入を行います。（そのため、x_rhs という変数を用意しています。）

検証結果は以下の通りとなりました。

minoru@mino11:~$ python3 torch_test.py
lattice: 30
cpu: 0.000318
gpu(32): 0.001461
gpu(64): 0.000962

lattice: 500
cpu: 0.374067
gpu(32): 0.122395
gpu(64): 0.184232

lattice: 1200
cpu: 7.152719
gpu(32): 0.521847
gpu(64): 1.7935889999999999

格子数30の時は CPU の方が速く、格子数500～1200になるとGPUの方が速くなりました。

FP32 と FP64 の差はこの例では4倍程度となっています。これは演算性能の差よりもメモリ移動が効いているためと考えられます。

最後に検証に使用したプログラムを上げておきます。

# -*- coding: utf-8 -*-
import numpy as np
import torch
import datetime

def diff_time(st, ed):
    d = ed - st
    return d.seconds + (d.microseconds / 1000000.)

def test_body(title, a, a_rhs, b, b_rhs):
    st = datetime.datetime.now()
    for _ in range(lattice):
        a_rhs[0:a.shape[0]:2, 0:a.shape[1]:2] = (a[1:a.shape[0]:2, 1:a.shape[1]:2] + b[0:a.shape[0]:2,0:a.shape[1]:2]) / 2.0
        a_rhs[1:a.shape[0]:2, 1:a.shape[1]:2] = (a[0:a.shape[0]:2, 0:a.shape[1]:2] + b[1:a.shape[0]:2,1:a.shape[1]:2]) / 2.0
        b_rhs[0:a.shape[0]:2, 0:a.shape[1]:2] = (a[0:a.shape[0]:2, 0:a.shape[1]:2] + b[1:a.shape[0]:2,1:a.shape[1]:2]) / 2.0
        b_rhs[1:a.shape[0]:2, 1:a.shape[1]:2] = (a[1:a.shape[0]:2, 1:a.shape[1]:2] + b[0:a.shape[0]:2,0:a.shape[1]:2]) / 2.0
        a[:, :] = a_rhs[:, :]
        b[:, :] = b_rhs[:, :]
    ed = datetime.datetime.now()
    print(title, diff_time(st, ed))

def test_round(lattice):
    print("lattice:", lattice)
    x = np.random.rand(lattice, lattice)
    x_64 = torch.from_numpy(x).to(dtype=torch.float64)
    x_32 = torch.from_numpy(x).to(dtype=torch.float32)
    y = np.random.rand(lattice, lattice)
    y_64 = torch.from_numpy(y).to(dtype=torch.float64)
    y_32 = torch.from_numpy(y).to(dtype=torch.float32)
    x_rhs = np.zeros((lattice, lattice), dtype=np.float64)
    y_rhs = np.zeros((lattice, lattice), dtype=np.float64)

    test_body('cpu:', x, x_rhs, y, y_rhs)
    x_rhs = torch.zeros((lattice, lattice), dtype=torch.float32)
    y_rhs = torch.zeros((lattice, lattice), dtype=torch.float32)
    test_body('gpu(32):', x_32, x_rhs, y_32, y_rhs)
    x_rhs = torch.zeros((lattice, lattice), dtype=torch.float64)
    y_rhs = torch.zeros((lattice, lattice), dtype=torch.float64)
    test_body('gpu(64):', x_64, x_rhs, y_64, y_rhs)
    print()

if __name__ == '__main__':
    lattices = [30, 500, 1200]
    for lattice in lattices:
        test_round(lattice)

変数 x_rhs は全て CPU で定義されていますが、これを GPU に持っていくと2～5倍程度遅くなります。

その理由についてはまた明日詳しく検証したいと思います。

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル