In the following, we compare the performance of the Tesla P100 to the previous Tesla K80 card using selected applications from the Xcelerit Quant Benchmarks.
Hardware Comparison
The table below shows the key hardware differences between the two cards.
Processor | Cores | CUDA Cores | Frequency | GFLOPs (double)1 | Memory | Memory B/W |
---|---|---|---|---|---|---|
NVIDIA Tesla K80 GPU (Kepler) | 2 x 13 (SMX) | 2 x 2,496 | 562 MHz | 2 x 1,455 | 2 x 12 GB | 2 x 240 GB/s |
NVIDIA Tesla P100 GPU (Pascal) | 56 (SM) | 3,584 | 1,126 MHz | 4,670 | 16 GB | 720 GB/s |
1Note that the FLOPs are calculated by assuming purely fused multiply-add (FMA) instructions and counting those as 2 operations (even though they map to just a single processor instruction).
Xcelerit Quant Benchmarks
The peak GFLOPs given above are rarely reached in real-world applications. Beyond compute instructions, many other factors influence performance, such as memory and cache latencies, thread synchronisation, instruction-level parallelism, GPU occupancy, and branch divergance. To give an indication of the performance in the real world, we use selected applications form the Xcelerit Quant Benchmarks, a representative set of applications widely used in Quantitative Finance. Those applications have been hand-tuned for maximum performance using native implementation by code optimisation experts, often in collaboration with the relevant processor maker.
Financial Instrument | Numerical Method | Description | Parameters |
---|---|---|---|
LIBOR Swaption Portfolio | Monte-Carlo | Prices a portfolio of LIBOR swaptions on a LIBOR Market Model and computes sensitivities |
|
American Options | Binomial Lattice | Prices a batch of American call options under the Black-Scholes model using a Binomial lattice (Cox, Ross and Rubenstein method). (read more) |
|
European Options | Closed form | Prices a batch of European call and put options the Black-Scholes-Merton formula. We repeat the formula 100 times to increase the overall runtime for performance measurements. (read more) |
|
Barrier Options | Monte-Carlo | Prices a portfolio of up-and-in barrier options under the Black-Scholes model using a Monte-Carlo simulation. (read more) |
|
Selected Applications from the Xcelerit Quant Benchmarks
Benchmark Setup
We compare the performance of each application on the K80 and P100 cards. The system configuration is given in the following:
- CPU: 2 sockets, Haswell (Intel Xeon E5-2698 v3)
- GPU: NVIDIA Tesla K80 and NVIDIA Tesla P100 (ECC on)
- OS: RedHat Enterprise Linux 7.2 (64bit)
- RAM: 128GB (K80 system) and 256GB (P100 system)
- CUDA Version: 8.0
- CPU Backend Compiler: GCC 4.8
- GPU clock: maximum boost
- Precision: double
Performance
To measure the performance, the application is executed repeatedly, recording the wall-clock time for each run, until the estimated timing error is below a specified value. The measurement includes the full algorithm execution time from inputs to outputs, including setup of the GPU and data transfers. The speedup versus a sequential implementation on a single CPU core is reported, averaged over varying numbers of paths or options:
We observe that the P100 gives a boost between 1.3 and 2.3x over the the K80 (1.7x on average). This high variation of the speedup across applications can be explained by the different application characteristics, in particular the relation of compute instructions to memory access operations. In peak performace, the P100 has 1.6x the FLOPs (double precision) and 3x the memory bandwidth of the K80 GPU.
Both the LIBOR swaption portfolio and Black-Scholes option pricers are heavy in compute instructions and need less memory accesses. Therefore these applications benefit mostly from the increased GFLOPs and less from the memory bandwidth improvement. This explains the speedup of around 1.3x compared to the K80.
The Binomial American option pricer is memory intensive, on global and shared memory as well as cache. It also uses thread synchronisation operations heavily. The performance of these operations has been increased significantly on the P100, which explains the highest-end gain for 2.3x.
The Monte-Carlo Barrier options application benefits from both the compute and memory performance increases to some extend. This results in a speedup of around 1.8x.