Computing Benchmarks: Processors

Last updated: 30 January 2025

Select Processor

Select Application

Quantitative Finance

Libor Swaption Portfolio (Monte-Carlo)

Prices a portfolio of LIBOR swaptions on a LIBOR Market Model using a Monte-Carlo simulation and computes Greeks.

American Options (Binomial)

Prices a portfolio of American call options using a Binomial lattice (Cox, Ross and Rubenstein method).

European Options (Black-Scholes-Merton)

Prices a portfolio of European options using the Black-Scholes-Merton formula.

Barrier Options (Monte-Carlo)

Prices a portfolio of up-and-in barrier options using a Monte-Carlo simulation.

Implementation:

Native C++Xcelerit SDK 3.2 / C++

Deep Learning

RNN Training

Trains a Recurrent Neural Network (RNN) using Tensorflow

RNN Inference

Inference with a Recurrent Neural Network (RNN) using Tensorflow

LSTM Training

Trains a Long Short Term Memory model (LSTM) using Tensorflow

LSTM Inference

Inference with a Long Short Term Memory model (LSTM) using Tensorflow

(All deep learning applications have been implemented using Nvidia’s TensorFlow NGC Docker container.)

Benchmark Description

This application prices a portfolio of LIBOR swaptions on a LIBOR Market Model using a Monte-Carlo simulation. It also computes Greeks.

In each Monte-Carlo path, the LIBOR forward rates are generated randomly at all required maturities following the LIBOR Market Model, starting from the initial LIBOR rates. The swaption portfolio payoff is then computed and discounted to the pricing date. Averaging the per-path prices gives the final net present value of the portfolio.

The full algorithm is illustrated in the processing graph below:

LiborGreeksGraph

More details can be found in Prof. Mike Giles’ notes [1].

This benchmark uses a portfolio of 15 swaptions with maturities between 4 and 40 years and 80 forward rates (and hence 80 delta Greeks). To study the performance, the number of Monte-Carlo paths is varied between 128K-2,048K.

[1] M. Giles, “Monte Carlo evaluation of sensitivities in computational finance,” HERCMA Conference, Athens, Sep. 2007.

Application Class: Pricer
Model: Libor Market Model
Instrument Type: Swaption Portfolio
Numerical Method: Monte-Carlo
Portfolio Size: 15 swaptions
Maturities: 4 to 40 years
Number of Forward Rates: 80
Number of Sensitivities: 80
Monte-Carlo Paths:128K–1,024K

This benchmark application prices a portfolio of American call options using a Binomial lattice (Cox, Ross and Rubenstein method).

For a given size N of the binomial tree, the option payoff at the N leaf nodes is computed first (the value at maturity for different stock prices, using the Black-Scholes model). Then, the pricer works towards the root node backwards in time, multiplying the 2 child nodes by the pre-computed pseudo-probabilities that the price goes up or down, including discounting at the risk-free rate, and adding the results. After repeating this process for all time steps, the root node holds the present value.

The algorithm is illustrated in the graph below:

Binomial tree illustration

This binomial pricing method is applied for every option in the portfolio.

For this benchmark, we use 1,024 steps (the depth of the tree). We vary the number of options in the portfolio to study the performance.

Application Class: Batch Pricer
Model: Black-Scholes
Instrument Type: American Option
Numerical Method: Binomial Lattice
Portfolio Size: 128K–2,048K
Maturities: 1–5 years
Depth of Lattice: 1,024

This benchmark application prices a portfolio of European options using the Black-Scholes-Merton formula.

The pricer calculates both the call and put price for a batch of options, defined by their current stock price, strike price, and maturities. It applies the Black-Scholes-Merton forumla for each option in the portfolio.

For this benchmark, we repeat the application of the formula 100 times to increase the overall runtime for the performance measurements. The number of options in the portfolio is varied to study the performance.

Application Class: Batch Pricer
Model: Black-Scholes
Instrument Type: European Option
Numerical Method: Closed-form Formula
Portfolio Size: 32M-256M
Maturities: 1–5 years

This benchmark application prices a portfolio of up-and-in barrier options with European exercise, using a Monte-Carlo simulation.

The Black-Scholes model is used to generate the paths. That is, in every Monte-Carlo path, the stock value at a fixed grid of time steps is computed using a geometric Brownian motion with constant volatility and the risk-free rate. If the stock value crosses the barrier anywhere along the path, the option becomes valid and its payoff is the positive value of the difference between the stock at maturity and the option’s strike. Otherwise, the option’s payoff is zero.

Then the payoff of each option is discounted to the valuation date using the risk-free rate and averaged across the paths. This yields the net present values of all options.

The full algorithm is illustrated in the processing graph below:

MC Barrier Option Pricer Graph

This benchmark uses 50,000 Monte-Carlo paths and 50 time-steps. The number of options in the portfolio is varied to study the performance.

Application Class: Batch Pricer
Model: Black-Scholes
Instrument Type: Barrier Option
Numerical Method: Monte-Carlo
Paths: 50,000
Time Steps: 50
Number of Options: 2,000–32,000
Maturities: 1-5 years

This application benchmarks the training of a deep Recurrent Neural Network (RNN), as illustrated below.

Deep Recurrent Neural Network (RNN)

RNNs are at the core of many deep learning applications in finance, as they show excellent predition performance for time-series data.

For benchmark purposes, we focus on a single layer of such network, as this is the fundamental building block of more complex deep RNN models. We use Tensorflow, optimised by Nvidia in their NGC Docker container.

Application Class: Deep Learning
Model: Recurrent Neural Network (RNN)
Mode: Training
Hidden Layers: 1
Hidden Units: 1,024
Sequence Length: 32
Batch Size: 128
Optimizer: Stochastic gradient descent
Training Data: Random

This application benchmarks the inference performance of a deep Recurrent Neural Network (RNN), as illustrated below.

Deep Recurrent Neural Network (RNN)

RNNs are at the core of many deep learning applications in finance, as they show excellent predition performance for time-series data.

Application Class: Deep Learning
Model: Recurrent Neural Network (RNN)
Mode: Inference
Hidden Layers: 1
Hidden Units: 1,024
Sequence Length: 32

This application benchmarks the training of a deep Long-Short Term Memory Model Network (LSTM). This is a modified version of the vanialla RNN, to overcome problems with vanishing or exploding gradients during back-propagation. This allows LSTMs to learn complex long-term dependencies better than RNNs.

RNNs are at the core of many deep learning applications in finance, as they show excellent predition performance for time-series data. In fact, LSTMs are often the perferred form of RNN networks in practial applications.

For benchmark purposes, we focus on a single layer of such network, as this is the fundamental building block of more complex deep LSTM models. We use Tensorflow, optimised by Nvidia in their NGC Docker container.

Application Class: Deep Learning
Model: Long Short Term Memory (LSTM)
Mode: Training
Hidden Layers: 1
Hidden Units: 1,024
Sequence Length: 32
Batch Size: 128
Optimizer: Stochastic gradient descent
Training Data: Random

This application benchmarks the inference performance of a deep Long-Short Term Memory Model Network (LSTM). This is a modified version of the vanialla RNN, to overcome problems with vanishing or exploding gradients during back-propagation. This allows LSTMs to learn complex long-term dependencies better than RNNs.

RNNs are at the core of many deep learning applications in finance, as they show
excellent predition performance for time-series data. In fact, LSTMs are often the
perferred form of RNN networks in practial applications.

For benchmark purposes, we focus on a single layer of such network,
as this is the fundamental building block of more complex deep LSTM models.
We use Tensorflow, optimised by Nvidia in their NGC Docker container.

Application Class: Deep Learning
Model: Long Short Term Memory (LSTM)
Mode: Inference
Hidden Layers: 1
Hidden Units: 1,024
Sequence Length: 32
Precision: FP16

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Intel Haswell	RedHat EL 6.6 (64bit)	128GB	Intel Compiler 17.0	on	double	2x Hyperthreading
Nvidia Kepler	RedHat EL 7.2 (64bit)	128GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 7.5
Nvidia Pascal	RedHat EL 7.2 (64bit)	256GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Intel Xeon Phi	RedHat EL 7.1 (64bit)	64GB (host)	Intel Compiler 15.0	on	double	MPSS 3.4.5, 4x Hyperthreading
IBM Power8	Ubuntu 14.10 (64bit)	256GB	IBM Compiler 13.1	on	double	8x SMT
Intel Broadwell	RedHat EL 7.2 (64bit)	128GB	Intel Compiler 17.0	on	double	2x Hyperthreading

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Intel Haswell	RedHat EL 6.6 (64bit)	128GB	Intel Compiler 17.0	on	double	no Hyperthreading
Nvidia Kepler	RedHat EL 7.2 (64bit)	128GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Nvidia Pascal	RedHat EL 7.2 (64bit)	256GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Intel Xeon Phi	RedHat EL 7.1 (64bit)	64GB (host)	Intel Compiler 15.0	on	double	MPSS 3.4.5, 4x Hyperthreading
Intel Broadwell	RedHat EL 7.2 (64bit)	128GB	Intel Compiler 17.0	on	double	no Hyperthreading

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Intel Haswell	RedHat EL 6.6 (64bit)	128GB	Intel Compiler 17.0	on	double	2xHyperthreading
Nvidia Kepler	RedHat EL 7.2 (64bit)	128GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Nvidia Pascal	RedHat EL 7.2 (64bit)	256GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Intel Xeon Phi	RedHat EL 7.1 (64bit)	64GB (host)	Intel Compiler 15.0	on	double	MPSS 3.4.5, 4x Hyperthreading
Intel Broadwell	RedHat EL 7.2 (64bit)	128GB	Intel Compiler 17.0	on	double	2xHyperthreading

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Intel Haswell	RedHat EL 6.6 (64bit)	128GB	Intel Compiler 17.0	on	double	2xHyperthreading
Nvidia Kepler	RedHat EL 7.2 (64bit)	128GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Nvidia Pascal	RedHat EL 7.2 (64bit)	256GB (host)	GCC 4.8	on	double	max. clock boost, CUDA 8.0
Intel Xeon Phi	RedHat EL 7.1 (64bit)	64GB (host)	Intel Compiler 15.0	on	double	MPSS 3.4.5, 4x Hyperthreading
Intel Broadwell	RedHat EL 7.2 (64bit)	128GB	Intel Compiler 17.0	on	double	2xHyperthreading

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Nvidia Pascal	RedHat EL 7.4 (64bit)	64GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11
Nvidia Volta	RedHat EL 7.4 (64bit)	128GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Nvidia Pascal	RedHat EL 7.4 (64bit)	64GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11
Nvidia Volta	RedHat EL 7.4 (64bit)	128GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Nvidia Pascal	RedHat EL 7.4 (64bit)	64GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11
Nvidia Volta	RedHat EL 7.4 (64bit)	128GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11

System	Operating System	Memory (RAM)	Compiler	ECC	Precision Mode	Other
Nvidia Pascal	RedHat EL 7.4 (64bit)	64GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11
Nvidia Volta	RedHat EL 7.4 (64bit)	128GB (host)	GCC 4.8	on	FP16	max. clock boost, NGC TensorFlow 17.11

The application is executed repeatedly, recording the wall-clock time for each run, until the estimated timing error is below a specified value. The full algorithm execution time from inputs to outputs is measured. This includes setup of accelerators and data transfers if applicable. The speedup vs. a sequential implementation on a single core is reported.

Hardware Specification

Processor	Cores	Logical Cores	Frequency	GFLOPs (double)	Max. Memory	Max. Memory B/W
Dual Intel Xeon E5-2698 v3 CPU (Haswell)	2 x 16	2 x 32	2.30 GHz	2 x 663	768 GB	2 x 68 GB/s
Dual Intel Xeon E5-2697 v4 CPU (Broadwell)	2 x 18	2 x 36	2.30 GHz	2 x 663	1.54 TB	2 x 76.8 GB/s
NVIDIA Tesla K40 GPU (Kepler)	15 (SMX)	2,880 (CUDA cores)	745 MHz	1,430	12 GB	288 GB/s
NVIDIA Tesla K80 GPU (Kepler)	2 x 13 (SMX)	2 x 2,496 (CUDA cores)	562 MHz	2 x 1,455	2 x 12 GB	2 x 240 GB/s
Intel Xeon Phi 7120P (Knight's Corner)	61	244	1.238 GHz	1,208	16 GB	352 GB/s
NVIDIA Tesla P100 GPU (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	16 GB	720 GB/s
Dual IBM Power8 ISeries 8286-42A CPU	2 x 12	2 x 96 (SMT8)	3.52 GHz	2 x 338	1 TB	2 x 384 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (double)	Max. Memory	Max. Memory B/W
Dual Intel Xeon E5-2698 v3 CPU (Haswell)	2 x 16	2 x 32	2.30 GHz	2 x 663	768 GB	2 x 68 GB/s
Dual Intel Xeon E5-2697 v4 CPU (Broadwell)	2 x 18	2 x 36	2.30 GHz	2 x 663	1.54 TB	2 x 76.8 GB/s
NVIDIA Tesla K40 GPU (Kepler)	15 (SMX)	2,880 (CUDA cores)	745 MHz	1,430	12 GB	288 GB/s
NVIDIA Tesla K80 GPU (Kepler)	2 x 13 (SMX)	2 x 2,496 (CUDA cores)	562 MHz	2 x 1,455	2 x 12 GB	2 x 240 GB/s
Intel Xeon Phi 7120P (Knight's Corner)	61	244	1.238 GHz	1,208	16 GB	352 GB/s
NVIDIA Tesla P100 GPU (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	16 GB	720 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (double)	Max. Memory	Max. Memory B/W
Dual Intel Xeon E5-2698 v3 CPU (Haswell)	2 x 16	2 x 32	2.30 GHz	2 x 663	768 GB	2 x 68 GB/s
Dual Intel Xeon E5-2697 v4 CPU (Broadwell)	2 x 18	2 x 36	2.30 GHz	2 x 663	1.54 TB	2 x 76.8 GB/s
NVIDIA Tesla K40 GPU (Kepler)	15 (SMX)	2,880 (CUDA cores)	745 MHz	1,430	12 GB	288 GB/s
NVIDIA Tesla K80 GPU (Kepler)	2 x 13 (SMX)	2 x 2,496 (CUDA cores)	562 MHz	2 x 1,455	2 x 12 GB	2 x 240 GB/s
Intel Xeon Phi 7120P (Knight's Corner)	61	244	1.238 GHz	1,208	16 GB	352 GB/s
NVIDIA Tesla P100 GPU (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	16 GB	720 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (double)	Max. Memory	Max. Memory B/W
Dual Intel Xeon E5-2698 v3 CPU (Haswell)	2 x 16	2 x 32	2.30 GHz	2 x 663	768 GB	2 x 68 GB/s
Dual Intel Xeon E5-2697 v4 CPU (Broadwell)	2 x 18	2 x 36	2.30 GHz	2 x 663	1.54 TB	2 x 76.8 GB/s
NVIDIA Tesla K40 GPU (Kepler)	15 (SMX)	2,880 (CUDA cores)	745 MHz	1,430	12 GB	288 GB/s
NVIDIA Tesla K80 GPU (Kepler)	2 x 13 (SMX)	2 x 2,496 (CUDA cores)	562 MHz	2 x 1,455	2 x 12 GB	2 x 240 GB/s
Intel Xeon Phi 7120P (Knight's Corner)	61	244	1.238 GHz	1,208	16 GB	352 GB/s
NVIDIA Tesla P100 GPU (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	16 GB	720 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (FP64)	GFLOPs (FP32)	GFLOPs (FP16/Tensor)	Max. Memory	Max. Memory B/W
NVIDIA Tesla P100 PCIe (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	9,340	18,680	16 GB	720 GB/s
NVIDIA Tesla V100 PCIe (Volta)	80 (SM)	5,120 (CUDA cores)	1,530 MHz	7,014	14,028	112,000	16 GB	900 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (FP64)	GFLOPs (FP32)	GFLOPs (FP16/Tensor)	Max. Memory	Max. Memory B/W
NVIDIA Tesla P100 PCIe (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	9,340	18,680	16 GB	720 GB/s
NVIDIA Tesla V100 PCIe (Volta)	80 (SM)	5,120 (CUDA cores)	1,530 MHz	7,014	14,028	112,000	16 GB	900 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (FP64)	GFLOPs (FP32)	GFLOPs (FP16/Tensor)	Max. Memory	Max. Memory B/W
NVIDIA Tesla P100 PCIe (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	9,340	18,680	16 GB	720 GB/s
NVIDIA Tesla V100 PCIe (Volta)	80 (SM)	5,120 (CUDA cores)	1,530 MHz	7,014	14,028	112,000	16 GB	900 GB/s

Processor	Cores	Logical Cores	Frequency	GFLOPs (FP64)	GFLOPs (FP32)	GFLOPs (FP16/Tensor)	Max. Memory	Max. Memory B/W
NVIDIA Tesla P100 PCIe (Pascal)	56 (SM)	3,584 (CUDA cores)	1,126 MHz	4,670	9,340	18,680	16 GB	720 GB/s
NVIDIA Tesla V100 PCIe (Volta)	80 (SM)	5,120 (CUDA cores)	1,530 MHz	7,014	14,028	112,000	16 GB	900 GB/s

Speedup vs. Sequential*

(higher is better)

*the sequential version runs on a single core of an Intel Xeon E5-2698 v3 CPU

Speedup vs. Sequential*

(higher is better)

*the sequential version runs on a single core of an Intel Xeon E5-2698 v3 CPU

Speedup vs. Sequential*

(higher is better)

*the sequential version runs on a single core of an Intel Xeon E5-2698 v3 CPU

Speedup vs. Sequential*

(higher is better)

*the sequential version runs on a single core of an Intel Xeon E5-2698 v3 CPU

Speedup vs. P100*

(higher is better)

*the results are normalised to the P100 GPU performance

Speedup vs. P100*

(higher is better)

*the results are normalised to the P100 GPU performance

Speedup vs. P100*

(higher is better)

*the results are normalised to the P100 GPU performance

Speedup vs. P100*

(higher is better)

*the results are normalised to the P100 GPU performance

Computing Benchmarks: Processors

Select Processor

Select Application

Quantitative Finance

Deep Learning

Benchmark Description

Hardware Specification

Speedup vs. Sequential*

Speedup vs. Sequential*

Speedup vs. Sequential*

Speedup vs. Sequential*

Speedup vs. P100*

Speedup vs. P100*

Speedup vs. P100*

Speedup vs. P100*

Request the Source Code