Intel’s Knights Landing processor is the latest generation of the Xeon Phi many-core processor family. It is a host processor, x86 binary compatible, so it can host any box-standard x86 operating system. It comes with 64-68 cores, high-performance stacked memory, and a 512bits wide vector unit – designed for massively parallel workloads. The Xeon Broadwell server processors on the other hand have less cores count, but invdidual cores are more powerful cores, each with a more sophisticated microarchitecture, making them suitable for general purpose workloads even with a low level of parallelism.

In the following, we compare the performance of the Xeon Phi Knights Landing to a dual-socket Xeon Broadwell using selected applications from the Xcelerit Quant Benchmarks.

Intel Knights Landing Xeon Phi

Intel Knights Landing Xeon Phi

Hardware Comparison

The table below shows the key hardware differences between the two processors.

Processor Cores Logical Cores Frequency GFLOPs (double)1 Cache MCDRAM Max. Memory Memory B/W
Intel Xeon E5-2697 v4 (Broadwell) 18 36 2.3 GHz 663 GFLOPs 45 MB N/A 1.54 TB 76.8 GB/s
Intel Xeon Phi 7210 (Knights Landing) 64 256 1.3 GHz 2,662 GFLOPs 32 MB 16GB, 400+ GB/s 385 GB 102 GB/s

1Note that the FLOPs are calculated by assuming purely fused multiply-add (FMA) instructions and counting those as 2 operations (even though they map to just a single processor instruction). Full SIMD instructions are assumed.

A key feature of the Knights Landing processor is that it comes with a stacked MCDRAM memory of 16GB. This memory is directly attached to the chip and has a much higher bandwidth than the main memory (400+ GB/s). It can be used as a cache, as memory, or in hybrid configurations.

Xcelerit Quant Benchmark

The peak GFLOPs given above are rarely reached in real-world applications. Beyond compute instructions, many other factors influence performance, such as memory and cache latencies, thread synchronisation, instruction-level parallelism, and branch divergance. To give an indication of the performance in the real world, we use selected applications form the Xcelerit Quant Benchmarks, a representative set of applications widely used in Quantitative Finance. Those applications have been hand-tuned for maximum performance using a native implementation by code optimisation experts, often in collaboration with the relevant processor maker.

Financial Instrument Numerical Method Description Parameters
LIBOR Swaption Portfolio Monte-Carlo Prices a portfolio of LIBOR swaptions on a LIBOR Market Model and computes sensitivities
  • 15 swaptions
  • 80 rates & sensitivities
  • 128K–1,024K paths
American Options Binomial Lattice Prices a batch of American call options under the Black-Scholes model using a Binomial lattice (Cox, Ross and Rubenstein method). (read more)
  • 1,024 steps lattice
  • 128K–2,048K options
European Options Closed form Prices a batch of European call and put options the Black-Scholes-Merton formula. We repeat the formula 100 times to increase the overall runtime for performance measurements. (read more)
  • 32M–256M options
Barrier Options Monte-Carlo Prices a portfolio of up-and-in barrier options under the Black-Scholes model using a Monte-Carlo simulation. (read more)
  • 50 time-steps
  • 50,000 paths
  • 2,000–4,000 options

Selected Applications from the Xcelerit Quant Benchmarks

Benchmark Setup

We compare the performance of each application on the Knights Landing and Broadwell processors. The system configuration is given below:

  • Broadwell CPU: 2 sockets, Intel Xeon E5-2697 v4
  • Knights Landing CPU: 1 socket, Intel Xeon Phi 7210
  • OS: RedHat Enterprise Linux 7.2 (64bit)
  • RAM: 128GB (Broadwell system) and 96GB (Knights Landing system)
  • Compiler: Intel 2017
  • Precision: double

Performance

To measure the performance, the application is executed repeatedly, recording the wall-clock time for each run, until the estimated timing error is below a specified value. The measurement includes the full algorithm execution time from inputs to outputs. On Knights Landing, only the MCDRAM has been used as storage for the application data due to its higher bandwidth. The speedup versus a sequential implementation on a single CPU core is reported, averaged over varying numbers of paths or options:

We observe that the Knights Landing processor gives a boost between 0.6x and 2.3x over the the Broadwell (1.7x on average). It is faster for most applications, but looses with the binomial pricer.

Most of the applications benefit highly from Knights Landing’s MCDRAM. Being able to access data at more than 4x the speed as Broadwell makes a huge difference and allows to keep the cores better utilised.

For the binomial pricer however, the application is highly compute-bound on both processors and memory bandwidth makes little difference. In fact, on both platforms, disabling hyperthreading increases performance significantly for this application. The arithmetics are able to fully utilise all of the physical core’s resources and hyperthreading only creates contention. The 36 sophisticated cores of the Broadwell system deliver 1.6x higher performance than Knights Landing’s 64 cores.

Request the Source Code

top