

# The First SVE Enabled Arm Processor: A64FX and Building up Arm HPC Ecosystem

Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jan. 14<sup>th</sup>, 2019

### Outline of This Talk



# The First SVE Enabled Arm Processor: A64FX A64FX: High Performance Arm CPU

#### Arm HPC Ecosystem Development

- Arm HPC Software Topics
  - •Activities with Arm, Linaro and OSS Community
  - •OSS Application Porting Updates



# A64FX: High Performance Arm CPU

- From presentation slides of Hotchips 30<sup>th</sup> and Cluster 2018
- Inheriting Fujitsu HPC CPU technologies with commodity standard ISA



#### A64FX Chip Overview





**ISA** (Extension)

Process Node

SIMD

# of Cores

Memory

**Peak Performance** 

**Memory Peak B/W** 

SVE

7nm

512-bit

48 + 4

HBM2

1024GB/s

>2.7TFLOPS

• 594 package signal pins

#### Peak Performance (Efficiency)

- >2.7TFLOPS (>90%@DGEMM)
- Memory B/W 1024GB/s (>80%@Stream Triad)

240GB/s x2 (in/out)

HPC-ACE2

1.1TFLOPS

20nm

256-bit

32+2

HMC

#### A64FX Many-Core Architecture



Consisting of 4 CMGs (Core Memory Group), ToFu and PCIe Controller

- A CMG consists of 13 cores, an L2 cache and a memory controller
  - One out of 13 cores is an assistant core which handles daemon, I/O, etc.
- CMGs keep cache coherency by ccNUMA with on-chip directory
- The X-bar connection realizes high efficiency for the L2 cache throughput
- NUMA-aware software techniques enable linear scalability up to 48 cores

Providing High I/O Performance by Wide Ring On-chip-network



#### A64FX Memory System

#### Extremely high bandwidth

- Asynchronous Processing in cores, caches and memory controllers
- Maximizing the capability of each layer's bandwidth



#### A64FX Core Features



- Optimizing SVE architecture for wide range of applications with Arm including AI area by FP16 INT16/INT8 Dot Product
- Developing A64FX core micro-architecture to increase application performance

|                             | A64FX<br>(Post-K)  | SPARC64 XIfx<br>(PRIMEHPC FX100) | SPAR64 VIIIfx<br>(K computer) |  |
|-----------------------------|--------------------|----------------------------------|-------------------------------|--|
| ISA                         | Armv8.2-A + SVE    | SPARC-V9 + HPC-ACE2              | SPARC-V9 + HPC-ACE            |  |
| SIMD Width                  | 512-bit            | 256-bit                          | 128-bit                       |  |
| Four-operand FMA            | ✓ Enhanced         | $\checkmark$                     | $\checkmark$                  |  |
| Gather/Scatter              | ✓ Enhanced         | $\checkmark$                     |                               |  |
| Predicated Operations       | ✓ Enhanced         | $\checkmark$                     | $\checkmark$                  |  |
| Math. Acceleration          | ✓ Further enhanced | ✓ Enhanced                       | $\checkmark$                  |  |
| Compress                    | ✓ Enhanced         | $\checkmark$                     |                               |  |
| First Fault Load            | ✓ New              |                                  |                               |  |
| FP16                        | ✓ New              |                                  |                               |  |
| INT16/ INT8 Dot Product     | ✓ New              |                                  |                               |  |
| HW Barrier* / Sector Cache* | ✓ Further enhanced | ✓ Enhanced                       | $\checkmark$                  |  |

\* Utilizing AArch64 implementation-defined system registers

#### A64FX: Power monitor and analyzer



Energy monitor (per chip)

■Node power via Power API\*1 (~msec)

Averaged power of a node, CMG (cores, an L2 cache, a memory) etc. Energy analyzer (per core)

- Power profiler via PAPI\*2 (~nsec)
- Fine grained power analysis of a core, an L2 cache and a memory

\*1: Sandia National Laboratory

\*2: Performance Application Programming Interface



#### A64FX: Power Knobs to reduce power consumption Fujitsu

"Power knob" limits units' activity via user APIs

Performance/W would be optimized by utilizing Power knobs, Energy monitor & analyzer



#### A64FX Chip Level Application Performance



- Boosting application performance up by micro-architectural enhancements, 512-bit wide SIMD, HBM2 and semi-conductor process technologies
  - > 2.5x faster in HPC/AI benchmarks than that of SPARC64 XIfx tuned by Fujitsu compiler for A64FX micro-architecture and SVE



#### A64FX Kernel Benchmark Performance (Preliminary results)

#### A64FX: Tofu interconnect D

#### Integrated w/ rich resources

Increased TNIs achieves higher injection BW & flexible comm. patterns

- Increased barrier resources allow flexible collective comm. algorithms
- Memory bypassing achieves low latency
  - Direct descriptor & cache injection

|                          | TofuD spec          |  |
|--------------------------|---------------------|--|
| Data rate                | 28.05 Gbps          |  |
| Link bandwidth           | dth 6.8 GB/s        |  |
| Injection bandwidth      | 40.8 GB/s           |  |
|                          | Measured            |  |
| Put throughput 6.35 GB/s |                     |  |
| PingPong latency         | atency 0.49~0.54 µs |  |



#### TofuD: 6D Mesh/Torus Network

Six coordinates: (X, Y, Z) × (A, B, C)
 X, Y and Z: sizes are depends on the system size
 A, B and C: sizes are fixed to 2, 3, and 2 respectively
 Tofu stands for "torus fusion"



#### TofuD: Packaging – CPU Memory Unit

Two CPUs connected with C-axis

- $\blacksquare X \times Y \times Z \times A \times B \times C = 1 \times 1 \times 1 \times 1 \times 1 \times 2$
- Two or three active optical cable cages on board

Each cable is shared by two CPUs



## TofuD: Packaging – Rack Structure

#### Rack

- 8 shelves
- 192 CMUs or 384 CPUs

#### Shelf

- 24 CMUs or 48 CPUs
- $\blacksquare X \times Y \times Z \times A \times B \times C = 1 \times 1 \times 4 \times 2 \times 3 \times 2$

# Top or bottom half of rack

- 4 shelves
- $\blacksquare X \times Y \times Z \times A \times B \times C = 2 \times 2 \times 4 \times 2 \times 3 \times 2$





# TofuD: Put Latencies & Throughput& Injection Rate from Clustrer 2018



TofuD: Evaluated by hardware emulators using the production RTL codes
 Simulation model: System-level included multiple nodes

|       | Communication settings    | Latency |
|-------|---------------------------|---------|
| Tofu  | Descriptor on main memory | 1.15 µs |
|       | Direct Descriptor         | 0.91 µs |
| Tofu2 | Cache injection OFF       | 0.87 µs |
|       | Cache injection ON        | 0.71 µs |
| TofuD | To/From far CMGs          | 0.54 µs |
|       | To/From near CMGs         | 0.49 µs |

|       | Put throughput   | Injection rate  |
|-------|------------------|-----------------|
| Tofu  | 4.76 GB/s (95%)  | 15.0 GB/s (77%) |
| Tofu2 | 11.46 GB/s (92%) | 45.8 GB/s (92%) |
| TofuD | 6.35 GB/s (93%)  | 38.1 GB/s (93%) |

#### A64FX: Summary

FUJITSU

Arm SVE, high performance and efficiency
 DP performance >2.7 TFLOPS, >90%@DGEMM
 Memory BW 1024 GB/s, >80%@STREAM Triad



|                         | A64FX            |  |
|-------------------------|------------------|--|
| ISA (Base, extension)   | Armv8.2-A, SVE   |  |
| Process technology      | 7 nm             |  |
| Peak DP performance     | >2.7 TFLOPS      |  |
| SIMD width              | 512-bit          |  |
| # of compute cores      | 48               |  |
| Memory capacity         | 32 GiB (HBM2 x4) |  |
| Memory peak bandwidth   | 1024 GB/s        |  |
| PCIe                    | Gen3 16 lanes    |  |
| High speed interconnect | TofuD integrated |  |



# Arm HPC Software Topics: Activities with Linaro and OSS Community

With Arm and Linaro

With OSS Community: Open MPI and Lustre

## Activities with Arm and Linaro



- LLVM SVE upstreaming and OSS porting with Arm
  - Variable Vector Length Support for LLVM Community in cooperation with Arm
- OpenHPC with Linaro:
  - Mr. Okamoto(Fujitsu) has been selected a 2018-2019 TSC(Technical Steering Committee) member

#### Development Status with Linaro

- LLVM/Clang for aarch64 Improvement: now ongoing
  - Register allocation, Software pipelining support, Vectorization/SIMDization
  - Pushing SVE support to the LLVM community in cooperation with Arm, Variable Vector Length Support is critical issue to introduce to LLVM tree.
- QEMU/SVE Development: for building SVE software development
  - •V3.1.0 released: https://www.qemu.org/2018/12/12/qemu-3-1-0/

#### QEMU/SVE Development with Linaro https://www.gemu.org/2018/12/12/gemu-3-1-0/

FUJITSU

Finally, Version 3.1.0 supports SVE in system emulation mode!



#### Post-K Software Stack

Under Development



#### Post-K system supports SBSA/SBBR

Keeping binary compatibility with the other Aarch64 based systems.

#### Post-K Applications

FUJITSU Technical Computing Suite / RIKEN Advanced System Software



Linux OS / McKernel (Lightweight Kernel)

Post-K System Hardware

#### Open MPI: from SC18 BoF Slides

https://www.open-mpi.org/papers/sc-2018



#### Open MPI: from SC18 BoF Slides

https://www.open-mpi.org/papers/sc-2018



## MPI for the Post-K Computer

- Post-K MPI based on Open MPI
  - Support A64FX (Armv8.2-A+SVE) and TofuD
  - Plan to use Open MPI 4.0 and PMIx 2.1
- Contribution to Open MPI from post-K MPI
  - Persistent collectives [see next page]
  - Datatype for half-precision floating point [early 2019]
  - Thread parallelization of pack/unpack [early 2019]

Half-precision (FP16) datatype development started in cooperation with ANL and Mellanox

#### Open MPI: from SC18 BoF Slides

https://www.open-mpi.org/papers/sc-2018



### Lustre Community: OpenSFS and EOFS

- OpenSFS: US Based Non-profit Organization
   President: Sarp Oral (ORNL)
- EOFS: EU Based Non-profit Organization

President: Frank Baetke (HPE)

Lustre for Arm

- Fujitsu is member of OpenSFS and will support Lustre based products.
- Two Major Events
  - Lustre User Group(LUG)
    - LUG 19@Houston, 2019/5/15-17 http://opensfs.org/events/
  - Lustre Admins and Devs workshop(LAD)
    - LAD 18@Paris, 2018/9/24-25 <u>https://www.eofs.eu/events/lad18</u>

Slides Archives are on each site







# 2018/11: Whamcloud has started Lustre client support on Arm based platforms



https://www.ddn.com/press-releases/ddn-unveils-professional-support-lustre-arm-based-rm-platforms/





# Arm HPC Software Topics: OSS Application Porting Updates

#### OSS apps porting at Arm HPC Users Group



(http://arm-hpc.gitlab.io/)

Twelve primary OSS applications are listed and being tested in the Users Group for each compilers, collaboratively w/ Arm

| Application      | Lang.   | GCC         | LLVM         | Arm          | Fujitsu  |
|------------------|---------|-------------|--------------|--------------|----------|
| LAMMPS           | C++     | Modified    | Modified     | Modified     | Modified |
| GROMACS          | С       | Modified    | Modified     | Modified     | Modified |
| GAMESS*          | Fortran | Modified    | Modified     | Modified     | Modified |
| OpenFOAM         | C++     | Modified    | Modified     | Modified     | Modified |
| NAMD             | C++     | Modified    | Modified     | Modified     | Modified |
| WRF              | Fortran | Modified    | Modified     | Modified     | Modified |
| Quantum ESPRESSO | Fortran | Ok in as is | Ok in as is  | Ok in as is  | Modified |
| NWChem           | Fortran | Ok in as is | Modified     | Modified     | ongoing  |
| ABINIT           | Fortran | Modified    | Modified     | Modified     | Modified |
| CP2K             | Fortran | Ok in as is | Issues found | Issues found | ongoing  |
| NEST*            | C++     | Ok in as is | Modified     | Modified     | Modified |
| BLAST*           | C++     | Ok in as is | Modified     | Modified     | Modified |

\* Registered by Fujitsu

#### Issue of CP2K (known issue)



#### flang rejects valid empty constructor https://github.com/flang-compiler/flang/issues/239 (Closed) https://github.com/flang-compiler/flang/issues/615 (New Ticket, Segfault)

dbcsr\_data\_types.F90

module dbcsr\_data\_types TYPE dbcsr\_mempool\_type END TYPE dbcsr\_mempool\_type

TYPE dbcsr\_memtype\_type LOGICAL :: mpi = .FALSE. TYPE(dbcsr\_mempool\_type), POINTER :: pool => Null() END TYPE dbcsr\_memtype\_type end module

dbcsr\_data\_types\_user.F90

module a

use dbcsr\_data\_types, only: dbcsr\_memtype\_type

type foo
 type(dbcsr\_memtype\_type) :: val = dbcsr\_memtype\_type()
 end type
end module

[eco@cn-r05-01 work]\$ gfortran -c dbcsr\_data\_types.F90 && gfortran -c dbcsr\_data\_types\_user.F90 [eco@cn-r05-01 work]\$ [eco@cn-r05-01 work]\$ flang -c dbcsr\_data\_types.F90 && flang -c dbcsr\_data\_types\_user.F90 F90-F-0155-Empty structure constructor() - type dbcsr\_memtype\_type (dbcsr\_data\_types\_user.F90: 5) F90/x86-64 Linux Flang - 1.5 2017-05-01: compilation aborted [eco@cn-r05-01 work]\$

#### Summary



# A64FX: High Performance Arm CPU > 2.5 TFLOPS singe-chip degemm performance Arm is already not only mobile CPU but also high-end HPCs

Arm HPC Ecosystem Development

- Arm HPC Software Topics
  - •Activities with Arm, Linaro and OSS Community
  - Porting and Evaluation of HPC Application
- Will need continuous efforts

# FUJTSU

shaping tomorrow with you