Inhalt des Dokuments
- © L. Steffen / F. Simons
The Hydroinformatics Modeling System (hms) is a modeling framework developed by the Chair of Water Resources Management and Modeling of Hydrosystems, Technische Universität Berlin, since 2004. See the project page  for details.
hms++ is a complete rewrite of hms with the goal of comprehensive performance optimisation.
hms++ aims to provide an efficient framework for the finite volume method on two-dimensional structured and unstructured grids, as well as an optimised implementation of first and second order HLLC Riemann solvers for the shallow water equations (SWE).
As the name indicates, hms++ is fully written in C++, which allows several avenues for optimisation when compared to the previous implementation of hms in Java:
- no Java Virtual Machine being required on the target machine
- allowing machine-specific optimisations by the compiler
- allowing to manually place compiler optimisation directives
- availability of mature high-performance computation libraries
- greater control of the memory layout, which is essential for achieving locality and alignment
- availability of certain advanced optimisation strategies, such as expression templates
- easier interfacing with graphics processing units (GPUs) using CUDA or HIP
Furthermore, the availability of data pointers allows for more efficient and intuitive usage of the MPI standard for distributed computing.
- © L. Steffen
Multiple optimisation strategies are
Code metrics such as cache utilisation (hits/misses), vectorisation and branch predicts/mispredicts are to be investigated for critical code sections. For flow simulations with the finite volume method (FVM), the most likely candidate for a performance critical section is the flux calculation during each time step.
Both cache utilisation as well as vectorisation depend on data locality. Locality is achieved, when firstly, instructions operating on the same data are executed within a minimal time frame and secondly, data operated on in a given time frame is located within a minimally long memory segment.
- © L. Steffen
For flow simulations with the FVM, the relative memory layout of edge-related and cell-related data is the key to locality. As can be seen in the picture, all horizontal edges lie between cells which are neighbours within the mesh, but not in memory (as indicated by their indices). This is the core challenge for achieving locality in FVM simulations. Different indexing strategies for structured meshes, (possibly dynamic) element reordering for unstructured meshes, and edge traversal strategies are to be investigated in this project as candidates for improving locality and thus performance.
- © L. Steffen
Very early in a flow simulation workflow, the mesh
type is decided. This has a significant impact on the memory footprint
of the calculation: Uniformly rectangular meshes (type
a in the picture) only store two integers (number of
cells in either direction) and four floating point numbers (origin and
spacing) and thus have constant memory requirements, no matter the
mesh size. The same mesh can, however, also be represented as an
arbitrary unstructured mesh (type d), but with much
higher memory requirements.
This is often the case when dealing with the output of external mesh generators. As an example, the widespread mesh generator gmsh outputs all meshes as type d, even if their topology would allow representation as types a-c.
To ascertain that the optimal mesh representation is used, independent of its source, hms++ can detect structured topology and convert any mesh to the most memory efficient type which can represent it.
- © L. Steffen
During the runtime of a shallow water flow simulation, normally both wet and dry cells occur. Dry cells typically produce a lower computational load, which leads to a load imbalance between processing nodes. Intra-node load balancing can be handled by the operating system, but inter-node load balancing must be handled by the program.
In the picture, two strategies are shown. The upper one prescribes a fixed load factor to wet and dry cells, and then arranges node boundaries to achieve a similar load for each node. This heuristic method requires good calibration of the load factor, which may only be partially portable between machines. However, its high resolution down to cell level makes it a good candidate for a precise and efficient method.
The lower method in the picture takes the actual load of the system and partitions the mesh accordingly. With this method, only one load factor per node is calculated, which may require more iterations to achieve an optimal inter-node load balance. However, this method does not require calibration and inherently accounts for unknown factors (such as services running in the background, I/O operations for writing out calculation results etc.)
Both are to be implemented and compared, for both structured and unstructured meshes. The latter will require efficient graph partitioning, e.g. with the ParMETIS library.