subscribe to arXiv mailings

doi 10.3847/1538-4357/acbea5

SCF-FDPS: A Fast $N$-body Code for Simulating Disk-halo Systems

Authors: Shunsuke Hozumi, Keigo Nitadori, Masaki Iwasawa

Abstract: A fast $N$-body code has been developed for simulating a stellar disk embedded in a live dark matter halo. In generating its Poisson solver, a self-consistent field (SCF) code which inherently possesses perfect scalability is incorporated into a tree code which is parallelized using a library termed Framework for Developing Particle Simulators (FDPS). Thus, the code developed here is called SCF-FD… ▽ More A fast $N$-body code has been developed for simulating a stellar disk embedded in a live dark matter halo. In generating its Poisson solver, a self-consistent field (SCF) code which inherently possesses perfect scalability is incorporated into a tree code which is parallelized using a library termed Framework for Developing Particle Simulators (FDPS). Thus, the code developed here is called SCF-FDPS. This code has realized the speedup of a conventional tree code by applying an SCF method not only to the calculation of the self-gravity of the halo but also to that of the gravitational interactions between the disk and halo particles. Consequently, in the SCF-FDPS code, a tree algorithm is applied only to calculate the self-gravity of the disk. On a many-core parallel computer, the SCF-FDPS code has performed at least three times, in some case nearly an order of magnitude, faster than an extremely-tuned tree code on it, if the numbers of disk and halo particles are, respectively, fixed for both codes. In addition, the SCF-FDPS code shows that the cpu cost scales almost linearly with the total number of particles and almost inversely with the number of cores. We find that the time evolution of a disk-halo system simulated with the SCF-FDPS code is, in large measure, similar to that obtained using the tree code. We suggest how the present code can be extended to cope with a wide variety of disk-galaxy simulations. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 13 pages, 8 figures, accepted for publication in ApJ

arXiv:2007.15352 [pdf, other]

doi 10.1093/mnras/staa2295

Step-size effect in the time-transformed leapfrog integrator on elliptic and hyperbolic orbits

Authors: Long Wang, Keigo Nitadori

Abstract: A drift-kick-drift (DKD) type leapfrog symplectic integrator applied for a time-transformed separable Hamiltonian (or time-transformed symplectic integrator; TSI) has been known to conserve the Kepler orbit exactly. We find that for an elliptic orbit, such feature appears for an arbitrary step size. But it is not the case for a hyperbolic orbit: when the half step size is larger than the conjugate… ▽ More A drift-kick-drift (DKD) type leapfrog symplectic integrator applied for a time-transformed separable Hamiltonian (or time-transformed symplectic integrator; TSI) has been known to conserve the Kepler orbit exactly. We find that for an elliptic orbit, such feature appears for an arbitrary step size. But it is not the case for a hyperbolic orbit: when the half step size is larger than the conjugate momenta of the mean anomaly, a phase transition happens and the new position jumps to the nonphysical counterpart of the hyperbolic trajectory. Once it happens, the energy conservation is broken. Instead, the kinetic energy minus the potential energy becomes a new conserved quantity. We provide a mathematical explanation for such phenomenon. Our result provides a deeper understanding of the TSI method, and a useful constraint of the step size when the TSI method is used to solve the hyperbolic encounters. This is particular important when an (Bulirsch-Stoer) extrapolation integrator is used together, which requires the convergence of integration errors. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Comments: 6 pages, 3 figures, MNRAS, accepted

arXiv:2006.16560 [pdf, other]

doi 10.1093/mnras/staa1915

PeTar: a high-performance N-body code for modeling massive collisional stellar systems

Authors: Long Wang, Masaki Iwasawa, Keigo Nitadori, Junichiro Makino

Abstract: The numerical simulations of massive collisional stellar systems, such as globular clusters (GCs), are very time-consuming. Until now, only a few realistic million-body simulations of GCs with a small fraction of binaries (5%) have been performed by using the NBODY6++GPU code. Such models took half a year computational time on a GPU based super-computer. In this work, we develop a new N-body code,… ▽ More The numerical simulations of massive collisional stellar systems, such as globular clusters (GCs), are very time-consuming. Until now, only a few realistic million-body simulations of GCs with a small fraction of binaries (5%) have been performed by using the NBODY6++GPU code. Such models took half a year computational time on a GPU based super-computer. In this work, we develop a new N-body code, PeTar, by combining the methods of Barnes-Hut tree, Hermite integrator and slow-down algorithmic regularization (SDAR). The code can accurately handle an arbitrary fraction of multiple systems (e.g. binaries, triples) while keeping a high performance by using the hybrid parallelization methods with MPI, OpenMP, SIMD instructions and GPU. A few benchmarks indicate that PeTar and NBODY6++GPU have a very good agreement on the long-term evolution of the global structure, binary orbits and escapers. On a highly configured GPU desktop computer, the performance of a million-body simulation with all stars in binaries by using PeTar is 11 times faster than that of NBODY6++GPU. Moreover, on the Cray XC50 supercomputer, PeTar well scales when number of cores increase. The ten million-body problem, which covers the region of ultra compact dwarfs and nuclearstar clusters, becomes possible to be solved. △ Less

Submitted 27 July, 2020; v1 submitted 30 June, 2020; originally announced June 2020.

Comments: 20 pages, 17 figures, accepted for MNRAS

Journal ref: MNRAS 497 (2020) 536-555

arXiv:2002.07938 [pdf, other]

doi 10.1093/mnras/staa480

A slow-down time-transformed symplectic integrator for solving the few-body problem

Authors: Long Wang, Keigo Nitadori, Junichiro Makino

Abstract: An accurate and efficient method dealing with the few-body dynamics is important for simulating collisional N-body systems like star clusters and to follow the formation and evolution of compact binaries. We describe such a method which combines the time-transformed explicit symplectic integrator (Preto & Tremaine 1999; Mikkola & Tanikawa 1999) and the slow-down method (Mikkola & Aarseth 1996). Th… ▽ More An accurate and efficient method dealing with the few-body dynamics is important for simulating collisional N-body systems like star clusters and to follow the formation and evolution of compact binaries. We describe such a method which combines the time-transformed explicit symplectic integrator (Preto & Tremaine 1999; Mikkola & Tanikawa 1999) and the slow-down method (Mikkola & Aarseth 1996). The former conserves the Hamiltonian and the angular momentum for a long-term evolution, while the latter significantly reduces the computational cost for a weakly perturbed binary. In this work, the Hamilton equations of this algorithm are analyzed in detail. We mathematically and numerically show that it can correctly reproduce the secular evolution like the orbit averaged method and also well conserve the angular momentum. For a weakly perturbed binary, the method is possible to provide a few order of magnitude faster performance than the classical algorithm. A publicly available code written in the c++ language, SDAR, is available on GitHub (https://github.com/lwang-astro/SDAR). It can be used either as a stand alone tool or a library to be plugged in other $N$-body codes. The high precision of the floating point to 62 digits is also supported. △ Less

Submitted 18 February, 2020; originally announced February 2020.

Comments: 14 pages, 13 figures, accepted to MNRAS

arXiv:1907.02290 [pdf, ps, other]

doi 10.1093/pasj/psz133

Accelerated FDPS --- Algorithms to Use Accelerators with FDPS

Authors: Masaki Iwasawa, Daisuke Namekata, Keigo Nitadori, Kentaro Nomura, Long Wang, Miyuki Tsubouchi, Junichiro Makino

Abstract: In this paper, we describe the algorithms we implemented in FDPS to make efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to make it possible for many researchers to develop their own high-performance parallel particle-based simulation programs without spending large amount of time for parallelization and performance tuning. The basic idea of FDPS is to provide a high-p… ▽ More In this paper, we describe the algorithms we implemented in FDPS to make efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to make it possible for many researchers to develop their own high-performance parallel particle-based simulation programs without spending large amount of time for parallelization and performance tuning. The basic idea of FDPS is to provide a high-performance implementation of parallel algorithms for particle-based simulations in a "generic" form, so that researchers can define their own particle data structure and interparticle interaction functions and supply them to FDPS. FDPS compiled with user-supplied data type and interaction function provides all necessary functions for parallelization, and using those functions researchers can write their programs as though they are writing simple non-parallel program. It has been possible to use accelerators with FDPS, by writing the interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of user-provided interaction function so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the side of CPU and amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a systems with NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27 \% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. △ Less

Submitted 4 July, 2019; originally announced July 2019.

arXiv:1907.02289 [pdf, ps, other]

Implementation and Performance of Barnes-Hut N-body algorithm on Extreme-scale Heterogeneous Many-core Architectures

Authors: Masaki Iwasawa, Daisuke Namekata, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Long Wang, Miyuki Tsubouchi, Jun Makino, Zhao Liu, Haohuan Fu, Guangwen Yang

Abstract: In this paper, we report the implementation and measured performance of our extreme-scale global simulation code on Sunway TaihuLight and two PEZY-SC2 systems: Shoubu System B and Gyoukou. The numerical algorithm is the parallel Barnes-Hut tree algorithm, which has been used in many large-scale astrophysical particle-based simulations. Our implementation is based on our FDPS framework. However, th… ▽ More In this paper, we report the implementation and measured performance of our extreme-scale global simulation code on Sunway TaihuLight and two PEZY-SC2 systems: Shoubu System B and Gyoukou. The numerical algorithm is the parallel Barnes-Hut tree algorithm, which has been used in many large-scale astrophysical particle-based simulations. Our implementation is based on our FDPS framework. However, the extremely large numbers of cores of the systems used (10M on TaihuLight and 16M on Gyoukou) and their relatively poor memory and network bandwidth pose new challenges. We describe the new algorithms introduced to achieve high efficiency on machines with low memory bandwidth. The measured performance is 47.9, 10.6 PF, and 1.01PF on TaihuLight, Gyoukou and Shoubu System B (efficiency 40\%, 23.5\% and 35.5\%). The current code is developed for the simulation of planetary rings, but most of the new algorithms are useful for other simulations, and are now available in the FDPS framework. △ Less

Submitted 4 July, 2019; originally announced July 2019.

arXiv:1903.03138 [pdf, other]

doi 10.3847/1538-4357/ab0d1f

A Mean-Field Approach to Simulating the Merging of Collisionless Stellar Systems Using a Particle-Based Method

Authors: Shunsuke Hozumi, Masaki Iwasawa, Keigo Nitadori

Abstract: We present a mean-field approach to simulating merging processes of two spherical collisionless stellar systems. This approach is realized with a self-consistent field (SCF) method in which the full spatial dependence of the density and potential of a system is expanded in a set of basis functions for solving Poisson's equation. In order to apply this SCF method to a merging situation where two sy… ▽ More We present a mean-field approach to simulating merging processes of two spherical collisionless stellar systems. This approach is realized with a self-consistent field (SCF) method in which the full spatial dependence of the density and potential of a system is expanded in a set of basis functions for solving Poisson's equation. In order to apply this SCF method to a merging situation where two systems are moving in space, we assign the expansion center to the center of mass of each system, the position of which is followed by a mass-less particle placed at that position initially. Merging simulations over a wide range of impact parameters are performed using both an SCF code developed here and a tree code. The results of each simulation produced by the two codes show excellent agreement in the evolving morphology of the merging systems and in the density and velocity dispersion profiles of the merged systems. However, comparing the results generated by the tree code to those obtained with the softening-free SCF code, we have found that in large impact parameter cases, a softening length of the Plummer type introduced in the tree code has an effect of advancing the orbital phase of the two systems in the merging process at late times. We demonstrate that the faster orbital phase originates from the larger convergence length to the pure Newtonian force. Other application problems suitable to the current SCF code are also discussed. △ Less

Submitted 7 March, 2019; originally announced March 2019.

Comments: 16 pages, 13 figures (14 figure files), accepted for publication in ApJ

Report number: SUE-SH-1901

arXiv:1804.08935 [pdf, ps, other]

doi 10.1093/pasj/psy062

Fortran interface layer of the framework for developing particle simulator FDPS

Authors: Daisuke Namekata, Masaki Iwasawa, Keigo Nitadori, Ataru Tanikawa, Takayuki Muranushi, Long Wang, Natsuki Hosono, Kentaro Nomura, Junichiro Makino

Abstract: Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables… ▽ More Numerical simulations based on particle methods have been widely used in various fields including astrophysics. To date, simulation softwares have been developed by individual researchers or research groups in each field, with a huge amount of time and effort, even though numerical algorithms used are very similar. To improve the situation, we have developed a framework, called FDPS, which enables researchers to easily develop massively parallel particle simulation codes for arbitrary particle methods. Until version 3.0, FDPS have provided API only for C++ programing language. This limitation comes from the fact that FDPS is developed using the template feature in C++, which is essential to support arbitrary data types of particle. However, there are many researchers who use Fortran to develop their codes. Thus, the previous versions of FDPS require such people to invest much time to learn C++. This is inefficient. To cope with this problem, we newly developed a Fortran interface layer in FDPS, which provides API for Fortran. In order to support arbitrary data types of particle in Fortran, we design the Fortran interface layer as follows. Based on a given derived data type in Fortran representing particle, a Python script provided by us automatically generates a library that manipulates the C++ core part of FDPS. This library is seen as a Fortran module providing API of FDPS from the Fortran side and uses C programs internally to interoperate Fortran with C++. In this way, we have overcome several technical issues when emulating `template' in Fortran. By using the Fortran interface, users can develop all parts of their codes in Fortran. We show that the overhead of the Fortran interface part is sufficiently small and a code written in Fortran shows a performance practically identical to the one written in C++. △ Less

Submitted 25 April, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: 10 pages, 10 figures; accepted for publication in PASJ; a typo in author name is corrected

arXiv:1612.06984 [pdf, ps, other]

doi 10.1093/pasj/psw131

Unconvergence of Very Large Scale GI Simulations

Authors: Natsuki Hosono, Masaki Iwasawa, Ataru Tanikawa, Keigo Nitadori, Takayuki Muranushi, Junichiro Makino

Abstract: The giant impact (GI) is one of the most important hypotheses both in planetary science and geoscience, since it is related to the origin of the Moon and also the initial condition of the Earth. A number of numerical simulations have been done using the smoothed particle hydrodynamics (SPH) method. However, GI hypothesis is currently in a crisis. The "canonical" GI scenario failed to explain the i… ▽ More The giant impact (GI) is one of the most important hypotheses both in planetary science and geoscience, since it is related to the origin of the Moon and also the initial condition of the Earth. A number of numerical simulations have been done using the smoothed particle hydrodynamics (SPH) method. However, GI hypothesis is currently in a crisis. The "canonical" GI scenario failed to explain the identical isotope ratio between the Earth and the Moon. On the other hand, little has been known about the reliability of the result of GI simulations. In this paper, we discuss the effect of the resolution on the results of the GI simulations by varying the number of particles from $3 \times10^3$ to $10^8$. We found that the results does not converge, but shows oscillatory behaviour. We discuss the origin of this oscillatory behaviour. △ Less

Submitted 21 December, 2016; originally announced December 2016.

Comments: Accepted to PASJ, an animation is available at https://vimeo.com/194156367

arXiv:1601.03138 [pdf, ps, other]

doi 10.1093/pasj/psw053

Implementation and performance of FDPS: A Framework Developing Parallel Particle Simulation Codes

Authors: Masaki Iwasawa, Ataru Tanikawa, Natsuki Hosono, Keigo Nitadori, Takayuki Muranushi, Junichiro Makino

Abstract: We present the basic idea, implementation, measured performance and performance model of FDPS (Framework for developing particle simulators). FDPS is an application-development framework which helps the researchers to develop particle-based simulation programs for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers n… ▽ More We present the basic idea, implementation, measured performance and performance model of FDPS (Framework for developing particle simulators). FDPS is an application-development framework which helps the researchers to develop particle-based simulation programs for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, redistribution of particles, and gathering of particle information for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as Barnes-Hut tree method should be used for long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are necessary. FDPS provides all of these necessary functions for efficient parallel execution of particle-based simulations as "templates", which are independent of the actual data structure of particles and the functional form of the interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N^2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speedup was obtained for up to the full system of K computer. The minimum calculation time per timestep is in the range of 30 ms (N=10^7) to 300 ms (N=10^9). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks. △ Less

Submitted 24 April, 2016; v1 submitted 13 January, 2016; originally announced January 2016.

Comments: 22 pages, 27 figures,accepted for publication in PASJ. The FDPS package is here https://github.com/fdps/fdps

arXiv:1504.03687 [pdf, ps, other]

doi 10.1093/mnras/stv817

NBODY6++GPU: Ready for the gravitational million-body problem

Authors: Long Wang, Rainer Spurzem, Sverre Aarseth, Keigo Nitadori, Peter Berczik, M. B. N. Kouwenhoven, Thorsten Naab

Abstract: Accurate direct $N$-body simulations help to obtain detailed information about the dynamical evolution of star clusters. They also enable comparisons with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a well-known direct $N$-body code for star clusters, and NBODY6++ is the extended version designed for large particle number simulations by supercomputers. We present NBODY6++… ▽ More Accurate direct $N$-body simulations help to obtain detailed information about the dynamical evolution of star clusters. They also enable comparisons with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a well-known direct $N$-body code for star clusters, and NBODY6++ is the extended version designed for large particle number simulations by supercomputers. We present NBODY6++GPU, an optimized version of NBODY6++ with hybrid parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large direct $N$-body simulations, and in particular to solve the million-body problem. We discuss the new features of the NBODY6++GPU code, benchmarks, as well as the first results from a simulation of a realistic globular cluster initially containing a million particles. For million-body simulations, NBODY6++GPU is $400-2000$ times faster than NBODY6 with 320 CPU cores and 32 NVIDIA K20X GPUs. With this computing cluster specification, the simulations of million-body globular clusters including $5\%$ primordial binaries require about an hour per half-mass crossing time. △ Less

Submitted 21 May, 2015; v1 submitted 14 April, 2015; originally announced April 2015.

Comments: 13 pages, 9 figures, 3 tables

Journal ref: MNRAS 450, 4070-4080 (2015)

arXiv:1412.0659 [pdf, other]

doi 10.1109/SC.2014.10

24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs

Authors: Jeroen Bédorf, Evghenii Gaburov, Michiko S. Fujii, Keigo Nitadori, Tomoaki Ishiyama, Simon Portegies Zwart

Abstract: We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This i… ▽ More We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This improves upon previous simulations by using 1000 times more particles, and provides a wealth of new data that can be directly compared with observations. We also report the scalability on both the Swiss Piz Daint and the US ORNL Titan. On Piz Daint the parallel efficiency of Bonsai was above 95%. The highest performance was achieved with a 242 billion particle Milky Way model using 18600 GPUs on Titan, thereby reaching a sustained GPU and application performance of 33.49 Pflops and 24.77 Pflops respectively. △ Less

Submitted 1 December, 2014; originally announced December 2014.

Comments: 12 pages, 4 figures, Published in: 'Proceeding SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis'. Gordon Bell Prize 2014 finalist

arXiv:1409.5981 [pdf, other]

Particle mesh multipole method: An efficient solver for gravitational/electrostatic forces based on multipole method and fast convolution over a uniform mesh

Authors: Keigo Nitadori

Abstract: We propose an efficient algorithm for the evaluation of the potential and its gradient of gravitational/electrostatic $N$-body systems, which we call particle mesh multipole method (PMMM or PM$^3$). PMMM can be understood both as an extension of the particle mesh (PM) method and as an optimization of the fast multipole method (FMM).In the former viewpoint, the scalar density and potential held by… ▽ More We propose an efficient algorithm for the evaluation of the potential and its gradient of gravitational/electrostatic $N$-body systems, which we call particle mesh multipole method (PMMM or PM$^3$). PMMM can be understood both as an extension of the particle mesh (PM) method and as an optimization of the fast multipole method (FMM).In the former viewpoint, the scalar density and potential held by a grid point are extended to multipole moments and local expansions in $(p+1)^2$ real numbers, where $p$ is the order of expansion. In the latter viewpoint, a hierarchical octree structure which brings its $\mathcal O(N)$ nature, is replaced with a uniform mesh structure, and we exploit the convolution theorem with fast Fourier transform (FFT) to speed up the calculations. Hence, independent $(p+1)^2$ FFTs with the size equal to the number of grid points are performed. The fundamental idea is common to PPPM/MPE by Shimada et al. (1993) and FFTM by Ong et al. (2003). PMMM differs from them in supporting both the open and periodic boundary conditions, and employing an irreducible form where both the multipole moments and local expansions are expressed in $(p+1)^2$ real numbers and the transformation matrices in $(2p+1)^2$ real numbers. The computational complexity is the larger of $\mathcal O(p^2 N)$ and $\mathcal O(N \log (N/p^2))$, and the memory demand is $\mathcal O(N)$ when the number of grid points is $\propto N/p^2$. △ Less

Submitted 17 October, 2014; v1 submitted 21 September, 2014; originally announced September 2014.

Comments: Submitted to J. Comput. Phys. (editor reject 17 Oct. 2014), 30 pages, 7 figures

arXiv:1211.4406 [pdf, ps, other]

4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem

Authors: Tomoaki Ishiyama, Keigo Nitadori, Junichiro Makino

Abstract: As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many p… ▽ More As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many previous Gordon-Bell prize winners that used the tree algorithm for astrophysical N-body simulations, we used the hybrid TreePM method, for similar level of accuracy in which the short-range force is calculated by the tree algorithm, and the long-range force is solved by the particle-mesh algorithm. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. The average performance on 24576 and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49% and 42% of the peak speed. △ Less

Submitted 13 April, 2015; v1 submitted 19 November, 2012; originally announced November 2012.

Comments: 10 pages, 6 figures, Proceedings of Supercomputing 2012 (http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp2012

Journal ref: SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Article No. 5 (2012)

arXiv:1205.1222 [pdf, ps, other]

doi 10.1111/j.1365-2966.2012.21227.x

Accelerating NBODY6 with Graphics Processing Units

Authors: Keigo Nitadori, Sverre J. Aarseth

Abstract: We describe the use of Graphics Processing Units (GPUs) for speeding up the code NBODY6 which is widely used for direct $N$-body simulations. Over the years, the $N^2$ nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time-steps, the calculation cost was first reduced by the introductio… ▽ More We describe the use of Graphics Processing Units (GPUs) for speeding up the code NBODY6 which is widely used for direct $N$-body simulations. Over the years, the $N^2$ nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time-steps, the calculation cost was first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers which speeded up the force calculation further, we are now in the era of GPUs where relatively small hardware systems are highly cost-effective. A significant gain in efficiency is achieved by employing the GPU to obtain the so-called regular force which typically involves some 99 percent of the particles, while the remaining local forces are evaluated on the host. However, the latter operation is performed up to 20 times more frequently and may still account for a significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction term is calculated using mainly single precision. We also discuss further strategies connected with coordinate and velocity prediction required by the integration scheme. This leaves hard binaries and multiple close encounters which are treated by several regularization methods. The present nbody6-GPU code is well balanced for simulations in the particle range $10^4-2 \times 10^5$ for a dual GPU system attached to a standard PC. △ Less

Submitted 6 May, 2012; originally announced May 2012.

Comments: 8 pages, 3 figures, 2 tables, MNRAS accepted

arXiv:1203.4037 [pdf, ps, other]

doi 10.1016/j.newast.2012.08.009

Phantom-GRAPE: numerical software library to accelerate collisionless $N$-body simulation with SIMD instruction set on x86 architecture

Authors: Ataru Tanikawa, Kohji Yoshikawa, Keigo Nitadori, Takashi Okamoto

Abstract: (Abridged) We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, AVX, an enhanced version of SSE. In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite… ▽ More (Abridged) We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, AVX, an enhanced version of SSE. In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius r_cut (i.e. f(r)=0 at r>r_cut), can be quickly computed. Using an Intel Core i7--2600 processor, we measure the performance of our library for both the forces. In the case of Newton's forces, we achieve 2 x 10^9 interactions per second with 1 processor core, which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With 4 processor cores, we obtain the performance of 8 x 10^9 interactions per second. In the case of the arbitrarily shaped forces, we can calculate 1 x 10^9 and 4 x 10^9 interactions per second with 1 and 4 processor cores, respectively. The performance with 1 processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend weakly on the number of particles. It is good contrast with the fact that the performance of force calculations accelerated by GPUs depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. Collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments. △ Less

Submitted 9 October, 2012; v1 submitted 19 March, 2012; originally announced March 2012.

Comments: 19 pages, 11 figures, 4tables, accepted for publication in New Astronomy

arXiv:1203.1623 [pdf, ps, other]

doi 10.1088/0004-637X/756/1/30

Formation and Hardening of Supermassive Black Hole Binaries in Minor Mergers of Disk Galaxies

Authors: Fazeel Mahmood Khan, Ingo Berentzen, Peter Berczik, Andreas Just, Lucio Mayer, Keigo Nitadori, Simone Callegari

Abstract: We model for the first time the complete orbital evolution of a pair of Supermassive Black Holes (SMBHs) in a 1:10 galaxy merger of two disk dominated gas-rich galaxies, from the stage prior to the formation of the binary up to the onset of gravitational wave emission when the binary separation has shrunk to 1 milli parsec. The high-resolution smoothed particle hydrodynamics (SPH) simulations used… ▽ More We model for the first time the complete orbital evolution of a pair of Supermassive Black Holes (SMBHs) in a 1:10 galaxy merger of two disk dominated gas-rich galaxies, from the stage prior to the formation of the binary up to the onset of gravitational wave emission when the binary separation has shrunk to 1 milli parsec. The high-resolution smoothed particle hydrodynamics (SPH) simulations used for the first phase of the evolution include star formation, accretion onto the SMBHs as well as feedback from supernovae explosions and radiative heating from the SMBHs themselves. Using the direct N-body code φ-GPU we evolve the system further without including the effect of gas, which has been mostly consumed by star formation in the meantime. We start at the time when the separation between two SMBHs is ~ 700 pc and the two black holes are still embedded in their galaxy cusps. We use 3 million particles to study the formation and evolution of the SMBH binary till it becomes hard. After a hard binary is formed, we reduce (reselect) the particles to 1.15 million and follow the subsequent shrinking of the SMBH binary due to 3-body encounters with the stars. We find approximately constant hardening rates and that the SMBH binary rapidly develops a high eccentricity. Similar hardening rates and eccentricity values are reported in earlier studies of SMBH binary evolution in the merging of dissipation-less spherical galaxy models. The estimated coalescence time is ~ 2.9 Gyr, significantly smaller than a Hubble time. We discuss why this timescale should be regarded as an upper limit. Since 1:10 mergers are among the most common interaction events for galaxies at all cosmic epochs, we argue that several SMBH binaries should be detected with currently planned space-borne gravitational wave interferometers, whose sensitivity will be especially high for SMBHs in the mass range considered here. △ Less

Submitted 7 March, 2012; originally announced March 2012.

Comments: 9 pages, 11 figures, submitted to ApJ

arXiv:1201.1694 [pdf, ps, other]

doi 10.1016/j.newast.2012.01.003

PSDF: Particle Stream Data Format for N-Body Simulations

Authors: Will M. Farr, Jeff Ames, Piet Hut, Junichiro Makino, Steve McMillan, Takayuki Muranushi, Koichi Nakamura, Keigo Nitadori, Simon Portegies Zwart

Abstract: We present a data format for the output of general N-body simulations, allowing the presence of individual time steps. By specifying a standard, different N-body integrators and different visualization and analysis programs can all share the simulation data, independent of the type of programs used to produce the data. Our Particle Stream Data Format, PSDF, is specified in YAML, based on the same… ▽ More We present a data format for the output of general N-body simulations, allowing the presence of individual time steps. By specifying a standard, different N-body integrators and different visualization and analysis programs can all share the simulation data, independent of the type of programs used to produce the data. Our Particle Stream Data Format, PSDF, is specified in YAML, based on the same approach as XML but with a simpler syntax. Together with a specification of PSDF, we provide background and motivation, as well as specific examples in a variety of computer languages. We also offer a web site from which these examples can be retrieved, in order to make it easy to augment existing codes in order to give them the option to produce PSDF output. △ Less

Submitted 9 January, 2012; originally announced January 2012.

Comments: 5 pages; submitted to New Astronomy

Journal ref: New Astronomy 17 (2012), pp. 520--523

arXiv:1104.2700 [pdf, ps, other]

doi 10.1016/j.newast.2011.07.001

N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions

Authors: Ataru Tanikawa, Kohji Yoshikawa, Takashi Okamoto, Keigo Nitadori

Abstract: We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implem… ▽ More We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme (Makino and Aarseth, 1992), and achieved the performance of 20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions (Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme (Nitadori et al., 2006a), and achieved 90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N 105 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs (Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs. △ Less

Submitted 5 September, 2011; v1 submitted 14 April, 2011; originally announced April 2011.

Comments: 14 pages, 9 figures, 3 tables, accepted for publication in New Astronomy. The code is publicly available at http://code.google.com/p/phantom-grape/

arXiv:1101.2020 [pdf, ps, other]

doi 10.1088/0004-637X/767/2/146

The Cosmogrid Simulation: Statistical Properties of Small Dark Matter Halos

Authors: Tomoaki Ishiyama, Steven Rieder, Junichiro Makino, Simon Portegies Zwart, Derek Groen, Keigo Nitadori, Cees de Laat, Stephen McMillan, Kei Hiraki, Stefan Harfst

Abstract: We present the results of the "Cosmogrid" cosmological N-body simulation suites based on the concordance LCDM model. The Cosmogrid simulation was performed in a 30Mpc box with 2048^3 particles. The mass of each particle is 1.28x10^5 Msun, which is sufficient to resolve ultra-faint dwarfs. We found that the halo mass function shows good agreement with the Sheth & Tormen fitting function down to ~10… ▽ More We present the results of the "Cosmogrid" cosmological N-body simulation suites based on the concordance LCDM model. The Cosmogrid simulation was performed in a 30Mpc box with 2048^3 particles. The mass of each particle is 1.28x10^5 Msun, which is sufficient to resolve ultra-faint dwarfs. We found that the halo mass function shows good agreement with the Sheth & Tormen fitting function down to ~10^7 Msun. We have analyzed the spherically averaged density profiles of the three most massive halos which are of galaxy group size and contain at least 170 million particles. The slopes of these density profiles become shallower than -1 at the inner most radius. We also find a clear correlation of halo concentration with mass. The mass dependence of the concentration parameter cannot be expressed by a single power law, however a simple model based on the Press-Schechter theory proposed by Navarro et al. gives reasonable agreement with this dependence. The spin parameter does not show a correlation with the halo mass. The probability distribution functions for both concentration and spin are well fitted by the log-normal distribution for halos with the masses larger than ~10^8 Msun. The subhalo abundance depends on the halo mass. Galaxy-sized halos have 50% more subhalos than ~10^{11} Msun halos have. △ Less

Submitted 8 April, 2013; v1 submitted 10 January, 2011; originally announced January 2011.

Comments: 15 pages, 18 figures, accepted by ApJ

Journal ref: 2013, ApJ, 767, 146

arXiv:1006.4159 [pdf, other]

doi 10.1111/j.1365-2966.2011.18313.x

Astrophysical Weighted Particle Magnetohydrodynamics

Authors: Evghenii Gaburov, Keigo Nitadori

Abstract: This paper presents applications of weighted meshless scheme for conservation laws to the Euler equations and the equations of ideal magnetohydrodynamics. The divergence constraint of the latter is maintained to the truncation error by a new meshless divergence cleaning procedure. The physics of the interaction between the particles is described by an one-dimensional Riemann problem in a moving fr… ▽ More This paper presents applications of weighted meshless scheme for conservation laws to the Euler equations and the equations of ideal magnetohydrodynamics. The divergence constraint of the latter is maintained to the truncation error by a new meshless divergence cleaning procedure. The physics of the interaction between the particles is described by an one-dimensional Riemann problem in a moving frame. As a result, necessary diffusion which is required to treat dissipative processes is added automatically. As a result, our scheme has no free parameters that controls the physics of inter-particle interaction, with the exception of the number of the interacting neighbours which control the resolution and accuracy. The resulting equations have the form similar to SPH equations, and therefore existing SPH codes can be used to implement the weighed particle scheme. The scheme is validated in several hydrodynamic and MHD test cases. In particular, we demonstrate for the first time the ability of a meshless MHD scheme to model magneto-rotational instability in accretion disks. △ Less

Submitted 21 June, 2010; originally announced June 2010.

Comments: 27 pages, 24 figures, 1 column, submitted to MNRAS, hi-res version can be obtained at http://www.strw.leidenuniv.nl/~egaburov/wpmhd.pdf

arXiv:1001.0773 [pdf, ps, other]

doi 10.1109/MC.2009.419

Simulating the universe on an intercontinental grid of supercomputers

Authors: Simon Portegies Zwart, Tomoaki Ishiyama, Derek Groen, Keigo Nitadori, Junichiro Makino, Cees de Laat, Stephen McMillan, Kei Hiraki, Stefan Harfst, Paola Grosso

Abstract: Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbation… ▽ More Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbations in the nonlinear regime. Simulating the formation of large-scale structures in the universe, however, is still a challenge due to the enormous dynamic range in spatial and temporal coordinates, and due to the enormous computer resources required. The dynamic range is generally dealt with by the hybridization of numerical techniques. We deal with the computational requirements by connecting two supercomputers via an optical network and make them operate as a single machine. This is challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as one is located in Amsterdam and the other is in Tokyo. The co-scheduling of the two computers and the 'gridification' of the code enables us to achieve a 90% efficiency for this distributed intercontinental supercomputer. △ Less

Submitted 5 January, 2010; originally announced January 2010.

Comments: Accepted for publication in IEEE Computer

arXiv:0708.0738 [pdf, ps, other]

doi 10.1016/j.newast.2008.01.010

6th and 8th Order Hermite Integrator for N-body Simulations

Authors: Keigo Nitadori, Junichiro Makino

Abstract: We present sixth- and eighth-order Hermite integrators for astrophysical $N$-body simulations, which use the derivatives of accelerations up to second order ({\it snap}) and third order ({\it crackle}). These schemes do not require previous values for the corrector, and require only one previous value to construct the predictor. Thus, they are fairly easy to implemente. The additional cost of th… ▽ More We present sixth- and eighth-order Hermite integrators for astrophysical $N$-body simulations, which use the derivatives of accelerations up to second order ({\it snap}) and third order ({\it crackle}). These schemes do not require previous values for the corrector, and require only one previous value to construct the predictor. Thus, they are fairly easy to implemente. The additional cost of the calculation of the higher order derivatives is not very high. Even for the eighth-order scheme, the number of floating-point operations for force calculation is only about two times larger than that for traditional fourth-order Hermite scheme. The sixth order scheme is better than the traditional fourth order scheme for most cases. When the required accuracy is very high, the eighth-order one is the best. These high-order schemes have several practical advantages. For example, they allow a larger number of particles to be integrated in parallel than the fourth-order scheme does, resulting in higher execution efficiency in both general-purpose parallel computers and GRAPE systems. △ Less

Submitted 4 February, 2008; v1 submitted 6 August, 2007; originally announced August 2007.

Comments: 21 pages, 6 figures, New Astronomy accepted

Journal ref: NewAstron.13:498-507,2008

arXiv:astro-ph/0606105 [pdf, ps, other]

High-Performance Small-Scale Simulation of Star Clusters Evolution on Cray XD1

Authors: Keigo Nitadori, Junichiro Makino, George Abe

Abstract: In this paper, we describe the performance of an $N$-body simulation of star cluster with 64k stars on a Cray XD1 system with 400 dual-core Opteron processors. A number of astrophysical $N$-body simulations were reported in SCxy conferences. All previous entries for Gordon-Bell prizes used at least 700k particles. The reason for this preference of large numbers of particles is the parallel effic… ▽ More In this paper, we describe the performance of an $N$-body simulation of star cluster with 64k stars on a Cray XD1 system with 400 dual-core Opteron processors. A number of astrophysical $N$-body simulations were reported in SCxy conferences. All previous entries for Gordon-Bell prizes used at least 700k particles. The reason for this preference of large numbers of particles is the parallel efficiency. It is very difficult to achieve high performance on large parallel machines, if the number of particles is small. However, for many scientifically important problems the calculation cost scales as $O(N^{3.3})$, and it is very important to use large machines for relatively small number of particles. We achieved 2.03 Tflops, or 57.7% of the theoretical peak performance, using a direct $O(N^2)$ calculation with the individual timestep algorithm, on 64k particles. The best efficiency previously reported on similar calculation with 64K or smaller number of particles is 12% (9 Gflops) on Cray T3E-600 with 128 processors. Our implementation is based on highly scalable two-dimensional parallelization scheme, and low-latency communication network of Cray XD1 turned out to be essential to achieve this level of performance. △ Less

Submitted 7 June, 2006; v1 submitted 6 June, 2006; originally announced June 2006.

Comments: 12 pages, 4 figures, SC06 submitted

arXiv:astro-ph/0511062 [pdf, ps, other]

doi 10.1016/j.newast.2006.07.007

Performance Tuning of N-Body Codes on Modern Microprocessors: I. Direct Integration with a Hermite Scheme on x86_64 Architecture

Authors: Keigo Nitadori, Junichiro Makino, Piet Hut

Abstract: The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between two and ten, depending on the code and the processor on which the code is run. These speedups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running… ▽ More The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between two and ten, depending on the code and the processor on which the code is run. These speedups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running on these chips, can easily incorporate our assembly code programs. In the current paper, we present an outline of our overall approach, which we illustrate with one specific example: the use of a Hermite scheme for a direct N^2 type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an effective performance of 4.05 Gflops, for double precision accuracy. In subsequent papers, we will discuss other variations, including the combinations of N log N codes, single precision implementations, and performance on other microprocessors. △ Less

Submitted 2 November, 2005; originally announced November 2005.

Comments: 32 pages, 2 figures

Journal ref: New Astron.12:169-181,2006

Showing 1–25 of 25 results for author: Nitadori, K