 
 
 
 
 
 
 
  
 Next: 4.6 Restarting
 Up: 4 Performances
 Previous: 4.4 Parallelization issues
     Contents 
Subsections
The time report printed at the end of a pw.x run contains a lot of useful 
information that can be used to understand bottlenecks and improve 
performances.
The following applies to calculations taking a sizable amount of time
(at least minutes): for short calculations (seconds), the time spent in 
the various initializations dominates. Any discrepancy with the following
picture signals some anomaly.
- For a typical job with norm-conserving PPs, the total (wall) time is mostly 
  spent in routine "electrons", calculating the self-consistent solution. 
- Most of the time spent in "electrons" is used by routine "c_bands", 
  calculating Kohn-Sham states. "sum_band" (calculating the charge density),
  "v_of_rho" (calculating the potential), "mix_rho" (charge density mixing)
  should take a small fraction of the time.
- Most of the time spent in "c_bands" is used by routines "cegterg" (k-points)
  or "regterg" (Gamma-point only), performing iterative diagonalization of
  the Kohn-Sham Hamiltonian in the PW basis set. 
- Most of the time spent in "*egterg" is used by routine "h_psi",
  calculating Hψ products. "cdiaghg" (k-points) or "rdiaghg" (Gamma-only), 
  performing subspace diagonalization, should take only a small fraction.
- Among the "general routines", most of the time is spent in FFT on Kohn-Sham
  states: "fftw", and to a smaller extent in other FFTs, "fft" and "ffts", 
  and in "calbec", calculating 
〈ψ| β〉 products. 
- Forces and stresses typically take a fraction of the order of 10 to 20%
  of the total time.
For PAW and Ultrasoft PP, you will see a larger contribution by "sum_band" 
and a nonnegligible "newd" contribution to the time spent in "electrons", 
but the overall picture is unchanged. You may drastically reduce the
overhead of Ultrasoft PPs by using input option "tqr=.true.".
The various parallelization levels should be used wisely in order to 
achieve good results. Let us summarize the effects of them on CPU:
- Parallelization on FFT speeds up (with varying efficiency) almost 
  all routines, with the notable exception of "cdiaghg" and "rdiaghg".
- Parallelization on k-points speeds up (almost linearly) "c_bands" and 
  called routines; speeds up partially "sum_band"; does not speed up
  at all "v_of_rho", "newd", "mix_rho".
- Linear-algebra parallelization speeds up (not always) "cdiaghg" and "rdiaghg" 
- "task-group" parallelization speeds up "fftw"
- OpenMP parallelization speeds up "fftw", plus selected parts of the 
  calculation, plus (depending on the availability of OpenMP-aware
  libraries) some linear algebra operations
and on RAM:
- Parallelization on FFT distributes most arrays across processors
  (i.e. all G-space and R-spaces arrays) but not all of them (in
  particular, not subspace Hamiltonian and overlap matrices)
- Linear-algebra parallelization also distributes subspace Hamiltonian
  and overlap matrices.
- All other parallelization levels do not distribute any memory
In an ideally parallelized run, you should observe the following:
- CPU and wall time do not differ by much
- Time usage is still dominated by the same routines as for the serial run
- Routine "fft_scatter" (called by parallel FFT) takes a sizable part of
  the time spent in FFTs but does not dominate it.
You need to know
- the number of k-points, Nk
- the third dimension of the (smooth) FFT grid, N3
- the number of Kohn-Sham states, M
These data allow to set bounds on parallelization:
- k-point parallelization is limited to Nk processor pools: 
  -nk Nk
- FFT parallelization shouldn't exceed N3 processors, i.e. if you
  run with -nk Nk, use 
N = Nk×N3 MPI processes at most (mpirun -np N ...)
- Unless M is a few hundreds or more, don't bother using linear-algebra
  parallelization
You will need to experiment a bit to find the best compromise. In order
to have good load balancing among MPI processes, the number of k-point
pools should be an integer divisor of Nk; the number of processors for
FFT parallelization should be an integer divisor of N3.
- a large fraction of time is spent in "v_of_rho", "newd", "mix_rho", or
 the time doesn't scale well or doesn't scale at all by increasing the 
  number of processors for k-point parallelization.  Solution:
- use (also) FFT parallelization if possible
 
- a disproportionate time is spent in "cdiaghg"/"rdiaghg". Solutions:
- use (also) k-point parallelization if possible
- use linear-algebra parallelization, with scalapack if possible.
 
- a disproportionate time is spent in "fft_scatter", or
in "fft_scatter" the difference between CPU and wall time is large. Solutions:
- if you do not have fast (better than Gigabit ethernet) communication
    hardware, do not try FFT parallelization on more than 4 or 8 procs.
- use (also) k-point parallelization if possible
 
- the time doesn't scale well or doesn't scale at all by increasing the 
  number of processors for FFT parallelization.
    Solutions:
- use "task groups": try command-line option -ntg 4 or
    -ntg 8. This may improve your scaling.
 
 
 
 
 
 
 
 
  
 Next: 4.6 Restarting
 Up: 4 Performances
 Previous: 4.4 Parallelization issues
     Contents