Basic schemes for planewave parallelization

----------

The G-space parallelization for each k-point

There are different ways to parallelize a planewave code. In PEtot, the parallelization is done in both G-space and real-space (using MPI). Each processor will have all the wavefunctions, but only part of G-space and real-space data. The parallelization is done by considering (1) load balancing for the number of data and computation, (2) minimum communication for the FFT. In G-space, the Ecut1 G sphere is divided into single grid point columns along the n1 direction. Then the columns are distributed among the processors in such a way so each processor will have roughly equal numbers of total G points and roughly equal numbers of columns (related to FFT computation). This data distribution is illustrated in the following figure (note, the z,y,x directions correspond to n1,n2,n3 respectively, one color corresponds to one processor:P0,P1,P2).

----------

In real space, the grid is divided into node (number of processors) layers in the n1 direction (partial layer is used if n1 cannot be divided by node, as showed in the figure below). Each processor possess the data on one such layer. This is illustrated in the following (again, the the z,y,x directions correspond to n1,n2,n3 respectively, one color corresponds to one processor:P0,P1,P2).

----------

The parallel Fast Fourier Transformation (FFT) routine which transforms the distributed G-space data to distributed real-space data (or vic versa) is developed by Andrew Canning.

----------

The k-space parallelization for all the k-points

In PEtot, we have two levels of parallization. The main level is the G-space parallization as described above. However, for small system calculations,
e.g, bulk, metal, surface, often there are many k-points (e.g, in kpt.file file). Since the system is usually not so large for these systems, the G-space parallelization does not scale well for large number of processors. Here, we use another level of parallelization: the parallelization over the k-points. The basic diagram is given below. Within each group of processors, there will be nnodes_k processors. These processors will parallelize the G-space (and real space) as described above for one k-point calculation. However, the different processors in different groups will calculate different k-points in parallel. For example, in the diagram below, the num_group is 3, if there are 6 k-points in total, then each group will only work on 2 k-points. This parallelization is used to solve (improve) the Kohn-Sham equation for the wavefunctions.  For charge density update, and to calculate the charge density and potentials, all the groups are doing the same thing (wastefully repeating). As a result, the charge density and potential parallelization is also done in G-space by nnodes_k processors (not by nnodes_k*num_group processors). Fortunately, this part of calculation is usually small. However, whenever there is a do loop for the atoms (e.g, to calculate the force), the calculation is distributed among different groups. Since the communication requirment between different k-points is very small, the k-point parallelization scales well with the number of num_group (as long as there are enough k-points).

The total number of processors for the whole job is: nnodes_k*num_group.  The "nnodes_k,num_group" are input in the first line of etot.input.

For some machine (for example, the seaborg in NERSC), when some of the processors are placed within one node to share the memory, it is a good idea to place the processors within one group inside one node (so the communication between them is fast, which is required in G-space parallelization).

----------

----------

Back-to-home-page