There are different ways to parallelize a planewave code. In PEtot, the parallelization is done in both G-space and real-space (using MPI). Each processor will have all the wavefunctions, but only part of G-space and real-space data. The parallelization is done by considering (1) load balancing for the number of data and computation, (2) minimum communication for the FFT. In G-space, the Ecut1 G sphere is divided into single grid point columns along the n1 direction. Then the columns are distributed among the processors in such a way so each processor will have roughly equal numbers of total G points and roughly equal numbers of columns (related to FFT computation). This data distribution is illustrated in the following figure (note, the z,y,x directions correspond to n1,n2,n3 respectively, one color corresponds to one processor:P0,P1,P2).
In real space, the grid is divided into node (number of processors) layers in the n1 direction (partial layer is used if n1 cannot be divided by node, as showed in the figure below). Each processor possess the data on one such layer. This is illustrated in the following (again, the the z,y,x directions correspond to n1,n2,n3 respectively, one color corresponds to one processor:P0,P1,P2).

The parallel Fast Fourier Transformation (FFT) routine which transforms the distributed G-space data to distributed real-space data (or vic versa) is developed by Andrew Canning.
In PEtot, we have two levels of parallization.
The main level is the G-space parallization as described above.
However, for small system calculations,
e.g, bulk, metal, surface, often there are many k-points (e.g, in
kpt.file file). Since the system is usually not so large for these
systems, the G-space parallelization does not scale well for large
number of processors. Here, we use another level of parallelization:
the parallelization over the k-points. The basic diagram is given
below. Within each group of processors, there will be nnodes_k
processors. These processors will parallelize the G-space (and real
space) as described above for one k-point calculation. However, the
different processors in different groups will calculate different
k-points in parallel. For example, in the diagram below, the num_group
is 3, if there are 6 k-points in total, then each group will only work
on 2 k-points. This parallelization is used to solve (improve) the
Kohn-Sham equation for the wavefunctions. For charge density
update, and to calculate the charge density and potentials, all the
groups are doing the same thing (wastefully repeating). As a result,
the charge density and potential parallelization is also done in
G-space by nnodes_k processors (not by nnodes_k*num_group processors).
Fortunately, this part of calculation is usually small. However,
whenever there is a do loop for the atoms (e.g, to calculate the
force), the calculation is distributed among different groups. Since
the communication requirment between different k-points is very small,
the k-point parallelization scales well with the number of num_group
(as long as there are enough k-points).
The total number of processors for the whole job
is: nnodes_k*num_group. The "nnodes_k,num_group" are input in the
first line of etot.input.
For some machine (for example, the seaborg in
NERSC), when some of the processors are placed within one node to share
the memory, it is a good idea to place the processors within one group
inside one node (so the communication between them is fast, which is
required in G-space parallelization).