================================================================================ Efficiency using various settings for waccm3.5.7 runs on 64, 96, and 128 cpus ================================================================================ Here are some data on performance running on different numbers of cpus obtained by Brian Eaton. The more cpus, the more costly the model, because performance scales less than linearly. But, for a given number of cpus, some configurations are more efficient than others, it appears. To get optimal performance, it looks like it is important to use no more than 4 threads/task; it also appears that 2-d decomposition is better than 1-d decomposition. Notes: Y-dom and Z-dom specify division of the MPI tasks by y-domain (latitude) and z-domain (vertical); Y-dom * Z-dom = #tasks; when Z-dom = 1, then Y-dom = #tasks, and you're doing 1-d decomposition. The numbers in the table are constrained by using SMT, which requires #threads/node to be 2 * #cpus/node (32 threads/node for blueice, since the nodes have 16 cpus). Thus, in the table entries below, tasks/node * threads = 32 always. The entries in each table are run using the same number of cpus, at 16 cpus/node. Thus, the number of tasks/node is always = #tasks/#nodes, where #nodes for each table is = #cpus/16. For example, in the first table, all the configurations use 8 nodes (since 8 x 16 cpu/node = 128 cpu), so that the first entry in the table has 256 tasks/8 nodes = 32 tasks/node, and so on. Tables: Performance/cost of WACCM on 8 blueice nodes (128 cpus, 2672 GAU/wallclock day): =========================================================== tasks Y-dom Z-dom threads tasks/node yrs/day GAU/yr =========================================================== 256 32 8 1 32 2.43 1100 128 16 8 2 16 2.77 965 128 32 4 2 16 2.65 1009 64 8 8 4 8 2.83 945 <== optimal 64 16 4 4 8 2.78 962 64 32 2 4 8 2.69 994 32 4 8 8 4 2.44 1095 32 8 4 8 4 2.57 1040 32 16 2 8 4 2.59 1032 32 32 1 8 4 2.61 1024 Performance/cost of WACCM on 4 blueice nodes (64 cpus, 1336 GAU/wallclock day): =========================================================== tasks Y-dom Z-dom threads tasks/node yrs/day GAU/yr =========================================================== 128 32 4 1 32 1.51 885 128 16 8 1 32 1.56 856 64 32 2 2 16 1.55 862 64 16 4 2 16 1.64 815 64 8 8 2 16 1.65 810 <== optimal 32 32 1 4 8 1.63 820 32 16 2 4 8 1.60 835 32 8 4 4 8 1.65 810 <== optimal 32 4 8 4 8 1.59 840 16 16 1 8 4 1.60 835 16 8 2 8 4 1.50 891 16 4 4 8 4 1.50 891 16 2 8 8 4 1.38 968 Performance/cost of WACCM on 6 blueice nodes (96 cpus, 2004 GAU/wallclock day): =========================================================== tasks Y-dom Z-dom threads tasks/node yrs/day GAU/yr =========================================================== 192 32 6 1 32 1.97 1017 192 16 12 1 32 2.01 997 96 32 3 2 16 2.12 945 96 16 6 2 16 2.22 903 96 8 12 2 16 2.17 924 48 24 2 4 8 2.22 903 48 16 3 4 8 2.21 907 48 12 4 4 8 2.29 875 <== optimal 48 8 6 4 8 2.26 887 48 4 12 4 8 2.08 963 24 24 1 8 4 2.15 932 24 12 2 8 4 2.10 954 24 8 3 8 4 2.06 973 24 6 4 8 4 2.07 968 24 4 6 8 4 1.97 1017 24 3 8 8 4 1.92 1044 24 2 12 8 4 1.77 1132