Optimisation
Load balancing
The following describes the workflow to generate a suite of simulations that enable load balancing to be completed. We use the folowing tools: access-experiment-generator and access-experiment-runner.
To access these python modules, one needs to on Gadi: module purge;module use /g/data/vk83/prerelease/modules; module load payu/dev;module list. Documentation for the experiment generator is available here.
Using the Experiment generator to create simulation suite
At time of writing some changes to the config are not supported by access-experiment-generator. So we need to do small changes to the folowing to turn off CICE (we'll start by doing tests with MOM6):
nuopc.runconfig;nuopc.runseq
Changes are detailed here.
In this case, this is done a on a branch of a fork. Example for dev-MC_100km_jra_ryf branch, one creates a new branch based on dev-MC_100km_jra_ryf called mom6_only_100kmprofiling. Specifically, in nuopc.runconfig:
we remove ICE in this line component_list: MED ATM ICE OCN ROF (example).
In nuopc.runseq, the following lines should be commented out (example):
MED med_phases_prep_ice
MED -> ICE :remapMethod=redist
ICE
MED med_phases_diag_ice_ice2med
ICE -> MED :remapMethod=redist
MED med_phases_post_ice
MED med_phases_diag_ice_med2ice
We also make small modifications to the config.yaml, namely
platform:
nodesize: 104
nodemem: 512
queue: normalsr
metadata:
enable: true
We then create a .yaml file to set up the simulations, an example is worked through below (called pr-mom6.yaml), other applications should be hackable from this.
model_type: access-om3 # Specify the model ("access-om2", "access-om3", "access-esm1.5", or "access-esm1.6")
repository_url: git@github.com:chrisb13/access-om3-configs.git
start_point: "908df0da4970fd9128d79ec5b4c567948fe8123c" # Control commit hash for new branches
test_path: "." # All control and perturbation experiment repositories will be created here; can be relative, absolute or ~ (user-defined)
repository_directory: mom6_only # Local directory name for the central repository (user-defined)
control_branch_name: ctrl
Control_Experiment:
repository_url specifies the fork and commit created earlier.
We then specify a suite of runs, here 50 MOM6 standalone runs will be set up:
Perturbation_Experiment:
mom6_only:
branches: ['mom6_1', 'mom6_2', 'mom6_3', 'mom6_4', 'mom6_5', 'mom6_6', 'mom6_7', 'mom6_8', 'mom6_9', 'mom6_10', 'mom6_11', 'mom6_12', 'mom6_13', 'mom6_14', 'mom6_15', 'mom6_16', 'mom6_17', 'mom6_18', 'mom6_19', 'mom6_20', 'mom6_21', 'mom6_22', 'mom6_23', 'mom6_24', 'mom6_25', 'mom6_26', 'mom6_27', 'mom6_28', 'mom6_29', 'mom6_30', 'mom6_31', 'mom6_32', 'mom6_33', 'mom6_34', 'mom6_35', 'mom6_36', 'mom6_37', 'mom6_38', 'mom6_39', 'mom6_40', 'mom6_41', 'mom6_42', 'mom6_43', 'mom6_44', 'mom6_45', 'mom6_46', 'mom6_47', 'mom6_48', 'mom6_49', 'mom6_50']
MOM_input:
AUTO_IO_LAYOUT_FAC: REMOVE
We also need to turn on various ESMF diagnostics to enable relevant profiling. By default ESMF_RUNTIME_TRACE_PETLIST will produce as many files as there are cores. Here 0 is output from NUOPC (related to choice of atm_ntasks: &ntasks52 [52]; discussed below) and 52 is the first ocean core (the most important for profiling purposes). 52 would go up depending on choice of atm_ntasks.
config.yaml:
env:
ESMF_RUNTIME_PROFILE: "on"
ESMF_RUNTIME_TRACE: "on"
ESMF_RUNTIME_TRACE_PETLIST: "0 52"
ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY"
Here in ncpus we specify the total number of cpus to be allocated, this effectively determines the number of cores available to MOM6 where the MOM6 cores are atm_ntasks + oce_ntasks = ncpus (see below for where ocn_ntasks is set). Details of how to calculate ncpus and mem below:
ncpus: [65, 117, 169, 221, 273, 325, 377, 429, 481, 533, 585, 637, 689, 741, 793, 845, 897, 949, 1001, 1053, 1105, 1157, 1209, 1261, 1313, 1365, 1417, 1469, 1521, 1573, 1625, 1677, 1729, 1781, 1833, 1885, 1937, 1989, 2041, 2093, 2145, 2197, 2249, 2301, 2353, 2405, 2457, 2509, 2561, 2613]
mem: ['500GB', '1000GB', '1000GB', '1500GB', '1500GB', '2000GB', '2000GB', '2500GB', '2500GB', '3000GB', '3000GB', '3500GB', '3500GB', '4000GB', '4000GB', '4500GB', '4500GB', '5000GB', '5000GB', '5500GB', '5500GB', '6000GB', '6000GB', '6500GB', '6500GB', '7000GB', '7000GB', '7500GB', '7500GB', '8000GB', '8000GB', '8500GB', '8500GB', '9000GB', '9000GB', '9500GB', '9500GB', '10000GB', '10000GB', '10500GB', '10500GB', '11000GB', '11000GB', '11500GB', '11500GB', '12000GB', '12000GB', '12500GB', '12500GB', '13000GB']
repeat: True
nuopc.runconfig:
PELAYOUT_attributes:
atm_ntasks: &ntasks52 [52]
cpl_ntasks: *ntasks52
ice_ntasks: REMOVE
ice_nthreads: REMOVE
ice_pestride: REMOVE
ice_rootpe: REMOVE
ocn_ntasks: [13, 65, 117, 169, 221, 273, 325, 377, 429, 481, 533, 585, 637, 689, 741, 793, 845, 897, 949, 1001, 1053, 1105, 1157, 1209, 1261, 1313, 1365, 1417, 1469, 1521, 1573, 1625, 1677, 1729, 1781, 1833, 1885, 1937, 1989, 2041, 2093, 2145, 2197, 2249, 2301, 2353, 2405, 2457, 2509, 2561]
ocn_rootpe: *ntasks52
rof_ntasks: *ntasks52
3 run of each (150 runs total) so we set repeat to True.
For defining the ocn_ntasks we want to increase the number of cores by atm_ntasks in each run. This python snippet can generate it for us: print([13+(k*52) for k in range(50)]). For higher resolution configurations, one would normally start at a higher number (e.g. for global 25km we would start at 52). Hence, we can set ncpus using this python snippet: print([13+(k*52)+52 for k in range(50)]). The mem is determined by how many nodes are in use, each node contains 104 cores and has 500GB of memory available. Finally, we can find mem with: import math;nodenumber=[math.ceil((13+(k*52)+52) /104) for k in range(50)];print([str(n*500)+'GB' for n in nodenumber]). In the previous snippets, when adjusting for your configuration you'll want to change 13 to the starting point that you prefer.
The following sets the run length at two days and the ice parameters are for turning off CICE.
CLOCK_attributes:
restart_n: 2
restart_option: ndays
stop_n: 2
stop_option: ndays
ALLCOMP_attributes:
ICE_model: sice
ICE_attributes: REMOVE
ICE_modelio: REMOVE
We can then take the above yaml (pr-mom6.yaml) and pass it to access-experiment-generator:
module purge
module use /g/data/vk83/prerelease/modules
module load payu/dev
module list
mkdir -p /g/data/tm70/$USER/access-om3-ptests
experiment-generator -i pr-mom6.yaml
\[cyb561@gadi-login-03 access-om3-ptests\]$ experiment-generator -i pr-mom6.yaml
-- Test directory . already exists!
Cloned repository from git@github.com:chrisb13/access-om3-configs.git to directory: /g/data/tm70/cyb561/access-om3-ptests/mom6_only
Created and checked out new branch: ctrl
laboratory path: /scratch/tm70/cyb561/access-om3
binary path: /scratch/tm70/cyb561/access-om3/bin
input path: /scratch/tm70/cyb561/access-om3/input
work path: /scratch/tm70/cyb561/access-om3/work
archive path: /scratch/tm70/cyb561/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: mom6_only
To change directory to control directory run:
cd /g/data/tm70/cyb561/access-om3-ptests/mom6_only
-- Creating branch mom6_1 from ctrl!
Created and checked out new branch: mom6_1
laboratory path: /scratch/tm70/cyb561/access-om3
binary path: /scratch/tm70/cyb561/access-om3/bin
input path: /scratch/tm70/cyb561/access-om3/input
work path: /scratch/tm70/cyb561/access-om3/work
archive path: /scratch/tm70/cyb561/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: mom6_only
Updated perturbation files: ['config.yaml', 'nuopc.runconfig']
...
This should create a folder structure that looks like:
\[cyb561@gadi-login-03 access-om3-ptests\]$ tree mom6_only/
mom6_only/
├── config.yaml
├── datm_in
├── datm.streams.xml
├── diagnostic_profiles
│ ├── diag_table_standard
│ ├── README.md
│ └── source_yaml_files
│ └── diag_table_standard_source.yaml
├── diag_table -> diagnostic_profiles/diag_table_standard
├── docs
│ ├── available_diags.000000
│ ├── MOM_parameter_doc.all
│ ├── MOM_parameter_doc.debugging
│ ├── MOM_parameter_doc.layout
│ └── MOM_parameter_doc.short
├── drof_in
├── drof.streams.xml
├── drv_in
├── fd.yaml
├── ice_in
├── input.nml
├── LICENSE
├── manifests
│ ├── exe.yaml
│ ├── input.yaml
│ └── restart.yaml
├── MOM_input
├── MOM_override
├── nuopc.runconfig
├── nuopc.runseq
├── README.md
└── testing
└── checksum
├── historical-6hr-checksum.json
└── test_report.xml
mom6_only folder is a git repository where each branch corresponds to a new simulation.
Using the Experiment runner to run the load balancing tests
We now have to write another .yaml file (example here called runner_pr-mom6.yaml) so we can use access-experiment-runner to run the experiments we generated in the previous section.
runner_pr-mom6.yaml looks like:
/g/data/tm70/cyb561/access-om3-ptests
test_path: /g/data/tm70/cyb561/access-om3-ptests # All control and perturbation experiment repositories.
repository_directory: mom6_only # Local directory name for the central repository, where the running_branches are forked from.
keep_uuid: True
# ================= 1 =================
# # 1. check different density coordinates (NUM_DIAG_COORDS = 1, 2)
running_branches: ['mom6_1', 'mom6_2', 'mom6_3', 'mom6_4', 'mom6_5', 'mom6_6', 'mom6_7', 'mom6_8', 'mom6_9', 'mom6_10', 'mom6_11', 'mom6_12', 'mom6_13', 'mom6_14', 'mom6_15', 'mom6_16', 'mom6_17', 'mom6_18', 'mom6_19', 'mom6_20', 'mom6_21', 'mom6_22', 'mom6_23', 'mom6_24', 'mom6_25', 'mom6_26', 'mom6_27', 'mom6_28', 'mom6_29', 'mom6_30', 'mom6_31', 'mom6_32', 'mom6_33', 'mom6_34', 'mom6_35', 'mom6_36', 'mom6_37', 'mom6_38', 'mom6_39', 'mom6_40', 'mom6_41', 'mom6_42', 'mom6_43', 'mom6_44', 'mom6_45', 'mom6_46', 'mom6_47', 'mom6_48', 'mom6_49', 'mom6_50']
nruns: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
startfrom_restart: ['cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold', 'cold']
We then run the following commands:
module purge
module use /g/data/vk83/prerelease/modules
module load payu/dev
module list
cd /g/data/tm70/$USER/access-om3-ptests
experiment-runner -i runner_pr-mom6.yaml