TPU Pod Commander is a package for setting up and launching jobs on Google Cloud TPU pods.
To install TPU Pod Commander, you need to intall the gcloud cli first. Follow the instructions here to install it. After installing the gcloud cli, you can install TPU Pod Commander by running the following command:
pip install tpu_pod_commander
After installing TPU Pod Commander, the command tpc
will be available in your
shell. TPC commands are all organized in the following format:
tpc <action> [config_file.py] [--flags=value ...]
where <action>
is the action to perform, and paramters are specified jointly
by the optional config file and the flags. There is a one-to-one correspondence
between the flags and the parameters in the config file. When both are specified,
the flags will override the parameters in the config file.
The following is a list of available actions:
list
: List all TPU pods in a given zone of a project.create
: Create a TPU pod.delete
: Delete a TPU pod.queue
: Create a TPU pod via queued resources API.ls_queue
: List all queued TPU pods.cancel_queue
: Cancel a queued TPU pod.describe
: Get the details of a TPU pod.ips
: List the external IPs of all the hosts in a TPU pod.upload
: Upload files to a TPU pod.run
: Run a command on all the hosts of a TPU pod.launch
: Launch a shell script job in a tmux session on all the hosts of a TPU pod.check
: Check the status of a job running in the tmux session on a TPU pod.stop
: Stop a job running in the tmux session on a TPU pod.reboot
: Reboot all the hosts in a TPU pod.unlock
: Remove the libtpu lock files on all hosts of a TPU pod.stop+unlock
: Performstop
andunlock
actions in sequence.relaunch
: Performstop
andlaunch
actions in sequence.upload+launch
: Performupload
andlaunch
actions in sequence.
The optional config file is a Python file that contains the parameters for the action. It should be in the following format:
configure_tpc(
key=value,
...
)
Note that no import statement is needed in the config file to use the
configure_tpc
function. The paramters for configure_tpc
has one-to-one
correspondence with the flags for the action, so for example specifying
--zone=us-central1-a
in the command line flags is equivalent to specifying
zone='us-central1-a'
in the config file. When both are specified for one
parameter, the flag will override the corresponding parameter in the config file.
The following is a list of available paramters:
zone
: The zone of the TPU pod.project
: The GCP project of the TPU pod.name
: The name of the TPU pod.accelerator_type
: The type and size of the TPU pod, for example,v4-256
.runtime_version
: The runtime software version of the TPU pod.reserved
: Whether the TPU pod should be created under reserved quota, default toFalse
.spot
: Whether the TPU pod should be created as a preemptible instance, default toFalse
.upload_path
: a comma-separated list of<local path>:<remote path>
pairs to upload.upload_remove_remote
: Whether to remove the remote files before uploading. Default toTrue
.command
: The command to run on the TPU pod.launch_script_path
: The path to load the content of the launch script.launch_script
: The content of the launch script. When bothlaunch_script_path
andlaunch_script
are specified, the script content will be loaded fromlaunch_script_path
and override thelaunch_script
parameter.launch_script_remote_path
: The remote path on TPU pod to save the launch script, default to~/tpc_launch_script.sh
.tpu_user
: The username to use when connecting to the TPU pod. Default to current user.tmux_session_name
: The name of the tmux session to create when launching a job. Default totpc
.show_command
: Whether to show the gcloud command when excuting an action. Default toTrue
.
Not all parameters are needed for all actions. The following is a list of required parameters for each action:
list
:zone
,project
.create
:zone
,project
,name
,accelerator_type
,runtime_version
.delete
:zone
,project
,name
.queue
:zone
,project
,name
,accelerator_type
,runtime_version
.ls_queue
:zone
,project
.cancel_queue
:zone
,project
,name
.describe
:zone
,project
,name
.ips
:zone
,project
,name
.upload
:zone
,project
,name
,upload_path
.run
:zone
,project
,name
,command
.launch
:zone
,project
,name
,launch_script_path
orlaunch_script
.check
:zone
,project
,name
.stop
:zone
,project
,name
.reboot
:zone
,project
,name
.unlock
:zone
,project
,name
.relaunch
:zone
,project
,name
,launch_script_path
orlaunch_script
.upload+launch
:zone
,project
,name
,upload_path
,launch_script_path
orlaunch_script
.
See the examples directory for some example config files and the corresponding commands.