Multi-process Trainer#
- class ablator.main.mp.ParallelTrainer(wrapper: ModelWrapper, run_config: ParallelConfig)[source]
Bases:
ProtoTrainerA class for parallelizing training and hyperparameter optimization of models of different configurations with ray.
Examples
Below is a complete workflow on how to launch a parallel experiment with
ParallelTrainer, from defining config, getting the model wrapper ready, to launching the experiment:Define training config:
>>> my_optim_config = OptimizerConfig("sgd", {"lr": 0.5, "weight_decay": 0.5}) >>> my_scheduler_config = SchedulerConfig("step", arguments={"step_size": 1, "gamma": 0.99}) >>> train_config = TrainConfig( ... dataset="[Dataset Name]", ... batch_size=32, ... epochs=10, ... optimizer_config = my_optimizer_config, ... scheduler_config = my_scheduler_config, ... rand_weights_init = True ... )
Define model config, we want to run HPO on activation functions and model hidden size:
>>> @configclass >>> class CustomModelConfig(ModelConfig): >>> hidden_size: int >>> activation: str >>> model_config = CustomModelConfig(num_filter1 =32, num_filter2 = 64, activation = "relu")
Define search space:
>>> search_space = { ... "train_config.optimizer_config.arguments.lr": SearchSpace( ... value_range = [0.001, 0.01], ... value_type = 'float' ... ), ... "model_config.hidden_size": SearchSpace(value_range = [32, 64], value_type = 'int'), ... "model_config.activation": SearchSpace(categorical_values = ["relu", "elu", "leakyRelu"]), ... }
Define run config (remember to redefine the parallel config to update the model config type to be
CustomModelConfig):
>>> @configclass >>> class CustomParallelConfig(ParallelConfig): ... model_config: CustomModelConfig >>> >>> parallel_config = CustomParallelConfig( ... train_config=train_config, ... model_config=model_config, ... metrics_n_batches = 800, ... experiment_dir = "/tmp/experiments/", ... device="cuda", ... amp=True, ... random_seed = 42, ... total_trials = 20, ... concurrent_trials = 20, ... search_space = search_space, ... optim_metrics = {"val_loss": "min"}, ... gpu_mb_per_experiment = 1024, ... cpus_per_experiment = 1, ... )
Create model wrapper:
>>> class MyModelWrapper(ModelWrapper): >>> def __init__(self, *args, **kwargs): >>> super().__init__(*args, **kwargs) >>> >>> def make_dataloader_train(self, run_config: CustomRunConfig): >>> return torch.utils.data.DataLoader(<train_dataset>, batch_size=32, shuffle=True) >>> >>> def make_dataloader_val(self, run_config: CustomRunConfig): >>> return torch.utils.data.DataLoader(<val_dataset>, batch_size=32, shuffle=False)
After gathering all configurations and model wrapper, we can initialize and launch the parallel trainer:
>>> wrapper = MyModelWrapper( ... model_class=<your_ModelModule_class>, ... ) >>> ablator = ParallelTrainer( ... wrapper=wrapper, ... run_config=parallel_config, ... ) >>> ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")
- Attributes:
- run_configParallelConfig
Running configuration for parallel training.
- devicestr
The device to use for training.
- experiment_dirPath
The directory that stores experiment information (optuna storage, experiment state database).
- loggerRemoteFileLogger
A centralized logger that writes messages to a file and prints them to the console.
- experiment_stateExperimentState
This attribute manages optuna trials.
- total_trialsint
Number of trials to run.
- gpu_mem_bottleneckint
The minimum memory capacity of all available gpus.
- cpufloat
The number of cpu used per trial.
- gpufloat
The number of gpu used per trial.
- launch(working_directory: str, auxilary_modules: list[module] | None = None, ray_head_address: str | None = None, resume: bool = False, excluding_files: list[str] | None = None)[source]
Set up and launch the parallel ablation process. This sets up a ray cluster, and trials of different hyperparameters initialized (or retrieved) will be pushed to ray nodes so they can be executed in parallel.
- Parameters:
- working_directorystr
The working directory that stores codes, modules that will be used by ray.
- auxilary_moduleslist[tys.ModuleType], None
A list of modules to be used as ray clusters’ working environment.
- ray_head_addressstr, None
Ray cluster address.
- resumebool, default=False
Whether to resume training the model from existing checkpoints and existing experiment state.
- excluding_files: list[str], None
A list of files in .gitignore format, that will be excluded from being uploaded to the ray cluster. If unspecified it ignores .git/** folder.