Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

#870 Feature/sg 757 resume for spots

Merged
Ghost merged 1 commits into Deci-AI:master from deci-ai:feature/SG-757_resume_for_spots
1 changed files with 34 additions and 0 deletions
  1. 34
    0
      documentation/source/Checkpoints.md
@@ -265,7 +265,41 @@ For this reason, SG also offers a safer option for resuming interrupted training
 Note that resuming training this way requires the interrupted training to be launched with configuration files (i.e., `Trainer.train_from_config`), which outputs the Hydra final config to the `.hydra` directory inside the checkpoints directory.
 Note that resuming training this way requires the interrupted training to be launched with configuration files (i.e., `Trainer.train_from_config`), which outputs the Hydra final config to the `.hydra` directory inside the checkpoints directory.
 See usage in our [resume_experiment_example](https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/examples/resume_experiment_example/resume_experiment.py).
 See usage in our [resume_experiment_example](https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/examples/resume_experiment_example/resume_experiment.py).
 
 
+## Resuming Training from SG Logger's Remote Storage (WandB only)
 
 
+SG supports saving checkpoints throughout the training process in the remote storage defined by `SG Logger` (more info about this object and it's role during training in SG at [Third-party experiment monitoring](experiment_monitoring.md).)
+Suppose we run an experiment with a `WandB` SG logger, then our `training_hyperparams` should hold:
+```yaml
+sg_logger: wandb_sg_logger, # Weights&Biases Logger, see class super_gradients.common.sg_loggers.wandb_sg_logger.WandBSGLogger for details
+sg_logger_params:             # Params that will be passes to __init__ of the logger super_gradients.common.sg_loggers.wandb_sg_logger.WandBSGLogger
+  project_name: project_name, # W&B project name
+  save_checkpoints_remote: True,
+  save_tensorboard_remote: True,
+  save_logs_remote: True,
+  entity: <YOUR-ENTITY-NAME>,         # username or team name where you're sending runs
+  api_server: <OPTIONAL-WANDB-URL>    # Optional: In case your experiment tracking is not hosted at wandb servers
+```
+
+The `save_checkpoints_remote` flag is set which will result in saving checkpoints in WandB throughout training.
+Now, in case the training was interrupted, we can resume it from the checkpoint located in the WandB run storage by setting 2 training hyperparameters:
+1. Set `resume_from_remote_sg_logger`:
+```yaml
+resume_from_remote_sg_logger: True
+```
+2. Pass `run_id` through  `wandb_id` to `sg_logger_params`:
+```yaml
+sg_logger: wandb_sg_logger, # Weights&Biases Logger, see class super_gradients.common.sg_loggers.wandb_sg_logger.WandBSGLogger for details
+sg_logger_params:             # Params that will be passes to __init__ of the logger super_gradients.common.sg_loggers.wandb_sg_logger.WandBSGLogger
+  wandb_id: <YOUR_RUN_ID>
+  project_name: project_name, # W&B project name
+  save_checkpoints_remote: True,
+  save_tensorboard_remote: True,
+  save_logs_remote: True,
+  entity: <YOUR-ENTITY-NAME>,         # username or team name where you're sending runs
+  api_server: <OPTIONAL-WANDB-URL>    # Optional: In case your experiment tracking is not hosted at wandb servers
+```
+
+And that's it! Once you re-launch your training, `ckpt_latest.pth` (by default) will be downloaded to the checkpoints directory, and the training will resume from it just as if it was locally stored.
 
 
 ## Evaluating Checkpoints
 ## Evaluating Checkpoints
 
 
Discard