Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

#563 Feature/sg 356 ddp silent mode and multi process safe docs

Merged
Ghost merged 1 commits into Deci-AI:master from deci-ai:feature/SG-356_ddp_silent_mode_and_multi_process_safe_docs
1 changed files with 49 additions and 0 deletions
  1. 49
    0
      README.md
@@ -309,6 +309,7 @@ across all GPUs after the backward pass.
 #### How to use it ?
 #### How to use it ?
 You can use SuperGradients to train your model with DDP in just a few lines.
 You can use SuperGradients to train your model with DDP in just a few lines.
 
 
+
 *main.py*
 *main.py*
 ```python
 ```python
 from super_gradients import init_trainer, Trainer
 from super_gradients import init_trainer, Trainer
@@ -347,6 +348,54 @@ python -m torch.distributed.launch --nproc_per_node=4 main.py
 torchrun --nproc_per_node=4 main.py
 torchrun --nproc_per_node=4 main.py
 ```
 ```
 
 
+#### Calling functions on a single node
+
+It is often in DDP training that we want to execute code on the master rank (i.e rank 0).
+In SG, users usually execute their own code by triggering "Phase Callbacks" (see "Using phase callbacks" section below).
+One can make sure the desired code will only be ran on rank 0, using ddp_silent_mode or the multi_process_safe decorator.
+For example, consider the simple phase callback below, that uploads the first 3 images of every batch during training to
+the Tensorboard:
+
+```python
+from super_gradients.training.utils.callbacks import PhaseCallback, PhaseContext, Phase
+from super_gradients.common.environment.env_helpers import multi_process_safe
+
+class Upload3TrainImagesCalbback(PhaseCallback):
+    def __init__(
+        self,
+    ):
+        super().__init__(phase=Phase.TRAIN_BATCH_END)
+    
+    @multi_process_safe
+    def __call__(self, context: PhaseContext):
+        batch_imgs = context.inputs.cpu().detach().numpy()
+        tag = "batch_" + str(context.batch_idx) + "_images"
+        context.sg_logger.add_images(tag=tag, images=batch_imgs[: 3], global_step=context.epoch)
+
+```
+The @multi_process_safe decorator ensures that the callback will only be triggered by rank 0. Alternatively, this can also
+be done by the SG trainer boolean attribute (which the phase context has access to), ddp_silent_mode, which is set to False
+iff the current process rank is zero (even after the process group has been killed):
+```python
+from super_gradients.training.utils.callbacks import PhaseCallback, PhaseContext, Phase
+
+class Upload3TrainImagesCalbback(PhaseCallback):
+    def __init__(
+        self,
+    ):
+        super().__init__(phase=Phase.TRAIN_BATCH_END)
+
+    def __call__(self, context: PhaseContext):
+        if not context.ddp_silent_mode:
+            batch_imgs = context.inputs.cpu().detach().numpy()
+            tag = "batch_" + str(context.batch_idx) + "_images"
+            context.sg_logger.add_images(tag=tag, images=batch_imgs[: 3], global_step=context.epoch)
+
+```
+
+Note that ddp_silent_mode can be accessed through SgTrainer.ddp_silent_mode. Hence, it can be used in scripts after calling
+SgTrainer.train() when some part of it should be ran on rank 0 only.
+
 #### Good to know
 #### Good to know
 Your total batch size will be (number of gpus x batch size), so you might want to increase your learning rate.
 Your total batch size will be (number of gpus x batch size), so you might want to increase your learning rate.
 There is no clear rule, but a rule of thumb seems to be to [linearly increase the learning rate with the number of gpus](https://arxiv.org/pdf/1706.02677.pdf) 
 There is no clear rule, but a rule of thumb seems to be to [linearly increase the learning rate with the number of gpus](https://arxiv.org/pdf/1706.02677.pdf) 
Discard