-
Does
ClipPPOLoss
optimize for all agents at each step ?- Yes -> Change the reduction to
"none"
and calculate mean on your own
- Yes -> Change the reduction to
-
Write your own
MaskedDependentDiscreteTensorsSpec(TensorSpec)
to sample from UCI moves givenaction_mask
- Use that as the action specs
- Update the action mask in
_reset()
and_step()
methods
-
Write your own
MaskedDependentCategorical(D.Categorical)
- This is the last (probabilistic) module in
ProbabilisticActor
- Takes the logits from the MLP
from_square_raw_logits
from("agents", "from_logits")
to_square_raw_logits
from("agents", "to_logits")
- Use
from_square_raw_logits
andaction_mask.any(dim=-1)
inMaskedCategorical
- Draw the first action
a1
- Draw the first action
- Use
to_square_raw_logits
andaction_mask[..., a1, :]
inMaskedCategorical
- Draw the second action
a2
- Draw the second action
dist.log_prob(action)
should returna1_masked_categorical.log_prob(a1) + a2_masked_categorical.log_prob(a2)
- Because
P(a1, a2) = P(a1) * P(a2 | a1)
- Because
- This is the last (probabilistic) module in
-
Parallelize traj collection:
ParallelEnv
orMultiSyncDataCollector
??? -
Maybe turn categorical vars to one-hot vectors (?)
x = F.one_hot(torch.arange(6).reshape(3, 2) % 4, num_classes=4).flatten(start_dim=-2, end_dim=-1) tensor([[1, 0, 0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 1, 0, 0]])