[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

jamartinh · 2023-05-22T06:43:42Z

Question

Hi, please state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0"

With previous Offline RL libs, there has been some confusion with this respect.

With the standar in RL being (s,a,r,s') one assume that r is a consequence of applying action a in state s.

If r is not, please state it clearly, because then, the r(s,a) should be r_1 and not r_0

Thanks !

balisujohn · 2023-05-23T15:53:25Z

I agree that's good to mention.

It is implied in this code block:

https://minari.farama.org/main/content/dataset_standards/

But I think further clarification wouldn't hurt, so I'll make a PR.

jamartinh · 2023-05-26T06:19:45Z

Thanks @balisujohn , now I am even more confused.

For me, this is a DejaVu from when working on the D3RLpy Offline RL library.
The code:

Minari/minari/data_collector/data_collector.py

Lines 182 to 193 in 7d16829

    
           obs, rew, terminated, truncated, info = self.env.step(action) 
        
           # add/edit data from step and convert to dictionary step data 
        
           step_data = self._step_data_callback( 
        
               env=self, 
        
               obs=obs, 
        
               info=info, 
        
               action=action, 
        
               rew=rew, 
        
               terminated=terminated, 
        
               truncated=truncated, 
        
           )

With this data collector:
the action and its reward as a consequence are on the same "row", however the state in which the action was taken is not in the same "row".

Also, in previous discussions on these kids of datasets, we concluded that the "original" D4RL datasets where in the format that is actualy used in the replay buffers implemented in almost all RL libraries:

$$(s,a,r,s', terminated, truncated, info)$$

E.g., a "full iteration" not just an env.step

So in just one "row" we have the state, the action taken in that state, the corresponding reward for taking action in current state , the subsequent state (required by onpolicy learning), terminal flags info etc.

This is basically the format of a Replay Buffer, the format that one expects from a dataset that describes a control task for using it with RL.

In the documentation: is $t$ starting from 0? or from 1?

What is the value of $r_0$ ?

As said, thanks again, but please, take ALL THE CARE with this issue since it is a brainer for people and also a "hidden" source of bad training. People can commit severe mistakes in using the data for training assuming something that is not the thing.

rodrigodelazcano · 2023-05-26T08:44:03Z

Hi @jamartinh . The datasets have the structure you are looking for:
The datacollector stores the action and its reward as a consequence on the same "row", and the state in which the action was taken is ALSO in the same "row". The first state that is recorded is the one returned by env.reset() thus the first timestep is t=0. You can also see that the last state of an episode is stored since each episode's observations array has an extra element compared to the rewards or actions arrays.

@balisujohn shared the code to convert an episodes data to the (s,a,r,s') format. However, I agree that we should update the documentation to make this clearer or add an extra array in each episode for s', sorry for the confusion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

jamartinh commented May 22, 2023

balisujohn commented May 23, 2023

jamartinh commented May 26, 2023

rodrigodelazcano commented May 26, 2023

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

Comments

jamartinh commented May 22, 2023

Question

balisujohn commented May 23, 2023

jamartinh commented May 26, 2023

rodrigodelazcano commented May 26, 2023