PrefMMT

This repository contains the source code for our paper: "PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers", submitted to the 2025 IEEE International Conference on Robotics and Automation (ICRA 2025). For more information, visit our project website.

Abstract

Preference-based Reinforcement Learning (PbRL) is a promising approach for aligning robot behaviors with human preferences, but its effectiveness relies on accurately modeling those preferences through reward models. Traditional methods often assume preferences are Markovian, neglecting the temporal dependencies within robot behavior trajectories that influence human evaluations. While recent approaches use sequence modeling to learn non-Markovian rewards, they overlook the multimodal nature of robot trajectories, consisting of both state and action elements. This oversight limits their ability to capture the intricate interplay between these modalities, which is critical in shaping human preferences.

In this work, we introduce PrefMMT, a multimodal transformer network designed to disentangle and model the state and action modalities separately. PrefMMT hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. Our experiments show that PrefMMT consistently outperforms state-of-the-art baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.

Comparison: PrefMMT vs. Other Preference Modeling Methods

Comparison of PrefMMT with other methods

The diagram above highlights the key distinctions between PrefMMT and other existing preference modeling methods. While traditional approaches often make Markovian assumptions and fail to capture the multimodal interactions between state and action, PrefMMT addresses this gap by leveraging a multimodal transformer architecture. This allows for a more accurate and dynamic understanding of human preferences by modeling both intra-modal and inter-modal dependencies.

Architecture Overview

Installation

Requirements

Install dependencies:

pip install --upgrade pip
conda install -y -c conda-forge cudatoolkit=11.1 cudnn=8.2.1
pip install -r requirements.txt
cd d4rl
pip install -e .
cd ..

Install JAX and JAXlib:
- jax 0.4.9
- jaxlib 0.4.9 (Install from JAX CUDA releases)

Run the code

Run Training Reward Model

CUDA_VISIBLE_DEVICES=0 python -m JaxPref.main --use_human_label True --comment {experiment_name} --transformer.embd_dim 256 --transformer.n_layer 3 --transformer.n_head 4 --env {D4RL env name} --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query {number of query} --query_len 100 --n_epochs 10000 --skip_flag 0 --seed {seed} --model_type PrefMMT

Run IQL with learned Reward Model

# Preference Transfomer (PT)
CUDA_VISIBLE_DEVICES=0 python train_offline.py --seq_len {sequence length in reward prediction} --comment {experiment_name} --eval_interval {5000: mujoco / 100000: antmaze / 5000: metaworld} --env_name {d4rl env name} --config {configs/(mujoco|antmaze|metaworld)_config.py} --eval_episodes {100 for ant , 10 o.w.} --use_reward_model True --model_type PrefMMT --ckpt_dir {reward_model_path} --seed {seed}

(The code was tested in Ubuntu 20.04 with Python 3.8.)

Acknowledgments

Our code is based on the implementation of PT, Flaxmodels and IQL.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
JaxPref		JaxPref
configs		configs
d4rl		d4rl
figures		figures
flaxmodels		flaxmodels
human_label		human_label
viskit		viskit
wrappers		wrappers
LICENSE		LICENSE
README.md		README.md
actor.py		actor.py
common.py		common.py
critic.py		critic.py
dataset_utils.py		dataset_utils.py
evaluation.py		evaluation.py
learner.py		learner.py
metaworld_utils.py		metaworld_utils.py
policy.py		policy.py
requirements.txt		requirements.txt
train_offline.py		train_offline.py
value_net.py		value_net.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrefMMT

Abstract

Comparison: PrefMMT vs. Other Preference Modeling Methods

Architecture Overview

Installation

Requirements

Run the code

Run Training Reward Model

Run IQL with learned Reward Model

Acknowledgments

About

Releases

Packages

Languages

License

Domijuteur/PrefMMT

Folders and files

Latest commit

History

Repository files navigation

PrefMMT

Abstract

Comparison: PrefMMT vs. Other Preference Modeling Methods

Architecture Overview

Installation

Requirements

Run the code

Run Training Reward Model

Run IQL with learned Reward Model

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages