Denoised MDPs: Learning World Models Better Than The World Itself

Tongzhou Wang
MIT CSAIL
Simon S. Du
University of Washington
Antonio Torralba
MIT CSAIL
Phillip Isola
MIT CSAIL
Amy Zhang
UC Berkeley, Meta AI
Yuandong Tian
Meta AI

ICML 2022

Paper: [arXiv] Code: [GitHub]
Task: Press green button to shift TV hue to green. True Signal: Robot joint position, TV green-ness, green light on desk. True Noise: Lighting, camera, TV content, imperfect sensor.
TIA does not remove any noise factors, while Denoised MDP correctly identifies all of them.
Hover here to see how we categorize information, and identify removable noises.
Information is categorized into four distinct types based on
  • Whether they are controllable (Ctrl) or not (Ctrl);
  • Whether they are related to rewards (Rew) or not (Rew).
Among them, only information both controllable (Ctrl) and reward-relevant (Rew) are signals necessary for control. An optimal denoised latent space should ignore the rest as noises.

Information in this RoboDesk environment can be categorized as following:
Task: Move walker robot forward when sensor readings are noisily affected by background images. True Signal: Robot joint position. True Noise: Background, imperfect sensor.
TIA fails to identify any noise with imperfect sensor readings. Denoised MDP, however, still learns a good factorization of signal and noise.

Abstract

The ability to separate signal from noise, and reason with clean abstractions, is critical to intelligence. With this ability, humans can efficiently perform real world tasks without considering all possible nuisance factors. How can artificial agents do the same? What kind of information can agents safely discard as noises? In this work, we categorize information out in the wild into four types based on controllability and relation with reward, and formulate useful information as that which is both controllable and reward-relevant.

This framework clarifies the kinds information removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that explicitly factors out certain noise distractors. Extensive experiments on variants of DeepMind Control Suite and RoboDesk demonstrate superior performance of our denoised world model over using raw observations alone, and over prior works, across policy optimization control tasks as well as the non-control task of joint position regression.


Four Types of Information in the Wild

Information in the wild can be categorized by controllability and whether they are related to rewards.
Imagine waking up and wanting to embrace some sunlight. As you open the curtain, a nearby resting bird is scared away and you are pleasantly met with a beautiful sunny day. Far away, a jet plane is slowly flying across the sky.
This simple activity highlights four distinct types of information (see figure below). Our optimal actions towards the goal, however, only depend on information that is both controllable and reward-relevant, and the three other kinds of information are merely noise distractors. Indeed, no matter how much natural sunlight there is outside, or how the plane and the bird move, the best plan is always to open up the curtain.
(Illustration credit to Jiaxi Chen.)
Different factorized MDP transition structures ((b) or (c) below) naturally separate out unwanted information. Whenever we have such a factorized model of the real dynamics, we can ignore much of the latent spaces. Latent x contains all signal sufficient for optimal decision making, and the rest latent spaces are noises.

Denoised MDP

Denoised MDP finds such factorized representations of the real dynamics, while minimizing the amount of information kept in signal x. The resulting algorithm is a modification to the standard variation maximum likelihood model fitting, but is effective in identifying and removing a variety of different types of noises, over baselines such as Dreamer and TIA.
The better denoised models from Denoised MDP also lead to better trained policies. See the paper for results on policy learning and transferring to a non-control task, with comparisons against many more model-free baselines, including PI-SAC, CURL, and Deep Bisimulation for Control.

Signal-Noise Factorization

Visualization of learned models by using decoders to reconstruct from encoded latents. For TIA and Denoised MDP, we visualize how they separate information as signal versus noise. In each row, what changes over frames is the information modeled by the corresponding latent component.
E.g., for RoboDesk, Denoised MDP's noise visualization only has the TV content, camera pose and lighting condition as elements changing over time. So Denoised MDP only considers these factors as noises, while modelling the TV hue, joint position, and light on the desk as useful signals.
Task: Press green button to shift TV hue to green. True Signal: Robot joint position, TV green-ness, green light on desk. True Noise: Lighting, camera, TV content, imperfect sensor.
TIA does not remove any noise factors, while Denoised MDP correctly identifies all of them.
Information is categorized into four distinct types based on
  • Whether they are controllable (Ctrl) or not (Ctrl);
  • Whether they are related to rewards (Rew) or not (Rew).
Among them, only information both controllable (Ctrl) and reward-relevant (Rew) are signals necessary for control. An optimal denoised latent space should ignore the rest as noises.

Information in this RoboDesk environment can be categorized as following:
Task: Move half cheetah robot forward. True Signal: Robot joint position. True Noise: N/A.
TIA noise latent still captures some robot move ment (see moving ground texture). Denoised MDP correctly learns a collapsed noise latent space for this noiseless environment.
Task: Move walker robot forward while standing up. True Signal: Robot joint position. True Noise: Background.
Both TIA and Denoised MDP correctly disentangle signal and noise, showing static background in Signal videos, and (mostly) static robot in Noise videos.
Task: Make reacher robot touch the target red object. True Signal: Robot joint position, target location. True Noise: Background.
TIA wrongly models robot position as noise and background as signal. Denoised MDP signal latent space correctly contains only robot and target positons.
Task: Move walker robot forward when sensor readings are noisily affected by background images. True Signal: Robot joint position. True Noise: Background, imperfect sensor.
TIA fails to identify any noise with imperfect sensor readings. Denoised MDP, however, still learns a good factorization of signal and noise.
Task: Move half cheetah robot forward when camera is shaky. True Signal: Robot joint position. True Noise: Background, camera.
TIA signal latent fails to ignore camera movements. Denoised MDP correctly finds a signal latent of only the robot position, and a noise latent of only the camera and background.

paper thumbnail

Paper

ICML 2022. arXiv 2206.15477.

Citation

Tongzhou Wang, Simon S. Du, Antonio Torralba, Phillip Isola, Amy Zhang, Yuandong Tian. "Denoised MDPs: Learning World Models Better Than The World Itself" International Conference on Machine Learning (ICML). 2022.

Code: [GitHub]



bibtex entry

@inproceedings{wang2022denoisedmdps,
  title={Denoised MDPs: Learning World Models Better Than The World Itself},
  author={Wang, Tongzhou and Du, Simon S. and Torralba, Antonio and Isola, Phillip and Zhang, Amy and Tian, Yuandong},
  booktitle={International Conference on Machine Learning},
  organization={PMLR},
  year={2022}
}

Acknowledgements

We thank Jiaxi Chen for the beautiful introduction example illustration. We thank Daniel Jiang and Yen-Chen Lin for their helpful comments and suggestions. We are grateful to the following organizations for providing computation resources to this project: IBM's MIT Satori cluster, MIT Supercloud cluster, and Google Cloud Computing with credits gifted by Google to MIT. We are very thankful to Alex Lamb for suggestions and catching our typo in the conditioning of Equation (1).