IRIS — Learning

Architecture

Three model variants.

Nine ablations across 3 architectures and 3 input modalities. CVAE and Transformer share a ResNet18 visual encoder; CNN-BC uses ResNet34.

Primary model

CVAE / ACT

CVAE Transformer ResNet18

Conditional Variational Autoencoder with a Transformer decoder that predicts a chunk of F=15 future actions. Latent variable z encodes multi-modal action distribution. Inspired by ACT.

ResNet18 + Spatial Softmax visual encoder
d_model=256 · 4 enc + 4 dec layers · 8 heads
Input: S=8 frame window + joint history + goal image
Output: F=15 joint angle targets
Loss: MSE + β·KL + λ·Smoothness
latent_dim=32, β=0.01

python train_cvae.py \
  --name MY_RUN \
  --model cvae_full \
  --loss loss_kl \
  --data_roots ~/Desktop/final_RGB_joint_goal \
  --checkpoint_dir ~/Desktop/checkpoints \
  --batch_size 64 \
  --epochs 100 \
  --latent_dim 32 \
  --beta 0.01

Auto-resumes from last checkpoint if interrupted. ~8 h on RTX 4090 for 100 epochs.

Baseline — no latent variable

Deterministic Transformer

Transformer ResNet18

Sequence-to-sequence Transformer without a latent variable. Ablates the stochastic component to measure the contribution of z in the CVAE.

Same ResNet18 encoder as CVAE
Input: S=8 frame window + joint state + goal image
Output: F=15 joint angle targets
Loss: MSE + smoothness
No latent variable (z removed)

python train_determinstic.py \
  --name MY_RUN \
  --model det_rgb \
  --loss mse_smooth \
  --data_roots ~/Desktop/final_RGB_only \
  --checkpoint_dir ~/Desktop/checkpoints

Also supports det_visual and det_full input modalities.

Baseline — behavior cloning

CNN-BC

ResNet34 MLP

ResNet34 + MLP behavior cloning baseline. Single-step prediction without action chunking or temporal modeling. Lower bound on sequence modeling capability.

Backbone: ResNet34 (larger than primary models)
Input: single RGB frame + current joint state
Output: next-step joint angle targets
Loss: MSE only
No temporal modeling or latent variable

python train_cnn_bc.py \
  --name MY_RUN \
  --data_roots ~/Desktop/final_RGB_joint_goal \
  --checkpoint_dir ~/Desktop/checkpoints

Fastest to train. Use as a sanity-check lower bound before running the full CVAE pipeline.

Full system — ResNet18 → Spatial Softmax → CVAE Transformer

Model variants: {arch}_{obs} where arch ∈ {cvae, det, vanilla_bc} and obs ∈ {rgb, visual, full}. Primary model: cvae_full.

Dataset

Episode format.

Demonstrations via kinesthetic teaching on real hardware, stored as ROS bags, processed into episodes with synchronized RGB, depth, and joint states.

Window (S)

input frames

Horizon (F)

predicted steps

RGB

640×480

cropped, normalized

Control dim

joint angles (rad)

Episode layout

processed_data/
└── demo_YYYYMMDD_episode_XXXX/
    ├── rgb/
    │   └── frame_0000.png    # 640×480
    ├── depth/
    │   └── frame_0000.png    # 16-bit
    ├── robot/
    │   └── joint_states.csv  # pos×6, vel×6, eff×6
    └── meta.json             # num_frames, t_start, t_end

Offline evaluation

python metrics_test.py \
  --test_data ~/Desktop/final_RGB_joint_goal/test \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --model_type cvae_full

EpisodeWindowDataset (datasets/iris_dataset.py) handles windowed sampling, normalization, and splits.

Deployment

Policy deployment.

Deploy a trained policy to real hardware. The policy reads from RealSense and publishes joint commands via the ROS driver.

Real robot

# Real hardware deployment
python policy.py \
  --model_type cvae_full \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \
  --device cuda \
  --real_robot \
  --vis    # live camera overlay

--stats_path is required at inference to un-normalize actions back to joint angles.

Simulation

# MuJoCo simulation (omit --real_robot)
python policy.py \
  --model_type cvae_full \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \
  --device cpu

Same policy interface as real hardware — validate policies in sim before deploying.

CVAE policy deployed on real IRIS hardware

CVAE Full — 46.2% task success, 97% of expert visual alignment

Training loss curves

CVAE training & validation loss over 100 epochs

Method	N	Success ↑	Visual Align. ↑	Jerk m/s³ ↓	SRR ↑
Expert (Human)	10	90.0%	0.874	3.64	67.1%
CVAE Full	13	90.0%	0.847	0.61	32.7%
Incremental	6	0.0%	0.636	0.83	35.2%
RGB Only	3	0.0%	0.584	1.65	7.2%
Visual (no joints)	4	0.0%	0.536	1.59	7.3%
RRT* (classical)	4	10.0%	0.636	0.22	10.5%

Robot Learning