Overview Hardware Software Simulation Learning Paper ↗
PyTorch · CVAE · ACT

Robot Learning

Visuomotor imitation learning from kinesthetic demonstrations. Three architectures evaluated end-to-end on real IRIS hardware.

Training Code Read the Paper
Architecture

Three model variants.

Nine ablations across 3 architectures and 3 input modalities. CVAE and Transformer share a ResNet18 visual encoder; CNN-BC uses ResNet34.

Primary model

CVAE / ACT

CVAE Transformer ResNet18

Conditional Variational Autoencoder with a Transformer decoder that predicts a chunk of F=15 future actions. Latent variable z encodes multi-modal action distribution. Inspired by ACT.

  • ResNet18 + Spatial Softmax visual encoder
  • d_model=256 · 4 enc + 4 dec layers · 8 heads
  • Input: S=8 frame window + joint history + goal image
  • Output: F=15 joint angle targets
  • Loss: MSE + β·KL + λ·Smoothness
  • latent_dim=32, β=0.01
python train_cvae.py \
  --name MY_RUN \
  --model cvae_full \
  --loss loss_kl \
  --data_roots ~/Desktop/final_RGB_joint_goal \
  --checkpoint_dir ~/Desktop/checkpoints \
  --batch_size 64 \
  --epochs 100 \
  --latent_dim 32 \
  --beta 0.01

Auto-resumes from last checkpoint if interrupted. ~8 h on RTX 4090 for 100 epochs.

Baseline — no latent variable

Deterministic Transformer

Transformer ResNet18

Sequence-to-sequence Transformer without a latent variable. Ablates the stochastic component to measure the contribution of z in the CVAE.

  • Same ResNet18 encoder as CVAE
  • Input: S=8 frame window + joint state + goal image
  • Output: F=15 joint angle targets
  • Loss: MSE + smoothness
  • No latent variable (z removed)
python train_determinstic.py \
  --name MY_RUN \
  --model det_rgb \
  --loss mse_smooth \
  --data_roots ~/Desktop/final_RGB_only \
  --checkpoint_dir ~/Desktop/checkpoints

Also supports det_visual and det_full input modalities.

Baseline — behavior cloning

CNN-BC

ResNet34 MLP

ResNet34 + MLP behavior cloning baseline. Single-step prediction without action chunking or temporal modeling. Lower bound on sequence modeling capability.

  • Backbone: ResNet34 (larger than primary models)
  • Input: single RGB frame + current joint state
  • Output: next-step joint angle targets
  • Loss: MSE only
  • No temporal modeling or latent variable
python train_cnn_bc.py \
  --name MY_RUN \
  --data_roots ~/Desktop/final_RGB_joint_goal \
  --checkpoint_dir ~/Desktop/checkpoints

Fastest to train. Use as a sanity-check lower bound before running the full CVAE pipeline.

IRIS model architecture

Full system — ResNet18 → Spatial Softmax → CVAE Transformer

Model variants: {arch}_{obs} where arch ∈ {cvae, det, vanilla_bc} and obs ∈ {rgb, visual, full}. Primary model: cvae_full.

Dataset

Episode format.

Demonstrations via kinesthetic teaching on real hardware, stored as ROS bags, processed into episodes with synchronized RGB, depth, and joint states.

Window (S)
8
input frames
Horizon (F)
15
predicted steps
RGB
640×480
cropped, normalized
Control dim
6
joint angles (rad)

Episode layout

processed_data/
└── demo_YYYYMMDD_episode_XXXX/
    ├── rgb/
    │   └── frame_0000.png    # 640×480
    ├── depth/
    │   └── frame_0000.png    # 16-bit
    ├── robot/
    │   └── joint_states.csv  # pos×6, vel×6, eff×6
    └── meta.json             # num_frames, t_start, t_end

Offline evaluation

python metrics_test.py \
  --test_data ~/Desktop/final_RGB_joint_goal/test \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --model_type cvae_full

EpisodeWindowDataset (datasets/iris_dataset.py) handles windowed sampling, normalization, and splits.

Deployment

Policy deployment.

Deploy a trained policy to real hardware. The policy reads from RealSense and publishes joint commands via the ROS driver.

Real robot

# Real hardware deployment
python policy.py \
  --model_type cvae_full \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \
  --device cuda \
  --real_robot \
  --vis    # live camera overlay

--stats_path is required at inference to un-normalize actions back to joint angles.

Simulation

# MuJoCo simulation (omit --real_robot)
python policy.py \
  --model_type cvae_full \
  --checkpoint ~/checkpoints/best_cvae_full_v1.pth \
  --stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \
  --device cpu

Same policy interface as real hardware — validate policies in sim before deploying.

CVAE policy deployed on real IRIS hardware

CVAE Full — 46.2% task success, 97% of expert visual alignment

Training loss curves

Training loss curves

CVAE training & validation loss over 100 epochs

Results

Performance summary.

CVAE Full achieves 46.2% task success and 97% of expert visual alignment — 6× smoother than human demonstrations.

IRIS results metrics
Method N Success ↑ Visual Align. ↑ Jerk m/s³ ↓ SRR ↑
Expert (Human) 10 90.0% 0.874 3.64 67.1%
CVAE Full 13 90.0% 0.847 0.61 32.7%
Incremental 6 0.0% 0.636 0.83 35.2%
RGB Only 3 0.0% 0.584 1.65 7.2%
Visual (no joints) 4 0.0% 0.536 1.59 7.3%
RRT* (classical) 4 10.0% 0.636 0.22 10.5%

Success = visual alignment > 0.85 at trajectory end. Visual Alignment = ResNet18 cosine similarity to goal image.