Visuomotor imitation learning from kinesthetic demonstrations. Three architectures evaluated end-to-end on real IRIS hardware.
Nine ablations across 3 architectures and 3 input modalities. CVAE and Transformer share a ResNet18 visual encoder; CNN-BC uses ResNet34.
Conditional Variational Autoencoder with a Transformer decoder that predicts a chunk of F=15 future actions. Latent variable z encodes multi-modal action distribution. Inspired by ACT.
python train_cvae.py \ --name MY_RUN \ --model cvae_full \ --loss loss_kl \ --data_roots ~/Desktop/final_RGB_joint_goal \ --checkpoint_dir ~/Desktop/checkpoints \ --batch_size 64 \ --epochs 100 \ --latent_dim 32 \ --beta 0.01
Auto-resumes from last checkpoint if interrupted. ~8 h on RTX 4090 for 100 epochs.
Sequence-to-sequence Transformer without a latent variable. Ablates the stochastic component to measure the contribution of z in the CVAE.
python train_determinstic.py \ --name MY_RUN \ --model det_rgb \ --loss mse_smooth \ --data_roots ~/Desktop/final_RGB_only \ --checkpoint_dir ~/Desktop/checkpoints
Also supports det_visual and det_full input modalities.
ResNet34 + MLP behavior cloning baseline. Single-step prediction without action chunking or temporal modeling. Lower bound on sequence modeling capability.
python train_cnn_bc.py \ --name MY_RUN \ --data_roots ~/Desktop/final_RGB_joint_goal \ --checkpoint_dir ~/Desktop/checkpoints
Fastest to train. Use as a sanity-check lower bound before running the full CVAE pipeline.
Full system — ResNet18 → Spatial Softmax → CVAE Transformer
Model variants: {arch}_{obs} where arch ∈ {cvae, det, vanilla_bc}
and obs ∈ {rgb, visual, full}. Primary model: cvae_full.
Demonstrations via kinesthetic teaching on real hardware, stored as ROS bags, processed into episodes with synchronized RGB, depth, and joint states.
processed_data/
└── demo_YYYYMMDD_episode_XXXX/
├── rgb/
│ └── frame_0000.png # 640×480
├── depth/
│ └── frame_0000.png # 16-bit
├── robot/
│ └── joint_states.csv # pos×6, vel×6, eff×6
└── meta.json # num_frames, t_start, t_endpython metrics_test.py \ --test_data ~/Desktop/final_RGB_joint_goal/test \ --checkpoint ~/checkpoints/best_cvae_full_v1.pth \ --model_type cvae_full
EpisodeWindowDataset (datasets/iris_dataset.py) handles windowed sampling, normalization, and splits.
Deploy a trained policy to real hardware. The policy reads from RealSense and publishes joint commands via the ROS driver.
# Real hardware deployment python policy.py \ --model_type cvae_full \ --checkpoint ~/checkpoints/best_cvae_full_v1.pth \ --stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \ --device cuda \ --real_robot \ --vis # live camera overlay
--stats_path is required at inference to un-normalize actions back to joint angles.
# MuJoCo simulation (omit --real_robot)
python policy.py \
--model_type cvae_full \
--checkpoint ~/checkpoints/best_cvae_full_v1.pth \
--stats_path ~/final_RGB_joint_goal/dataset_stats.pkl \
--device cpuSame policy interface as real hardware — validate policies in sim before deploying.
CVAE Full — 46.2% task success, 97% of expert visual alignment
CVAE training & validation loss over 100 epochs
CVAE Full achieves 46.2% task success and 97% of expert visual alignment — 6× smoother than human demonstrations.
| Method | N | Success ↑ | Visual Align. ↑ | Jerk m/s³ ↓ | SRR ↑ |
|---|---|---|---|---|---|
| Expert (Human) | 10 | 90.0% | 0.874 | 3.64 | 67.1% |
| CVAE Full | 13 | 90.0% | 0.847 | 0.61 | 32.7% |
| Incremental | 6 | 0.0% | 0.636 | 0.83 | 35.2% |
| RGB Only | 3 | 0.0% | 0.584 | 1.65 | 7.2% |
| Visual (no joints) | 4 | 0.0% | 0.536 | 1.59 | 7.3% |
| RRT* (classical) | 4 | 10.0% | 0.636 | 0.22 | 10.5% |
Success = visual alignment > 0.85 at trajectory end. Visual Alignment = ResNet18 cosine similarity to goal image.