Pendahuluan & Peta Jalan
Deep Learning (DL) telah merevolusi robotika modern. Robot tidak lagi sekadar mengeksekusi program deterministik — mereka kini belajar dari data sensor, gambar kamera, dan umpan balik lingkungan untuk membuat keputusan otonom yang kompleks.
Neural Net
Visual
Modelling
RL
& Mapping
Deploy
Ekosistem Teknologi
Fondasi Deep Learning
1.1 Neuron Buatan & Perceptron
Sebuah neuron tunggal menghitung kombinasi linear dari input ditambah bias, lalu dilewatkan ke fungsi aktivasi:
Di mana W adalah bobot (weights), x vektor input, b bias, dan f fungsi aktivasi nonlinear.
1.2 Fungsi Aktivasi
| Fungsi | Formula | Penggunaan | Kelebihan |
|---|---|---|---|
| ReLU | max(0, x) | Hidden layers konvensional | Komputasi cepat, mengatasi vanishing gradient |
| Leaky ReLU | max(αx, x) | Saat dying ReLU jadi masalah | Gradien tidak nol untuk x < 0 |
| GELU | x·Φ(x) | Transformer, modern networks | Smooth, performa lebih baik di NLP/vision |
| Sigmoid | 1/(1+e⁻ˣ) | Output biner | Probabilitas [0,1] |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | LSTM gates, RNN | Output [-1,1], zero-centered |
| Softmax | eˣⁱ/Σeˣʲ | Klasifikasi multi-kelas | Distribusi probabilitas |
1.3 Backpropagation & Chain Rule
Algoritma backpropagation menghitung gradien loss terhadap setiap parameter menggunakan chain rule. Untuk jaringan dengan L layer:
1.4 Optimizer Modern
Dalam robotika, pemilihan optimizer krusial karena pelatihan sering melibatkan data non-stasioner dari lingkungan fisik:
| Optimizer | Karakteristik | Cocok Untuk |
|---|---|---|
| Adam | Adaptive LR + momentum. β₁=0.9, β₂=0.999 | Default untuk kebanyakan arsitektur |
| AdamW | Adam + weight decay yang benar | Transformer, ViT, CLIP |
| SGD + Nesterov | Lebih stabil untuk fine-tuning | Transfer learning final stage |
| RMSProp | Adaptive LR per parameter | RL (DQN, A3C) |
| Shampoo / SOAP | Second-order, efisien | Large model pre-training |
1.5 Implementasi MLP Dasar di PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
# Contoh: prediksi posisi joint robot dari sensor IMU
class RobotStatePredictor(nn.Module):
def __init__(self, input_dim=12, hidden=256, output_dim=6):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden),
nn.LayerNorm(hidden), # lebih stabil dari BatchNorm untuk robot
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(hidden, hidden),
nn.LayerNorm(hidden),
nn.GELU(),
nn.Linear(hidden, output_dim)
)
def forward(self, x):
return self.net(x)
# Training loop
model = RobotStatePredictor().cuda()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
loss_fn = nn.HuberLoss() # robust terhadap outlier sensor
for epoch in range(100):
for x_batch, y_batch in dataloader:
pred = model(x_batch.cuda())
loss = loss_fn(pred, y_batch.cuda())
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
Convolutional Neural Networks & Persepsi Visual Robot
2.1 Arsitektur CNN untuk Robot Vision
CNN adalah tulang punggung persepsi visual robot. Hierarki fitur dari edge → texture → object memungkinkan robot memahami lingkungan dari raw pixel.
INPUT IMAGE (640×480×3)
│
┌────▼────────────────────────────────────┐
│ Conv2D 3×3, 64ch → ReLU → BN │ ← Edge detectors
│ Conv2D 3×3, 64ch → ReLU → MaxPool │
└────┬────────────────────────────────────┘
│ 320×240×64
┌────▼────────────────────────────────────┐
│ Conv2D 3×3, 128ch → ReLU → BN │ ← Texture features
│ Conv2D 3×3, 128ch → ReLU → MaxPool │
└────┬────────────────────────────────────┘
│ 160×120×128
┌────▼────────────────────────────────────┐
│ Conv2D 3×3, 256ch → ReLU ×3 │ ← Object parts
│ MaxPool 2×2 │
└────┬────────────────────────────────────┘
│
┌────▼────────────────────────────────────┐
│ FPN (Feature Pyramid Network) │ ← Multi-scale
└────┬────────────────────────────────────┘
│
Detection Head / Segmentation Head / Depth Head
2.2 Object Detection: YOLOv10 di Robot
from ultralytics import YOLO
import cv2
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from cv_bridge import CvBridge
class RobotVisionNode(Node):
def __init__(self):
super().__init__('robot_vision')
self.model = YOLO('yolov10n.pt') # nano = cepat utk embedded
self.model.to('cuda')
self.bridge = CvBridge()
self.sub = self.create_subscription(
Image, '/camera/rgb/image_raw',
self.image_callback, 10)
def image_callback(self, msg):
frame = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
results = self.model.track(frame,
persist=True,
conf=0.45,
iou=0.5,
device='cuda:0')
for r in results:
boxes = r.boxes.xyxy.cpu().numpy()
classes = r.boxes.cls.cpu().numpy()
tracks = r.boxes.id # ByteTrack ID
for i, box in enumerate(boxes):
self.get_logger().info(
f"Obj [{classes[i]:.0f}] ID={tracks[i]} @ {box.astype(int)}")
rclpy.init()
rclpy.spin(RobotVisionNode())
2.3 Semantic Segmentation untuk Navigasi
Segmentasi semantik memungkinkan robot memahami setiap piksel gambar — membedakan lantai, dinding, manusia, dan rintangan untuk navigasi aman:
import torch
from transformers import SegformerForSemanticSegmentation, SegformerImageProcessor
import numpy as np
# Load SegFormer-B2 yang ringan untuk robot
processor = SegformerImageProcessor.from_pretrained(
"nvidia/segformer-b2-finetuned-ade-512-512")
model = SegformerForSemanticSegmentation.from_pretrained(
"nvidia/segformer-b2-finetuned-ade-512-512").cuda().eval()
def segment_scene(bgr_frame):
rgb = cv2.cvtColor(bgr_frame, cv2.COLOR_BGR2RGB)
inp = processor(images=rgb, return_tensors="pt")
inp = {k: v.cuda() for k, v in inp.items()}
with torch.no_grad():
logits = model(**inp).logits # (1, 150, H/4, W/4)
seg = logits.argmax(dim=1).squeeze().cpu().numpy()
# Resize ke ukuran frame asli
seg = cv2.resize(seg.astype(np.uint8),
(bgr_frame.shape[1], bgr_frame.shape[0]),
interpolation=cv2.INTER_NEAREST)
return seg # class ID per-pixel
# Ekstrak "free space" untuk path planning
FLOOR_IDS = {3, 28} # floor, carpet di ADE20k
def get_traversable_mask(seg_map):
return np.isin(seg_map, list(FLOOR_IDS)).astype(np.uint8) * 255
2.4 Depth Estimation & 3D Reconstruction
RNN, LSTM & Temporal Modelling untuk Robot
3.1 Mengapa Temporal Modelling Penting di Robotika
Robot beroperasi dalam waktu nyata — sensor menghasilkan urutan data, bukan data independen. Prediksi gerakan manusia, kontrol motor berulang, dan localization semuanya bergantung pada konteks temporal.
3.2 LSTM untuk Prediksi Trajectory
import torch
import torch.nn as nn
class TrajectoryLSTM(nn.Module):
"""
Memprediksi N langkah trajektori ke depan dari M langkah historis.
Input: (batch, seq_len, 4) — [x, y, vx, vy] per frame
Output: (batch, pred_len, 2) — [x, y] prediksi
"""
def __init__(self, input_dim=4, hidden=128,
layers=2, pred_len=12):
super().__init__()
self.encoder = nn.LSTM(input_dim, hidden,
num_layers=layers,
batch_first=True,
dropout=0.2)
self.decoder = nn.LSTM(hidden, hidden,
num_layers=layers,
batch_first=True,
dropout=0.2)
self.fc_out = nn.Linear(hidden, 2)
self.fc_init = nn.Linear(hidden, hidden)
self.pred_len = pred_len
def forward(self, x):
_, (h, c) = self.encoder(x)
dec_in = self.fc_init(h[-1]).unsqueeze(1) # seed
preds = []
for _ in range(self.pred_len):
dec_out, (h, c) = self.decoder(dec_in, (h, c))
step_pred = self.fc_out(dec_out)
preds.append(step_pred)
dec_in = dec_out
return torch.cat(preds, dim=1) # (B, pred_len, 2)
3.3 Temporal Fusion Transformer (TFT) untuk Sensor Fusion
Untuk fusi multi-sensor (IMU + GPS + wheel encoder + kamera), Temporal Fusion Transformer unggul karena dapat menangani input dengan missing values dan skala waktu berbeda:
from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet
# Konfigurasi dataset: sensor robot multi-modal
dataset = TimeSeriesDataSet(
df_sensor, # DataFrame dengan kolom sensor
time_idx="timestep",
target="joint_pos_6", # posisi joint ke-6
group_ids=["robot_id"],
max_encoder_length=100, # 100 step historis
max_prediction_length=20, # prediksi 20 step ke depan
time_varying_known_reals=["imu_ax", "imu_ay", "imu_az"],
time_varying_unknown_reals=["enc_vel", "torque"],
static_categoricals=["robot_model"],
)
tft = TemporalFusionTransformer.from_dataset(
dataset,
learning_rate=1e-3,
hidden_size=64,
attention_head_size=4,
dropout=0.1,
hidden_continuous_size=32,
)
Reinforcement Learning untuk Kontrol Robot
4.1 Formulasi Masalah RL dalam Robotika
Kontrol robot dimodelkan sebagai Markov Decision Process (MDP):
4.2 PPO (Proximal Policy Optimization) — Standar Industri
PPO adalah algoritma on-policy yang menjaga update kebijakan tidak terlalu jauh dari kebijakan lama, mencegah divergensi yang fatal pada sistem fisik.
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.callbacks import EvalCallback
# Training robot locomotion di MuJoCo
def make_env(rank):
def _init():
env = gym.make('HalfCheetah-v4')
env.reset(seed=rank)
return env
return _init
n_envs = 16
vec_env = SubprocVecEnv([make_env(i) for i in range(n_envs)])
model = PPO(
"MlpPolicy", vec_env,
learning_rate=3e-4,
n_steps=2048, # rollout buffer per env
batch_size=512,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.005,
vf_coef=0.5,
policy_kwargs=dict(
net_arch=[256, 256],
activation_fn=nn.Tanh # Tanh lebih smooth untuk kontrol
),
verbose=1,
device='cuda'
)
eval_cb = EvalCallback(
gym.make('HalfCheetah-v4'),
best_model_save_path='./models/best/',
eval_freq=50_000, n_eval_episodes=10)
model.learn(total_timesteps=5_000_000, callback=eval_cb)
model.save("cheetah_ppo")
4.3 SAC (Soft Actor-Critic) untuk Continuous Control
SAC mengoptimalkan entropi kebijakan bersama reward, menghasilkan policy yang lebih exploratory dan robust terhadap gangguan fisik:
| Algoritma | On/Off-Policy | Action Space | Kelebihan di Robotika |
|---|---|---|---|
| PPO | On-policy | Kontinu / Diskrit | Stabil, mudah tune, cocok locomotion |
| SAC | Off-policy | Kontinu | Sample efficient, robust, manipulasi |
| TD3 | Off-policy | Kontinu | Deterministik, kontrol presisi tinggi |
| DDPG | Off-policy | Kontinu | Base dari TD3/SAC, referensi klasik |
| DreamerV3 | Off-policy | Kontinu | World model, training dari raw pixels |
4.4 Reward Shaping untuk Locomotion
class HumanoidReward:
"""Reward function untuk humanoid robot berjalan tegak."""
def compute(self, obs, action, next_obs, info):
# Forward velocity (tujuan utama)
r_vel = info['x_velocity'] * 1.0
# Penalti energi (efisiensi)
r_energy = -0.001 * np.sum(action**2)
# Penalti jatuh (ketinggian COM)
com_height = info['com_height']
r_alive = 2.0 if com_height > 0.8 else -10.0
# Penalti jerk (gerakan halus)
jerk = np.linalg.norm(info['joint_acc'])
r_smooth = -0.01 * jerk
# Penalti keseimbangan lateral
lean = abs(info['body_tilt_x'])
r_upright = -0.5 * lean
return r_vel + r_energy + r_alive + r_smooth + r_upright
Transfer Learning & Foundation Models di Robotika
5.1 Mengapa Transfer Learning Krusial untuk Robot
Data berlabel dari robot fisik sangat mahal dan berbahaya untuk dikumpulkan. Transfer learning memanfaatkan pengetahuan dari dataset besar (ImageNet, CLIP, large language models) dan mengadaptasinya ke tugas robot dengan data minimal.
- Pre-trained vision backbone (ResNet, ViT) hemat 10-100x data labeled
- CLIP embeddings sebagai universal representation untuk manipulation
- Foundation models (RT-2, Octo) memberikan generalisasi lintas robot
- LoRA / QLoRA fine-tuning VLM dengan GPU terbatas
5.2 Fine-tuning ViT untuk Klasifikasi Objek Robot
from transformers import ViTForImageClassification, ViTImageProcessor
import torch, torch.nn as nn
# Load ViT-B/16 pre-trained ImageNet-21k
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained(
'google/vit-base-patch16-224',
num_labels=20, # 20 kategori objek robot kita
ignore_mismatched_sizes=True)
# Freeze semua kecuali 4 block terakhir + head
for name, param in model.vit.named_parameters():
layer_num = int(name.split('.')[2]) if name.startswith('encoder.layer') else -1
param.requires_grad = (layer_num >= 8) # fine-tune layer 8-11
# Differential learning rates
optimizer = torch.optim.AdamW([
{'params': model.vit.encoder.layer[8:12].parameters(), 'lr': 1e-5},
{'params': model.classifier.parameters(), 'lr': 1e-4},
], weight_decay=0.01)
5.3 RT-2: Vision-Language-Action Model
5.4 CLIP sebagai Zero-Shot Grounding
import clip, torch
from PIL import Image
model, preprocess = clip.load("ViT-B/32", device="cuda")
def find_target_object(rgb_frame, text_query: str, candidate_crops: list):
"""
Zero-shot: temukan crop yang paling cocok dengan teks instruksi.
Contoh: text_query = "the red cup on the left"
"""
images = [preprocess(Image.fromarray(c)).unsqueeze(0) for c in candidate_crops]
images = torch.cat(images).cuda()
text = clip.tokenize([text_query]).cuda()
with torch.no_grad():
img_feat = model.encode_image(images)
txt_feat = model.encode_text(text)
img_feat /= img_feat.norm(dim=-1, keepdim=True)
txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
sims = (img_feat @ txt_feat.T).squeeze()
best_idx = sims.argmax().item()
return candidate_crops[best_idx], sims[best_idx].item()
Robot Vision Lanjutan: 3D, Point Cloud & Pose Estimation
6.1 6-DoF Pose Estimation
Pose estimation 6D (posisi + orientasi) adalah kunci untuk robot arm — robot harus tahu di mana tepatnya sebuah objek dalam 3D sebelum bisa mengambilnya.
# Menggunakan FoundPose / GigaPose untuk zero-shot 6D pose
from gigapose import GigaPoseEstimator
import numpy as np
estimator = GigaPoseEstimator(device='cuda')
# Estimasi dari RGB-D frame
def estimate_grasp_pose(rgb, depth, K_intrinsics, obj_mesh_path):
result = estimator.estimate(
rgb=rgb,
depth=depth,
K=K_intrinsics,
mesh_path=obj_mesh_path,
n_hypotheses=100
)
R = result['R'] # rotation matrix (3×3)
t = result['t'] # translation vector (3,)
score = result['score']
return R, t, score
# Konversi ke robot base frame
def camera_to_robot(R_cam, t_cam, T_cam2base):
"""Transform pose dari frame kamera ke frame robot."""
T_obj_cam = np.eye(4)
T_obj_cam[:3, :3] = R_cam
T_obj_cam[:3, 3] = t_cam
T_obj_robot = T_cam2base @ T_obj_cam
return T_obj_robot
6.2 PointNet++ untuk Klasifikasi 3D
import torch
import torch.nn as nn
from pointnet2_ops import pointnet2_utils as pn2
class PointNet2Encoder(nn.Module):
"""Encoder point cloud untuk scene understanding robot."""
def __init__(self, out_dim=256):
super().__init__()
self.sa1 = pn2.PointnetSAModule(
npoint=512, radius=0.2, nsample=32,
mlp=[0, 64, 64, 128])
self.sa2 = pn2.PointnetSAModule(
npoint=128, radius=0.4, nsample=64,
mlp=[128, 128, 128, 256])
self.sa3 = pn2.PointnetSAModule(
mlp=[256, 256, 512, out_dim])
def forward(self, xyz):
# xyz: (B, N, 3) — N points dari lidar/depth camera
xyz = xyz.contiguous()
xyz, feat = self.sa1(xyz, None)
xyz, feat = self.sa2(xyz, feat)
_, feat = self.sa3(xyz, feat)
return feat.squeeze(-1) # (B, out_dim)
Kontrol End-to-End & Navigasi Neural
7.1 End-to-End Imitation Learning
Behavior Cloning (BC) melatih policy langsung dari demonstrasi manusia — robot meniru expert tanpa perlu reward function:
class BehaviorCloningPolicy(nn.Module):
"""
End-to-end: (image, proprio_state) → robot_action
Arsitektur ACT (Action Chunking Transformer) yang disederhanakan.
"""
def __init__(self):
super().__init__()
# Visual encoder: ResNet18 + FPN
resnet = models.resnet18(pretrained=True)
self.visual = nn.Sequential(*list(resnet.children())[:-2])
self.pool = nn.AdaptiveAvgPool2d((4,4))
# Proprio encoder (joint angles, velocities)
self.proprio = nn.Sequential(
nn.Linear(14, 64), nn.GELU(),
nn.Linear(64, 64))
# Fusion Transformer
d_model = 512
self.img_proj = nn.Linear(512*4*4, d_model)
self.prop_proj = nn.Linear(64, d_model)
enc_layer = nn.TransformerEncoderLayer(d_model, nhead=8,
dim_feedforward=2048, batch_first=True)
self.transformer = nn.TransformerEncoder(enc_layer, num_layers=4)
# Action head: output 100 action chunks (chunk policy)
self.action_head = nn.Linear(d_model, 14 * 100)
def forward(self, img, proprio):
B = img.shape[0]
v = self.pool(self.visual(img)).flatten(1)
v = self.img_proj(v).unsqueeze(1)
p = self.prop_proj(self.proprio(proprio)).unsqueeze(1)
tokens = torch.cat([v, p], dim=1)
out = self.transformer(tokens).mean(1)
actions = self.action_head(out).reshape(B, 100, 14)
return actions
7.2 Neural SLAM & Navigasi Otonom
NeRF-SLAM dan GaussianSLAM memungkinkan robot membangun peta 3D implisit dari kamera monocular, tanpa LiDAR:
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ RGB-D Camera │────▶│ Pose Tracker │────▶│ Map Builder │
│ (30 fps) │ │ (ORB-SLAM3 / │ │ (NeRF / │
└──────────────┘ │ DROID-SLAM) │ │ 3DGS scene) │
└───────┬───────┘ └──────┬───────┘
│ T_wc │ Dense map
▼ ▼
┌───────────────┐ ┌──────────────┐
│ Global Planner│◀────│ Occupancy │
│ (A* / Dijkstr)│ │ Grid Extract │
└───────┬───────┘ └──────────────┘
│ Waypoints
▼
┌───────────────┐
│ Local Planner │
│ (DWA / TEB / │
│ learned) │
└───────┬───────┘
│ cmd_vel
▼
Robot Motors
7.3 Learned Local Planner dengan RL
# Nav2 custom planner berbasis RL
import rclpy
from nav2_simple_commander.robot_navigator import BasicNavigator
from geometry_msgs.msg import PoseStamped
import torch
# Policy yang dilatih dengan PPO di sim, sekarang running di robot
rl_policy = torch.jit.load('nav_policy_sim2real.pt').cuda().eval()
def get_rl_cmd_vel(laser_scan, goal_vec, current_vel):
"""
Input: 360-ray laser + goal direction + current velocity
Output: linear vel + angular vel
"""
obs = torch.FloatTensor(
np.concatenate([laser_scan, goal_vec, current_vel])
).unsqueeze(0).cuda()
with torch.no_grad():
action = rl_policy(obs).squeeze().cpu().numpy()
linear_vel = np.clip(action[0], -0.5, 0.5)
angular_vel = np.clip(action[1], -1.0, 1.0)
return linear_vel, angular_vel
Manipulasi Objek dengan Deep Learning
8.1 Grasp Planning Neural Network
GraspNet-1Billion dan Contact-GraspNet memprediksi ribuan konfigurasi grasp yang valid dari point cloud atau RGB-D, memilih yang terbaik berdasarkan kualitas dan kinematik robot:
from contact_graspnet_pytorch import ContactGraspNetModel
import numpy as np
grasp_model = ContactGraspNetModel().cuda().eval()
def plan_grasp(depth_image, rgb_image, K, segmask=None):
"""
Kembalikan daftar grasp poses (R, t, quality) diurutkan berdasar kualitas.
"""
pred_grasps, scores, contact_pts = grasp_model.predict(
depth_image=depth_image,
rgb=rgb_image,
K=K,
segmask=segmask,
z_range=[0.2, 1.8],
local_regions=True,
filter_grasps=True
)
# Sort by score descending
sorted_idx = np.argsort(scores)[::-1]
return [(pred_grasps[i], scores[i]) for i in sorted_idx[:10]]
8.2 Diffusion Policy — State-of-the-Art Imitation Learning
from diffusion_policy import DiffusionUnetHybridImagePolicy
# Konfigurasi Diffusion Policy dengan CNN backbone
config = {
"obs_encoder": {
"type": "MultiImageObsEncoder",
"rgb_model": "resnet18",
"obs_groups": {
"camera_0": [3, 96, 96], # wrist cam
"camera_1": [3, 96, 96], # overhead cam
"robot_eef_pos": [3],
"robot_eef_quat": [4],
"robot_gripper": [1],
}
},
"noise_scheduler": {
"type": "DDIMScheduler",
"num_train_timesteps": 100,
"beta_schedule": "squaredcos_cap_v2",
},
"horizon": 16, # prediksi 16 action ke depan
"n_action_steps": 8, # eksekusi 8 sebelum re-plan
"n_obs_steps": 2,
"action_dim": 10, # 7 DoF arm + 3 gripper
}
policy = DiffusionUnetHybridImagePolicy(**config)
Sim-to-Real Transfer
9.1 Domain Randomization
Melatih dengan variasi parameter fisika dan visual yang masif di simulator sehingga policy menjadi robust terhadap ketidaksesuaian antara sim dan realitas:
from isaacgym import gymapi, gymtorch
import numpy as np
class DomainRandomizer:
"""Randomisasi domain di Isaac Gym setiap N episode."""
DR_RANGES = {
'friction': (0.1, 3.0),
'restitution': (0.0, 0.9),
'mass_scale': (0.5, 2.0),
'stiffness': (200, 800),
'damping': (10, 80),
'gravity_z': (-9.8, -10.2),
}
VISUAL_DR = {
'albedo': (0.3, 0.9),
'roughness': (0.0, 1.0),
'light_intensity': (500, 2000),
'cam_noise': (0.0, 0.03),
}
def randomize_episode(self, env, actor_handle):
for prop, (lo, hi) in self.DR_RANGES.items():
val = np.random.uniform(lo, hi)
self._set_property(env, actor_handle, prop, val)
def add_observation_noise(self, obs, sigma=0.01):
return obs + np.random.normal(0, sigma, obs.shape)
9.2 Adaptive Domain Randomization (ADR)
OpenAI Dactyl menggunakan ADR: mulai dari lingkungan mudah, secara otomatis perluas range randomisasi selama agent berhasil:
| Teknik | Deskripsi | Kelebihan |
|---|---|---|
| UDR | Uniform Domain Randomization — range tetap, distribusi uniform | Simple, baseline yang kuat |
| ADR | Rentang randomisasi dikembangkan otomatis berdasar performa | Scalable, digunakan di Dactyl |
| RARL | Adversarial RL — adversary melatih perturbasi terburuk | Policy sangat robust |
| Grounded Sim | Identifikasi parameter sim dari data real menggunakan Bayesian opt | Realistic gap minimization |
Edge Deployment: Optimasi Model untuk Robot Embedded
10.1 Pipeline Optimasi
- Training model penuh dengan PyTorch (FP32)
- Pruning: hapus bobot tidak penting (magnitude/structured pruning)
- Knowledge Distillation: latih student model dari teacher besar
- Post-training Quantization (PTQ): FP32 → INT8/FP16
- Export ke ONNX → TensorRT conversion
- Benchmark latency + akurasi di target hardware
- Deploy via TensorRT Engine atau TFLite/CoreML
10.2 TensorRT Deployment
import tensorrt as trt
import torch
import numpy as np
# Step 1: Export ke ONNX
model.eval()
dummy = torch.randn(1, 3, 640, 640).cuda()
torch.onnx.export(
model, dummy, "robot_perception.onnx",
opset_version=17,
input_names=['image'],
output_names=['detections', 'embeddings'],
dynamic_axes={'image': {0: 'batch'}})
# Step 2: ONNX → TensorRT Engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16) # atau INT8 dengan calibrator
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
with open('robot_perception.onnx', 'rb') as f:
parser.parse(f.read())
engine = builder.build_serialized_network(network, config)
with open('robot_perception.trt', 'wb') as f:
f.write(engine)
print(f"TensorRT engine saved — FP16 inference siap di Jetson!")
10.3 Perbandingan Hardware untuk Robot AI
| Hardware | TOPS | Power | Cocok Untuk | Harga |
|---|---|---|---|---|
| NVIDIA Jetson Orin NX 16GB | 70 | 10-15W | Mobile robot, drone, arm | ~$500 |
| NVIDIA Jetson AGX Orin 64GB | 275 | 15-60W | Industrial robot, autonomous vehicles | ~$1400 |
| Intel RealSense D457 + NUC | ~20 | 28W | Research robot, indoor nav | ~$800 |
| Coral TPU Edge | 4 | 2W | Ultra-low power, inference only | ~$60 |
| RK3588 NPU (Orange Pi 5) | 6 | 5-8W | Budget robot, education | ~$80 |
Keamanan Model DL pada Sistem Robot BlueDragonSec
11.1 Adversarial Attacks pada Robot Vision
Adversarial examples adalah input yang sengaja dimodifikasi secara kecil untuk menipu model DL:
# PGD (Projected Gradient Descent) Attack pada robot perception
import torch
import torch.nn.functional as F
def pgd_attack(model, image, target_class, epsilon=8/255,
alpha=2/255, iters=40):
"""
PGD adversarial attack untuk test robustness model robot.
HANYA untuk defensive testing / red team.
"""
x_adv = image.clone().detach().requires_grad_(True)
orig = image.clone().detach()
for _ in range(iters):
logits = model(x_adv)
loss = F.cross_entropy(logits, target_class)
model.zero_grad()
loss.backward()
with torch.no_grad():
x_adv += alpha * x_adv.grad.sign()
x_adv = torch.clamp(orig + torch.clamp(x_adv - orig,
-epsilon, epsilon), 0, 1)
x_adv = x_adv.detach().requires_grad_(True)
return x_adv.detach()
# Adversarial Training — pertahanan terbaik
def adversarial_train_step(model, optimizer, x, y, epsilon=4/255):
# Generate adversarial examples dalam inner loop
x_adv = pgd_attack(model, x, y, epsilon=epsilon, iters=7)
model.train()
logits = model(x_adv)
loss = F.cross_entropy(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
11.2 Model Backdoor & Data Poisoning
| Attack Type | Mekanisme | Impact pada Robot | Mitigasi |
|---|---|---|---|
| Backdoor Attack | Trigger pattern tersembunyi di training data menyebabkan misclassification | Robot salah mengenali "teman" sebagai ancaman atau sebaliknya | Neural Cleanse, STRIP, spectral signatures |
| Data Poisoning | Modifikasi training dataset menyebabkan degradasi performa sistematis | Model navigasi membawa robot ke lokasi berbahaya | Data provenance, anomaly detection |
| Model Extraction | Query model berulang kali untuk clone fungsionalitasnya | Kloning IP model komersial | Rate limiting, output perturbation, watermarking |
| Adversarial Patch | Stiker fisik yang ditempel di lingkungan menipu robot vision | Robot mengabaikan stop sign atau manusia | Ensemble, certified defenses, physical robustness testing |
11.3 Certified Robustness & Defenses
# Randomized Smoothing — certified L2 robustness
from smoothing import Smooth
# Bungkus model dengan randomized smoothing certifier
smooth_model = Smooth(
base_classifier=model,
num_classes=10,
sigma=0.25 # noise std utk certifikasi
)
# Prediksi dengan sertifikasi radius L2
prediction, radius = smooth_model.certify(
x=image,
n0=100, # sample awal
n=100_000, # sample sertifikasi
alpha=0.001, # tingkat kepercayaan 99.9%
batch_size=1000
)
if prediction != -1:
print(f"Predicted: {prediction}, certified radius: {radius:.3f}")
# Robot AMAN mengambil keputusan ini selama perturbasi ≤ radius
Proyek Lengkap: Pick-and-Place Robot Otonom
Proyek capstone ini mengintegrasikan seluruh modul: vision, perception, planning, dan kontrol menjadi satu sistem robot arm otonom yang dapat memilih dan meletakkan objek berdasarkan instruksi bahasa natural.
╔══════════════════════════════════════════════════════════════╗ ║ ARSITEKTUR SISTEM PICK-AND-PLACE OTONOM ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ [Instruksi Bahasa] "taruh cangkir merah ke kotak biru" ║ ║ │ ║ ║ ▼ ║ ║ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ║ ║ │ LLM Parser │───▶│ CLIP Visual │───▶│ Grasp Net │ ║ ║ │ (GPT-4o / │ │ Grounding │ │ (6D Pose) │ ║ ║ │ Qwen2-VL) │ │ │ │ │ ║ ║ └─────────────┘ └──────┬──────┘ └──────┬───────┘ ║ ║ │ target bbox │ T_grasp ║ ║ ▼ ▼ ║ ║ ┌─────────────────────────────────────────────────────┐ ║ ║ │ Motion Planner │ ║ ║ │ MoveIt2 + learned collision avoidance (RL) │ ║ ║ └────────────────────┬────────────────────────────────┘ ║ ║ │ joint trajectory ║ ║ ▼ ║ ║ ┌─────────────────────────────────────────────────────┐ ║ ║ │ Low-level Controller: PD + residual RL policy │ ║ ║ │ (100Hz control loop di robot arm hardware) │ ║ ║ └─────────────────────────────────────────────────────┘ ║ ║ ║ ╚══════════════════════════════════════════════════════════════╝
12.1 Full Integration Code
#!/usr/bin/env python3
# ─────────────────────────────────────────────────
# BlueDragonSec — Pick-and-Place Integration
# Robot: UR5e + Robotiq 2F-85 + RealSense D435i
# ROS2 Humble + MoveIt2
# ─────────────────────────────────────────────────
import rclpy
from rclpy.node import Node
import numpy as np
import torch
from moveit2 import MoveIt2
from openai import OpenAI
import clip, json
class PickPlaceRobot(Node):
def __init__(self):
super().__init__('pick_place_node')
self.moveit = MoveIt2(node=self, joint_names=[...])
self.llm = OpenAI().chat
self.clip_mdl, self.clip_pp = clip.load("ViT-L/14", device="cuda")
self.grasp_net = load_graspnet()
self.detector = YOLO('yolov10s.pt').cuda()
def execute_instruction(self, instruction: str, rgbd_frame):
# 1. Parse instruksi bahasa → target + destination
parsed = self._parse_instruction(instruction)
target_desc = parsed['pick_object']
dest_desc = parsed['place_location']
# 2. Deteksi semua objek di scene
detections = self.detector(rgbd_frame.rgb)
crops = [crop_bbox(rgbd_frame.rgb, d.box) for d in detections]
# 3. CLIP grounding untuk target
target_crop, conf = find_target_object(
rgbd_frame.rgb, target_desc, crops)
self.get_logger().info(f"Target found (conf={conf:.2f}): {target_desc}")
# 4. 6D pose estimation
R, t, grasp_q = self.grasp_net.predict(
rgbd_frame.depth, rgbd_frame.K)
# 5. Plan & execute: pre-grasp → grasp → lift → place
T_grasp = build_pose(R, t)
for waypoint in self._plan_trajectory(T_grasp, dest_desc):
self.moveit.move_to_pose(waypoint, velocity_scaling=0.4)
def _parse_instruction(self, instruction):
response = self.llm.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Parse robot instruction to JSON: {pick_object, place_location}. ONLY return JSON."
}, {"role": "user", "content": instruction}])
return json.loads(response.choices[0].message.content)
if __name__ == '__main__':
rclpy.init()
robot = PickPlaceRobot()
robot.execute_instruction(
"ambil botol air dan letakkan di keranjang hijau",
rgbd=get_current_rgbd_frame()
)
rclpy.spin(robot)
12.2 Checklist Deployment Robot AI
- 01Validasi akurasi model di lingkungan test sebelum deploy ke hardware fisik — minimum 95% precision/recall untuk safety-critical perception.
- 02Implementasi safety watchdog: independent process yang memonitor output model dan menghentikan robot jika deteksi anomali (confidence rendah, out-of-distribution input).
- 03Hardware E-stop tersambung ke sinyal terpisah dari software — tidak boleh bisa di-override oleh model DL.
- 04Logging semua inference: gambar input, output model, aksi robot — untuk forensic analysis jika terjadi insiden.
- 05Red-team testing: uji adversarial patches, distribusi shift (cahaya berbeda, objek baru), dan failure modes sebelum operational deployment.
- 06Dokumentasikan model card: training data, performance metrics, known limitations, intended use, out-of-scope scenarios.
Referensi, Dataset & Resource
Dataset Penting untuk Robotika DL
| Dataset | Task | Size | Notes |
|---|---|---|---|
| Open X-Embodiment | Robot manipulation (22 robot) | 1M+ episodes | Google, cross-robot generalization |
| DROID | Diverse robot interaction | 76k demos | Stanford, in-the-wild manipulation |
| BridgeData V2 | Table-top manipulation | 60k demos | UC Berkeley, diverse tasks |
| nuScenes | Autonomous driving perception | 1000 scenes | LiDAR + camera fusion |
| Replica / ScanNet | Indoor scene reconstruction | 18/1500 scenes | SLAM, navigation research |
| YCB Video | 6D object pose | 92 videos | Standard benchmark pose estimation |
| LEROBOT (HuggingFace) | Multi-task robot learning | Growing | Community-driven, ACT/Diffusion Policy |
Paper Wajib Baca
- Attention Is All You Need (Vaswani et al., 2017) — Transformer foundation
- PPO: Proximal Policy Optimization (Schulman et al., 2017) — RL standard
- SAC: Soft Actor-Critic (Haarnoja et al., 2018) — continuous control RL
- RT-2: Vision-Language-Action Models (Brohan et al., 2023) — foundation model untuk robot
- Diffusion Policy (Chi et al., 2023) — state-of-the-art imitation learning
- Contact GraspNet (Sundermeyer et al., 2021) — 6D grasp generation
- NeRF: Representing Scenes as Neural Radiance Fields (Mildenhall et al., 2020)
- DreamerV3 (Hafner et al., 2023) — world models untuk RL
- ACT: Action Chunking with Transformers (Zhao et al., 2023) — low-cost robot learning
- Towards Safe & Adversarially Robust RL — safety dalam robot RL