TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes

tl;dr A simple, tuning-free recipe for teaching humanoids to perform dynamic skills.

Abstract

How do we learn to hit a tennis backhand? Not from a thousand hours of tennis tournaments on TV — we work with a coach and practice. We argue this is also the right recipe for teaching dynamic skills to humanoid robots. This follows from a structural property of dynamic skills: the outcome is decided by a short, crucial portion of the trajectory — for a backhand, the ∼20cm of racket travel around ball contact. Getting this interaction window right requires coordinating the whole motion, so that control, physics, and morphology act in concert. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the window comes out right. To this end, we introduce TaskNPoint, a training protocol which makes the coach-learner division of labor explicit. The human coach contributes four inputs: a discrete set of skills (e.g. different tennis shots), one demonstration per skill, identification of the interaction window, the goal (e.g. ball placement). Learning in a physically realistic simulation environment fills in the trajectory of each action and provides robustness to unmodeled events. Crucially, randomized target sampling during training lets a single demonstration generalize zero-shot to unseen goal locations. We test this approach on a Unitree G1 humanoid that hits forehands and backhands against balls thrown by a human, kicks incoming soccer balls, and picks and places boxes from novel locations. We find that learning is successful from a short human video demonstration and under an hour of training on a single GPU, with no per-task reward tuning and no new demonstration per target.

Method

TaskNPoint Overview

TaskNPoint is a real-to-sim-to-real pipeline with four stages. (1) Human Pose Reconstruction. We collect monocular (or multi-view) video of a human coach demonstrating each skill and recover 3D SMPL-X body pose estimates using PromptHMR. For in-the-wild multi-view footage we fuse per-camera estimates into a maximum-likelihood consensus pose to suppress depth-axis drift. (2) TaskNPoint Abstraction. We kinematically retarget the human motion to the robot morphology with GMR, then annotate the single contact frame (e.g., racket–ball impact) to extract the nominal 3D target point p*, target velocity ν*, and target orientation n*. This compact goal tuple — one point per skill — is the only task-specific information the policy receives. (3) RL Policy. We train a single goal-conditioned PPO policy in MJlab simulation. Target points are randomized around the nominal during training so a single demonstration generalizes zero-shot to novel goal locations. (4) Deployment. An OptiTrack motion-capture system estimates the incoming object trajectory in real time; a Kalman filter and hybrid dynamics model forward-propagate the trajectory, and the nearest motion class is selected via a Voronoi partition over the workspace.

Policy Architecture

TaskNPoint Abstraction

Intuition. Consider what a tennis player must do when returning an incoming ball. First, they must decide which shot to play — forehand, backhand, smash, or volley. This is a discrete choice from a finite vocabulary of actions. Second, once the shot is chosen, they must shape it depending on the incoming ball’s speed, spin, and trajectory, and on the desired destination. Shaping includes the footwork, preparation, and follow-through that produce the desired impact quality — a choice within a continuum of possibilities. This gives a natural hierarchy: discrete action selection followed by continuous action shaping. TaskNPoint uses RL for shaping and a simple hand-designed policy for selection. The decomposition generalizes beyond tennis to ball kicking and box pick-and-place.

For a given action (e.g. a backhand), four interrelated entities are at play: the control, the trajectory, the interaction, and the goal. Control drives the robot along a trajectory — a time sequence of joint configurations q_0:T. The end-effector’s position and velocity at the moment of interaction (e.g. racket–ball contact) determine the outcome, which ideally meets the goal (e.g. ball placement). Thus G is a function of J, which is a function of A, which is a function of the control. Learning must invert this chain. TaskNPoint simplifies the problem by adopting impact quality as the goal — identifying G := J — so only one map, from goal to trajectory, needs to be learned.

Formalization. The teacher identifies the crucial moment in the demonstration: racket–ball contact in tennis, foot–ball contact in soccer, hand–box contact in pick-and-place. For a tennis shot, this interaction is parameterized by the time t* ∈ ℝ, the 3D location p* ∈ ℝ³, the velocity direction ν* ∈ ℝ³, and the orientation n* ∈ S² of the racket at impact. The goal is the tuple G* = (p*, ν*, n*, t*).

One demonstration is enough to specify the shape of the motion, but not enough to handle the full range of incoming ball trajectories. To train a robust policy, target points are randomized during training: p ~ N(p*, Σ), and similarly for ν and n. Position, velocity, and orientation rewards are averaged over a small phase window around impact, Ω = t* + (−δt, δt), keeping the reward signal sharp. The full set of actions and nominal goals forms a motion library used to train the policy.

Training rewards — motion tubules color-coded by reward value

TaskNPoint space coverage across motions

Qualitative Results

We train TaskNPoint on a small set of reference demonstrations of tennis shots, soccer kicks, and box pick-and-place. By randomizing incoming target trajectories and positions, we are able to generalize to unseen target locations in deployment.

Tennis

Soccer

Box pickup

Simulation Results

We compare TaskNPoint against state-of-the-art methods on two tasks: ballistic hitting (tennis forehand/backhand) and cargo pickup (box pick-and-place). We report success rate (SR), generalized success rate (GSR) on novel trajectories, and target position error e_b. TaskNPoint achieves the highest GSR across both tasks and the lowest target position error, demonstrating superior generalization.

Method	Ballistic Hitting			Box Pick-and-Place
Method	SR↑	GSR↑	e_b (m)↓	SR↑	GSR↑	e_b (cm)↓
SkillMimic	68.2%	30.9%	0.48	0.4%	0.0%	28.3
OmniRetarget	75.8%	20.4%	0.15	0.0%	0.0%	13.3
HDMI	90.0%	25.3%	0.07	95.8%	1.8%	13.1
HumanX	100%	90.6%	0.09	99.3%	96.3%	8.67
TaskNPoint (Ours)	99.5%	93.0%	0.02	100.0%	98.0%	4.85

Hardware Results

We deploy our controller on a 27-DoF Unitree G1 humanoid, running the policy onboard at 50 Hz and receiving OptiTrack ball position estimates via Ethernet. The robot successfully returns balls thrown at 4–8 m/s and placed up to 2 m laterally. Performance degrades gracefully with ball speed, and the robot never falls even on missed hits — a robustness property we attribute to the abstraction. Box pickup failures were exclusively due to imperfect grasps, not policy failure.

	Tennis SR	Soccer SR	Box SR
Slow (0–4 m/s)	1.00	1.00	0.60
Fast (4–8 m/s)	0.45	0.70	n/a

Success rates over 20 trials per condition.

BibTeX

@article{tasknpoint2026,
    author    = {Werner, Blake and Demler, Ilona and Perona, Pietro and Ames, Aaron D.},
    title     = {TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes},
    journal   = {https://arxiv.org/pdf/2606.26215},
    year      = {2026},
}

TaskNPoint

How to Teach Your Humanoid to Hit a Backhand in Minutes