Open Sources

Curated repos, tools, and frameworks shaping the developer ecosystem.
Live data from GitHub.

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted & Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically:

Organizes multi-turn interactions into session-aware training trajectories
Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
Uses the next user, environment, or tool feedback as a natural "next-state" signal
Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring
Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Hybrid Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.

cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_topk_select.sh

http://<HOST_IP>:30000/v1

Then configure OpenClaw to route requests to your RL server.

Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" → "providers":

Example of Slime-based RL server:

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://<HOST_IP>:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

Example of Tinker-based RL server:

{
  "models": {
    "providers": {
      "openclaw-rl": {
        "baseUrl": "http://localhost:30000/v1",
        "apiKey": "no-auth-needed",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b-lora",
            "name": "Qwen3 4B (OpenClaw-RL LoRA)",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,

That's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

Setting	Environment	Next-state signal	Horizon
Terminal	Shell execution sandbox	stdout/stderr, exit code	Long
GUI	Screen state + accessibility tree	Visual state diff, task progress	Long
SWE	Code repository + test suite	Test verdicts, diff, lint output	Long
Tool-call	API/function execution	Return values, error traces	Medium

cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.sh

cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.sh

cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.sh

cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.sh

@article{wang2026openclawrl,
  title={OpenClaw-RL: Train Any Agent Simply by Talking},
  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2603.10165},
  year={2026}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

Open Sources

OpenClaw-RL

About this project

OpenClaw-RL

📰 News

Related Projects

hermes-agent

yt-dlp

💡 TL;DR

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Feedback to Gradient — Automatically

Three Optimization Methods in One Framework

From Personal Agents to Real-World Agentic RL

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

📝 Contents

🔧 Personal Agent Optimization Quick Start

1. Deployment Requirements

2. Start the RL Server

3. OpenClaw Setup

🔧 Agentic RL in Real-world Settings

🖥️ Terminal Agent — the most widely used computer-use agent

📟 GUI Agent — the most general computer-use agent

👨‍💻 SWE Agent — software engineering agent

🛠️ Tool-call Agent — the most practical agent

📖 Citation

🙏 Acknowledgements

⚠️ Reminder

stable-diffusion-webui

Open Sources

We read 100+ sources so you don't have to.

OpenClaw-RL

About this project

OpenClaw-RL

📰 News

Related Projects

hermes-agent

yt-dlp

💡 TL;DR

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Feedback to Gradient — Automatically

Three Optimization Methods in One Framework

From Personal Agents to Real-World Agentic RL

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

📝 Contents

🔧 Personal Agent Optimization Quick Start

1. Deployment Requirements

2. Start the RL Server

3. OpenClaw Setup

🔧 Agentic RL in Real-world Settings

🖥️ Terminal Agent — the most widely used computer-use agent

📟 GUI Agent — the most general computer-use agent

👨‍💻 SWE Agent — software engineering agent

🛠️ Tool-call Agent — the most practical agent

📖 Citation

🙏 Acknowledgements

⚠️ Reminder

stable-diffusion-webui