home

CraneSched: Open‑Source Scheduler for HPC and AI Workloads

CraneSched offers a high‑performance, scalable, and secure open‑source system for distributed job scheduling of HPC and AI workloads. It cleanly separates front‑end and back‑end components and provides flexible resource and container management.

Key Takeaway

CraneSched delivers a robust job‑scheduling platform that supports scalable HPC clusters and AI workloads, featuring clear separation between UI/API layers and advanced resource isolation.

Project Overview

  • Founders: High‑Performance Computing Public Platform, Peking University
  • Languages & Architecture:
    Backend – C++ (kernel‑level control, resource management)
    Front‑end – Go (user interfaces, APIs)
  • Version History:
    • 2025/11/11 v1.1.3 – Partition‑Default/Max‑Memory, multi‑GID support, optimized job script permissions, expanded documentation, multilingual support.
    • 2025/04/08 v1.1.2 – GCC 15 / Clang 20, Node‑Drain/Resume events, partition‑based account control, HashiCorp Vault integration.
    • 2025/01/24 v1.1.0 – X11 forwarding, user‑QoS limits, multiple GIDs, cgroup v2 compatibility, Ascend NPU support, scheduler optimizations.
    • 2024/10/24 v1.0.0 – Job monitoring, plugin modules, device support, IPv6, scheduler improvements.
  • Core Functions: Resource and job management, job submission, query, and monitoring; resource isolation via cgroups and containers.
  • Performance Highlights: Over 100,000 scheduling decisions per second, scalable to millions of cores, simple user/admin commands, RBAC‑based security, encrypted communication, fault‑tolerant design without single‑point failures.
  • Architecture Components:
    Cranectld (control plane): manages node lifecycles, queues, resources, job submissions and queries.
    Craned (compute node service): monitors resources, job status, receives user commands and forwards to Cranectld.
    Resource Manager:
    – Cgroup Manager – resource allocation and isolation for HPC jobs.
    – Container Manager – Kubernetes‑based resource management for AI jobs, container lifecycle handling.
    – CRI‑Support (experimental) – container runtimes like containerd or CRI‑O.
  • Use Cases: Shared HPC + AI workloads across distributed clusters, cloud‑connected clusters maximizing utilization and reducing wait times.
  • Community & Resources: Open‑source code on GitHub. Documentation at https://pkuhpc.github.io/CraneSched/. Demo cluster (test account: demo_admin / demo_admin) – https://hpc.pku.edu.cn/demo/cranesched.
  • Project Names: Front‑end repo: CraneSched-FrontEnd; Back‑end repo: CraneSched.

Related Queries

  • What are the main advantages of CraneSched over traditional schedulers?
  • How are HPC and AI workloads managed and isolated in CraneSched?
  • What programming languages and toolchains are supported by CraneSched?

Source: https://github.com/PKUHPC/CraneSched/blob/master/README_EN.md