{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your First Program with CuTe DSL\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Welcome! In this tutorial, we'll write a simple \"Hello World\" program that runs on your GPU using CuTe DSL. This will help you understand the basics of GPU programming with our framework.\n",
    "\n",
    "### What You'll Learn\n",
    "\n",
    "- How to write code that runs on both CPU (host) and GPU (device),\n",
    "- How to launch a GPU kernel (a function that runs on the GPU),\n",
    "- Basic CUDA concepts like threads and thread blocks,\n",
    "\n",
    "### Step 1: Import Required Libraries\n",
    "\n",
    "First, let's import the libraries we need:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cutlass               \n",
    "import cutlass.cute as cute  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Step 2: Write Our GPU Kernel\n",
    "A GPU kernel is a function that runs on the GPU. Here's a simple kernel that prints \"Hello World\".\n",
    "Key concepts:\n",
    "- `@cute.kernel`: This decorator tells CUTLASS that this function should run on the GPU\n",
    "- `cute.arch.thread_idx()`: Gets the ID of the current GPU thread (like a worker's ID number)\n",
    "- We only want one thread to print the message (thread 0) to avoid multiple prints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cute.kernel\n",
    "def kernel():\n",
    "    # Get the x component of the thread index (y and z components are unused)\n",
    "    tidx, _, _ = cute.arch.thread_idx()\n",
    "    # Only the first thread (thread 0) prints the message\n",
    "    if tidx == 0:\n",
    "        cute.printf(\"Hello world\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Write Our Host Function\n",
    "\n",
    "Now we need a function that sets up the GPU and launches our kernel.\n",
    "Key concepts:\n",
    "- `@cute.jit`: This decorator is for functions that run on the CPU but can launch GPU code\n",
    "- We need to initialize CUDA before using the GPU\n",
    "- `.launch()` tells CUDA how many blocks, threads, shared memory, etc. to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cute.jit\n",
    "def hello_world():\n",
    "\n",
    "    # Print hello world from host code\n",
    "    cute.printf(\"hello world\")\n",
    "\n",
    "    # Launch kernel\n",
    "    kernel().launch(\n",
    "        grid=(1, 1, 1),   # Single thread block\n",
    "        block=(32, 1, 1)  # One warp (32 threads) per thread block\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Run Our Program\n",
    "\n",
    "There are 2 ways we can run our program:\n",
    "\n",
    "1. compile and run immediately\n",
    "2. separate compilation which allows us to compile the code once and run multiple times\n",
    "   \n",
    "Please note the `Compiling...` for Method 2 prints before the \"Hello world\" of the first kernel. This shows the asynchronous behavior between CPU and GPU prints. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running hello_world()...\n",
      "hello world\n",
      "Compiling...\n",
      "Hello world\n",
      "Running compiled version...\n",
      "hello world\n"
     ]
    }
   ],
   "source": [
    "# Initialize CUDA context for launching a kernel with error checking\n",
    "# We make context initialization explicit to allow users to control the context creation \n",
    "# and avoid potential issues with multiple contexts\n",
    "cutlass.cuda.initialize_cuda_context()\n",
    "\n",
    "# Method 1: Just-In-Time (JIT) compilation - compiles and runs the code immediately\n",
    "print(\"Running hello_world()...\")\n",
    "hello_world()\n",
    "\n",
    "# Method 2: Compile first (useful if you want to run the same code multiple times)\n",
    "print(\"Compiling...\")\n",
    "hello_world_compiled = cute.compile(hello_world)\n",
    "\n",
    "# Run the pre-compiled version\n",
    "print(\"Running compiled version...\")\n",
    "hello_world_compiled()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
