A Screenshot and OCR Workflow for Wayland

February 18, 2026

WaylandSwayOCRTesseractScreenshotsShell Script

X11 screenshot tools like scrot or maim do not work on Wayland. The compositor owns the display, and there is no equivalent of grabbing the X framebuffer. On Sway, the replacements are grim (screenshot capture) and slurp (region selection). Together with Tesseract for OCR, they form a lightweight screenshot suite that is entirely keyboard-driven.

Region Screenshot

The simplest case: select a region, capture it, copy to clipboard.

bash

#!/usr/bin/env bash

tmpfile=$(mktemp /tmp/screenshot-XXXXXX.png)

grim -g "$(slurp)" "$tmpfile" && echo "Screenshot saved to $tmpfile"

wl-copy < "$tmpfile"

mv "$tmpfile" "$HOME/Pictures/"

slurp gives you a crosshair to draw a rectangle. grim -g captures that exact region. wl-copy puts the image in the Wayland clipboard. The file gets moved to ~/Pictures/ as a backup.

Focused Window Screenshot

Instead of manually selecting a region, this variant grabs the currently focused window by querying the Sway tree:

bash

geometry=$(swaymsg -t get_tree | jq '.. | select(.focused? == true) | .rect | "\(.x),\(.y) \(.width)x\(.height)"' | head -n 1 | tr -d '"')

if [ -n "$geometry" ]; then
    grim -g "$geometry" "$tmpfile" && echo "Screenshot saved to $tmpfile"
    wl-copy < "$tmpfile"
    mv "$tmpfile" "$HOME/Pictures/"
fi

The jq query recursively walks the Sway window tree, finds the node with .focused == true, and extracts its position and size in the x,y widthxheight format that grim expects.

Adding OCR

This is where it gets interesting. Select a region, capture it, run it through Tesseract, and get the extracted text in your clipboard:

bash

#!/usr/bin/env bash

tmpfile=$(mktemp /tmp/screenshot-ocr-XXXXXX.png)

grim -g "$(slurp)" "$tmpfile" && echo "Screenshot saved to $tmpfile"

tesseract "$tmpfile" - -l eng 2>/dev/null | wl-copy

copied_text=$(wl-paste)
truncated_text="${copied_text:0:50}$([ ${#copied_text} -gt 50 ] && echo "..." || echo "")"
notify-send "OCR Complete" "Text copied to clipboard: $truncated_text"

mv "$tmpfile" "$HOME/Pictures/"

tesseract "$tmpfile" - -l eng reads the image and writes the extracted text to stdout (the - tells Tesseract to output to stdout instead of a file). That gets piped straight into wl-copy. A desktop notification shows a 50-character preview of what was captured so you get instant feedback without switching windows.

The truncation is a nice touch for notifications -- without it, a full page of OCR text would create an absurdly tall notification bubble.

The Color Picker

A bonus one-liner that captures a single pixel and copies the hex color to clipboard:

bash

bindsym CTRL+Print exec grim -g "$(slurp -p)" -t ppm - | convert - -format '%[pixel:p{0,0}]' txt:- | tail -n 1 | cut -d ' ' -f 4 | wl-copy

slurp -p selects a single pixel instead of a region. grim captures it as a PPM image piped to stdout. ImageMagick's convert extracts the hex color value. The result lands in your clipboard.

Keybindings

All four variants are bound to intuitive key combinations in the Sway config:

bash

bindsym Print exec $USER_BIN/screenshot.sh &
bindsym Shift+Print exec $USER_BIN/screenshot-highlighted-window.sh &
bindsym Mod1+Print exec $USER_BIN/screenshot-ocr.sh &
bindsym Mod1+Shift+Print exec $USER_BIN/screenshot-window-ocr.sh &
bindsym CTRL+Print exec grim -g "$(slurp -p)" -t ppm - | convert - -format '%[pixel:p{0,0}]' txt:- | tail -n 1 | cut -d ' ' -f 4 | wl-copy

Print for region, Shift+Print for focused window, Alt+Print for region OCR, Alt+Shift+Print for window OCR, and Ctrl+Print for the color picker. Easy to remember once you think of Alt as the "OCR modifier" and Shift as the "focused window modifier."

The Good and The Bad

The good: this entire setup is five short shell scripts and five lines of keybindings. There is no screenshot application running in the background, no GUI to navigate, no settings to configure. Press a key, get a result.

The bad: Tesseract's accuracy depends heavily on the source. Clean rendered text from a terminal or code editor works well. Text over complex backgrounds, curved text, or handwriting will produce garbage. For those cases you still need to screenshot and read it yourself. Adding -l eng+nld for multiple languages helps if you work in more than one language, but it slows down the processing noticeably.

Dependencies: grim, slurp, wl-clipboard, tesseract, jq, and optionally imagemagick for the color picker. On Arch, that is pacman -S grim slurp wl-clipboard tesseract tesseract-data-eng jq imagemagick.