Quickly Detect Cursor Position and Color

I've been absolutely thrilled moving my home dev world back to Fedora. I'm not fighting OS ads, virtualization just works, and my settings actually stay the same after updates. I am, however, missing AutoHotkey. It's been an integral part of my computing world since undergrad. I've spent the better part of three years looking for a POSIX AHK clone with no luck.

I've tossed around the idea of starting something similar for some time now. Obviously I don't have the expertise to make anything near as streamlined as AHK, but I do have the muleheadness to spend an entire weekend trying to bum a word or two from an awk chain. Tenacity is a useful trait when you have absolutely no idea what you're doing.

Background
Software
My First Cursor Position
pyautogui
Frankenstein
Xlib
- Caveats

Background

Several months ago I began drafting a project along these lines. I think RobotJS is pretty neat, and I've been looking for an excuse to use it. Bundling it via Electron would make the result consumable everywhere, sidestepping the AHK-proprietary concern.

However, Electron is a Chromium wrapper. It's got a ton of bloat and the whole Chromium thing to worry about. I typically hit my macros pretty hard, so an app that has to juggle many intensive GUI actions while being an intensive GUI might not be the best. (Note: I didn't actually pursue this, so I could be totally wrong and Electron might be faster.) I decided to mess around with a few other ideas this weekend.

I'd like to build something simple this weekend to figure out a few new components. I don't have any set goals, but these things are in the back of my head:

Lightweight: When Electron is your baseline everything's an improvement
Cross-platform: If nothing else, the RHEL and Debian ecosystems
Fast: I feel like 10+ ops a second isn't asking too much

Software

I mention some software during this post. I install everything here, but I remove some of it later. YMMV. If you're in the Debian ecosystem, this should work but the packages will probably have completely different names.

xdotool and its dependencies
$ sudo dnf install xdotool
ImageMagick
$ sudo dnf install ImageMagick

xwd: You might have to hunt for this executable's provider.

$ dnf provides xwd
Last metadata expiration check: 0:10:01 ago on Sun 14 Jan 2018 06:28:40 AM UTC.
xorg-x11-apps-7.7-18.fc27.x86_64 : X.Org X11 applications
Repo        : fedora
Matched from:
Provide    : xwd = 1.0.6
$ sudo dnf install xorg-x11-apps

Depending on your Python setup, you might already have these external dependencies.

$ sudo dnf install \
        python{2,3}-{tkinter,xlib} \
        scrot

Finally, for good measure, check the pip dependencies:

$ pip install --user \
        pyautogui \
        Xlib

My First Cursor Position

You can't automate the mouse if you don't know where it is. xdotool provides fast and easy access to the cursor (among other things!).

$ xdotool getmouselocation
x:2478 y:1603 screen:0 window:48234503
$ xdotool getmouselocation --shell
X=2468
Y=1265
SCREEN=0
WINDOW=48234503
$ eval $(xdotool getmouselocation --shell); echo $X
2468
$ eval $(xdotool getmouselocation --shell); echo $X
2873
$ eval $(xdotool getmouselocation --shell); echo $X
2456

The only downside is trying to consume this. It might be a better idea to try native options.

`pyautogui`

pyautogui is one of those native options. Al Sweigart wrote a great little intro that I highly recommend reading. The tool is simple and fast, which is what we're looking for.

pyautogui-geometry.py

"""This file illustrates a basic pyautogui setup"""

from __future__ import print_function
import time

import pyautogui

START = time.time()
SCREEN_WIDTH, SCREEN_HEIGHT = pyautogui.size()
MOUSE_X, MOUSE_Y = pyautogui.position()
END = time.time()

print("Screen: %dx%d" % (SCREEN_WIDTH, SCREEN_HEIGHT))
print("Mouse: (%d,%d)" % (MOUSE_X, MOUSE_Y))
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/pyautogui-geometry.py
Screen: 3000x1920
Mouse: (2570,1597)
Start: 1515917871.17
End: 1515917871.17
Difference: 0.000105142593384

pyautogui can also work with the screen's images and colors. Detecting color under the cursor is a great way to trigger actions, especially in routine applications.

pyautogui-pixel-color.py

#!/usr/bin/env python
"""This file illustrates using pyautogui to check a pixel's color"""

from __future__ import print_function
import time

import pyautogui

START = time.time()
SCREEN_WIDTH, SCREEN_HEIGHT = pyautogui.size()
MOUSE_X, MOUSE_Y = pyautogui.position()
PIXEL = pyautogui.screenshot(
    region=(
        MOUSE_X, MOUSE_Y, 1, 1
    )
)
COLOR = PIXEL.getcolors()
END = time.time()

print("Screen: %dx%d" % (SCREEN_WIDTH, SCREEN_HEIGHT))
print("Mouse: (%d,%d)" % (MOUSE_X, MOUSE_Y))
print("RGB: %s" % (COLOR[0][1].__str__()))
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/pyautogui-pixel-color.py
Screen: 3000x1920
Mouse: (2731,1766)
RGB: (20, 21, 22)
Start: 1515918345.91
End: 1515918346.19
Difference: 0.282289028168

Unfortunately, pyautogui is super slow. No matter what, it screenshots everything, then returns what you requested.

pyautogui-pixel-color.py

"""This file illustrates the similar run time between regions and full screenshots"""

from __future__ import print_function
import time

import pyautogui

print('Using a region')
START = time.time()
SCREEN_WIDTH, SCREEN_HEIGHT = pyautogui.size()
MOUSE_X, MOUSE_Y = pyautogui.position()
PIXEL = pyautogui.screenshot(
    region=(
        MOUSE_X, MOUSE_Y, 1, 1
    )
)
COLOR = PIXEL.getcolors()
END = time.time()

print("Screen: %dx%d" % (SCREEN_WIDTH, SCREEN_HEIGHT))
print("Mouse: (%d,%d)" % (MOUSE_X, MOUSE_Y))
print("RGB: %s" % (COLOR[0][1].__str__()))
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

print("=" * 10)

print('Full screen')
START = time.time()
SCREEN_WIDTH, SCREEN_HEIGHT = pyautogui.size()
MOUSE_X, MOUSE_Y = pyautogui.position()
PIXEL = pyautogui.screenshot()
COLOR = PIXEL.getcolors()
END = time.time()

print("Screen: %dx%d" % (SCREEN_WIDTH, SCREEN_HEIGHT))
print("Mouse: (%d,%d)" % (MOUSE_X, MOUSE_Y))
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/pyautogui-full-vs-region.py
Using a region
Screen: 3000x1920
Mouse: (2416,1456)
RGB: (20, 21, 22)
Start: 1515919266.8
End: 1515919267.09
Difference: 0.286885023117
==========
Full screen
Screen: 3000x1920
Mouse: (2416,1456)
Start: 1515919267.09
End: 1515919267.37
Difference: 0.281718969345

Smaller screens will do better; the docs mention this. I was able to cut my time by disabling my second screen. That's neither fun nor practical, so I put pyautogui aside for now.

Frankenstein

I have to admit, I was kinda stumped at this point. pyautogui is well-written and seriously vetted. I too would have gone the screenshot route, which means I too would be much too slow. A different approach is necessary, but I don't know what.

Luckily I stumbled into xwd via a wonderfully succinct bash solution. It's the X Window Dumping utility. That's insanely useful here, since X is what runs the system. xwd, in theory, has all of the screen information I could want. The linked solution uses ImageMagick to convert the dump; not only did I not know it was possible to get an X dump, I also did not know ImageMagick would parse it beautifully (I would have guessed that part eventually).

`bash`

xwd-convert.sh

view source

#!/bin/bash
# Original:
# https://stackoverflow.com/a/25498342/2877698
# I've made a couple of tweaks; nothing substantial
start=$(date +%s%3N)
eval $(xdotool getmouselocation --shell)
xwd -root -screen -silent | convert xwd:- -crop "1x1+$X+$Y" txt:-
end=$(date +%s%3N)

echo "Start: $start"
echo "End: $end"
echo "Difference: $(($end - $start))"

$ scripts/xwd-convert.sh
ImageMagick pixel enumeration: 1,1,65535,srgb
0,0: (5140,5397,5654)  #141516  srgb(20,21,22)
Start: 1515922099742
End: 1515922100086
Difference: 344

That's even slower than pyautogui. To be fair, we should probably build it in Python.

Disconnected and Secure

In order to convert it to Python, we'll have to break the pipes. Shell injection is no joke.

xwd-convert-disconnected.py

"""This file runs several commands without allowing them to communicate directly"""
from __future__ import print_function

import re
import subprocess
import time

FULL_XDO_PATTERN = r"""
^\s*X=(?P<x>\d+)    # Name x for easy access
\s+
Y=(?P<y>\d+)        # Name y for easy access
\s+
[\s\S]+$            # Ditch everything else
"""

COMPILED_PATTERN = re.compile(FULL_XDO_PATTERN, re.VERBOSE | re.MULTILINE)

START = time.time()
COORD = subprocess.check_output([
    'xdotool',
    'getmouselocation',
    '--shell'
])
MATCHED = COMPILED_PATTERN.match(COORD)
DUMP = subprocess.check_output([
    'xwd',
    '-root',    # starts from the base up
    '-screen',  # snags the visible screen for things like menus
    '-silent',  # don't alert
    '-out',     # outfile
    'dump.xwd'
])
IMAGE = subprocess.check_output([
    'convert',
    'dump.xwd',  # infile
    '-crop',    # restrict the image to the cursor
    '1x1+%s+%s' % (MATCHED.group('x'), MATCHED.group('y')),
    'text:-'    # throw out on stdout
])
END = time.time()

print(COORD)
print(IMAGE)
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/xwd-convert-disconnected.py
X=2291
Y=1551
SCREEN=0
WINDOW=48234503

ImageMagick pixel enumeration: 1,1,65535,srgb
0,0: (9766,35723,53970)  #268BD2  srgb(38,139,210)

Start: 1515923808.71
End: 1515923809.05
Difference: 0.338629961014

However, that didn't seem to give us any extra time. Chaining the commands might work in our favor here.

Chained

Rather than totally isolate the commands, we can redirect their output. It's not quite a shell, as the output from one is finished and sanitized before being sent on, but it functions in a similar manner.

I spent at least an hour trying to figure out how to multiplex stdin on the file command. If you know of a clever way to do that without setting environment variables, I'd love to hear about it. I wasn't able to come up with anything that worked, so the coordinates aren't part of the pipe chain.

xwd-convert-chained.py

"""This file runs several commands by linking stdout and stdin in subprocesses"""
from __future__ import print_function

import re
import subprocess
import time

FULL_XDO_PATTERN = r"""
^\s*X=(?P<x>\d+)    # Name x for easy access
\s+
Y=(?P<y>\d+)        # Name y for easy access
\s+
[\s\S]+$            # Ditch everything else
"""

COMPILED_PATTERN = re.compile(FULL_XDO_PATTERN, re.VERBOSE | re.MULTILINE)

START = time.time()
COORD = subprocess.check_output([
    'xdotool',
    'getmouselocation',
    '--shell'
])
MATCHED = COMPILED_PATTERN.match(COORD)
IMAGE = subprocess.Popen(
    [
        'convert',
        'xwd:-',
        '-crop',    # restrict the image to the cursor
        '1x1+%s+%s' % (MATCHED.group('x'), MATCHED.group('y')),
        'text:-'    # throw out on stdout
    ],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
)
DUMP = subprocess.Popen(
    [
        'xwd',
        '-root',    # starts from the base up
        '-screen',  # snags the visible screen for things like menus
        '-silent',  # don't alert
    ],
    stdout=IMAGE.stdin
)
OUTPUT = IMAGE.communicate()[0]
DUMP.wait()
END = time.time()

print(OUTPUT)
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/xwd-convert-chained.py
ImageMagick pixel enumeration: 1,1,65535,srgb
0,0: (5140,5397,5654)  #141516  srgb(20,21,22)

Start: 1515929723.34
End: 1515929723.65
Difference: 0.312852144241

It's still absysmally slow. We haven't actually changed how we're getting the dump and parsing it, so that makes sense.

Shelled

DON'T DO THIS IN PRODUCTION. The docs will warn you too. I threw this script together quickly to make sure everything was working as intended.

xwd-convert-dangerous.py

"""This file contains code to monitor the desktop environment"""
from __future__ import print_function

import subprocess
import time

START = time.time()
OUTPUT = subprocess.check_output(
    'eval $(xdotool getmouselocation --shell); '
    'xwd -root -screen -silent '
    '| convert xwd:- -crop "1x1+$X+$Y" text:-',
    shell=True
)
END = time.time()
print(OUTPUT)
print("START: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/xwd-convert-dangerous.py
ImageMagick pixel enumeration: 1,1,65535,srgb
0,0: (5140,5397,5654)  #141516  srgb(20,21,22)

START: 1515930641.2
End: 1515930641.52
Difference: 0.320123195648

Once again, we're not seeing a time boost because we're just shuffling code around. The important bits haven't changed.

Xlib

I spent most of Saturday trying to figure out how to parse a xwd result in Python. I got absolutely nowhere. There are many good resources and several example parsers, but nothing worked out of the box for me. The header that describes the dump, XWDFile.h, is apparently different enough across distros that, in combination with having absolutely no idea how properly parse binary files, I couldn't figure it out. I discovered this morning that I had been using a newer (or older?) version of the file with some major differences (e.g. 32bit vs 64bit) as a reference.

$ sudo find /usr -type f -name 'XWDFile.h'
/usr/include/X11/XWDFile.h

However, even knowing that in hindsight, I lost interest in parsing the dumps manually somewhere around the fourth hour of knowing the byte order was wrong but having no idea how to debug it. The image at the top of this post shows some of my failed attempts to test and retry. I learned quite a bit about X11 muddling my way through it, so I was able to refine my research and started getting useful hits. Eventually, I discovered this SO answer, which gave me the library I needed to track down.

The Python X Library, python-xlib, provides some of the items from Xlib and seems to be the best package at the moment. I neither know enough about X11 nor care to spend the time comparing interfaces to fully grok the differences; many elements covered in the X11 docs (e.g. AllPlanes) are either missing from python-xlib or do not appear to work as the external docs suggest. The library works, I was able to stumble my way through the hand-wavy docs, and, best of all, it blows everything else out of the water.

xlib-color.py

#!/usr/bin/env python
"""This file demonstrates color snooping with Xlib"""
from __future__ import print_function

from os import getenv
import time

from Xlib.display import Display
from Xlib.X import ZPixmap

# Pull display from environment
DISPLAY_NUM = getenv('DISPLAY')
# Activate discovered display (or default)
DISPLAY = Display(DISPLAY_NUM)
# Specify the display's screen for convenience
SCREEN = DISPLAY.screen()
# Specify the base element
ROOT = SCREEN.root
# Store width and height
ROOT_GEOMETRY = ROOT.get_geometry()

# Ensure we can run this
EXTENSION_INFO = DISPLAY.query_extension('XInputExtension')

START = time.time()
COORDS = ROOT.query_pointer()
# Create an X dump at the coordinate we want
DISPLAY_IMAGE = ROOT.get_image(
    x=COORDS.root_x,
    y=COORDS.root_y,
    width=1,
    height=1,
    format=ZPixmap,
    plane_mask=int("0xFFFFFF", 16)
)
# Strip its color info
PIXEL = getattr(DISPLAY_IMAGE, 'data')
# Look up the color
RESULTS = SCREEN.default_colormap.query_colors(PIXEL)
# If there are multiple, just return the last one
for raw_color in RESULTS:
    final = (
        raw_color.red,
        raw_color.green,
        raw_color.blue
    )
END = time.time()
print('#%04x%04x%04x' % (final[0], final[1], final[2]))
print("Start: %s" % (START))
print("End: %s" % (END))
print("Difference: %s" % (END - START))

$ python scripts/xlib-color.py
e3fb6e6d2e25
Start: 1515954353.75
End: 1515954353.75
Difference: 0.000375986099243

So far, it looks like this script can consistently poll cursor colors in fewer than five milliseconds. Everything else takes at least 250 milliseconds. Depending on how fast and loose you want to play with the numbers, this solution is at least a 97.5% reduction (200 to 5) and optimistically closer to a 99.6% reduction (300 to 1).

Caveats

I'm sort of waving my hands at a few things here because I still don't fully understand them yet.

I do not understand the bitmap side of things, i.e. XYPixmaps. I don't feel all that good about ZPixmaps either but they're at least composed of RGB triplets which is something I do get.
Speaking of ZPixmaps, I had quite a bit of trouble with plane_masks as well. They're used to hide unwanted bit planes, which I just realized writing this sentence isn't anything more special than masking bits. That being said, it's not documented well in X11 and at all in python-xlib.
The pixels in ZPixmaps return strange data structures. This is 100% my inexperience with Python and its advanced data types. I will eventually break down the code and figure out what it's doing.

Also, python-xlib is really awesome. I've made some comments that could be taken as passive aggressive stabs at the devs, and I want to make sure that misconception is cleared up. They've built some amazing software, they're working hard to make the world a better place, and I'm stoked about their contributions.

I'm going to follow this up with another post or two playing with this stuff, doing some analysis, and trying out some implementations.