Breaking personal highscores with a bot for the game Stack

2024-12-31

I created a bot to break the game Stack game for me.

Back when I was in school and learned about transistors, I somehow got curious if one could use transistors to control a smartphone bot playing the game Stack. This would avoid the need for mechanical tapping devices to press on the smartphone. After some testing though I gave up, because I lacked the tools and knowledge to get there. This article is unfortunately not about how I figured out how to use a transistor to control my phone. This article is about how to actually play the game using a bot.

It’s best if you just watch the trailer of the game

In the game when you tap the screen it will place a block, that goes back and forth, above the previous block and cut where it doesn’t align perfectly with the one below it. The important thing here is timing, perfect for a robot.

General approach:

mirror the screen of my phone to my laptop monitor
have a python application capture that screen
do some processing and logic
make the application make a mouse-click on the shared monitor (which also registers mouse clicks as input)

What does the processing look like?

Some Observations I made in the beginning:

the box to be placed has a constant speed
the corner of the highest box is always above the corner of the one below it. This is great to identify where the current box is and when to place it, as I essentially only need two points to do the entire tracking. The tracking can be done using template matching. So for each frame the bot gathers all the detections matching to a corner of a block.
I failed to stack the tower higher than around ~60

The corner template looks something like this.

The idea then is to have have a statemachine, that decides when the bot does what.

acquiring the base point
- determine stationary (non moving) points
- select highest point (along y vertical axis) as the one where to place above. This is called the base point
determine moving points
- determine which of the detections are moving points (all detections that are non stationary).
- ❌ The naive approach is to wait until the moving point is right above the base point. In an ideal world where there is infinite framerate, compute power and 0 delays that would work. However like that you will only able to stack up to 20. There is a very large input delay and also as I found out the speed changes! Thus we need to predict when it’s the right time to stack, before it actually is right above the base-point.
- store the time and x (horizontal) pixel values of the moving pixel detections
- to calculate when to stack we need to know the speed of the moving block. since the block goes back and forth in a deterministic way one can assume that there is a frequency and by determining the frequency we would be able to determine the speed. Thus we could use a fast fourier transform of all the times and previously detected moving points … STOP! Too complex. Just fit a line through the points and continue if line is a “good” fit (Pearson correlation coefficient is above 99.7%; to me it’s just an output of the library)
- the time at which the line crosses the x value of the basepoint is the time when it needs to press the phone screen (input lag will be described later)
- if the time to stack is in less than half a second, wait for the time to stack and then stack (by clicking on the phone screen)
await stabilization
- await until the block that was not placed correctly goes missing and all special animations are gone. In the end it’s just a fixed duration
- clear all previously detected stationary points and repeat from acquiring the base point

This approach worked for up to 20 stacks, but there was noticable lag. I wasn’t sure if the waiting time was mispredicted or if the detections were wrong or a bug in my code. There were various sources of lag. I wanted to know them all instead of aggregating them into one, as this was what I was struggling with the most.

type	amount	measureable*	description
render and screen capture	~25 ms	no	the game screen is rendered after inputs are processed. The rendered screen then needs to be transferred to my monitor and then into python
python frame processing	~18 ms	yes	doing the whole template matching, determining the moving points and determining the best fit line takes time too
sleep deviation	1 ms	yes**	when one puts a program to sleep using the system call, it will never wake up at perfectly the right time
execute mouse click	1 ms	yes**	making the mouse move requires syscalls. Making system calls incurrs some overhead.
game loop lag	~25 ms	no	the game runs at 60fps and the game needs to know that the screen was tapped at least 1 frame before. This is a fixed and predictable delay
input forwarding to game	~55 ms	no	this is the biggest and most difficult issue. When the mous click happens it takes a while to arrive in the game. This is unpredictable and causes the most error which is exacerbated at higher speeds

*measurable in my setup

**I didn’t bother measuring the exact duration of 1 ms coefficients

As I was struggling I even manually labeled the failures and created a correlation table but I could not find any reliable way to determine what made it have the lag. The only somewhat obvious correlation is, that as the stack height increases the speed increases with it.

I also checked that the block speed is indeed linear; your eyes can trick you.

In the end I had to settle for a heuristic. For all I know it was down to luck, but I had made sure that my chances were high.

All in all this allowed me to achieve my goal of exceeding stack high 100. I was able to get all the way to 175!

For this long run it was also interesting to look at the speed. Generally it increases, but as one gets more success it speeds up. It reached a top speed of 680 pixels / second.

Sometimes it still fails either because there is suddenly more input lag or because sometimes the bot falsely predicts a stationary point.

Enough effort has flown into it and now I feel like I can lay this topic to rest 🪦.