That title isn’t a metaphor. This blog post is genuinely about detecting presses on buttons that don’t exist. Let’s jump in!
Imaginary Trigger Buttons
If you’ve ever tried to play a first-person shooter game on a smartphone, you’ve probably thought about the imaginary buttons I’m talking about. Smartphone FPS games are typically played with the phone held in “landscape” configuration with the thumbs on the screen and the rest of the fingers beneath or behind the phone.
Controls, for lack of other options, are implemented almost exclusively using the touchscreen. Personally, though, whenever I hold a phone in this way, I always instinctively want to press the imaginary trigger buttons on the back of the phone near the upraised corners of the device where my index fingers are resting. Unfortunately, however, those buttons simply do not exist.
So now we’re going to press them anyway.
More specifically, I’m going to describe a technique for using a smartphone’s gyroscope-based orientation information to detect the secondary motion caused by the act of pressing a button. Suppose there really were a trigger button on the back of your phone; what would happen if you (rather enthusiastically) pressed it? The button itself would move, of course; but most likely the whole device would also move, if only a little, in response to the force you applied to the button. What if you could detect that little motion? What if you could detect it even if the button-press force was applied to the back of the phone because the button was purely imaginary?
As it turns out, you can detect that. Enthusiastic button presses on phones produce gyroscope signals that are surprisingly clear and singular. Not only is it possible to detect imaginary trigger presses with pretty good reliability, it’s even possible to distinguish between left and right presses.
This Playground shows a working demo of this technology. Brace the phone between your hands as shown in the picture above, then tap the back of the device near the upper corners with one index finger at a time to press the imaginary trigger buttons. Note that this is just a proof of concept; it’s not expected to be flawless, and it’s currently tailored to my specific way of tapping, so it may take some experimenting to get the demo to work for you. Think of this as just the beginning; this entire demo, including the imaginary button press detector, was put together by a single dev in a single day. Imagine what more could be done with techniques like these!
How It Works
Speaking of techniques like these, let’s discuss the underlying mechanisms that enable this imaginary button press detection to work. As described above, the core idea is to use secondary sensors to detect side effects of the button-press action; specifically, in this case we use the phone’s gyroscope to look for small movements caused by the user tapping on specific parts of the device. There are many possible ways we could tackle this problem such as machine learning or cross-correlation with known signals. (Note: those are actually essentially the same approach undertaken two different ways.) However, for this proof of concept, I decided to create a simple hand-crafted signal recognizer.
The formula above is how my hand-crafted recognizer might be presented if it appeared in an academic paper. I imagine there are some people for whom a representation like this is helpful. Personally, though, I find that mathematical forms like this can take something very simple and render it nigh incomprehensible. I don’t even have a good grasp on this one: I wrote it, and then a friend of mine had to explain that I’d made a transcription error while writing out my own formula. So if you, like me, find the above mathematical expression intimidatingly confusing and complicated, I hope the following exploration will help show that the idea behind the math is actually very, very simple.
Whenever I’m trying to extract information from sensor data, I like to start by looking at a sample of the sensor data I want to extract information from. In this case, that sensor data is a time series of gyroscope measurements as reported by the browser. I created a simple Playground to collect data and gathered a sample during which I pressed the imaginary triggers in the same way I wanted to be able to detect. Then I simply opened the data in a spreadsheet and stared at it for a while.
Initially, I just wanted to see if I could find the events I was looking for in the data just by looking at it. Once I could identify those events myself, all I’d need to do would be to create an algorithm to look for the same things I was looking for. I had a general idea of the actions I had taken while recording my sample, so I knew roughly how many events I was looking for and where I might find them. I played around with different ways of looking at the data in Excel until the patterns began to emerge; then I got carried away and ended up creating the entire recognizer in my spreadsheet.
Columns A, B, and C are unchanged (except for coloring) from the original data recording: they are the frame-over-frame deltas of pitch, roll, and yaw, respectively, as reported by the phone’s gyroscope. The first breakthrough I had in understanding this data was when I added the conditional coloring to column A and noticed the feature you see from line 621 to 630: a very brief spike of red followed by a very brief spike of blue. This pattern recurs throughout the dataset at roughly the intervals I had tapped while collecting the data, so I deduced that this might be the feature I was looking for. The physical motion associated with this data is the “bounce” of the phone as it momentarily tilts toward, then immediately away from, the user in response to having been tapped on the back of the device. Importantly, every time this feature recurs in the data, it lasts almost exactly 10 frames (rendering frames at 60 per second). I now had a sufficient description of the feature I was looking for in the data: my imaginary button presses would be marked by a quick decrease and immediate rebound in pitch, all within a duration of ten frames.
With this insight in hand, I just kept adding columns to the spreadsheet to deduce more things about the data.
- Column E contains the sum of the squares of the gyroscope readings for that frame.
- Column F is a sliding window aggregate summing the values in column E for the current frame and the seven (number chosen arbitrarily) frames preceding it, which gives an indicator of how much the device is moving so we don’t even try to detect imaginary button presses if the device is moving too much for us to tell what’s what.
- Columns H and I, like column F, contain sliding window aggregations, but they both aggregate directly from the pitch values in column A: column H aggregates frames 0 through 3 (i.e., the current frame and the three frames before it) while column I aggregates frames 6 through 9.
- Column J is just for convenience; it pulls directly from column F, but ten frames ago, meaning it effectively aggregates column E for frames 10 through 17.
- Column K subtracts column I from column H; thus, when column I is very negative (i.e. the device was pitching forward 6 through 9 frames ago) while column H is very positive (i.e., the device was pitching backward 0 through 3 frames ago), column K will be large. Column K, therefore, is the raw signal indicator: when this value is large, a press on an imaginary button has been detected.
- Column L translates this into a direct output by measuring the value in K against a sensitivity threshold (arbitrary, but three-ish is a good value); it also checks that column J is small (arbitrarily, less than one), which ensures that we detected this signal during a period where there is little other motion that might cause false positives.
- Column M was added last to help distinguish between left and right imaginary trigger presses; it leverages the observation that a similar, though weaker, feature to what we detect in the pitch usually also appears at the same time in the yaw. However, in the yaw, the directionality is variable depending on which corner of the phone was tapped: sometimes the column appears blue over red as shown, but other times it appears as red over blue. To detect left from right, then, all that’s necessary is to figure out in which direction the yaw changed first, so we just apply the same logic from column I to the yaw data in column C, then check the aggregate’s sign.
And that’s it! That’s everything you need for a working proof-of-concept of an imaginary button press detector. The code in the Playground demo I linked to above is just a direct translation of the logic described for columns L and M. The impenetrable mathematical expression from earlier in this blog post, too, is just a translation of this simple spreadsheet: the leftmost summation is column J, the second and third summations are columns H and I (bundled together into a single expression equivalent to column K), and the bottom-rightmost expression is column M. Thus, these three representations — a mathematical formula, a spreadsheet, and a working code demo — despite how dissimilar they look and despite the fact that some appear simple and others complex, are all just different ways of expressing the exact same thing!
As I’ve mentioned before, I think this is only the tip of the iceberg. This proof-of-concept demonstrates an extremely naïve implementation for a single use case put together by a single dev in a single day from a grand total of 1024 frames of data samples. Honestly, I’m quite happy with this demo and excited to share it with you; but I’m way more excited by where you might take it next. With more effort invested, just how good could an imaginary button detector get? Can we teach it to learn inputs at runtime so that it self-tunes for particular users/devices? Could we improve it using cross-correlation or machine learning? What gains could we get by fusing in additional information from other sensors like the accelerometer? Or the microphone?
And what else could we put imaginary buttons on? Tablets? Smart watches? What if the imaginary buttons weren’t even on the device, but were on something nearby that made a characteristic sound, or vibrated in a certain way? Could techniques like this have uses for retrofitting old technology with new capabilities? Could we use this for accessibility?
I have no idea what the answers to any of those questions are. Let’s find out!
Justin Murray — Babylon.js team