Finding Patterns in Data Sequences

Some time ago I was presented question if there would be easy and quick way to recognize certain usage patterns in huge amount of event data. The events in question were common user activity like clicking buttons, waiting certain time between and so on. This data could then used to improve user experience or catch cheaters that write bots to play their game for them and so on.

There are multiple ways to approach this problem but I came up with relatively simple method that seems to have flexibility. We can think the events of interest as sequence of symbols and in case also time between events is of interest, quantize the time as symbols for short, long and medium time. Then, sequence of events can be presented as list of symbols


Here A could be button click to open window, B a tab click in window, C close button, 1 a second pause before moving mouse, 2 a few second pause before moving mouse and X is collect resources etc..

Now we can approach problem with pattern finding. First let’s define function that is supposed to find recurring patterns and output them and input as sequence of those patterns. (Following is written in Python)

def findpattern(s, trivial):
	patterns = {}
	symbolmap = {}

	p = patterns
	acc = []
	output = []
	for c in s:
		if not c in p:
			if not c in patterns or (not trivial and acc[-1] == c):
				p[c] = {}
				p = p[c]
				acc = ''.join(map(lambda x: str(x),acc))
				if not acc in symbolmap:
					symbolmap[acc] = true

				acc = [c]
				p = patterns[c]
			p = p[c]

	if acc:
		acc = ''.join(map(lambda x: str(x),acc))
		if not acc in symbolmap:
			symbolmap[acc] = true

	return (symbolmap.keys(), output)

Function accepts two arguments, the list of symbols and flag if trivial pattern should be accepted. This is needed to determine if you consider a input list “AAAAA” to be one instance of pattern “AAAAA” or 5 instances of pattern “A”.

When this function is applied on the example string above, we get list of core patterns and the input as pattern list.

> s = "ABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123X"
> findpattern(s, False)
['ABC123', 'ABC123X'] ['ABC123', 'ABC123', 'ABC123', 'ABC123X', 'ABC123', 'ABC123', 'ABC123', 'ABC123X', 'ABC123', 'ABC123', 'ABC123', 'ABC123X']

Here it’s easy to see that the string is composed of two main patterns, ABC123 and ABC123X. The second output list is the input string split as instances of pattern.
Real data is noisier than the example here, so proper data cleaning and event definition / filtering are vital for any kind of useful pattern recognition.

In this case one could guess that the player might be bot, as the same sequence occur regularly with too much precision and too little variability for human player. (Remember that the symbols 1, 2 and 3 presented also time in this example.)

Going forward we can also define function that recursively uses the output of the findpattern to find patterns of patterns up to the “prime” pattern of the input, i.e. the pattern that when repeated N times produces the original input string.

def findprimepattern(s, trivial=False):
	o = s
	while True:
		s, o = findpattern(o, trivial)
		print s,o
		if len(s) == 1:
			return s[0]

Example with more complicated pattern string

> s = (((("ABC"*2)+"DE")*3+"X")*2+"O")*4
> findprimepattern(s)

And indeed the input string is the prime pattern ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO concatenated 4 times.

Note effect of parameter trival on the function.

> findprimepattern('AAAA', False)
> findprimepattern('AAAA', True)
['A'] ['A', 'A', 'A', 'A', 'A']

Former is single instance of pattern “AAAA” and latter is 4 instances of pattern “A”.