From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State
The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected...
Data science tools are nowadays at the core of the most popular gaming engines on the market. We can even say that data science has become indispensable for the gaming industry, given the value it brings to a variety of stakeholders.
Game designers can improve features based on usage data collected in almost real-time. Developers have access to unlimited algorithmic power, only made possible by the increased availability of gaming data. Players enjoy increasingly personalized experiences, while marketing teams and business analysts can understand their behavior better than ever.
But what lies as the common foundation that has turned data science into such a powerful tool? We know it, you know it, everybody knows it: it’s the datasets on which the algorithms are built. None of the challenges above could be solved without advanced algorithms. These algorithms, in turn, could not be implemented without access to high-quality datasets.
What types of data are contained in it, how are they relevant, and what kind of processing does that data need?
We roughly isolate the following categories: player data, activity data, and environment data.
Player data is the kind of data that describes a player’s behavior and usage patterns: daily logins, the number of sessions, the average playtime, and maybe even their absence (tracked as the last login date). While these are all common measures found across most games, many gaming datasets contain more fine-grained data, corresponding to specific game features: progress, role, current character level, and character race.
From such basic features, more complex features are usually computed and added to the gaming dataset. These are usually built using common mathematical operations (such as sum, mean, max, min, etc.), which allow building features that summarize different intervals of interest.
Research on the topic reports using different means built over such basic data, in order to build more features: “the mean is calculated over several different time periods, namely over the player’s first nine days, last nine days and full lifetime. Finally, to get the current state of the player, the total purchases, total playtime, total logins, and current level are also added. ”
A game data mining competition organized in 2018 reports up to 600 features describing an individual player. However, manually crafting features also comes with a caveat: having too many features characterizing each player (and not a big enough dataset) can lead to the well-known curse of dimensionality. An appropriately large dataset needs to be available, in order for machine learning and statistical models to perform well on such multidimensional data.
Together with past spending habits on in-game acquisitions, you can already see how this kind of data is important: it can be used to model (and anticipate) players’ behaviors. But let’s not get ahead of ourselves here. Let us look at some more data types commonly found in gaming datasets.
Game Activity Data is known in the scientific literature under the name of Game Telemetry Data. If you thought Player Data can get complex, you need to see Game Activity Data in action. This kind of data refers to different game events, initiated either by the player or by the game engine. It can refer either to interactions among participants or to interactions with objects. To make things even more complex, this data is usually collected over time, which means it needs to be stored and handled as time-series data.
Basic activity features usually refer to all possible actions that players can take. Whether the player ducks jump or crawl if he shoots and the exact timestamp when he does it. If the player is moving an object, making an in-game purchase, or using his superpowers. This list can go on and on
Since these features are usually represented as timestamps, most of the data-preprocessing on activity data is dealing with transformations of time-series data points. For example, it is common to transform time-series data into the frequency domain and use that as a feature for learning.
The potential for manual feature engineering is once again unlimited. But modeling and gaining insights from time-series data such as game activity data requires specific algorithms and neural models. Such is the case of LSTM networks, which are known to perform well on time-series data. Another popular approach in dealing with such complex data is the use of ensemble learning when outputs from multiple networks are stacked together before being fed to other algorithms.
There are players, the interactions among them, and the actions they take. That must be all of it, right? Well, not really.
There’s plenty of other data sources in gaming environments, referring to objects, the states of the game, and the overall system parameters. On a higher abstraction level, this is data describing game progress parameters, such as the current game level, the number of interactions in each stage, or available system resources. More fine-grained information comes from the different objects in the game: their position in 3D space, their state, their motions, and even their physical characteristics (such as texture, shape, and size).
The data about the players’ environment and the objects that they are interacting with can be thought of as contextual information. It completes the information created by the Player Data and by the Activity Data, since analyzing the particular behavior of a player is only complete when we understand the context in which he is manifesting that behavior.
The particularly interesting aspect of Environment Data is the fact that such data is already at the core of the gaming engine itself. Just as object information is required for rendering, so are game parameters known, logged, and tracked by the gaming engine itself at every point in time. This turns gaming engines into an infinite source of data ready for learning. Many gaming engines even provide APIs for accessing their environments, and these can be extremely valuable for building custom data sets.
While the gaming data categorizations above are meant to illustrate the variety and challenges of different types of gaming data, it also shows a tendency that exists in the industry: that of collecting every single event and action that happen throughout a game.
When it all comes together, we feel compelled to raise a warning sign about the unhealthy habit of collecting unnecessary data. Not only can this result in unnecessary processing costs, but it may actually hinder any productive analysis. Instead, observations and data collection strategies should correspond to well-defined requirements that serve very specific goals. It is only when meaningful data is collected that all stakeholders get to truly benefit from it.
Schedule 15-min with a Blue Orange Digital Solution Architect to discuss which option is right for your data sources and future goals.
Josh Miramant is the CEO and founder of Blue Orange Digital, a data science and machine learning agency with offices in New York City and Washington DC.
Miramant is a popular speaker, futurist, and a strategic business & technology advisor to enterprise companies and startups. As an example of thought leadership, Miramant has been featured in IBM ThinkLeaders, Dell Technologies, Global Banking & Finance Review, the IoT Council of Europe, among others. He can be reached at firstname.lastname@example.org.
Blue Orange Digital is recognized as a “Top AI Development and Consultant Agency,” by Clutch and YahooFinance, for innovations in predictive analytics, automation, and optimization with machine learning in NYC.
They help organizations optimize and automate their businesses, implement data-driven analytic techniques, and understand the implications of new technologies such as artificial intelligence, big data, and the Internet of Things.