Basic Data Analytics and Statistics Terms
Statistics play a crucial role in various fields, and anyone involved in analyzing apps and games must understand it. To make sense of the data and draw meaningful conclusions, it is important to understand some fundamental concepts and core indicators.
In this article, we will look at key terms such as mean, median, and mode values, statistical distribution, percentile, quartile and decile.
Mean Value
The mean, often referred to as the average, is a commonly used term in analytics. It is calculated by summing all the values in the data set and dividing the sum by the total number of values.
Median Value
The median represents the middle value in the sorted list. Essentially, once an option is selected, the list is organized accordingly and the median is defined as the central value in the sorted arrangement.
Mode Value
The mode is the value that occurs most frequently in a data set and is especially useful for categorical or discrete data analysis. In cases where no value outperforms the others in frequency, the data set is considered to be modeless.
Read more: Glossary of Ad Monetization Terms and Metrics
Statistical Distribution
A statistical distribution gives insight into the spread of data and the probabilities associated with different outcomes. Simply put, it shows the commonality and rarity of different values in a data set.
The graph visually represents users and their session times, including the distribution. However, it can be converted to a density plot. Achieving this transformation requires normalization to 100%, ensuring that the cumulative sum of all values in the graph equals 1.
The graph mirrors the one in the preceding image, but with a modified y-axis that now shows the proportion of all users sharing the same session duration.
This information shows that 20% of users have an initial session length of 2 minutes. In addition, if we were to select a random user and calculate the duration of their session, then with a 20% probability it would be 2 minutes, with a 15% probability it would be 3 minutes, and so on.
Common statistical distributions include the normal distribution, which is characterized by a bell-shaped curve, and the uniform distribution, in which all values have equal probability. In each scenario, the mode, mean, and median show different patterns. For example, in a normal distribution they are usually almost identical. In a log-normal distribution, the median always falls below the mean, and so on.
The key takeaway is that relying only on mean values alone is not always necessary. Looking at the distribution of an attribute provides valuable information about the variability and nature of the data.
Percentile
A percentile is a metric that indicates the relative position of a particular value in a set of data, denoting the percentage of values equal to or below the given value.
Let’s apply this concept to our example. In the graph, approximately 6% of users had a session duration of about 0 minutes (rounding is applied in calculations). This value corresponds to the 6th percentile, meaning that 6% of the sample has a value less than or equal to 0 minutes.
An important condition for percentile calculation is having properly sorted data.
This might sound familiar — that’s exactly how we interpret the median. The median is essentially a special percentile, namely the 50th percentile.
Quartile
A quartile is a specific version of a percentile that represents the 25th percentile. Quartiles are used to divide a data set into four equal parts, representing distinct segments of the data distribution. These quartiles, labeled as Q1, Q2, and Q3, provide valuable insights. Q2, commonly known as the median, partitions the data into two equal halves. Q1 represents the lower quartile, signifying the value below which the lowest 25% of the data lies. Conversely, Q3 represents the upper quartile, indicating the value below which the highest 25% of the data resides. Quartiles offer a robust tool for comprehending data spread and pinpointing outliers.
Read more: Main Metrics. Average Session Length
For example, a third quartile value of 6 minutes means that 75% of users had a first session that lasted 6 minutes or less, while only 25% had a longer session. This distinction is remarkable, especially when contrasted with the mean value for the entire sample, which is 5.7 minutes. This highlights how statistics can sometimes mask the nuances inherent in actual data.
Decile
A decile constitutes a 10th part of the overall sample, corresponding to the 10th percentile. The division of a dataset into ten equal parts characterizes deciles, delineating diverse segments of the data distribution. Similar to quartiles, deciles offer a nuanced understanding of data spread. The first decile, denoted as D1, signifies the value below which the lowest 10% of the data resides, while the ninth decile, marked as D9, represents the value below which the lowest 90% of the data falls. Deciles prove particularly valuable in the analysis of large datasets, facilitating the identification of patterns and trends across various segments.
Let’s Practice
If these terms and indicators seem similar, let’s move on to the example.
Imagine you’re tasked with estimating the duration of app usage during the first session. You manually selected nine users and calculated the first session duration in minutes for each. Now armed with this small dataset, you proceed to calculate the mean, median, and mode.
The most frequently occurring session duration is three minutes, observed in two users.
The median time is four minutes. With nine values, the median, positioned as the 5th value in the sorted list, reveals that half of the users had a first session lasting four minutes or less, while the other half exceeded this duration.
The mean duration of the first session is calculated to be nine minutes.
But could it be that our results were influenced by the luck of the draw with the selected users?
To mitigate this, we expanded our analysis to encompass 75 thousand users in the latest app version, plotting the results on a graph. The x-axis represents session length, and the y-axis denotes the number of users with corresponding session lengths.
Here’s what we found: the most common session duration is two minutes, true for nearly 15 thousand users. The median time is 3 minutes, and the mean is 5.7 minutes. Despite the apparent similarity in these indicators, they yield significantly different results.
In real-world scenarios, such variations are commonplace because the distribution of indicators in research can differ markedly.
In conclusion, effective data analysis requires an understanding of fundamental concepts, statistical terms, and indicators. We hope this article will help you differentiate between these concepts and foster a healthy skepticism about mean or average data.