## Computing the Mode in a Multimodal dataset

*January 22, 2013 at 9:05 am* *
3 comments *

In a recent project in which I participated, one of the main metrics the user wanted to visualize involved mode values. I will write about some of the complexities we encountered while implementing it, but before diving into it let’s remember what a mode value is.

### Mode Basics

In simple terms, the mode value is the most commonly occurring value in a dataset. For example, suppose we have the following dataset:

`{7, 7, 5, 4, 3, 3, 3, 9, 0}`

In the above example, we can see that the value which appears the most is the number 3, with a total of three instances.

QlikView has a built-in function to calculate the mode value in a given dataset. For instance, if the field containing the values we want to evaluate is named Grades, then we would use the following expression to obtain its most occurring value:

`Mode(Grades)`

### More than one mode?

Now comes the tricky part. Given the following dataset, which value would the Mode() function return?

`{7, 7, 5, 5, 5, 3, 3, 3, 9}`

We now have two values which occur the most (numbers 5 and 3), both with equal number of instances. This dataset is called a bimodal dataset, since there are two mode values. The problem resides in the fact that the Mode() function will return null whenever there is more than one mode value. Our user required QlikView to handle these instances differently: returning the lowest of the mode values. So, in the above bimodal dataset, we would return the number 3.

To approach this, we discarded the Mode() function and built our own *modified mode* expression. Once our expression was created and properly tested, we stored it into a variable with parameters to make it more versatile and serve as if it were an actual function, with the ability to call it from different charts. I will now explain in more detail how we achieved this.

### Calculating the mode, step by step

To illustrate, suppose we are working with the dataset of a University, containing the different *Grades* obtained by the students across several courses. The steps required to calculate the mode value are: first, we obtain the frequency of each *Grade* value in the dataset. Then, we pick the one that occurred the most. If there is more than one highest-frequency value, we pick the lowest of those values.

We can visualize this in the following straight table:

As you can see, the two mode values are 100 and 90, each occurring 201 times. The second step would be to pick the lowest value. Since 90 is lower than 100, our modified mode function should return 90.

I’ve used a straight table with the *Grade* field as dimension and *Count(Grade)* as expression, and manually sorted the table based on the second column, in descending order. We will now calculate this same value, but without using the straight table. Enter the *Aggr()* function. We can use the following expression:

`Min(Aggr(If(Count(Grade) = Max(Total Aggr(Count(Grade), Grade)), Grade), Grade))`

What the above expression does is:

- Calculate the maximum number of occurrences that any Grade value has in the dataset. This is done using
*Max(Total Aggr(Count(Grade), Grade))*. The result of this part of the expression would be 201 in the above example.- Note this is not calculating the max Grade value (first column), but the maximum number of occurrences (second column) a single
*Grade*value has.

- Note this is not calculating the max Grade value (first column), but the maximum number of occurrences (second column) a single
- Then, we calculate the number of occurrences for each individual
*Grade*value and compare the result with the number we obtained in the previous step. If the number of occurrences for a given*Grade*value is equal to that of the max occurrences, that particular value is “preserved”, otherwise, it is discarded. We achieve this by using the*If*expression inside the*Aggr()*function:

`If(Count(Grade) = Max(Total Aggr(Count(Grade), Grade))`

- Once all
*Grade*values are “scanned” by the*Aggr()*function, we will have a list of all of the values that were preserved. Using the above example, we will have the following list of values:

`{100, 90}`

- To the list of values resulting from the
*Aggr()*function, we apply the*Min()*function and obtain the lowest of them, which in the above example is 90.

Now, we’ve covered the basics behind the modified mode expression, but that’s not the final solution. In order for us to accurately use the *Aggr()* function inside a Chart, we need to take into account the dimensions used in the chart.

### The Dimensions’ role

For example, we’ve seen that the most frequent Grades occurring in our entire dataset are 100 and 90. But maybe we want to visualize the most frequent Grade by Class, or by Semester, or by Professor. In that case we’d need to obtain a different “max occurrences” value for each subset of the data (for each Class, for instance), and then compare each individual Grade value within that subset to its corresponding subtotal.

So, now the above expression should be modified to the following:

`Min(Aggr(If(Count(Grade) = Max(Total <[Course Title]> Aggr(Count(Grade), Grade, [Course Title])), Grade), Grade, [Course Title]))`

We have added the *[Course Title]* field to the *Total* qualifier to produce a different “subtotal” for each course. Additionally, the *[Course Title]* field was added as a second dimension field in both *Aggr()* functions used in our expression.

We can compare the values resulting from the built-in function and the ones from the new modified expression with the following chart:

The advantage is clear: while the built-in function returns null for some courses (bimodal subsets), our new expression returns the lowest of the mode values.

Similarly, our chart could contain more than one dimension at the same time. We should add all dimensions used in the chart to the *Total* qualifier, as well as to the two *Aggr()* functions. Otherwise, our results would be inaccurate.

### Re-usability

Once our expression has been built, and after we’ve identified the different scenarios in which it could be used (i.e. No dimensions, 1 dimension, 2, etc), we can store it as a variable and add some parameters to make it re-usable regardless of the number of dimensions used, or the name of those dimensions. This approach will also help us manage our expressions centrally.

We can create a variable, name it **eModifiedMode**, and define it as follows:

`Pick($1,`

Min(Aggr(If(Count(Grade) = Max(Total <$2> Aggr(Count(Grade), Grade, $2)), Grade), Grade, $2)),

Min(Aggr(If(Count(Grade) = Max(Total <$2, $3> Aggr(Count(Grade), Grade, $2, $3)), Grade), Grade, $2, $3)),

Min(Aggr(If(Count(Grade) = Max(Total <$2, $3, $4> Aggr(Count(Grade), Grade, $2, $3, $4)), Grade), Grade, $2, $3, $4)))

In the above expression, the *Pick()* function houses three separate expressions: one to be used in one-dimension charts, another for two-dimension charts, and another for three-dimension charts (we could easily add more expressions for additional scenarios). This arrangement will help us dynamically select a different expression depending on where the variable is being called from, by using the first of the parameters ($1). If we have only one dimension, then we pass 1 as the first parameter and the first of the listed expressions will be used and so on.

Additionally, each of the listed expressions expects the corresponding fields across which the aggregation should be made. This is specified with the remaining parameters (*$2, $3, $4*).

To illustrate, suppose we have a straight table with two dimensions: *Semester* and *[Course Title]*. To calculate the *Grade* which occurs the most for each row, we can call our custom function as follows:

`$(eModifiedMode(2, Semester, [Course Title]))`

The three parameters we have used to call our variable are:

*2*, which is the number of dimensions used in the chart*Semester*, which is the first of our dimensions*[Course Title]*, which is the name of our second dimension.

If we had an additional dimension in the chart, Faculty for instance, then we’d call our variable as follows:

`$(eModifiedMode(3, Semester, [Course Title], Faculty))`

### Some considerations

Finally, there are some considerations we have to take into account when using this approach, let’s discuss some of them.

#### Response Times

When using this approach, we must take into account the fact that, since it is not a built-in function, it might not perform as optimal as we’d expect. There are several steps the computation has to take to arrive at the final result. However, when implementing this with our customer, we were able to see that response times were still acceptable in a ~100-million rows app.

In the end, we decided to implement a mechanism to allow the user to select if he wanted to consider the multimodal nature of the dataset in the calculation or only use the “natural”, built-in function. This made sense as there were only a few, rare instances in which the dataset became multi-modal.

#### Pre aggregations

Another way of approaching this could have been to pre-calculate everything at script run, and use simple aggregations in the front end. However, we discarded this option as it would mean to limit the functionality of the app. By calculating the result in the front end, we allowed the user to filter the data in whichever way he decided, with no restrictions, and the result will always consider those selections. At the end of the day, that’s one of the advantages of QlikView: ad-hoc querying and no need for pre-aggregates.

### Extending the modified mode expression

The example I have described uses the *Min()* function to retrieve the lowest of the mode values (when there is more than one). However, the expression could easily be extended to use different operations: Min, Max, Avg, etc. This could be implemented using an island table, for example, but since this has been a long post, I will leave that up to you to explore.

If you would like to explore this approach first-hand, you can download the example app by clicking here.

To know more about expression variables, the use of variable parameters and advanced expressions, be sure to check out our book: QlikView 11 for Developers.

#QlikOn!

Entry filed under: Advanced Expressions. Tags: advanced expressions, aggr, variables.

1.WL | December 8, 2013 at 9:49 amThanks! I also wonder if there is some way of doing multi mode for a data set with a average approach for fractions. For instance if you have the the data set of (3.1, 3.2, 3.3, 12.5, 12.6, 12.7) there are two clearly noticeable peaks around 3 and 12. The average of each peak is 3.2 and 12.6.

But how to automatically calculate for large dynamic data set?

It might have some relation to “linear” regression:

http://en.wikipedia.org/wiki/Linear_regression

And the method will probably need some kind of threshold for minimum and maximum fractional multi modes to find.

2.Mike | December 8, 2013 at 10:21 amHi WL,

There’s always a way :)

From what you mention, this would probably be solved by adding a new field to the data model from the script. For instance, if you have the field “Value (fraction)”, then you can create a new one called “Value (rounded)” which will associate several fractional values to a single “grouping” value. Then, in your expression, you can use Aggr(), as shown in the example of this post, using the new grouping value as dimension and applying the Avg() to the fractional values corresponding to each grouping value.

Probably a complex operation to describe, but hope you get the idea.

Good luck with it and thanks for leaving your comment.

Miguel.

3.WL | December 13, 2013 at 3:02 pmThanks for your reply and ideas. The closest thing I’ve come across is moving average, http://en.wikipedia.org/wiki/Moving_average. But ideally there is some other formula for large set of values (not in a x-y graph system) where the areas of interest are selected with a radius proportionally to the data spread.