Computing the Mode in a Multimodal dataset

In a recent project in which I participated, one of the main metrics the user wanted to visualize involved mode values. I will write about some of the complexities we encountered while implementing it, but before diving into it let’s remember what a mode value is.

Mode Basics

In simple terms, the mode value is the most commonly occurring value in a dataset. For example, suppose we have the following dataset:
{7, 7, 5, 4, 3, 3, 3, 9, 0}

In the above example, we can see that the value which appears the most is the number 3, with a total of three instances.

QlikView has a built-in function to calculate the mode value in a given dataset. For instance, if the field containing the values we want to evaluate is named Grades, then we would use the following expression to obtain its most occurring value:
Mode(Grades)

More than one mode?

Now comes the tricky part. Given the following dataset, which value would the Mode() function return?

{7, 7, 5, 5, 5, 3, 3, 3, 9}

We now have two values which occur the most (numbers 5 and 3), both with equal number of instances. This dataset is called a bimodal dataset, since there are two mode values. The problem resides in the fact that the Mode() function will return null whenever there is more than one mode value. Our user required QlikView to handle these instances differently: returning the lowest of the mode values. So, in the above bimodal dataset, we would return the number 3.

To approach this, we discarded the Mode() function and built our own modified mode expression. Once our expression was created and properly tested, we stored it into a variable with parameters to make it more versatile and serve as if it were an actual function, with the ability to call it from different charts. I will now explain in more detail how we achieved this.

Calculating the mode, step by step

To illustrate, suppose we are working with the dataset of a University, containing the different Grades obtained by the students across several courses. The steps required to calculate the mode value are: first, we obtain the frequency of each Grade value in the dataset. Then, we pick the one that occurred the most. If there is more than one highest-frequency value, we pick the lowest of those values.

We can visualize this in the following straight table:

As you can see, the two mode values are 100 and 90, each occurring 201 times. The second step would be to pick the lowest value. Since 90 is lower than 100, our modified mode function should return 90.

I’ve used a straight table with the Grade field as dimension and Count(Grade) as expression, and manually sorted the table based on the second column, in descending order. We will now calculate this same value, but without using the straight table. Enter the Aggr() function. We can use the following expression:

Min(Aggr(If(Count(Grade) = Max(Total Aggr(Count(Grade), Grade)), Grade), Grade))

What the above expression does is:

  1. Calculate the maximum number of occurrences that any Grade value has in the dataset. This is done using Max(Total Aggr(Count(Grade), Grade)). The result of this part of the expression would be 201 in the above example.
    • Note this is not calculating the max Grade value (first column), but the maximum number of occurrences (second column) a single Grade value has.
  2. Then, we calculate the number of occurrences for each individual Grade value and compare the result with the number we obtained in the previous step. If the number of occurrences for a given Grade value is equal to that of the max occurrences, that particular value is “preserved”, otherwise, it is discarded. We achieve this by using the If expression inside the Aggr() function:
    If(Count(Grade) = Max(Total Aggr(Count(Grade), Grade))
  3. Once all Grade values are “scanned” by the Aggr() function, we will have a list of all of the values that were preserved. Using the above example, we will have the following list of values:
    {100, 90}
  4. To the list of values resulting from the Aggr() function, we apply the Min() function and obtain the lowest of them, which in the above example is 90.

Now, we’ve covered the basics behind the modified mode expression, but that’s not the final solution. In order for us to accurately use the Aggr() function inside a Chart, we need to take into account the dimensions used in the chart.

The Dimensions’ role

For example, we’ve seen that the most frequent Grades occurring in our entire dataset are 100 and 90. But maybe we want to visualize the most frequent Grade by Class, or by Semester, or by Professor. In that case we’d need to obtain a different “max occurrences” value for each subset of the data (for each Class, for instance), and then compare each individual Grade value within that subset to its corresponding subtotal.

So, now the above expression should be modified to the following:
Min(Aggr(If(Count(Grade) = Max(Total <[Course Title]> Aggr(Count(Grade), Grade, [Course Title])), Grade), Grade, [Course Title]))

We have added the [Course Title] field to the Total qualifier to produce a different “subtotal” for each course. Additionally, the [Course Title] field was added as a second dimension field in both Aggr() functions used in our expression.

We can compare the values resulting from the built-in function and the ones from the new modified expression with the following chart:

The advantage is clear: while the built-in function returns null for some courses (bimodal subsets), our new expression returns the lowest of the mode values.

Similarly, our chart could contain more than one dimension at the same time. We should add all dimensions used in the chart to the Total qualifier, as well as to the two Aggr() functions. Otherwise, our results would be inaccurate.

Re-usability

Once our expression has been built, and after we’ve identified the different scenarios in which it could be used (i.e. No dimensions, 1 dimension, 2, etc), we can store it as a variable and add some parameters to make it re-usable regardless of the number of dimensions used, or the name of those dimensions. This approach will also help us manage our expressions centrally.

We can create a variable, name it eModifiedMode, and define it as follows:

Pick($1,
Min(Aggr(If(Count(Grade) = Max(Total <$2> Aggr(Count(Grade), Grade, $2)), Grade), Grade, $2)),
Min(Aggr(If(Count(Grade) = Max(Total <$2, $3> Aggr(Count(Grade), Grade, $2, $3)), Grade), Grade, $2, $3)),
Min(Aggr(If(Count(Grade) = Max(Total <$2, $3, $4> Aggr(Count(Grade), Grade, $2, $3, $4)), Grade), Grade, $2, $3, $4)))

In the above expression, the Pick() function houses three separate expressions: one to be used in one-dimension charts, another for two-dimension charts, and another for three-dimension charts (we could easily add more expressions for additional scenarios). This arrangement will help us dynamically select a different expression depending on where the variable is being called from, by using the first of the parameters ($1). If we have only one dimension, then we pass 1 as the first parameter and the first of the listed expressions will be used and so on.

Additionally, each of the listed expressions expects the corresponding fields across which the aggregation should be made. This is specified with the remaining parameters ($2, $3, $4).

To illustrate, suppose we have a straight table with two dimensions: Semester and [Course Title]. To calculate the Grade which occurs the most for each row, we can call our custom function as follows:
$(eModifiedMode(2, Semester, [Course Title]))

The three parameters we have used to call our variable are:

  1. 2, which is the number of dimensions used in the chart
  2. Semester, which is the first of our dimensions
  3. [Course Title], which is the name of our second dimension.

If we had an additional dimension in the chart, Faculty for instance, then we’d call our variable as follows:

$(eModifiedMode(3, Semester, [Course Title], Faculty))

Some considerations

Finally, there are some considerations we have to take into account when using this approach, let’s discuss some of them.

Response Times

When using this approach, we must take into account the fact that, since it is not a built-in function, it might not perform as optimal as we’d expect. There are several steps the computation has to take to arrive at the final result. However, when implementing this with our customer, we were able to see that response times were still acceptable in a ~100-million rows app.

In the end, we decided to implement a mechanism to allow the user to select if he wanted to consider the multimodal nature of the dataset in the calculation or only use the “natural”, built-in function. This made sense as there were only a few, rare instances in which the dataset became multi-modal.

Pre aggregations

Another way of approaching this could have been to pre-calculate everything at script run, and use simple aggregations in the front end. However, we discarded this option as it would mean to limit the functionality of the app. By calculating the result in the front end, we allowed the user to filter the data in whichever way he decided, with no restrictions, and the result will always consider those selections. At the end of the day, that’s one of the advantages of QlikView: ad-hoc querying and no need for pre-aggregates.

Extending the modified mode expression

The example I have described uses the Min() function to retrieve the lowest of the mode values (when there is more than one). However, the expression could easily be extended to use different operations: Min, Max, Avg, etc. This could be implemented using an island table, for example, but since this has been a long post, I will leave that up to you to explore.

If you would like to explore this approach first-hand, you can download the example app by clicking here.

To know more about expression variables, the use of variable parameters and advanced expressions, be sure to check out our book: QlikView 11 for Developers.

#QlikOn!

January 22, 2013 at 9:05 am Leave a comment

QlikView 11 for Developers: What’s Inside?

It is with great excitement that I post this announcement to let you know a bit more about the “QlikView 11 for Developers” book that my friend Barry and I have co-authored.

As you may already know from our initial announcement, the book will officially be released around November 15th (though you can submit your pre-order now). If you’ve heard the news, you may be wondering what you will find inside the book, who is it for, and what are the features that make it different from the various other resources available for learning QlikView. Let’s go through some of these features.

 

Practical and Hands-on

The book is filled with examples that will let you take the theory into practice right away. We support this hand-on experience by providing a full dataset used across the entire book, and around which we build a fully-functional QlikView document that contains a dashboard, various analyses (both basic and complex to build) and reports, using the well-known DAR principle.

Chapter by chapter, a piece of the final QlikView document is built, which allows you to follow its evolution from start to finish. It also enables us to cover different development challenges that you may encounter in a real-world QlikView project.

 

Backend and frontend development

We made sure to cover both backend and frontend development, so you will find that all 14 chapters cover different topics, from scripting and data extraction to data modeling, design, charts and expressions, as well as security and everything in between. We also talk about various best practices related to each of these topics.

 

Open code

All of the examples discussed in the book are complemented with solution files for the reader to follow the exercises and compare his/her work. The QVW files we provide are Personal Edition-enabled, which means that a purchased QlikView license will not be required to open them.

 

Finally, real-world

Although the case is built around a fictitious company, the data we use in our examples and final application is real. Thanks to the Open Government initiative and the Bureau of Transportation Statistics of the United States, which compiles and maintains a complete dataset about Airline Operations in the US, you will able to work with real data and build a QlikView application to analyze flights, enplaned passengers, cargo, etc, across multiple dimensions such as carriers, airports, cities, aircraft types, etc.

I invite both newcomers and seasoned QlikView developers to place your orders and get this book on your hands.

Qlik On!

 

November 1, 2012 at 9:00 am 15 comments

Business is a Jeopardy! game

Answers. Only those that have them keep on playing the game. Moreover, those that ask the right questions will always stay in advantage.

Everyone that has played any trivia game knows that having the right answers is not an easy feat, but if you’ve played Jeopardy! you’ll know that it is even harder to turn an answer into a winning question; that’s the key differentiator of the game.

Jeopardy! is an American TV quiz show that features trivia in a wide variety of topics; unlike many other trivia games, it has a unique answer-question format in which contestants are presented with clues in the form of answers, and must phrase their responses in question form. (Wikipedia: http://en.wikipedia.org/wiki/Jeopardy!)

If you’ve never heard of Jeopardy!, take a quick look at this video to see a Jeopardy game at play. One minute of the 19-minute video will suffice to give you an idea of the game, I recommend you to fast forward to minute 11:20.

In the trivia featured in the above video, for example, the participant is told that a certain kind of crime was up 50% in 2011, to which he answers “What is petnapping?”. Just as in Jeopardy!, business analysis has moved from a dynamic of simply asking a question and finding the answer, to turning that initial answer into a new question, planting the seed for further discoveries and deeper insights.

In the game, you don’t know what will come up next, what’s behind each window, how your knowledge will be tested; will it be math? science? arts? pop culture? It’s no different in the business world; you never know the circumstances under which your ability to provide solutions will be tested. One day a circumstance presents involving one of the many areas that affect an outcome: the supply chain, the sales process, the global market, the competitor, etc. The next day, it’s a new challenge: what products to invest on, where to reduce expenditures, how to boost performance. Business people have to be prepared for every new challenge, whatever that is and wherever it comes from.

While in the Jeopardy! game success is mostly determined by how fast the buzzer is hit by the participant, in the business realm, a very important part of success is determined by how the decision-makers arrive at their solutions. The “participants” must have the ability to quickly make sense of the ever-expanding world of information that supports their decisions, and to find answers in the mountains of data that pile up every day, from every source, about every business process. They also have to keep questioning facts, creating new hypotheses, so that new discoveries are unearthed. Discoveries only happen when we keep asking questions, when we analyze an answer, a principle, a paradigm from a different angle; and to keep asking questions, businesses must also provide a framework to easily keep answering them.

A man once said that:

By the time the people asking the questions are ready for the answers, the people doing the work have lost track of the questions.

- Norman Ralph Augustine

In the information era, with businesses rapidly evolving, the above statement is mostly true. However, new technologies have arisen to fill the gap and help business people hit the buzzer quickly and confidently. Meet QlikView.

We can narrow the similarities between Jeopardy! and the business game down to three characteristics:

  • You need to get answers fast
  • You need to keep asking winning questions
  • You never know what’s next

For decision makers to get answers, almost any BI tool does the trick. But to get them fast, only a few really make it possible.

To keep asking questions is what we’ve recently come to refer to as “business discovery”, a process which is enabled by a new generation of BI solutions, in which the analysis is user-driven and every new question can be easily answered on the spot and with just a few clicks. The technology is finally in sync with the user. No more waiting time.

The final characteristic is one that few BI tools have successfully accomplished: giving the users the ability to analyze unplanned scenarios. By empowering the user with tools that help them ask any question from any perspective, there are no more blind spots and IT doesn’t have to prepare or pre-aggregate the data for a specific set of questions or answers. On the contrary, without predefined queries or pre constructed paths, it’s the user’s thinking that guides the analysis and discovery process.

Imagine a Jeopardy! participant preparing for the contest, behind the cameras, by studying a few set of question – answer cards. You’ll agree that, if his game is solely based on previously studied cards, it’s very likely the player will open a lot of windows that will be left unanswered and, furthermore, unquestioned. It’s the same with business players.

With QlikView, the ones asking the questions and making the decisions are enabled to find the answers quickly and by themselves, so that they are ready for the new question behind the window and with the hand firm on the buzzer.

August 20, 2012 at 11:05 am Leave a comment

Idea: Distributed Resources for Development

This is not the common iQlik post, in which I try to share a particular technique, but it is something I would like to share with you and know what you think about it. I’ve been working with large Apps and I detected a possible feature that would be nice to have on QlikView desktop. Read on…

Continue Reading October 25, 2011 at 8:55 am 4 comments

Power Up your Big Apps with Binary Index Expansion

When dealing with Mega Apps (Qlikview applications in the order of 100+ Million rows), it is important to have a good user perception in calc times. Use this technique to improve performance using the server Cache.

Continue Reading September 11, 2011 at 12:16 am 11 comments

A New Choice of Armor: The Flag Matrix

As you know already, I’m a big fan of using Set Analysis in my QlikView applications, especially for Point In Time Reporting. However, there might be some specific situations in which Set Analysis just won’t cut it and you need an alternative… Continue Reading to discover it! (Demo Script included…)

Continue Reading May 22, 2011 at 2:51 pm 9 comments

QlikView mobile: AJAX Touch Client

It’s time for a new post, and today it is a very special one. QlikTech has released a beta version of the new “AJAX Touch Client” and I think it is major news and will give QlikView a big bump in the mobile filed.

Continue Reading April 2, 2011 at 3:46 pm 3 comments

Older Posts


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 744 other followers

Recent Posts

Twitter


Follow

Get every new post delivered to your Inbox.

Join 744 other followers