White Paper
Last Revised: 03/06/04

SuperQuery;
Data Mining for Everyone

by
Ashraf Azmy

 

Table of Contents

  1. Introduction
  2. What is a pattern?
    2.1 Do we want to find all patterns and exceptions?
    2.2 What is an interesting pattern?
  3. The Fact Discovery Engine in SuperQuery
    3.1 Avoiding and filtering out uninteresting facts
    3.2 Configuring the Fact Discovery Engine
    3.3 Presenting and Studying the found facts
  4. Conclusions

1. Introduction

Data mining is the process of extracting potentially useful information from raw data. A software engine can scan large amounts of data and automatically report interesting patterns without requiring human intervention. Other knowledge discovery technologies are Statistical Analysis, OLAP, Data Visualization, and Ad hoc queries. Unlike these technologies, data mining does not require a human to ask specific questions. This makes data mining a fast and inexpensive way of summarizing, exploring, understanding, and analyzing data.

Given the massive amounts of data that has been stored during the past two decades, data mining is a logical tool to deploy. The more we analyze the data, the more we learn from it and improve our business systems. It is expected that more people will be involved in data analysis. The Internet will only increase the demand for data analysis tools because it is making more data available to more people.

In section two, we present definitions of terms commonly used in data mining like patterns, rules, facts, exceptions, confidence factors, and support levels. We also discuss the differences among Data Mining and other related technologies like Expert Systems. In section three, we explain how SuperQuery does Data Mining. A number of screen shots are presented and their function is explained. In the fourth section, we summarize and present our conclusions.

2. What is a pattern?

Data mining and knowledge discovery systems often talk about finding patterns, rules, and facts in row data. So, what is a pattern? In a data table, a pattern is defined as a set of rows that share the same values in two or more columns. Consider for example, the following table that contains data about objects; shape, color, and weight.

Row#

Shape

Color

Weight

1 ->

Box

Red

100

2 ->

Box

Red

200

3 ->

Box

Red

300

4

Box

Blue

400

5

Cone

Blue

400

Table 1: Objects Table

In this table, we have 3 rows (row 1, 2 and 3) that share the same values in two columns (Shape and Color). From this table, we can observe that:

Most Boxes are Red.

The Confidence with which we state this fact is 75%. The confidence is calculated by dividing the number of rows that have Box and Red (3) by the number of rows that have Box and any color (4).

Notice that, from a presentation point of view, we use "Most" when the confidence level is higher than 50% and less than 100%. If confidence is 100% we use "All". For example, "All Boxes are Red". On the other hand, if confidence is less than 50% we can still present the pattern using the percentage. For example, "40% of Boxes are Red".

Another parameter that we can use to differentiate between patterns is the number of rows that support the pattern. For example, the previous fact is supported by 3 rows. A pattern supported by a large number of rows is more compelling than a pattern supported by less rows.

We can distinguish powerful from weak patterns using a combination of confidence and support levels. The higher the confidence and support levels, the more powerful the pattern. This can be depicted in the following formula:

Power of Pattern Pp = ( c1 * Confidence ) + ( c2 * Support )
Where c1 and c2 are arbitrary constants.

Once we find patterns, we can also look for exceptions. For example, row 4 represents an exception or may be an error. This row has the only blue box, while all other boxes are red. This exception becomes more interesting if we have 1000 red boxes and just 1 blue box. An exception is defined as a weak pattern (or one row) that exists within the rows of more powerful patterns.

Three terms are used interchangeably; Pattern, Fact, and Rule. We can draw a distinction here between them as follows:

So what is the difference between Facts and Rules? Just the presentation. The following two statements represent the same pattern.

We find that using rules is more oriented to condition-action situations like in expert and prediction systems. While using facts is more suited to making observations and analysis which is typical of Knowledge Discovery and Data Mining applications.

2.1 Do we want to find all patterns and exceptions?

In an expert system, prediction system or database compression situations, the designer should be interested in all the patterns that represent the database. This allows the system to make more predictions. The general rule here is; the more patterns, the better.

In Data Mining, however, we are searching for useful information that is hidden in the database. And since this information is going to be used by human beings, we should be looking for interesting patterns.

2.2 What is an interesting pattern?

Consider the following three requirements:

  1. An interesting pattern is one that involves interesting columns.
  2. An interesting pattern is a powerful pattern that has high confidence and support levels.
  3. An interesting pattern depicts an unexpected or unknown information.

By allowing the analyst to select which columns to mine, we can fulfill the first requirement. The engine itself could also do a simple measurement of interest. By studying the clustering level of each column, the engine can eliminate columns with none or little clustering. The lower the clustering level, the less interesting the column is. At one extreme, a column with unique values, i.e. zero clustering is not interesting at all. On the other hand, columns with binary values tend to classify data and usually are more interesting.

The mining software can use the confidence and support levels to detect and report powerful patterns. A good mining engine should allow the user to try different levels of power and select the one appropriate for the analysis.

For an engine to fulfill the last definition, it has to know what patterns users know and expect in their data. The software can simply acquire this information from the user. A good engine would allow the user to input columns that they know are dependant and not interesting. For example, in a customer data table, we usually have Country, State, City, and ZIP columns. In this case, the analyst is not interested in any patterns like: All orders that have City = New York City also have State = New York. The engine should be able to accept such known dependencies and avoid rediscovering them. Moreover, the software may already have knowledge of dependant columns that are not interesting. For example, calculated columns. If the user created a column from calculations on other columns, then any pattern that exist between the calculated column and its source columns are -- most likely -- not interesting.

The advantage we have in Data Mining, is that finding interesting patterns is much less expensive than finding All patterns.

3. The Fact Discovery Engine in SuperQuery.

The Fact Discovery Engine is the part of SuperQuery that is responsible for data mining and discovering facts and exceptions.

The Fact Discovery Engine (FDE) is the part of SuperQuery that is responsible for the data mining. It searches for facts and exceptions in any table view. SuperQuery contains many other facilities to prepare, link, partition, classify, and summarize data tables. In addition, SuperQuery has other graphical and statistical engines that help analyze and explore data. The scope of this section is limited to the FDE.

A fact is an observation about a group of rows in the form "Most rows that have column c1 = v1, also have column c2 = v2". For example:

All orders that have Product = Jacket, also have Profit = High.
Most orders that have Product = Hat, also have Profit = Negative.
Only 1% of orders that have Product = Hat, have Profit = High.

The first two statements are facts and simply say that jackets are profitable and hats are not. The last statement is an exception. It points out that there are few orders of hats that are very profitable. These rows are interesting because they may give a clue as to when hat orders are profitable.

SuperQuery reports the confidence factor and support level of each fact or exception. For example, the second statement may have a confidence factor of 80% and may be supported by 1000 rows.

The FDE stores all discovered facts in a table called the Fact Table. This table is treated just like any data table. This means that the user can use all of SuperQuery analysis tools to explore and understand the found facts. This is especially important when the FDE discovers more than a few facts. The Fact Table View has an additional feature, which is the drill-down button. This button allows the user to inspect any fact and view its supporting rows. Drill down creates a new view containing those rows that support the current fact or exception. This gives the user instant graphs and statistics about a fact. The user can then choose to keep or remove the new view.

The FDE user interface has defaults to all controls, which means that the new user need only to provide a name for the output table and press "Run".


Figure 1: The Fact Discovery Engine Main Dialog.

In the following section we explain the Filters, Dependency, and Settings buttons and how to use them to modify or fine tune the performance of the engine.

3.1 Avoiding and filtering out uninteresting facts

It is important to avoid spending time discovering already known facts. The FDE's objective is to report only interesting facts. SuperQuery allows the user to enter groups of columns that are known to be dependent and their relation is not useful. The following dialog contains an example of such a situation.


Figure 2: Selecting Dependent Columns.

It is also important not to report any fact that can be implied from other more general facts. There are a number of other dependencies that SuperQuery is able to detect automatically and avoid. The FDE can automatically ignore:

The following screen shot shows all the patterns that the engine automatically ignore. This allows the user to cancel any or all filters and in the case of Most Common Values (MCV), the user can also adjust the filter.


Figure 3: Activating or Deactivating Automatic Filters.

Following is the explanation of each option:

3.2 Configuring the Fact Discovery Engine

SuperQuery provides the user with full control on the quality of facts found. This is done by allowing the user to change the support requirements at various confidence levels. Each fact is supported by a number of rows. For example, if the fact:

Most products where Product = Jacket have Profit = High

is supported by 100 rows. This means that there are 100 rows in the table where the Product = Jacket and at the same time, Profit = High. The user can increase or decrease the support level that a fact must achieve to be reported. Increasing the support level reduces the number of facts discovered. However, those facts are likely to be interesting and/or important. On the other hand, sometimes, it is required to find all the patterns or facts. In this case, the support level is decreased. The fact discovery process will take more time to process. Each slider has two numbers; the percentage at the left side is the confidence range that this slider controls. The number at the right side is the value of the slider itself.As the slider value increases, the number of facts increases and the support level requirements decreases.


Figure 4: The Fact Discovery Settings Dialog.

The blue sliders control the number of Facts found at different confidence levels. First Slider is for facts of type ALL i.e. confidence 100%. The next slider is for facts of type MOST and have confidence level in the range 80% to 99%. Each slider operates on one confidence range. Increasing the slider, increases the number of facts reported. This is done by relaxing the support requirements on facts at that confidence range. On the other hand, decreasing the slider decreases the number of facts by requiring each fact to be supported by more rows. This means that the engine could be configured in a variety of ways. For example, it could find only facts with 100% confidence, or facts with confidence in the range 80% to 99%, or any combination.

The red sliders are for Exceptions. Each slider controls the number of exceptions found. Here we use Clustering levels instead of Support levels. For an exception to be reported in a column, this column must have some dominant values or clusters. As the slider value increases, the number of exceptions requested increases and the level of clustering required is decreased.

The engine also allows the user to select the highest fact level to search for . A fact like

Most products where Product = Jacket have Profit = High

is considered Fact Level One, because it has 1 condition in its condition side. On the other hand a fact like

Most products where Product = Jacket and Color = Black have Profit = High

is a Fact Level Two, since it has 2 conditions in its condition side.

The default level is 1, and usually 2 or 3 are enough. The higher the fact level, the longer the discovery process takes.

3.3 Presenting and Studying the found facts

SuperQuery has three formats for presenting its discovered facts and exceptions:

While the engine is running, patterns are reported as rules in an If-Then statements as shown in the following dialog.


Figure 5: Reporting Rules

The fact table format is more suitable for manipulating large number of facts. Since facts are represented in a normal data table, it is possible to set various filters, drill-down, and even view graphs. From the fact table, one can:

One can even run the fact engine on the fact table itself to see if there are any patterns among the facts themselves! The user can always generate the other two text formats from the fact table.

As shown in the following figure, the current fact is also represented in a textual form in a separate pane to help the user conceive the fact.

 


Figure 6: The Fact Table View.
Illustration lines and text are added to the actual screen shot.

4. Conclusions

Data Mining is the natural evolution of query and reporting tools. Everyone who creates queries and reports, benefits from having data mining capabilities. We probably have more data than what we can handle, therefore, more data analysis tools are needed. We need to have computers sift through data and report the important patterns for us. Moreover, since the data mining process is systematic, we will be able to discover information that are otherwise completely hidden.

SuperQuery makes data mining available to everyone by integrating all the data mining functions in one simple framework. These functions include:

SuperQuery answers queries easily and intuitively but it really shines when the query is not defined at all, when the objective is to explore and discover.

 

For more information
For additional information, please access AZMY's Web site at
http://www.azmy.com.
Send e-mail messages to
info@azmy.com.

1997 AZMY Thinkware Inc. All rights reserved.

HOME