Useless Knowledge

Saturday, April 26, 2008

Ten Mistakes to Avoid for Data Warehouse Project Managers

By Larissa Moss - TDWI

FOREWORD

Most organizations treat project management as an administrative function. A project manager often “manages” multiple projects. However, a more accurate way to define a project manager would be to say that he or she “administers” multiple projects because he/she is rarely involved in any daily project activities. The project teams merely report to him/her.

Project managers, assuming that data warehouse projects are like any other project, are often surprised when their data warehouse project spins out of control. The requirements appear to be a “moving target;” the schedule keeps slipping; the source data is much dirtier than expected and is impacting the ETL team; the staff does not have the necessary skills and is not properly trained; communication between staff members takes too long; traditional roles and responsibilities, and how they are assigned, seem to result in too much rework; the traditional methodology does not seem to work; and so on.

Techniques that work on other projects do not work well on data warehouse projects. This booklet describes how to avoid 10 common mistakes made by data warehouse project managers.

Top

ABOUT THE AUTHOR

Larissa Moss, president of Method Focus Inc., specializes in data warehousing, business intelligence, information quality, data integration, project management, and spiral data warehouse methodologies. She presents and lectures at various conferences worldwide. She co-authored the books Data Warehouse Project Management, Impossible Data Warehouse Situations, Business Intelligence Roadmap, and Data Strategy. Her works have been published in trade journals, including DM Review, Teradata Magazine, and TDWI’s Business Intelligence Journal. Additionally, her white papers are available through the Cutter Consortium and NCR/Teradata. She can be reached at methodfocus@earthlink.net.

Top

1. Failing to Use a Methodology

Software development has become relatively lax over the past two decades, and the use of system development methodologies has become more of an exception rather than a rule. Project teams—as well as business users—seem to think that with all the new development tools available, system development is (or should be) trivial. They are often surprised to learn that project managers and project teams must consider approximately 920 tasks when developing a data warehouse. Who can remember 920 tasks? No one. But every one can look up 920 tasks in a methodology.

Having the right kind of methodology is important. It cannot be a traditional “waterfall” methodology because that type of methodology assumes you are building a stand-alone “final” product, which does not have to integrate with any other product and will not dramatically evolve or expand over time. Thus, a traditional methodology does not include cross-organizational business integration tasks. Since a data warehouse is an evolving environment with many databases and applications, it is important to design databases and processes for reuse whenever possible. This requires specific integration tasks that a data warehouse methodology must provide.

In addition, a data warehouse methodology must take into account that a data warehouse environment cannot be built all at once. In other words, the deliverable will not be a stand-alone “final” product, but will have to be expanded and enhanced over time. If a data warehouse is successful, then each release will most likely generate new requirements. Sometimes these requirements will be for a brand new data warehouse application, but many times they are simply an enhancement of an existing application. Periodically, these new requirements may even demand that new technology be evaluated and purchased. A data warehouse methodology provides appropriate tasks for all of these activities.

Another differentiating aspect of data warehouse projects is that you have to manage multiple sub-projects in parallel. One such sub-project is the development of the data warehouse application (e.g., reports, canned queries, or customized cubes for slicing and dicing). Then there is the ETL process, including data profiling, data transformations, and data cleansing in addition to source data extracting and target data loading. A third sub-project may be building and loading the metadata repository. And there may even be a data mining deliverable, requiring its own development track.

A data warehouse methodology includes tasks for all of these development activities, and recognizes that many of these activities can run simultaneously.

Since metadata is an important deliverable, it deserves special mention when discussing methodologies. Not only does a data warehouse methodology have to include tasks for gathering, storing, and delivering metadata to the business community, it must also provide tasks for either evaluating and installing a purchased metadata repository product, or designing and building one.

In an evolving and expanding data warehouse environment, where maximum reusability must be built into all deliverables, it is important to continuously review and improve the environment. That means reviewing old and new requirements against existing data warehouse databases and applications, and finding ways to reuse what has already been built. Such reviews may result in requirements for minor database design changes, or program changes to the ETL process, reports, queries, or other applications. The methodology must provide tasks for conducting such reviews and folding the resulting changes into the next data warehouse project.

Taking the many development steps into account (from business case assessment to post-implementation review) and considering that most data warehouse projects are composed of several sub-projects, it is easy to understand that there are hundreds of tasks to be considered. Naturally, not all tasks have to be executed on each project, but all tasks must be known to the project manager so that he/she can pick the right ones for each data warehouse release.

The role of a methodology is to provide a list of all possible tasks, their dependencies, the roles and responsibilities assigned to execute them, and the deliverables resulting from them. Not using a methodology almost guarantees that vital tasks will be dropped, requiring rework that could have been avoided.

Top

2. Ineffective Project Team Structure

Traditional project teams are not structured to cope effectively with the dynamic nature of data warehouse projects and cannot react fast enough to the constant changes and challenges. What’s a traditional project team structure? Typically, the project manager alone defines and plans the project, and assigns a discrete set of tasks to each project team member. When a team member completes a task, the deliverable gets “handed over the cubicle wall” to the next team member who performs his/her assigned tasks and hands over the deliverable to the next person, and so on. Then on Friday afternoon, all team members submit a status report of their individually assigned tasks to the project manager who uses these reports to monitor and control the project activities. Occasionally, or regularly, team meetings are called to exchange information, and when a problem arises, special meetings are arranged with the business people or other stakeholders who can help resolve the issue.

A data warehouse team must be much more flexible and dynamic than that. There should be a core team of four to five people who together define, plan, and co-lead the project. The core team should be thought of as a high-powered, self-organizing SWAT team. Core team members must be 100 percent available from the beginning to the end of the data warehouse project. They brainstorm together, assign work to each other, review each other’s deliverables (peer reviews), resolve issues, and make project-related decisions together. This team should be staffed with senior-level team members who are experts in:

  • Project management (a lead person, not an administrator)
  • Subject matter expertise (a business representative, not an
    IT person) (1)
  • Business analysis practices (data modeling and process modeling)
  • System analysis techniques (light programming)
  • Programming (ETL, OLAP, report writers, metadata repository, etc.)

Each person on the core team can be, and probably will be, assigned multiple roles. The core team roles and their main responsibilities are listed in the following table.

Core Team Role Major Responsibilities
Application Lead Developer Design and oversee the development
of the data warehouse access and analysis applications.
Business Representative (2) Make business decisions, resolve disputes between
business units, and improve
the source data quality.
Data Administrator Perform cross-organizational data analysis, establish naming standards, create the project-specific logical data models and merge those models into an enterprise logical data model.
Database Expert (Architect and Administrator) Design, load, monitor, and tune the data warehouse databases.
Data Quality Analyst Assess source data quality and prepare data cleansing specifications
for the ETL process.
ETL Lead Developer Design and oversee the development
of the ETL process.
Metadata Administrator Build or buy, enhance, load, and maintain the metadata repository.
Project Manager Define, plan, control, and review all project activities.
Subject Matter Expert Provide business knowledge about data, processes, business rules, metadata, and requirements.
Technical Architect Establish and maintain the technical
infrastructure (hardware, network, middleware, system software).
(1) One-hundred percent availability from a business representative is a critical success factor for a data warehouse project. If management resists releasing one businessperson full time, it’s an indication that they don’t support the data warehouse as a critical strategic business initiative.
(2) The business representative role on the core team is usually assigned to the primary business user who represents the business units for which the data warehouse application is being developed. This person must be authorized to make business decisions on behalf of the business community he/she represents.

Top

3. Failing to Involve the Business People

Data warehouse projects are notoriously dynamic. That can be good or bad. Usually, it is perceived as bad because requirements change constantly; the scope is hard to control; the timeframes for delivering applications are unreasonably short; the data is usually dirtier than expected; business people are hard to pin down to provide business rules for data cleansing; project team members are often unclear about their specific roles and responsibilities, etc.

On the other hand, dynamic projects can be good because business people have an opportunity to learn about any new technology or tools early on. They also have an opportunity to “play” with their requirements and adjust them as they learn more about the capabilities and limitations of the data warehouse. IT folks can experiment with different database and application designs, and they can negotiate the project scope and delivery time to be more realistic depending on the difficulties they encounter. They can also profile the source data early and show all the data defects to the business people for resolution or deferment.

It should be obvious that dynamic projects have to be set up differently from traditional projects if the dynamics are to have a positive impact on the project team or the project schedule. The differences include adoption of a rapid development approach similar to prototyping, acceptance of the software release concept, a self-organizing SWAT team, and full-time involvement of business people in project activities. This is a paradigm shift for how applications are developed, and only a few business and IT people are comfortable with it. In contrast, organizations that successfully practice “extreme programming” techniques understand the benefits of this new approach because the prerequisites for extreme programming are the same (rapid development, software releases, SWAT teams, and participation from business people).

Why is it so important for business people to participate and what would they be doing on the projects? The most important reason business people should participate is to speed up the development work. It is a common complaint among IT people that situations come up several times a week for which they need input from the business people. But business people don’t often make themselves available. And when they do, it can be too late, especially when weeks have passed and IT has already assumed how best to resolve the situation. Sometimes IT guesses incorrectly, leading to rework, which can impact the project schedule. When combining these situations with changing business requirements that are not subjected to rigorous impact analysis, the result is a frustrated IT team and unhappy business users who don’t understand why IT takes so long to deliver the application. Business people must “live” the projects alongside IT in order for the entire project team to be more productive. Hence, business people must make time to become participating members of data warehouse project teams. They must participate in project planning, perform impact analysis on their own requirement (scope) changes, remove business-related road blocks (like data disputes between business areas), and perform other business-related project activities such as:

  • Determining project deliverables for each software release
  • Participating in tool evaluation and selection
  • Negotiating data and functional requirements
  • Participating in data and process modeling sessions
  • Providing data definitions and business rules
  • Participating in testing activities, including writing the
    test cases
  • Profiling the source data and validating the quality of data
  • Identifying the cleansing rules for dirty source data
  • Validating/testing the accuracy of the ETL programs
  • Validating/testing the accuracy of reports and queries · Resolving disputes among business units
  • Monitoring/auditing the data warehouse data on an ongoing basis
  • Participating in post-implementation review discussions

Business people must be told that they are an invaluable and indispensable part of an effective data warehouse project team because they possess certain knowledge and authority their IT counterparts don’t have. In addition, business people understand the severity and monetary impact of their organization’s business problems, and they are the only ones with the position and authority to negotiate the priorities of data warehouse projects.

Top

4. Failing to Have Application Releases

The deliverable of a data warehouse project is usually a fully operable data warehouse application with a lot of functionality and a lot of data. The source data typically resides on multiple (and often heterogeneous) operational files or databases, which adds to the complexity of integrating the source data. The amount of data redundancy, data inconsistency, and data defects is habitually underestimated. Project teams who did not plan to spend the majority of their time on data cleansing are caught off guard—especially because they are expected to implement the fully functioning data warehouse application in an extremely short timeframe. In short, many project teams bite off much more than they can chew (i.e., their scope is much too large for their deadline).

It has been said for years that “you cannot build a whole data warehouse [environment] in one big bang.” Nobody challenges that anymore. But that does not go far enough in the attempt to reduce scope and complexity to a manageable chunk of work. Therefore, the new mantra should be: “you should not build a data warehouse application in one big bang.”

Following the principles of extreme programming, extreme project management, extreme methodologies, we should also adopt extreme scoping. Extreme scoping means reducing the complexity, and thus, the scope of each project in order to deliver something to the business users in a very short period of time. The first something would equate to only a fraction of the requested fully functioning application, but more functionality and more data would be added rapidly with each subsequent application release. Many business users still balk at this approach and insist on the “minimum required” deliverable, saying that anything less is of no use to them. They don’t realize that not only are they not losing anything, but they are gaining a lot with application releases.

Building an application in small iterations (application releases) will not take any longer than building the whole enchilada at one time. In addition, the business users can see their application grow and catch mistakes or adjust their requirements if needed (under strict change control procedures). This will greatly enhance the quality of the final fully functioning application. Another benefit of this approach is the opportunity for business people to slowly become familiar with their data warehouse application and any new technology or tools before the application is even completed.

Maybe the best way to illustrate the effectiveness of extreme scoping with application releases is to recall the concepts of prototyping. In prototyping we focus on a small (partial and incomplete) scope that is not too complex, and we produce a not yet fully functioning application. The next prototype release includes another small portion of the overall scope with a little more complexity and a little more functionality. This process can be repeated until the application is fully functioning. Although application releases are based on the same concepts as prototyping, they are not equal to prototyping. The difference is that traditional prototyping is pure ad hoc development, whereas application releases demand all necessary project activities to be performed with the rigor of a methodology.

Scoping data warehouse projects will remain a struggle as long as we are married to the idea that a project must produce a fully functioning application. But if we use application releases to build data warehouse applications, controlling the scope becomes much easier because the complexity of each release is reduced, the number of activities performed is decreased, and the project team can be smaller, rendering the methods for controlling the project more effective.

Top

5. Failing to Have an Active Project Charter

Most project managers create a short document that describes their project in terms of high-level requirements, users, schedule, resources, and budget. This document is often known as a document of understanding, project agreement, scope agreement, or project charter. Frequently, this document is created by copying an old document from a previous project and changing a few details here and there. Once the project is kicked off, this document disappears into a project manual—never to be seen again.

A well-thought-out project charter is a very useful instrument and should be used actively to monitor and control project activities during the entire development cycle of a data warehouse project. Therefore, the project manager and the business user, or business sponsor, should spend some time documenting the details of their agreement in this charter. While a detailed project charter can contain as many as 20 sections, the following four sections are the most useful to serve as a baseline for change control:

  • Scope
  • Risks
  • Assumptions
  • Constraints

Traditionally, scope has been measured by the number of functions the system will perform (function point analysis)—a sure way to underestimate effort, budget, and resources. Data warehouse applications are data-intensive, not function-intensive. Therefore, scope must be measured by the number of data elements that have to be extracted from the source systems, transformed and cleansed, and loaded into the data warehouse target databases.

Every project is subject to some risks—risks are unavoidable. Such risks could severely affect the project schedule as well as the project deliverables, depending on the likelihood the risks will materialize and the impact they would have on the data warehouse project. The project manager must identify triggers for each risk and incorporate a risk mitigation plan as well as a contingency plan into the project program.

An assumption is anything taken for granted; a supposition or a presumption. It’s important to document assumptions because a wrong assumption could very quickly turn into a risk. Important assumptions should always have counterpart risks, in case the assumptions either turn out to be false or do not materialize.

All projects are subject to the four constraints of scope, effort (time), budget, and resources (capable and available people). In reality, there is a fifth constraint: quality. Although quality is a measure of how well the deliverables meet the requirements, it can also be considered a constraint that must be balanced with the other constraints because higher quality requires more effort and therefore more time to deliver.

The project charter should record the agreed-upon scope negotiated under the stated risks, assumptions, and constraints. If any of these components change, the entire project has to be reevaluated and renegotiated, and the changes should be reflected in the revised project charter so it can be used as a baseline for the next change request.

Unfortunately, business managers and IT managers frequently put their project teams under unwarranted pressure to incorporate scope changes without ever referring back to the original agreement in the project charter, and also without performing the necessary impact analysis.

Top

6. Lack of a Readiness Assessment

Many project managers attend data warehouse conferences and learn best practices for planning, designing, and implementing data warehouses. But, when they return to their organizations and try to apply the best practices, they often encounter resistance from business users and “uninitiated” IT members. Regardless of how hard the project managers try to educate, convince, or force those who oppose them to follow best practices, they are often unsuccessful.

The mistake these project managers make is not realizing their organizations aren’t ready to suddenly stop building traditional stand-alone systems and begin building an integrated data warehouse environment. Some organizations are bent on trying every shortcut and silver-bullet solution before they admit those solutions only add to their data chaos.

Therefore, at the beginning of the initiative, an organization’s readiness should be assessed (understanding, ability, and willingness). A readiness assessment would include the following questions:

1. Have the goals and objectives been defined?
2. Do the goals and objectives for the data warehouse map to those
of the organization?
3. Has the source data been inventoried and modeled?
4. What is the quality of the source data?
5. Are the skills in place to build and support the data warehouse?
6. Is an adequate budget in place?
7. Has supporting software (ETL, cleansing, DBMS, etc.) been
selected and installed?
8. Is there a strong, well-placed, and reasonable business sponsor?
9. Does the business sponsor understand that a data warehouse is
not a stand-alone system?
10. Are the primary business users computer literate?
11. Are the business users’ expectations realistic?
12. Do the business users understand they have to participate in
project activities?
13. Does the business sponsor accept the approach of building
applications iteratively (using the software release concept)?

Based on the assessment, the project manager can determine which best practices to implement. Periodically, the questionnaire should be re-distributed to gauge the organization’s understanding of data warehousing. At that time, more best practices can be incorporated.

Top

7. Inadequate Testing

Data warehouse testing is often done poorly. It’s unacceptable—and so is the excuse that “it can be fixed in the next release.” If it takes too long to test the data warehouse application properly now, it will take even longer to test it later because the next release will be larger and more complicated.

The same types of testing activities that apply to operational systems also apply to data warehouse applications, including unit testing, integration (systems) testing, performance (stress) testing, quality assurance testing, and user acceptance testing.

Unit testing refers to the testing of discrete program modules and scripts. Every developer must test his or her program modules and scripts individually.

Integration testing tests the complete process. The interactions and flow of all programs must be observed and validated. Every time actual test results do not equal the expected test results, the program producing the error must be corrected, and all programs must be rerun.

The most complicated and time consuming type of testing is regression testing. The main goal of regression testing is to ensure any modifications to existing programs did not inadvertently produce errors.

Performance testing, also known as stress testing, is performed to predict system behavior and performance. Performance testing can be limited to only the most critical program modules with the highest volumes of data and the longest runtimes.

Most large organizations have strict procedures for moving applications into production. These procedures usually include QA testing, at which time the operations staff goes through a simulated production run before allowing the application to transfer to the production environment.

Acceptance testing is done by business users. They validate the functionality of the data warehouse application.

With the possible exceptions of unit and performance testing, all other testing activities are controlled by a test plan. The bulk of the plan will be a list of test cases. Each test case specifies the input criteria and the expected output results for each run. It also describes the program logic performed and the appearance of the resulting data.

Top

8. Underestimating Data Cleansing Efforts

Most organizations admit they are not paying sufficient attention to their data quality. As disheartening as the situation is with operational source systems, it is very discouraging to see that many IT and business managers are continuing to put pressure on project teams to build data warehouses quicker using the motto: “There’s never enough time to do it right, but always enough time to do it over.”

With project schedules shrinking and project scopes expanding, project managers are under the gun to deliver more in less time. Therefore, they habitually do not allocate enough time for source data analysis, business rule discovery, data cleansing, data reconciliation, and ETL testing. As a result, two things happen: (1) many data defects propagate into the data warehouse unnoticed, and (2) some dirty data is discovered too late when data exceptions are caught during ETL testing or while loading the data warehouse databases.

To avoid project delays, the project manager should build sufficient time into the project plan to profile each data element. Common data violations to look for include:

  • Missing data values
  • Default values that actually have a meaning, e.g., using
    “888-88-8888” as a social security number to indicate
    a non-resident alien
  • Logic embedded in a data value, such as an implied roll-up
    structure, e.g., a 10-digit account number where the first four
    digits are the branch number
  • Cryptic and overused data elements, e.g., using the values
    “A, B, C, D” to mean type of customer, while the values
    “E, F, G, H” mean type of location
  • Multipurpose data elements, e.g., data elements redefined using
    the old COBOL “redefines” clause
  • Contradicting data values among dependent data elements, e.g., “Boston, CA” · Reused primary key, e.g., two different employees (one retired,
    one active) with the same employee number
  • No unique primary key, e.g., one customer with multiple
    customer numbers
  • Objects without their dependent parent object, e.g., job
    assignments for employee 3321, but employee 3321 does not exist in the employee database

Top

9. Ignoring Metadata

Metadata is nothing new—it has always been a part of automated systems. It was found in system documentation, record layouts, database catalogs, and data declaration sections in programs. In fact, it used to be called the dirty “D” word—documentation. Since technicians detested the thought of doing documentation, they often simply didn’t do it.

However, in a data warehouse environment, metadata takes on a new level of importance. Since one data warehouse objective is to eliminate inconsistencies, data must be standardized. Standardization may result in renaming the data, splitting one source data element into multiple target columns, or populating one target column from multiple source data elements. It can also mean translating codes into mnemonics, standardizing (changing) data values, and filtering out inappropriate or invalid records. At the end of the day, business people will not be able to reconcile their operational source data to the data warehouse data unless they have the ability to trace these changes.

Therefore, metadata is now the nice “N” word—navigation. It helps the business people locate, manage, understand, and use the data in the data warehouse databases. It describes what data is available in which database, what it means, where it came from, how it was processed, how clean it is, and how it is used in reports and queries. Not delivering any metadata that could help business people navigate through their data warehouse environment is a mistake.

Granted, implementing a metadata repository has its challenges. Although many data warehouse experts consider metadata to be the “glue” holding the warehouse environment together, most organizations allocate little or no money for creating and maintaining a metadata repository. Also, metadata should be “living” documentation that is constantly updated, which means—at a minimum—one metadata administrator must be assigned full time to manage the metadata repository. But regardless of the challenges, metadata must be an integral part of a data warehouse environment.

And, since the importance of metadata is still not understood by most business executives, project managers need to do a much better job of promoting metadata and communicating its benefits to the business people.

Top

10. Being a Slave to Project Management Tools

Planning and controlling a project is not a trivial task. It takes a long time to create a work breakdown structure, estimate effort and duration time for all tasks, apply task dependencies and resource dependencies, and determine the critical path. Critical path refers to a string of dependent tasks that cannot be late without affecting the project schedule, as compared to other tasks executed at the same time with slack time built into them. For example, if it takes four days to evaluate a product but only three days to create a project charter (at the same time the product is being evaluated), then the task of evaluating a product is the critical path because it has no leeway in timing. If that task is a day late, the project schedule is impacted.

Knowing where the critical path tasks are at any point during a project is crucial to staying on track. Since estimates are only best guesses based on prior experience with similar tasks, the actual time it takes to complete a task usually differs from its estimate. These differences can easily change the critical path. Tracking the differences between estimated and actual time, and continuously adjusting the critical path has enslaved more than one project manager to his/her project management tool. Other project managers find this activity too laborious and too tedious, and they stop tracking the critical path altogether.

The best approach (and compromise) for tracking the critical path is to do it at the milestone level rather than at the task level because there are fewer milestones than there are tasks. Thus, it takes less time and effort to continuously adjust the project plan. The critical path among the tasks between the milestones can be tracked more informally using a whiteboard or a flipchart instead of the project management tool. This approach can be safely and effectively used when the scope of the data warehouse project is very small and the project team is managed by a self-organizing core team (SWAT team) rather than by
a project manager who is not involved in daily project activities.

Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models

Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models

By Mary Breslin

Many organizations today need to create data warehouses-- massive data stores of time-series data used for decision support. These organizations face a range of choices, both in terms software tools and development approaches. Making good choices requires an understanding of the two main data warehousing models-- Inmon's and Kimball's.

Bill Inmon advocates a top-down development approach that adapts traditional relational database tools to the development needs of an enterprisewide data warehouse. From this enterprisewide data store, individual departmental databases are developed to serve most decision support needs.

Ralph Kimball, on the other hand, suggests a bottom-up approach that uses dimensional modeling, a data modeling approach unique to data warehousing. Rather than building a single enterprisewide database, Kimball suggests creating one database (or data mart) per major business process. Enterprisewide cohesion is accomplished by using another Kimball innovation, a data bus standard.

Understanding how these two models are similar and how they differ gives the reader a foundational knowledge of the most basic data warehouse concepts. We will also explore which organizational characteristics are best suited to each approach.

Introduction and Context
We begin our discussion by defining the data warehouse. We will introduce the Inmon-Kimball debate, and provide a brief history of the evolution of the two models. We also provide a brief explanation of the nature of the data warehouse, and conclude with a discussion of the scope of the article.

Context of the Inmon-Kimball Debate
A data warehouse contains massive amounts of highly detailed, time-series data used for decision support. Data warehouses often contain terabytes of data that can be readily queried by end users. The sources of most of the data in a data warehouse are internal transaction processing systems (also known as operational systems). Specialized software extracts data from operational databases, then summarizes, reconciles, and manipulates it. Then the data is ready to be stored in carefully designed relational database tables in the data warehouse.

An organization must choose a set of data warehouse design and maintenance tools from among scores of software tools commercially available. Not all tools are compatible with each other, and not all tools are appropriate for all development methodologies. Despite the array of choices, the industry's tools and methodologies are generally based on only two models: Inmon's and Kimball's.

Choosing between Inmon's, Kimball's, and a hybrid model is, at the most basic level, a choice of both architecture and methodology (Wells, 2003a). Understanding the basics of the architecture and methodology of both models provides a good foundational knowledge of data warehousing. Upon this foundation, readers can build situation-specific knowledge that is appropriate to their organization's needs.

History of the Data Warehouse
How did Inmon and Kimball come to be giants in this field? Each is a creator of a unique school of thought and practice within data warehousing.

In 1990, Bill Inmon earned the moniker "Father of Data Warehousing" by coining the term in his seminal work Building the Data Warehouse. The industry soon began to implement Inmon's vision, with varying degrees of success. In the third edition of this work (2002), Inmon describes a logical architecture that extracts detailed, time-stamped data from disparate operational databases. The data is then transformed and stored in a single database (the data warehouse). Data extracts from this monolithic data warehouse create smaller, departmental databases. Decision support users query and create reports from the departmental databases. To create both the data warehouse and the departmental databases, Inmon proposes a top-down variation of the spiral system development methodology.

After the publication of Inmon's book, other database experts began creating data warehouses. The experience of one scholar-practitioner, Ralph Kimball, led to the development of a model that competes with Inmon's. In 1996, Kimball first published his model in his seminal work, The Data Warehouse Toolkit. After several years of experimentation, he published a second edition in 2002. In the latest version, he recommends an architecture of multiple databases, called data marts, organized by business process. The sum of the data marts comprises the data warehouse. He recommends a development methodology that is unique to data warehousing. It involves a bottom-up approach that must adhere to an enterprisewide standard "data bus." (See "The Data Bus and Conformed Dimensions" later in this article for a discussion of the data bus).

Nature of the Data Warehouse
The data warehouse exists to facilitate decision support in the organization. Decision support systems help users with ad hoc analyses and strategic decision making. Generally, decision support systems require historical data, both summarized and at a transaction level of detail. Users need to be able to query these massive amounts of data easily. Often, they may not really know what relationships between data elements they are searching for. One data warehousing anecdote tells how a retail chain learned that new fathers often shopped for diapers and beer in the same trip. Sales of both products soared when the diapers and beer were placed next to one another. Data warehousing technology is credited with the discovery (Albert, 2000).

This example neatly illustrates the nature of data warehousing What does it take to find a statistically significant purchasing relationship between two such unlikely products? One obvious requirement is that the data you are analyzing must be sufficiently detailed to contain the date of the transaction as well as descriptions of the products purchased. This illustrates why data warehouses tend to contain very large quantities of time-stamped data.

A less obvious requirement of finding the beer-diaper connection is being able to "browse" through the warehouse without really knowing what you are looking for. In data warehousing, you typically submit many queries before you get results worth analyzing. This means data warehouses must make it reasonably easy for end users to make queries. This, in turn, implies user-friendly access tools and reasonable response times. When you consider user-friendly access of massive amounts of detailed data with reasonable response times, you can appreciate the challenges of providing an effective data warehouse solution.

Scope of This Article
This article compares and contrasts the Inmon and Kimball approaches to meeting the challenges of creating a data warehouse. While it discusses the most basic aspects of both approaches, there are many topics it does not address. For example, the article does not address physical design considerations, such as distributed data warehouse processing. It does not discuss special applications of the data warehouse, such as support of executive information needs, or considerations in creating Webbased data warehouses.

This article does not address some concepts that scholarpractitioners in the industry consider fairly basic, such as metadata, snowflaking, or data mining. These topics have been excluded from the article in order to give more thorough attention to the most basic aspects of each model.

The Inmon Model
Inmon's architected environment consists of all information systems and their databases throughout a given organization. He calls this behemoth the Corporate Information Factory, or CIF (Inmon and Imhoff, 2002). Even a cursory discussion of the CIF is beyond the scope of this article, and therefore the following discussion is limited to those components of Inmon's architected environment most essential to the data warehouse.

Inmon divides the overall database environment of the organization into four levels:<

  • Operational
  • Atomic data warehouse
  • Departmental
  • Individual

The last three levels comprise the data warehouse. The first level contains data from legacy and other transaction processing systems. This level supports the day-to-day operation of the organization; in other words, the first level supports all transaction processing. From the operational systems, data is extensively manipulated and then moved to the atomic data warehouse (Inmon, 2002). (See "Extract, Transform, and Load" later in this article for an overview of the data manipulation performed between the operational and atomic data warehouse levels.)

Inmon uses an example to illustrate the difference between operational data and data stored in the atomic data warehouse. In the example, the entity is a customer, and the attribute of most interest is the customer's credit rating. The operational system's database contains the customer's current credit rating and related information of interest (such as loan balances, address, etc.) in a single record. The atomic data warehouse, by contrast, contains the credit history for this customer, summarized by year, with one record per year (Inmon, 2002).

Inmon does not thoroughly pursue the customer credit example in its transformation from the atomic to the department level. His example is extended here based on a synthesis of various discussions throughout his book. The data contained in the departmental level is lightly to heavily summarized, depending on a given department's information requirements. The credit department might lightly summarize the data by dropping customer address information as irrelevant, but keeping a "flag" to indicate a change of address. In contrast, the marketing department might more heavily summarize the data by dropping all customer-identifying data except zip code. Each department's database can hold data summarized according to its needs. At the same time, Inmon's architecture ensures that all data is consistent because all departmental data comes from the atomic data warehouse.

Individual users create the fourth and final level of the architected environment when they create heuristic, ad hoc data sets as part of decision support analyses. This fourth level tends to be temporary and housed on the individual user's personal computer (Inmon, 2002). For example, a user working in the credit department might ask to see records for all accounts that have been delinquent at least once in the last three years.

If the department's database has not retained the data at the level of detail needed, it is possible to query the atomic data warehouse. Queries against the atomic data warehouse generally go through the IT department. Inmon argues that the atomic data warehouse is worth the initial effort to construct because it allows the creation of any number of departmental databases without risking creating incompatible data between them (Inmon, 2002). This is done using a three-level data model.

The Three-Level Data Model
Inmon proposes three levels of data modeling. The first is ERD (entity relationship diagrams). Just as in the development of operational databases, ERDs are used to explore and refine entities, their attributes, and the relationships between entities. The development team creates one set of ERDs for each department that is expected to use the data warehouse. The corporate ERD is the sum of all department ERDs (Inmon, 2002).

The second (mid-level) data model, establishes the DISdata item set) for each department. Again, the sum of the departmental DISs comprise the corporate DIS. The midlevel data model includes four constructs:

  • A primary data grouping
  • A secondary data grouping
  • A connector, signifying the relationships of data
  • between major subject area
  • "Type of " data

A critical aspect of the mid-level data model is that the primary grouping exists only once for each major subject area. This means that an ERD created in the first-level data model is the basis for a DIS in the second-level data model. Figure 1, taken from Inmon's book, illustrates the ERD-DIS relationship for a given user view. It also shows

Page 9: Figure 1

Breslin_Fig1

Figure 1. Relationship between Levels One and Two of Inmon's Data model (Inmom,2002)

Figure 1. Relationship between Levels One and Two of Inmon's Data Model (Inmon, 2002)

how the various user views are combined into a corporate ERD and DIS. Within a DIS, each rectangle represents a logical table in either a departmental or the corporate DIS. The connections between these tables are the same as those that connect entities in the ERDs. Rectangles to the right in a given DIS represent the secondary grouping of data (Inmon, 2002). ("Type of " data does not appear in Figure 1. It would be represented by another column of rectangles branching to the right from the secondary grouping rectangles.)

Inmon's banking example helps make this clear. In banking, the entity "customer" generates a primary grouping of data such as account (primary grouping). "Account" may have several manifestations, such as loan, savings, or trust (secondary grouping). Connectors show that one customer may have several different accounts. Finally, each account may have data generated by similar activities, such as ATM deposits, ATM withdrawals, teller deposits, or teller withdrawals; these are examples of "type of" data (Inmon, 2002).

Creating the departmental and corporate ERDs and DISs shown in Figure 1 requires very high levels of data modeling expertise. It also requires breadth and depth of knowledge of the organization's business processes. Inmon suggests using enterprisewide data models if possible to save development time; they already exist for many industries (Inmon, 2000).

The final level of Inmon's data model is the physical. In his words: "The physical model is created from the mid-level data model merely by extending the mid-level data model to include keys and physical characteristics of the model" (Inmon, 2002). Inmon explains various techniques for optimizing the performance of the data warehouse at both the atomic and departmental levels. Although the techniques may not be familiar (creating arrays of data, preformatting, rejoining tables), the purpose--optimizing I/O performance --is the same as for operational database systems. Most of these techniques involve denormalization of tables.

There are several reasons to denormalize tables at the physical level. For example, records in the atomic data warehouse are rarely updated because the data is historical. This makes it possible to physically place data in ways that would not work for operational data because it is frequently updated (Inmon, 2002).

Breslin_Fig2

Figure 2. Inmon's Meth2 (Inmon, 2002)

Figure 2. Inmon's Meth2 (Inmon, 2002)

Once the three-level data model is complete, the data warehouse development has begun.

A Spiral Development Approach (Meth2)
A completed three-level data model is the only prerequisite to using Inmon's special adaptation of the spiral development methodology, which he calls Meth2. (Meth1 is for developing operational systems; Meth3 is for tuning an existing data warehouse). Inmon calls the modeling step DSS1 (for Decision Support 1). He outlines nine more steps, shown in Figure 2.

Using the completed three-level data model as the first input to the process, the next step is to conduct a size/granularity analysis (DSS2 in Figure 2). Granularity is a measure of the detail of the data. For example, transactional data has the lowest level of granularity because it has the most detail. Inmon calls the size/granularity analysis a breadbox analysis, presumably an allusion to the saying: Is it bigger than a breadbox? If the volume of data is massive, then the team needs to consider multiple levels of granularity for the data (Inmon, 2002). This might involve storing some data at a transaction level and other data in summarized forms (such as a daily total).

Once granularity issues are resolved, the first subject area is selected (DSS5). This will become the first departmental database. The team analyzes the source systems of the first subject (DSS7), writes specs (DSS8), code programs (DSS9) and populates the database (DSS10). The atomic data warehouse database design begins concurrently (DSS6). When there is enough information to do so, the team conducts a technical assessment (DSS3). This assessment ensures that the data in the warehouse will be accessible and well managed (Inmon, 2002).

As the team develops each successive departmental database, they impact the atomic data warehouse. Figure 2 shows this iterative aspect of the model by showing lines connecting various steps. Lines connect both the source systems analysis step (DSS7) and the specs step (DSS8) with the atomic data warehouse design (DSS6). This means that the atomic data warehouse design will be revisited each time a new departmental database is developed. The line connecting the population of a departmental database (DSS10) with the preparation of the technical environment (DSS4) also shows the iterative nature of Meth2. By preparation of the technical environment (DSS4), Inmon means making sure that the data warehouses's network, storage hardware, OS, and all interface and access software are ready to receive data (Inmon, 2002).

Being data driven is an essential aspect of Inmon's spiral development methodology. "One of the salient aspects of a data-driven methodology is that it builds on previous efforts--utilizing both code and processes that have already been developed." (Inmon, 2002) His three-level data model helps support a spiral methodology, in that all user views are consistent with the corporate model. The team derives subsequent departmental databases using the code and processes they created when they developed earlier departmental databases. This means the time it takes to produce the second departmental database should be considerably less than the time it took to go through DSS1 through DSS10 for the first departmental database.

Inmon's Philosophy: Evolutionary, Not Revolutionary
Inmon sees the data warehouse as an integral part of the Corporate Information Factory (CIF). This means, among other things, that the data warehouse and operational databases are all part of a larger whole. This perception helps explain why Inmon's data warehouse must adhere to most of the same standards as operational systems. From this premise, it is easy to see how Inmon's evolutionary approach grows out of operational relational database technology and development methods. Each aspect discussed in this article--the architected environment, the three-level data model, and the spiral approach--is consistent with established practices in operational DBMS design and deployment. It is built upon principles and practices that have been in use in the operational database world at least a decade longer than even the earliest data warehouse efforts. Viewed in this context, Inmon's model is much more evolutionary than revolutionary.

A by-product of this evolutionary approach is that Inmon's primary audience is IT professionals, as it takes an IT professional's level of understanding to actively use his tools or development methodology. Inmon's tools and methodology ensure that end users will have mostly passive roles in the development of the data warehouse, reviewing the IT professionals' output.

Kimball's Model
Kimball's model differs in several important respects from a traditional relational database approach. One significant difference is that data warehouses built with the Kimball model use a data modeling method unique to the data warehouse. This is discussed in the next section: "Dimensional Data Modeling."

Another significant difference is that the overall architecture features multiple databases that are expected to be highly interoperable. The data bus is the main design feature that makes this possible (further discussion of the data bus is included in the section "The Data Bus and Conformed Dimensions").

Dimensional Data Modeling
Dimensional modeling may seem strange to IT professionals familiar with traditional relational modeling. Dimensional modeling begins with tables rather than entity-attribute data models such as ERDs. The tables are either fact tables or dimension tables. Fact tables contain metrics, while dimension tables contain attributes of the metrics in the fact tables. Dimension tables routinely contain repeating groups; this violates normalization rules. However, dimensional modeling violates normalization rules in order to achieve a high level of performance in the data warehouse, while keeping it end-user accessible.

An example best illustrates how dimensional modeling meets the dual objectives of ease of use and performance. The first example in Kimball's book is a retailing data warehouse (Kimball, 2002). One fact table from this example is the Daily Product Sales table. This table contains five columns: the product key, store key, date key, quantity sold, and dollar sales amount. The dimension tables in this example include the Date Dimension, Store Dimension, and Product Dimension tables.

Fact tables contain many rows and relatively few columns; this is essential to ease of use and query performance. The number of rows in Daily Product Sales table can be estimated using formulae. While explaining the formulae is beyond the scope of this article, they basically involve assumptions regarding the number of different products sold in each store each day. Kimball estimates that the Daily Product Sales fact table is likely to contain millions of rows, and be about 10 GB or more (Kimball, 2002). Although the table has only five columns, adding just one additional column would increase the file size by 2 GB! This example makes it easy to grasp the importance of keeping the number of columns in fact tables as small as possible.

In contrast, the dimension tables are likely to have only hundreds or thousands of rows (rather than millions), and be only megabytes in total size (Kimball, 2002). Unlike fact tables, dimension tables may have a hundred columns or more. This is because they contain all the attributes of the data in the fact table in highly denormalized forms. Following along with the retailing example, the primary key of the Product Dimension table is the product key. The rest of the dimension table's columns are attributes of product. These include product description, brand description, package type description, department description, package size, weight, shelf life, shelf width, shelf height, and many more. The Date Dimension and Store Dimension tables also have large numbers of columns, but relatively few rows.

It is easy for end users to query the database because virtually all the ways of summarizing the data is already in the dimension tables. This goes a long way toward meeting the ease of use goal. In terms of meeting the performance goal, Kimball says:

A database engine can make very strong assumptions about first constraining the heavily indexed dimension tables, and then attacking the fact table all at once with the Cartesian product of the dimension table keys satisfying the user's constraints.

Dimensional modeling is a data modeling approach that capitalizes on the unique requirements of the data warehouse. Keeping fact tables to a small number of rows and allowing dimension tables to be highly denormalized are both essential. The resulting data mart is highly accessible to the end user and provides reasonable query response times.

The Data Bus and Conformed Dimensions
In Kimball's architecture, data is copied from operational source systems to a staging area. In the staging area, the data is scrubbed, that is, made consistent and suitable for end-user queries. (The scrubbing process is discussed in "Extract, Transform, and Load" later in this article.) From the staging area, data is loaded into data marts. The data marts are the source of data for user queries.

Each data mart is based on a single business process. Some examples of business processes are point of sale (retail sales), inventory (from receiving dock to point of sale), procurement, and order management. More than one department may be interested in a given business process; therefore, no one department is perceived as the sole owner of a given data mart (Kimball, 2002).

The data warehouse bus is the part of Kimball's architecture that allows the sum of the data marts to truly be an integrated whole--a data warehouse. The bus architecture is another way of saying that all data marts must use standardized conformed dimensions. The basic requirements of conformed dimensions are that keys, column names, attribute definitions, and attribute values are consistent across business processes. Put another way, two dimensions are conformed "when they are exactly the same, or one is a perfect subset of the other. Most important, the row headers produced in answer sets from two different conformed dimensions must be able to be matched perfectly" (Kimball, 2002). This may seem an impossible set of requirements, but a knowledge of dimensional data modeling and adherence to the four-step dimensional design process help keep the requirements manageable.

An example of using conformed dimensions across business processes will help make clear how these requirements can be met without superhuman efforts. One data item that spans multiple business processes is the product dimension. The primary key for the product is an artificial key assigned during the ETL process. The first data mart development defines the product key, and all subsequently developed data marts must use the key. This ensures that queries can be made across data marts without conflicting results. For example, product 18874002 is the same to a user interested in patterns of the product's movement through the warehouse as it is to the user interested in the relative success of a promotion for the product. In other words, conformed dimensions help ensure that product data refers to the same product in the retail sales data mart as in the inventory data mart.

The Four-Step Dimensional Design Process
Kimball recommends a development methodology that is unique to data warehousing. It involves a bottom-up approach, which in the case of data warehouses means to build one data mart at a time. The four steps of the dimensional design process are:

  • Select the business process
  • Declare the grain
  • Choose the dimensions
  • Identify the facts

Kimball defines business processes quite broadly. Examples include point of sale (POS) retail sales, inventory, ordering, and shipments, all of which cross department lines in most organizations. For example, the ordering process is of interest to sales, marketing, finance, and inventory control personnel. To choose the first business process for the data warehouse project, select the process that has "the most impact--it should answer the most pressing business questions and be readily accessible for data extraction" (Kimball, 2002).

Declaring the grain is the process of deciding what level of detail the data warehouse will contain. The lowest level of granularity is called atomic, meaning that it cannot be further subdivided. Choosing a grain at the atomic level is highly desirable, since users can always aggregate the data as desired. Choosing a more summarized level means queries below that level cannot be fulfilled by the data warehouse. In Kimball's words:

Preferably you should develop dimensional models for the most atomic information captured by a business process... A data warehouse almost always demands data expressed at the lowest possible grain of each dimension. (Kimball, 2002)

In the retail example, the grain declared is an individual line item on a POS transaction. The possibilities for analyzing data with this level of granularity are virtually unlimited. It allows for the discovery of non-obvious relationships in retail sales, such as the beer-and-diaper relationship (as discussed previously). Atomic data granularity provides decision support for virtually every aspect of retail sales. Examples include evaluation of promotions, expansion or contraction of product lines, and cannibalization of the sales of one product due to the promotion of another.

With the grain declared, the next step is to choose dimensions. In the retail example, the dimensions include date, store, product, and promotion. Each of the dimension tables has a large number of attributes. The date dimension table includes many attributes that would make a relational data modeler shudder, including Day Number in Epoch, Week Number in Epoch, Month Number in Epoch, Day Number in Calendar Month, and so on. Kimball justifies this highly denormalized table by pointing out that ten years' worth of such data generates only approximately 3,650 rows and a file measured in kilobytes (Kimball, 2002).

The fourth and final step is to determine which facts to include in the fact tables. In the retail example, Kimball chooses to include some computed values as well as truly atomic values, making queries easy for the end user and providing acceptable data warehouse performance. The values in retail sales fact table are: Date, Product, Store, Promotion, POS transaction number, Sales quantity, Sales dollar amount, Cost dollar amount, and Gross profit dollar amount (Kimball, 2002). Including the gross profit dollar amount is an example of improving performance while violating traditional relational database rules. Users frequently query the data warehouse for gross profit. Therefore including this computed value in the fact table improves query performance.

The result of the four-step process is shown with minimal detail in Figure 3. Sample Kimball Fact and Dimension Tables. The fact table is shown with all the facts, but the dimension tables are shown only with their primary keys. Each of the dimension tables shown in the figure has dozens of dimensions. The wealth of dimensions allow end users to compose virtually unlimited queries.

Breslin_Fig3

Figure 3. Sample Kimball fact and dimension Tables (Kimball, 2002)

Figure 3. Sample Kimball Fact and Dimension Tables (Kimball, 2002)

Basics of Kimball's Data Warehouse Philosophy
Kimball's philosophy shines through every chapter of his book. The business requirements drive both the process and the nature of the data warehouse. In the first chapter, he defines the goals of a data warehouse (Kimball, 2002):

  • Make information easily accessible
  • Present the organization's information consistently
  • Be adaptive and resilient to change
  • Protect information
  • Serve as the foundation for improved decision making

He ends his list with a warning, masquerading as a goal: "The business community must accept the data warehouse if it is to be deemed successful" (Kimball, 2002). To Kimball, acceptance is measured by how much the data warehouse is used, which is directly related to its userfriendliness. This proactive stance in "designing in" userfriendliness is essential to Kimball's philosophy. Kimball's four-step development methodology is easy enough for the end user to actively participate. The example retail sales dimensional model (Figure 3) shows the user-friendly nature of the final form of the data mart. Both the attribute names and the relationships between the fact table and dimension tables are very familiar to users who need to query retail sales data.

Similarities and Differences: Inmon versus Kimball
Similarities
The most prominent similarities between Inmon's and Kmball's models are the use of time-stamped data, and the extract, transform, and load (ETL) process. Although the execution of these two elements differs between the two models, the data attributes and query results are very similar.

Similar Time-Stamped Data
Operational systems' databases generally carry detailed data for "anywhere from one week to two years" (Inmon, 2002). In contrast, the data warehouse stores data for five or even ten years. The time attribute is arguably the most important defining characteristic of data warehouse data. This is so because it is the time attribute that allows decision support analyses to compare this year's sales of Product X with last year's, or to determine whether more of Product X is sold on weekend than on holidays. So, how the time attribute is captured is critical, because it controls which analyses are possible and which aren't.

Breslin_fig4

Figure 4. Kimball's Date Dimension (Kimball,2002)

Kimball calls the time attribute the "date dimension;" Inmon calls it the "time element." Figure 4. Kimball's ate Dimension shows a range of possibilities for time attributes for a retail sales data mart. In Kimball's example, the date key is an artificial key that defines a conformed dimension. In an Inmon example, the same attributes would either be contained in several different, more normalized tables or simply be calculated at the time of the user query. The choice of storing versus calculating in Inmon's model would be guided by performance considerations. Whether using an Inmon-or Kimball-based approach, however, end users are able to query the data by day, month, quarter, year, holiday, weekday, weekend, etc.

Similar Extract, Transform, and Load (ETL)
The data warehouse environment begins with (ETL). Data is extracted from operational databases, transformed to meet the data warehouse's standards, and loaded. The data is loaded into either the monolithic data warehouse Figure 4. Kimball's Date Dimension (Kimball, 2002)(· la Inmon) or into a series of smaller databases called data marts (· la Kimball).

The ETL process has spawned a market niche, and there are a number of ETL tools available today. Some are addons to DBMSs, such as the ETL tools Oracle provides. Others are DBMS-independent. Even a cursory comparison of ETL tools is beyond this scope of this article. Instead, we will discuss the general process and scope of the ETL activity and its importance to the data warehouse.

Extract: The first part of ETL--extract--involves moving data from operational systems to a persistent staging area. Issues of timing of the extraction can be important in the extract process, in that different systems may make a given data item available at different times. It is also important to know how the operational systems handle exceptions and updates, since once data enters the data warehouse, it is rarely subject to updating.

Transform: Once data is extracted into the staging area, it is ready for the transformation portion of ETL. A simple transform example is renaming a data item, such as when two different operational systems call a single data item by a different name. A more complex example is adjusting the value of a data item, such as when two different operational systems measure a product or process differently and therefore assign a different value to the same data item. In general, the purpose of the transform processes is to ensure data integrity within the data warehouse. There are several methods used to transform data, including field mapping and algorithmic comparisons.

Load: The final step in ETL is loading the data into either the atomic data warehouse (in Inmon's model) or into data marts (in Kimball's model). The load process in either case involves placing the data physically. The main concern in this process is appending the newly extracted and transformed data onto the data already in the data warehouse. Various ETL routines run at this point help ensure data integrity and guard against data redundancy.

ETL is essential to the viability of the data warehouse in that it attempts to ensure data integrity within the data
warehouse. Obviously, if two user queries that are essentially the same return two different results, the credibility
of the data warehouse is damaged in the eyes of the users. Because operational systems are seldom (if ever) designed
to produce results compatible with one another, making the output of these systems consistent is generally aHerculean effort. Not surprisingly, ETL is frequently considered the most labor-intensive data warehouse activity,
surpassing even decision support analysis activities!

Differences
The differences between Inmon's and Kimball's approaches are many and deep. It is interesting to note that the two features that create similarities between the two models--timestamped data and ETL--are required to make decision support systems viable. In other words, the two models are similar only in the areas in which, arguably, they have to be similar. In all other areas, their differences are profound. The most essential differences between the two models are in the areas of development methodologies, data modeling, and data warehouse architectures. Table 1 summarizes these differences. Following the table, each major area of difference is discussed in detail.

Inmon Kimball
Methodology and architecture
Overall approach Top-down Bottom-up
Architectural structure Enterprisewide (atomic) data warehouse "feeds" departmental databases Data marts model a single business process; enterprise consistency achieved through data bus and conformed dimensions
Complexity of the method Quite complex Fairly simple
Comparison with established development methodologies Derived from the spiral methodology Four-step process; a departure from RDBMS methods
Discussion of physical design Fairly thorough Fairly light
Data modeling
Data orientation Subject- or data-driven Process oriented
Tools Traditional (ERDs, DISs) Dimensional modeling; a departure from relational modeling
End-user accessibility Low High
Philosophy
Primary audience IT professionals End users
Place in the organization Integral part of the Corporate Information Factory (CIF) Transformer and retainer of operational data
Objective Deliver a sound technical solution based on proven database methods and technologies Deliver a solution that makes it easy for end users to directly query the data and still get reasonable response times

Table 1.Comparison of Essential Features of Inmon's and Kimbal's models


Differences in Development Methodologies and Architectures
In order to have an atomic data warehouse, as in Inmon's model, some degree of top-down development must be present. The atomic data warehouse must serve the entire enterprise, and all departmental databases obtain their data through the atomic data warehouse. Top-down development efforts have a certain unavoidable degree of complexity, and Inmon's methodology is no exception, although his clear presentation helps it seem less complex.

Overall, Inmon's methodology and architectural orientation is a technical one. His primary interest is ensuring that the technical solution works. Oversimplified, the objective of this technical solution is to optimize I/Os. Inmon's audience is clearly comprised of IT professionals. Few business readers have the background to understand Inmon's development approach because of its emphasis on technical aspects and a lack of understanding of the spiral development approach on which it is based. His emphasis on the technical aspects of the development implies that the IT department members of the data warehousing team will feel the greatest degree of ownership of the data warehouse as they, not the end users, will understand the development methodology. In contrast, Kimball's four-step development methodology is very accessible to the end user. A user can even understand moderately technical concepts of the data bus and conformed dimensions without extensive study, in contrast to learning to interpret ERDs. By definition, a bottom-up approach involves fewer data elements than a top-down development. Even if users are unfamiliar with the concept of a business process, the smaller scope of the data mart is more accessible to end users. Inmon's Meth2 helps make the enterprisewide scope less daunting, but the data mart scope is still considerably easier for users to grasp.

Differences in Data Modeling
Two obvious ways in which Inmon's and Kimball's data modeling differ are (1) orientation toward the data and (2) modeling rules and techniques.

In his own terms, Inmon takes a subject-oriented or datadriven approach to data modeling. This means that the nature of the data directs the data modeling process. This fits well with Inmon's traditional data modeling tools, such as ERDs and DISs. It also means that the IT members of the data warehouse team will have primary responsibility for data modeling, because the modeling tools and the thought processes they involve require a technical background to use effectively. End users can attend review presentations, but few could review ERDs or DISs unassisted unless they received fairly extensive special training.

In contrast, Kimball takes a process orientation, meaning that data modeling becomes an attempt to define the interaction of data across a business process (such as retail sales or inventory). By their natures, such business processes usually cross departmental lines. This fits well with the new data modeling approach of dimensional data modeling, in which the process determines which metrics (facts) and attributes (dimensions) are important enough to claim a place in the data warehouse. Dimensional modeling tools allow end users to take an active role in the data modeling process.

Philosophical Differences
By now it is clear that Inmon views IT as the primary developer and provider of the data warehouse. Inmon believes that the performance of the completed data ware-house will be maximized by ensuring a technically oriented development process. Meanwhile, Kimball sees end users and IT professionals sharing duties roughly equally. By ensuring the active participation of end users throughout the development process, the likelihood of user acceptance of the completed data warehouse is greatly enhanced.

Of course, both of these experts are well aware that a data warehouse that doesn't involve the users at all points in its lifecycle is just as likely to fail as one that performs poorly for the users. What the two do not agree upon is which of these considerations should be considered the most important.

Choosing the Best Approach
Following are guidelines for determining whether Inmon's or Kimball's approach is best suited to organization's data warehousing needs. Dave Wells addressed this problem in a TDWI FlashPoint article in early 2003 (Wells, 2003). He proposes 12 evaluation criteria that focus on the needs, environment, culture, and technical expertise of an organization planning to create a data warehouse. Of the 12, eight can be relatively easily categorized as favoring either Kimball's or Inmon's approach. Whether the remaining four elements (cost to operate, consistency of metadata and business rules, sustainability, and technology requirements) favor Kimball's or Inmon's approach would depend on the implementation of a given data warehouse project. To at least partially summarize the data in Table 1, an organization is more likely to succeed using Inmon's approach if it has a large team of data warehouse specialists, plans a large project with enterprisewide access needs, stores data that is not primarily business metrics, and can wait to see results over a longer timeframe--from four to nine months (Inmon, 2000). These characteristics and data requirements fit well with Inmon's recommendation to first build a considerable infrastructure on a solid enterprisewide data model.

On the other hand, an organization with different characteristics may be better off with a Kimball-based approach. According to one expert, "A typical requirement is to develop an operational data mart for a specific business area in 90 days, and develop subsequent data marts in 60 to 90 days each" (Mimno, 2002). Kimball's approach is generally recognized as faster than Inmon's, at least for the delivery of the first data mart (versus the first departmental database using Inmon's approach). Kimball's approach is also indicated if the organization is better able to field smaller teams of generalists for data warehouse project development, and expects to store mostly business metrics. An organization with these characteristics and requirements is more likely to succeed with a data mart architecture developed using the dimensional modeling approach.

Characteristic Favors Kimball Favors Inmon
Nature of the organization's decision support requirements Tactical Strategic
Data integration requirements Individual business areas Enterprisewide integration
Structure of data Business metrics, performance measures, and scorecards Non-metric data and for data that will be applied to meet multiple and varied information needs
Scalability Need to adapt to highly volatile needs within a limited scope Growing scope and changing requirements are critical
Persistency of data Source systems are relatively stable
High rate of change from source systems
Staffing and skills requirements Small teams of generalists Larger team(s) of specialists
Time to delivery Need for the first data warehouse
application is urgent
Organization's requirements allow for longer start-up time
Cost to deploy Lower start-up costs, with each
subsequent project costing about the same
Higher start-up costs, with lower subsequent project development costs

Table 2. Specific Characteristics Favoring Inmon's or Kimball's Model

Table 2. Specific Characteristics Favoring Inmon's or Kimball's Model

It is important to realize that choosing an approach to data warehousing is not as simple as the two preceding paragraphs imply. However, as long as the reader understands that these guidelines represent a gross oversimplification of the process, they may be useful as a starting point for discussing the data warehousing needs and characteristics unique to a given organization.

Finally, research shows that having the right set of soft skills is just as important, if not more important, than technical skills and knowledge.

Interestingly, the keys to success are not technical in nature. Projects don't succeed because they use an innovative design or radical new technology. They succeed because of the "soft" stuff--leadership, communication, planning, and interpersonal relationships (Eckerson, 2003).

When building a data warehouse, whether using Inmon's or Kimball's approach, it is critical that the data warehouse team employ soft skills liberally and effectively. This involves ensuring that the organization has a well-articulated vision of the data warehouse's role and usage, and allocates sufficient resources to create and maintain the data warehouse (Eckerson, 2003). These are not typically responsibilities that an IT project development team must shoulder, yet they are critical to the success of a data warehouse project.

Summary
Data warehouses require storage and access of massive amounts of time-stamped data for decision support. Since the building of data warehouses was first attempted in the early 1990s, two models have emerged as dominant: Inmon's and Kimball's.

Inmon's approach stresses top-down development using proven database development methodologies and tools, such as ERDs, DISs, and a modification of the spiral development approach. Inmon's tools and methods are adaptations of traditional tools and methods for operational database development. Inmon sees the data warehouse as a part of a much larger information environment, which he calls the Corporate Information Factory (CIF). To ensure that the data warehouse fits well in this larger environment, he advocates the construction of both an atomic data warehouse and departmental databases.

Inmon's approach is evolutionary rather than revolutionary. His tools and methods can be actively used only by IT professionals. End users have a more passive role in the development process, mostly reviewing the results generated by IT professionals. Inmon's attention to the technical aspects of the data warehouse development process increases the chances of a sound technical solution. For end users, this is likely to mean very good query response times.

Kimball's approach is a departure from traditional database development. His bottom-up approach recommends building one data mart per business process. The sum of all data marts is the organization's data warehouse. The data bus is the aspect of Kimball's architecture that ensures interoperability between various data marts. The data bus requires that all data marts are modeled within consistent data standards called conformed dimensions.

Kimball recommends a four-step development process for each data mart, in which dimensional data modeling plays a central role. Dimensional data modeling involves fact tables that contain metric data, and dimension tables that modify that data. Dimensional modeling tools can be actively used by end users with some special training. This helps ensure that end users are actively involved in the development of the data warehouse. Ease of use and reasonable query response times in the final product (the data mart) are the dual goals of dimensional data modeling.

Inmon's and Kimball's models are similar in some ways, such as the treatment of time-stamped data. Although there are some differences in the ways in which each model handles this challenge, the two models are more similar than not in modeling the time attribute. Likewise, both models address the challenges of massaging operational data similarly. This process, called ETL, is one of the most labor-intensive aspects of the data warehouse.

Noted data warehouse expert Dave Wells suggests characteristics of organizations that favor the adoption of either Inmon's or Kimball's models. Some of these characteristics include the organization's decision support requirements, staffing and skills requirements, time to delivery and cost to deploy. His advice can help organizations begin the process of choosing an approach to developing their data warehouse.

Other research suggests that success in developing a data warehouse relies as much on the soft skills of the data warehouse team as on its technical expertise or business acumen. It is not surprising that a large IT-related project needs "...leadership, communication, planning, and interpersonal relationships" in order to succeed (Eckerson, 2003). What makes the data warehouse more of a challenge than a comparable operational development project is that the data warehouse technology is relatively new. A development team with a sound understanding of Inmon's and Kimball's models is in a much better position to articulate a vision of the data warehouse that matches the organization's characteristics and decision support goals.