Monday, August 11, 2014

Microsoft Parallel Data Warehouse : Building a data warehouse in SQL Server: Tips to get started

When building a data warehouse using SQL Server technologies, there are some challenges you might face. I'll offer you some suggestions for overcoming them. Naturally, each environment is likely to add its own twist to the challenges of building a data warehouse, but most of these issues are generic enough to apply to many companies embarking on architecting business intelligence (BI) applications.

1.  Management support. This is the first and foremost challenge you must overcome. It is annoying to technical people (including yours truly) because it has to do with politics and not software. Let's keep in mind who signs your paycheck, however. If management isn't convinced they need a data warehouse all your technical skills will be in vain.

Unfortunately upper management often thinks of a data warehouse as yet another system they have to invest in without any immediate return. Your job is to convince the powers that be they're not paying for bells and whistles – data warehouse will help them make better, more informed decisions by correlating data scattered across the enterprise. It is not uncommon to totally change the way an organization operates after a data warehouse application is implemented. Sometimes it helps to build a proof-of-a-concept (POC) project to demonstrate the power of BI. POC data warehouse will only contain a small portion of all data

and will only be exposed to a few users for testing. Even if you don't have an approval for a POC you should demonstrate parts of the warehouse as they become available to your users; doing so keeps them interested and assured that the system will help them be more effective in their work.

2.  Data availability. Data warehouse usually combines data from multiple data sources, which aren't necessarily obvious or easily accessible. Depending on the company size, you might have to travel to multiple offices and speak to numerous employees to figure out where you can get the data that is necessary to build analytical views desired by your users. Again, depending on the political climate of your enterprise, you might find it difficult to get your hands on each necessary data element. Data you work with might contain confidential and highly sensitive details. Furthermore, some data might only be accessible through reports provided by an outside entity who will not grant you access to their data source.

How do you overcome this challenge? First, be sure to convince everyone that you're not taking their job. Data warehouse is supposed to supplement and not replace any existing systems. If you have management on your side you shouldn't have too many issues with internal employees. But if you cannot get access to the data source – for unforeseen reasons – then you may have to get creative. If data is only available on report screens or on paper you might have to find a way to capture this data perhaps through scraping report screens or scanning documents.

3.  Complexity of data sources. Sometimes you get lucky and all data elements that you need reside in the database management system (DBMS) of your choice. More often, data is dispersed across multiple DBMS, spreadsheets, email systems, electronic documents and even on paper. Yes, we're in 21st century, but take my word - there are still companies out there who keep certain information only on paper. It's your job to figure out how to get data into your warehouse from all disparate sources and give it a common shape.

4.  Data quality. Many transaction processing applications are thrown together using rapid development techniques, often by people who have very limited knowledge or experience with the tools they're using. This is not a derogatory statement; one must start somewhere and often junior programmers are most affordable resources that a company can use. The trouble is that if the application does not validate data you're likely to encounter string values that are abbreviated, misspelled or completely omitted. For transaction level reports, this might not be a huge deal, but when you're trying to group data and empower your users with the ability to make decisions, data quality is essential.

For example, take a look at the following customer names:

a. ACME Boot Outlet
b. ACME bt outlt
c. A.C.M.E
d. Boots (ACME)
e. A c m e boot store (outlet)

Human eyes can easily spot that each of the above refers to the same business entity. But to the computer program, each of these values represents a separate customer. Unfortunately there is no easy way to correct all bad data using Integration Services (or Data Transformation Services) packages. SQL Server 2005 Integration Services offers the Fuzzy Lookup Transformation which may greatly simplify your data cleansing efforts. Even so, you might also have to enlist the help of clerical workers to correct the data.

5.  Multitude and technical savvy of your users. Helping a handful of users get to the data they need in order to make decisions is not a small fish to fry. Historically, data warehouse was reserved for upper-level management's use. However, data warehousing has slowly made its way to the masses. It's not uncommon to see data warehousing and BI applications used by users that don't necessarily make strategic decisions.

Once your users see the power of data warehousing, they'll want to use it for everything from data entry to retrieving transactional level reports. The larger your user community, the tougher it becomes to keep everyone happy and to educate them on appropriate uses of your application. BI applications perform great when used appropriately, but they are not suitable for every business requirement. Often, you must be the bearer of bad news that data warehouse is not the tool of choice for the functionality your users desire.

6.  Slowly changing dimensions. Your application must reflect data changes that occur in data sources populating the warehouse. Data within dimension tables is particularly prone to change. But how do you maintain the history of such changes?

    a:    The first and easiest method is to overwrite the existing record without tracking changes. Fortunately              this method is acceptable for many dimensions. For example, if a department name changes from                  "finance" to "finance and accounting" you probably won't be required to maintain the history of such              change. However, in a customer or student dimension it is often necessary to keep track of changing              names, marital status, education level, and other attributes - your application must be able to retrieve              current as well as historical values.

  b:      The second method for managing slowly changing dimensions is to create a new record when the                   value changes and tag the older record as obsolete.

  c:        The third and final method is to maintain historical value(s) of the changing domain in separate                        columns of the same row within the dimension table.

So far we've covered the handling of changing values in the data warehouse dimension. But what about Analysis Services dimensions? To reflect changes in attribute values you must re-process dimensions. But if you process dimensions fully you'd have to also reprocess cubes, which might be impractical if your cube contains large volumes of data. Fortunately you can often use the incremental update (process update) option to avoid reprocessing cubes.

7.  Changing facts. Normally one would think that records in the fact tables are static – once the record gets to the warehouse your job is finished, right? Unfortunately the answer is "it depends." In some cases like in a data warehouse tracking patients' hospital stays, all records are generally static. If your hospital stay lasted from January 1 to January 5, that record isn't likely to change.

But consider the retail industry; all sales aren't final – I'm sure you know people who often take the goods they purchased back to the store for various reasons. Some companies would manage such transactions as a series of credits and debits that balance each other out. But in other cases you must either update or delete fact table records, even after they're added to the data warehouse. For example, if a stock trade is recorded incorrectly, balancing it with a corresponding negative trade might not be acceptable. Here's another thing to consider: you might not want your customers to know about problems in your transactional systems. Rather, you want them to see the data only after it is corrected.

One method for dealing with changing facts is to keep the records in the staging area until the data is examined for quality and then migrate it to the warehouse. Even the most thorough testing won't catch all errors in the data sources, however. Occasionally, you might have to refresh cubes by processing those partitions that contain incorrect data. This is why it's essential to keep your Analysis Services partitions small so that processing will complete relatively quickly.

Another way to deal with this challenge is through a write-back partition. With cube write-back, you're not really changing data in the underlying relational warehouse; instead you're adding a record to a separate partition. When a user queries a particular measure group Analysis Services combines data from read-only partition with the data in write-back partition and exposes the result. Not surprisingly, performing such calculations at query execution time imposes additional overhead on Analysis Server and can cause performance degradation.

8.  Implementing security. Much like with any other business application, it's imperative to secure access to your data warehouse and cubes. Each user might only be permitted to see a small portion of the warehouse, depending on their job title or area of responsibility. Analysis Services implements security through Windows accounts which you can associate with Analysis Services roles. You can secure data at a very granular level, all the way down to individual cube cells.

Unfortunately creating a separate role for each user is not only tedious and cumbersome but could also instigate performance problems – this is particularly true of Analysis Services versions prior to 2005. The advice is to group your users into a handful of roles, if possible. If every single user must have her own set of permissions, you'll have to implement a security dimension. Once you have such dimension every MDX query will have to filter the output based on the identity of the current user.

Naturally I can't cover every data warehousing challenge in a single tip. Nor can any one person claim to be familiar with every difficulty involved in building data warehousing solutions. However, if you're contemplating an attempt to implement such solutions, you should carefully consider what you're up against. Do your homework and be prepared for a battle.

For your Reference

No comments:

Post a Comment