數據倉庫的潛在問題
?
?
??? 這是一篇轉載的文章,文章主要描述了在數據倉庫在每天的ETL過程中,有可能存在的潛在問題,并給出了問題的一些理論上的解決辦法。根據墨菲定律,需要在考慮到有可能發(fā)生的問題的同時就在設計時杜絕問題發(fā)生的可能性。下面是這篇文章的原文,后附我的翻譯:
?
?
Murphy’s Laws of Data Warehousing: What Goes Bump in the Night?
?
InfoManagement Direct, August 15, 2008
?
Fon Silvers
?
Murphy’s Law tells us that whatever can go wrong will go wrong. This axiom applies to all aspects of life, including data warehousing. The following corollaries to Murphy’s Law relate this inevitability of something going “bump” in the night to data warehousing.
?
Your data warehouse has been extracting data from a source system. The extract, transform and load (ETL) from the source system to the data warehouse is stable. Everything seems to be working correctly, and then:
?
1.
A rose by any other name - The source system decides to change the meaning of a data element that you are extracting on a daily basis. The structure and form of the data are the same, but the meaning has changed. If the extract, transform and load application enforces relational integrity, you may know about this source data change before a data warehouse customer tells you about it.
?
2.
Decoy - The source system decides to switch to a new operational application with its new sets of data. For back-out and backward compatibility purposes, the old operational application and its data are allowed to continue their existence. Your only clue is that data volumes from the source system decrease. If the ETL application monitors its data volumes, you may be aware of the change before a data warehouse customer tells you about it.
?
3.
New caboose - The source system adds new processes to the end of the job stream. The new processes mean the data you are extracting, which had previously been in its final state, is now in an intermediate state. The caboose of the train is no longer the caboose, or at least, not the last car. If the ETL application monitors the quality of its data, you might observe the change before a data warehouse customer tells you about it.
4.
Data quality - The source system blows a gasket. The quality of the data goes through the floor. If the ETL application is monitoring the quality of its data, you may have been able to see that the source system has a problem. In this situation, the data warehouse is the least of the system’s problems. Even though you’ve exercised the due diligence to identify the presence of poor quality data, the source system is not able to provide high-quality data. In this conundrum, the data warehouse may opt to exclude erroneous data from a source system until that source system can clean up its own data.
?
5.
Data volume - The source system dramatically increases its data volume. The cause of the sudden increase in data volume can be anything from true expansion to corporate reorganization. Regardless of the cause, the effect is a significantly higher number of records passing through the ETL application, causing data files to exceed their allocations and the data warehouse to miss its service level agreement (SLA). If the ETL application monitors its data volumes, you may notice the change before a data warehouse customer asks you why the data warehouse failed to meet its SLA.
?
6.
Higher power - Upper management decides the data warehouse should include a subject area(s) specifically in support of a new high-profile, high-ROI and politically charged application. The data will be extracted, stored and reported to meet the specific requirements of the new operational application. This is when a data warehouse is a victim of its own success. The reason upper management makes such a decision is because the data in the data warehouse is expected to increase the value and ROI of the new operational application. The best defenses against such attempts to operationalize a data warehouse are:
? 1.A complete and comprehensive enterprise data model, so that all subject areas of the enterprise have a home;
? 2.An understanding of and commitment to the data warehousing principles established by Ralph Kimball and Bill Inmon; and
? 3.Knowledge of available data warehousing architectures that simultaneously follow data warehousing principles and provide data to operationalapplications in order to meet specific requirements.
?
7.
Wrong question - Data warehouse customers try to make business decisions by asking a question of the data warehouse and then finding the answer using the data in the data warehouse. But if they don’t understand how to formulate their question or how to find the answer, they often blame the data warehouse. The best remedy for such misunderstanding is metadata. Metadata provides descriptions and definitions of the data in a data warehouse. Delivered side-by-side with the data, metadata gives data warehouse customers the information they need to correctly formulate questions and answers.
?
8.
Attractive nuiscance - You find out that a table or view, which was descoped by a project a year ago, has been erroneously allowed to continue its existence in the data warehouse. To make matters worse, data warehouse customers and operational applications found this table or view and are actively using it. As a result, you have stale and obsolete data that is being used actively. The best defense against such mishaps is the dynamic metadata, which provides the activity (i.e., number of rows loaded, date loaded and row count) of that table. Metadata providing statistics about individual tables, would lead you to wonder why there is a table that is queried but never loaded.
?
9.
Locksmith -The key of the source data changes. Usually when this occurs, a new data field has been added to the key. Suddenly, the changed data capture process of the ETL application misinterprets source data. If the ETL application monitors its data volumes, you may have noticed the change in the number of records the ETL application applies to the data warehouse.
?
10.
Didn’t I Tell You? - You find out that a logic gap exists in an ETL application. This logic gap is causing the ETL application to misstate the data it is loading into the data warehouse. Such a logic gap can be the result of new logic recently added to an operational source system, or it may have existed from the beginning of the ETL application and only recently been revealed by a new set of conditions. The number of rows has not changed. The keys of the data have not changed. It could be as simple as the application of a new sales tax or profit margin. The best defense against such logic gaps is a good working relationship with a data steward who knows the subject area and can recognize such errors.
?
It is rather easy to create an ETL application that trusts the world to be a perfect place wherein nothing goes wrong. A robust ETL application, however, includes metadata, monitors its own data volumes and assesses data quality. A robust data warehouse is built on a broad foundation provided by a comprehensive enterprise data model and data warehousing principles. These methods that enhance the quality and viability of a data warehouse can be a tough sell when allocating personnel, funds and other resources to the creation of a data warehouse. That is, until you understand the world is not a perfect place and anything that can go wrong, will go wrong - even in a data warehouse.
?
?
Fon Silvers graduated from the University of South Florida with an MBA concentrating in Information Systems. He is currently an ETL analyst for a Fortune 500 retail corporation, developing ETL applications for a data warehouse program. In March 2008 Silvers published his first book, Building and Maintaining a Data Warehouse. He may be reached at
fon.silvers@verizon.net
.
?
?
?
譯文:
?
??? 摩菲定理告訴我們:任何有可能發(fā)生的問題都將會成為問題。這個公理可以應用于生活的各個方面,也包括數據倉庫。以下關于摩菲定理的推論,告訴我們一些可能會在未知的
?
情況下對數據倉庫形成的破壞的因素。
?
??? 數據倉庫從源數據庫進行數據抽取,ETL將數據從源數據庫抽取、轉換、載入到數據倉庫是穩(wěn)定的,所有一切都看上去工作正常,但是:
?
??? 1、數據的本質 - 源系統(tǒng)決定改變某個數據元素的含義,而這個元素正好是你每天抽取的數據。整個結構、表格都是正確的,但是含義卻改變了。如果你在ETL程序中定義了數據
?
關聯(lián)的完整性,那么你就會在數據倉庫使用者告訴你錯誤之前,知道源數據被更改了。
?
??? 2、圈套 - 源數據庫決定切換到新的應用程序以及新的數據格式,而依然向后兼容。所以舊的系統(tǒng)以及它的數據依然存在。我們只能知道這部分數據每天從源數據中導入的量減
少了。如果ETL程序監(jiān)控了數據量,我們就可以在倉庫使用者告訴你錯誤之前,知道這個改變。
?
??? 3、新的結尾 - 源系統(tǒng)添加了新的程序到工作流的末端,這意味著原先抽取的末端數據現在變成了中間數據,原來火車的末端不再是末端,至少不再是最后的車廂。如果ETL監(jiān)控了數據性質,那么就可以在倉庫使用者告訴你錯誤之前,知道這個改變。
?
??? 4、數據質量 - 源系統(tǒng)的數據質量很差。如果ETL監(jiān)控了數據的質量,那么可以發(fā)現源系統(tǒng)的很多問題,在這種情況下,數據倉庫會出去盡可能多的源系統(tǒng)中存在的問題。不管你怎么樣努力得去排除低質量的數據,源系統(tǒng)還是很難系統(tǒng)高質量的數據。在這個矛盾下,數據倉庫可以選擇拒絕從源系統(tǒng)接收錯誤數據,直到源系統(tǒng)可以自己清洗干凈自己的數據。
?
??? 5、數據量 - 源系統(tǒng)不斷得增加它的數據量。而公司的重組會使得數據庫產生突然間的增加。不管什么原因,突然間大規(guī)模的記錄通過ETL程序進行傳輸,會引起數據文件增長超過其配額,而數據倉庫會失去其服務等級協(xié)議(service level agreement,SLA)。如果ETL監(jiān)控了數據量,那么我們可以在倉庫使用者告訴你錯誤之前,知道這個改變,并且告訴你為什么數據倉庫不再適合SLA。
?
??? 6、高能量 - 高層管理人員決定要在數據倉庫中包含一個指定的支持新的決議、品質管理、制度等的模塊。數據會被新的操作程序進行抽取、存儲、報告。這是數據倉庫為了其成功而做出的犧牲。因為高層管理人員做這個決定是因為數據倉庫中的數據是為了提升公司的銷售量、品質管理。最佳的拒絕這類在數據倉庫中的操作的方法是:
????? 1) 制作完整、全面的企業(yè)數據模型,這樣所有的企業(yè)模塊都會在數據模塊當中;
????? 2) 根據Ralph Kimball和Bill Inmon的法則,對數據倉庫做一個限制;
????? 3) 了解現有的數據倉庫數據、結構,然后在接下去的數據倉庫設計中提供一些特殊的應用需求。
?
??? 7、錯誤的問題 - 數據倉庫用戶希望能夠通過詢問數據倉庫來得到答案以做出業(yè)務上的決策。但是如果他們不知道怎么使用查詢公式或不知道如何得到答案時,他們總是會責怪數據倉庫。對于這種不理解的最佳解決辦法是使用元數據,元數據提供了對于數據倉庫中數據的定義和描述,與數據進行并行的傳遞。元數據會告訴數據倉庫用戶如何做出正確的查詢或得到答案。
?
??? 8、吸引人的損害 - 有時你會找到這樣的一個表或者視圖,在一年之前被移除出一個項目,而又錯誤得存在數據倉庫內。更糟糕的是,數據倉庫操作員和使用者發(fā)現了這個表或視圖,并應用了它。結果是,你廢除的數據重新被激活應用了。消除這類錯誤的最佳方法是使用動態(tài)元數據,用動態(tài)元數據來提供表的動態(tài)信息。元數據會為你提供某個表的信息,并讓你驚訝于為什么某個表被需要但沒有被載入。
?
??? 9、鎖定 - 源系統(tǒng)數據變化的鑰匙。當鎖定發(fā)生時,一個新的數據域被加入到密匙中,改變的數據會突然捕獲到ETL程序曲解了源系統(tǒng)的數據。如果ETL監(jiān)控了數據量,你會注意到數據記錄條數在數據倉庫中的改變。
?
??? 10、我沒有告訴你? - 你發(fā)現一個邏輯缺陷存在于數據倉庫中。這個邏輯缺陷使得ETL錯誤得轉換數據到數據倉庫之中,這類邏輯缺陷有可能是新增加到源系統(tǒng)的邏輯引起的,或者可能是之前就存在于ETL程序中,而因為修改了某處的設置而顯露出來。數據記錄數量沒有改變,數據主鍵沒有改變,這可能和一個新的銷售稅率或利潤率一樣簡單的應用。消除這類邏輯錯誤的最好辦法是和一個數據管理員一同工作,他知道這些模塊,并且可以找到這樣的錯誤。
??? 要創(chuàng)建一個ETL應用程序,并且相信它完美而不會出錯是很簡單的。一個健壯的ETL應用,包括元數據來監(jiān)控它本身的數據量以及所接受數據的質量。一個健壯的數據倉庫是建立在一個完善的企業(yè)數據模型和完整的數據倉庫理論之上的。這個方法提高了數據倉庫的質量和可擴展性,當它得到人員、資金、以及其他資源的支持。直到你了解到,這個世界并不完美,任何事情都可能會出錯,并且出錯,包括數據倉庫。
?