Thursday, August 13, 2009

Enterprise Metadata

1 What is Metadata?

"Meta" is a Greek word meaning transcending, or going above and beyond. Metadata by definition is data that describes other data. Technology describes the term as referring to files or databases with information about another's attributes, structure, processing or changes. It can describe any characteristics of the data such as the occurrence or its quality.


But metadata is not just about the database and how data is placed in it. There can be other type of metadata as well. Let us try to understand the other possible type of metadata with the help of following example.

1 How to classify metadata?
Widely metadata is classified in two types:
· Technical or Administrative meta-data
· Business meta-data
2.1 Technical or Administrative Metadata
Administrative meta-data includes information about such things as data source, update times and any extraction rules and cleansing routines performed on the data.
It includes:
· Definition of source and target
· Schemas, dimensions, hierarchies
· Rules for extraction, cleaning
· Refresh, purging policies
· User profiles, access control
· Data lineage
· Data currency (e.g. active, archive, purged)
· Use Stats, error reports, audit trails
· ETL Tools
2.2 Business Metadata
Business meta-data, on the other hand, allows users to get a more clear understanding of the data on which their decisions are based. Information about calculations performed on the data, date and time stamps as well as meta-data about the graphic elements of data analysis generated by front end query tools.
It Includes:
· Business terms and definitions
· Object definitions and object help
· Data ownership, charging
· Data modeling tools
All types of metadata are critical to a successful data mart or warehouse solution. How well the data warehouse replenishment solution you choose manages and integrates metadata may affect the performance of presentation tools and the overall effectiveness of the data warehouse.
2 Why is Metadata Required?

Following are some of the main reasons for why metadata is required:

Metadata is required to understand the data.
Metadata provides information about where the data came from, when it was delivered, what happened to it during transport, and other descriptions can all be tracked.
Metadata gives context to the information.
Business rules/ Calculations applied to get the data can be analyzed or studied.

Metadata is required to manage the data.
Metadata explains how data is stored in a database.
The operational metadata gives us the statistics of data, e.g. error reports, use stats etc.
Metadata helps in Impact analysis for change in database schema etc.
Metadata helps in co-relating/ combining data from heterogeneous applications.
Metadata helps to ensure data accuracy, integrity and consistency of data.
Metadata minimizes the risk of system down time by eliminating the dependencies upon specialized system experts

Metadata is required in taking the business decisions.
Metadata helps the OLAP tools to exploit the correct information from warehouse.
As stated earlier, as the metadata gives the context to the data, it helps the business user to understand the data and relate various data.
Good Metadata makes it easier to use the data warehouse, so that the turn-around for information requests is faster.
Metadata gives us the information about how the historic data is saved. This data can be used for analyzing the business processes.
Metadata improves operational efficiency and customer certainty about the data.

3 How does Metadata help?

Metadata can help various types of users in following manner.

4.1 To Business Users

· Examining meta-data enhances the end user's understanding of the data they are using.
· It can also facilitate valuable “what if” analysis on the impact of changing data schemas and other elements.
· Identify and locate the information that they need in warehouse.
· Analyze the aggregation, summarization rules applied to data.

4.2 To Technical Users

· Metadata helps them ensure data accuracy, integrity and consistency.
· Metadata helps them in co-relating/ combining heterogeneous applications.
During data replenishment, solutions should store meta-data in tables located in the publisher and/or subscriber database, enabling companies to share metadata among heterogeneous applications and databases.
· Metadata helps them in defining the Transformation rules, access patterns and entity relationships.



4 Where is metadata in Data Warehousing?
5.1 Meta data storage
For business related decision-making user has to be given easy access to relevant business data. A data warehouse serves as a central repository for recording everything about business information and analysis. The warehouse can be accessed using different methods which include application, query and analysis tools etc. Metadata for theses warehouse applications should be readily available.

Metadata can be stored in two ways:

· Metadata Repository:
Metadata Repository is an old style of managing metadata, going back to the days of the mainframe. In a repository everything is centralized, i.e. all the metadata is integrated and stored at one location, which is the metadata repository of a warehouse.

· Distributed Metadata:
Here, the metadata is present at different locations, but this metadata can be exchanged as per requirements and there is integrity of ownership of metadata.
In a typical warehouse scenario, metadata is present in its raw form in various layers like ETL tool repository, OLAP tool repository and source system metadata.
5.2 Metadata in Data Warehouse

Let us analyze what to look for in every stage of Data warehousing:

Staging:
Staging is a phase between loading the data from source system to ETL database. A staging file may be a flat file that contains extracted and transformed data. The file contains all the data that is to be transported to the warehouse. Following information is considered as metadata for staging area.
1. The location of the staging file
2. Duration, volatility, and ownership of staging file
3. The fields in the staging file
4. The format of fields in staging file
5. Security settings for extract files
6. The statistics about the data i.e. duplicate data, erroneous data in staging file
7. The mapping of fields from staging area to database
8. Data staging area archive logs and recovery procedures

Data Mart:
Accessing data from a warehouse is time consuming, because there are a large number of users and large volumes of data. The use of data mart can solve this problem. It is also called as Business area warehouse or departmental warehouse. Following information about data mart is considered as metadata.
1. Loaded from: operational source or data mart
2. Expected network traffic
3. The users of data mart
4. The department/ business of which the data mart contains the information

ETL (Extraction, Transformation and Loading):
The first data warehouse process extracts data from many data sources. The extraction process extracts the selected data fields from the source. Extraction can take place with the help of routines containing business rules. Next to extraction is transformation. The transformation process integrates and transforms data into a consistent and uniform format for the target database. The aggregation and/or summarization can be applied at this stage. The final stage of ETL processing is loading. The loading process involves integrating, cleaning the data in warehouse and loading the data into target tables. Following data about ETL Processes can be considered as metadata.
1. Each column and its format of a warehouse table
2. The table, view, macro definition
3. Fields of interest from the source
4. Transformation rules from source to target table
5. Rules for aggregation, summarization, calculation over a period of time
6. Rules for stripping out fields, and looking up attributes
7. Slowly changing dimension policies
8. Current surrogate key assignments for each production key
9. Data cleaning specifications
10. Refresh frequency for tables
11. Refresh, purging policies
12. Modification logs
13. Business requirements
14. The relationship between the objects being used during ETL processing
15. Data transform run-time logs, success summaries, and time stamps

Storage (Database, RDBMS):
The metadata for storage Allows proactive assessment of impact of changes by providing enterprise-wide view and relationships of data gathered from many disparate sources. It also minimizes the risk of system down time by eliminating the dependencies upon specialized system experts. Following information regarding storage can be considered as metadata.
1. Schema for the database
2. The number of tables, views, stored procedures, macros in DB and their definitions
3. DBMS-level security privileges and grants
4. The relationship between various objects
5. DB monitoring Statistics
6. Database catalogs
7. DBMS load scripts
8. Partition settings
9. Indexes
10. DBMS backup status, procedures, and security

Application (Source System and Target system):
The source system here refers to the system to which the user actually feeds the data, where as the target system refers to the application which will be used by the end-user. The points mentioned below are considered as metadata for source and target systems.
1. The source of the data
2. The formats and definitions of all the fields from source and target system
3. The mapping information from the source system to data warehouse
4. Business rules, processes, data structures, programs and model definition to eliminate duplicate development efforts.
5. Platform for knowledge sharing across the organization.
6. Data management over time
7. Data dictionaries
8. Data lineage and audit records
9. Source system and target system job schedules
10. Access methods, access rights, privileges, and passwords for source and target access

OLAP and Reporting application:
On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. That is OLAP is a category of applications and technologies for collecting, managing, processing and presenting multidimensional data for analysis and management purposes. The function of reporting application is to display the required fields, in the layout specified by the business user. Following information about OLAP and reporting application can be considered as metadata.
1. Information related to the objects defined in OLAP/reporting tool e.g. their definition and attributes
2. Transformations applied in reporting tool
3. Business interests i.e., the measures, dimensions, hierarchy…present in the report
4. Slowly changing dimension policies
5. Rules for aggregation, summarization, calculation over a period of time for the fields in the report
6. Business terminologies
7. Report Layout
8. Detailed information of all the reports developed
9. The drill through and drill down specifications
10. The report schedules and the distribution list
11. Transformations required for data mining (for example, interpreting nulls and scaling numerics)
12. Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
13. Usage and access maps for data elements, tables, views, and reports
5 What to consider in Metadata Design?

For business to analyze the decision process, the business must convert the data into reliable, reusable information asset to improve the operational efficiency and Customer certainty. Metadata management can provide solution to this. Metadata management is a key to eliminating information disparity, rapidly deploying information solutions, integrating disparate data sources, finding and sharing information assets and making the information coherent.

While designing metadata following points should be taken into consideration:

· Business-
The source and the target system should be analyzed and then business rules to migrate the data should be created. These rules should not violate the business considerations/ requirements.
· End User-
The areas which are of the interest of end user should be covered. E.g. all the required dimensions and measures should be present in the report specifications.
· Performance-
The turn around time for the queries fired should be as low as possible. The data should be made available to the ETL as well as reporting tools in minimum possible time.
· Accessibility-
The metadata should be easily accessible to the business as well as technical users.
· Standardization-
All the terminologies used across the metadata should be standardized, so that there should be no problem while integrating metadata from various locations. The metadata should be consistent over all the locations where it is stored.
· Historical Data-
Data warehouses also store the historical information. So the transformation rules applied over a period of time also need to be stored as metadata. Meta data design should make a provision to include the historical information as well.
· Up to date and accurate-
As the metadata is accessed for information it must be accurate and up-to-date.
· Completeness-
The metadata should include data about all the objects representing a data warehouse. Incomplete metadata may lead to improper analysis or erroneous reporting.