|
A
Ad-Hoc Query
An Ad-Hoc Query is a database
query
(or request for information) that
is dynamically entered by a database user. Static queries that
exist in production reports or provide data results for
user-interface screens are not considered Ad-Hoc Queries.
Agent
An Agent is an application or automated
routine that searches data sets for pre-defined thresholds or
patterns. Upon encountering a targeted condition, the Agent
performs a set action to report or record the event (also called
an
Alert).
Examples of resulting actions include report generation, e-mail
notification, and trigger execution.
Aggregate
An Aggregate is summary information stored in
a
Data Warehouse.
Instead of repeatedly summarizing the same information to
perform multiple analyses, information in a warehouse may be
pre-summarized for performance. The two primary uses for
Aggregates include:
Improving Performance – Less I/O means less processing time.
Warehouses must contain the lowest level of
detail that is required to support specific analyses.
Query Redirection,
Materialized Views, or manually User Interface coding provides
access to the level of detail requested by
end-users.
Having both the low-level and summarized information available
provides the fastest query response time for the information
There are five aggregate functions defined in standard
SQL
(Structured Query Language): SUM, COUNT, MIN (the lowest value),
MAX (the highest value), and AVG (the average value).
Aggregate Navigation
Also Known As :
Materialized View
Alert
A message sent automatically by a computer
system when a certain situation occurs. One of the greatest
benefits of data warehousing is the ability to set alerts.
Practical applications of Alerts include:
-
A store
manager can be automatically informed when a certain product's
sales fall below or rise above a specified range.
-
A factory
manager can be automatically informed when the failure rate of a
product exceeds a specified level.
-
A sales
manager can be automatically informed when a member of his staff
achieves a personal high level of sales for a time period.
Alerts allow a company to receive critical
business information in the quickest possible time.
Atomic Data
Atomic Data are the data elements that
represent the lowest level of detail. For example, in a daily
sales report, the individual items sold would be atomic data,
while rollups such as invoice and summary totals from invoices
are aggregate data.
Attribute
See Data Model
B
Business Intelligence (BI) Tools
Business Intelligence Tools are software
programs enabling business users to see and use large amounts of
complex information. Classes of BI tools include:
-
Multi-Dimensional Analysis Software -
Also Known As:
OLAP (On-Line Analytical
Processing) - Software that gives the user the opportunity to
look at the data from a variety of different dimensions.
-
Query Tools-
Software that allows the user to ask questions about patterns or
details in the data.
-
Data Mining
Tools - Software that automatically searches for significant
patterns or correlation in the data
Business Rule
A Business Rule is a calculation or
documented procedure applied to business
attributes.
Defining of Business Rules occurs during Data Warehouse design.
Procedures or Functions within the ETL process enforce these
business rules.
back to top
C
Calculated
Measure
Also Known As:
Derived Data
Cardinality
Cardinality is the degree of even
distribution of values across a specific table column. Low
cardinality distributions contain few unique values over a set
or subset of rows. High cardinality distributions contain many
unique values. A unique primary key represents the highest level
of cardinality.
Cell
A Cell is a single data value stored at the
intersection of a set of dimensional values. An example of a
cell would be the value for a
measure
Revenue in a
Cube
for a specific Product, Time, and
Customer intersection. Cells may be blank if there is no measure
for the intersection.
Client/Server
Client/Server is a distributed technology
architecture where process load is distributed between two sides
of a transaction. Software on a client (desktop PC) is
responsible for much of the user-interface rendering activities.
The server component is only responsible for providing the data
sets needed by the client component.
Column
A Column is a single
attribute
of a
table,
view, or query.
Cube
Also Known As:
HyperCube or Multi-Dimensional
Cube. A Cube is a data structure and access method for
multi-dimensional information. Cubes represent an efficient
storage algorithm for routinely
aggregated
or
drilled
into analytical data. Desktop
OLAP
tools package
multi-dimensional information into Cubes distributable across
the network or to individual computers.
back to top
D
Dashboard
Also Known As:
Digital Dashboard or
EIS
A Dashboard provides an executive view of
specific Key Performance Indicators within an organization.
Dashboards typically display high-level measurements and do not
provide drill-down capability.
Data-Based
Knowledge
Knowledge derived from data using
Business Intelligence
Tools and the process
of
Data Warehousing.
In the past, business users based decisions on a combination of
experience, perception, and intuition. Decision Support Systems
consolidated business information into a single system that
mirrors the decision maker’s thought processes. Data based
knowledge has the following advantages over experience/intuition
based knowledge:
-
Accuracy because it is a consolidation from several sources,
-
Currency because the data gathering and publishing process is
automatic,
-
Flexibility because different perspectives of the same
information are available depending on the business view needed,
-
Insightful because complex data patterns emerge from
data mining
that otherwise go
unnoticed,
-
Less
subjective because conclusions are tied directly to the physical
data.
Data Cleansing
Also Known As:
Data Quality Assurance
Data Dictionary
The Data Dictionary is the catalog of data
elements within a database
system. Data Dictionaries
maintain the
metadata
needed to manage and define data elements.
Data Extraction
Data Extraction is the process of extracting
specific
source data
to satisfy a data loading or reporting needs. Typically,
Data Warehouse
data extraction occurs on
one or more source
OLTP
systems. Business rules and
filtering applied during the data extraction process minimize
the volume of data transferred to the data warehouse or data
warehouse stage area.
Data, Information, and Knowledge
Data is the reality that a computer records,
stores, and processes. The use of computers to manipulate,
report, and analyze this reality is referred to as data
processing. Data only has meaning when combined with information
that describes the data’s relevance or application ( also
known as:
metadata).
For example, the raw numbers within a spreadsheet have no
meaning without the column and row headers defining a
relationship.
Information is what a person is able to
understand or deduce about data based on the available metadata.
Knowledge is the result of combining
information with experience and is what a business uses to make
decisions. The process of organizing information in such a way
as to create data-based knowledge is called Data Warehousing.
Business Intelligence Tools are software products that present
knowledge to business users.
Every year the amount of raw data in the
world is approximately doubling. The goal of business
intelligence and
data warehousing
is changing data into
knowledge and incorporating the knowledge back into a business
decision making process.
Data Mart
A Data Mart is a database having the same
characteristics as a data warehouse, but focused on the data for
one division or one business unit within an enterprise. The Data
Mart may precede or succeed the
Data Warehouse
depending on the architecture and implementation plan.
Typically, the data mart is the prototype or
the first step of a data warehousing process. An enterprise
picks the division or group that would most benefit from
data-based knowledge
and constructs a data mart to satisfy specific business needs.
Proper planning performed during the design incorporates
attributes needed by other business requirements, thus
facilitating future integration. As business needs expand,
additional subject areas are added to the design and eventually
the data mart becomes a data warehouse.
Data Marts may also be created after a Data
Warehouse for a number of accessibility and performance reasons.
Data Migration
Data Migration is the act of moving data from
one environment to another. In
Data Warehousing,
migration occurs when information transfers from a legacy system
into a data warehouse.
Data Mining
Data Mining is the process of finding hidden
patterns and relationships in the data. Analyzing data involves
the recognition of significant patterns and relationships. At a
simple level, Business Analysts recognize relationships between
common business objects such as: customer, product, and time
relating to a specific sales transaction. Business Analysts can
effectively see patterns in small data sets where viewing all
factors simultaneously is possible. As information volumes grow,
however, there is a need for complex analyses to identify
patterns and relationships that are not apparent. Data Mining
tools interrogate information and identify relationships that
may become very strategic when planning campaigns or profiling
behavior.
One of the primary advantages of Data Mining
for
Data Warehousing
is that users do not need to have a preexisting hypothesis about
the information to analyze it. Without Data Mining, Analysts
must “look” for trends in information. With Data Mining, the
system “finds” patterns using techniques like “classification”,
“association”, or “clustering”.
Data Model
A Data Model is a logical representation of
data element ( Also
Known As: Attribute)
groupings, relationships, and
meta data
independent of software or
hardware. Data Models may exist in a Logical or Physical form. A
Logical Data Model defines a design in terms of business
processes and business relationships. A Physical Data Model
represents the physical implementation of the Logical design in
the host RDBMS system.
Modeling tools such as ER Win or System
Architect allow designers to develop a data model that is
compatible with multiple
RDBMSs.
An Entity within a data model consists of
logical groupings of attributes. Entities of a Physical Data
Model directly corresponds to a table or view within a database
system. Meta data for each attribute and entity for a data
warehouse project should be contained within the data model.
Data Quality Assurance
Also Known As:
Data Cleansing or Data Scrubbing
The process of checking the quality of the
data being imported into the data warehouse.
Data quality assurance is one of the greatest
challenges in the process of data warehousing. If the data-based
knowledge generated by the data warehouse is to be trusted, the
data entered into the warehouse must be complete and accurate -
"garbage in, garbage out".
Data quality can be a challenge for several
reasons:
-
The data is
being consolidated from a variety of legacy sources that may
have differing definitions of key concepts such as "customer" or
"profit".
-
The legacy
data was not originally collected for the purpose of decision
support so some of the key data might be missing, incomplete, or
not as accurate as desired.
-
There might be
times when all the data is not received from one of the legacy
systems. This could make comparisons between time periods
invalid.
A significant portion of time in the
development process should be set aside for setting up the data
quality assurance process and implementing whatever data
cleansing is needed.
In a production environment, there should be
a data quality report generated after each data warehouse
import. There should be provision for rolling back an import if
data quality testing indicates that the data is unacceptable.
Data Scrubbing
Also Known As:
Data Quality Assurance
Data Transformation
Transformation is the modification of data
during the
data extraction
and loading of a
data warehouse.
This modification can include:
Data Warehouse
Also Known As:
Information Warehouse or
Warehouse
A Data Warehouse is a physical staging area
for decision-support information. It collects data from various
applications in an organization's
operational systems,
integrates the data into a logical
model
of business subject areas, stores
the information in a manner that is accessible and
understandable to non-technical decision makers, and delivers
information to decision makers across the organization through
various report writing and query tools.
The term “Data Warehouse” was coined by Bill
Inmon in 1990 as a database that is a subject-oriented,
integrated, time-variant and non-volatile collection of data to
support a decision making process. Specific definition of these
terms is:
-
Subject-oriented - Data that gives information about a
particular subject instead of about a company's on-going
operations,
-
Integrated - Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole,
-
Time-variant - All data in the data warehouse is identified with
a particular time period,
-
Non-volatile - Data is stable in a data warehouse. Information
is added, but data is never removed.
This enables management to gain a consistent
picture of the business.
Data Warehousing
Data Warehousing is the process of visioning,
planning, building, using, managing, maintaining, and enhancing
data warehouses
and/or data marts. There
are many steps in the data warehousing process:
-
Visioning - Having an idea about what could be accomplished.
-
Learning - Studying the potential of data warehousing.
-
Justifying - Developing a business purpose for the process.
-
Budgeting - Counting the cost.
-
Deciding - Making a commitment to develop and use data-based
knowledge.
-
Gathering Information - Examining legacy systems.
-
Interviewing Users - Finding what information is needed.
-
Choosing Tools - Choosing the hardware, the database management
system, the data extraction tools, and the Business Intelligence
tools..
-
Building, Using, Testing, and Evaluating the Prototype - Repeat
this step and the above steps as necessary.
-
Deploying - Putting the system into operation.
-
Training - Helping users make full use of the Business
Intelligence tools.
-
Managing - Keeping track of scheduled data replication, system
usage, and query performance.
-
Adding,
Modifying, On-Going Development - As the system is used, new
possibilities will be discovered. Consider also all the actions
that take place as a part of the data warehousing process.
-
Data
Replication - Periodic copying of legacy data.
-
Data
Transformation - Transforming the legacy data into the form in
which it will be stored in the data warehouse.
-
Data Quality
Assurance - Testing the data for inconsistencies and errors.
-
Data Storage -
Storing the data in a
DBMS
(Database Management System).
-
Metadata Storage - Storing the description of the data - the
data about the data.
-
Data Mart
Population - Populating all the data marts that receive their
data from the warehouse.
-
Setting Up
Business Intelligence Tools - Giving users access to the data
through
multi-dimensional
analysis, querying, and data mining.
-
Setting Alerts
- Establishing conditions that result in an automatic message
being sent.
-
Data
Warehousing Management - Keeping track of how well all the other
actions are being carried out.
Database
Management System (DBMS)
The DBMS is the software used to store,
access, and manage data. Examples of these software include
Oracle, SQL Server, and Sybase. There are two main types of
Database Management Systems used for business intelligence and
data warehousing - specialized Multi-Dimensional Database
Management Systems (MDBMS) and the more widely used general
purpose Relational Database Management Systems (RDBMS).
Data Webhouse
A Data Webhouse is a data warehouse built
specifically for collecting and analyzing web page user
information. The primary source of information in a Webhouse is
the web server logs. Within these logs is a plethora of
information that defines every action performed during a user
session. When integrated with other subject areas of traditional
data warehouses,
Webhouses relate customer information or history to web site
behavior.
DBMS
Also Known As:
Database Management System
Decision Support System (DSS)
A DSS is a computer system designed to assist
an organization in making decisions. The data store behind a DSS
is typically a
Data Warehouse
or
Data Mart,
however, some DSS do not require the capabilities found in a
warehouse design.
De-Normalization
De-Normalization is reversing the application
of
Normalization
techniques to information. Normalization enforces data integrity
and minimizes storage space usage. The downside to a normalized
design is additional processing overhead incurred during query
processing to re-assemble the information before display or
analysis. De-Normalizing a design eliminates this overhead by
joining frequently accessed information into the same table. In
a
Data Warehouse,
dimensions
are very often
De-Normalized for query performance improvements.
Density
Opposite of:
Sparsity
Derived Data
Also Known As :
Calculated Measure
Data not directly supplied by source systems,
but calculated according to business rules. Derived data
includes
aggregated data
and other data
transformed
from its original state.
Desktop OLAP (DOLAP)
Desktop OLAP is a form of
OLAP
tool where a Cube containing a
summary of multi-dimensional data exists on an individual user's
computer.
Dimension
A Dimension is a component of a star schema
containing the criteria by which analyses are constrained.
During data warehouse design, we group
attributes
according to relationship and
level of detail. These groupings form
hierarchies
and classifications by
which business measure are compared and summarized. Each
physical grouping collectively is a dimension.
Examples of a dimension found in most all
data warehouses is Calendar. A Calendar dimension contains a
single record for every day within the domain of a
data warehouse.
Each day has hierarchies of month, year, quarter, season, day of
week, fiscal period, and day of month. Business measure, or
fact,
occurring on a single day group or summarize to any level
hierarchy found in the Calendar Dimension.
Most dimension tables in a
star schema
are intentionally
de-normalized.
This typically does not add a significant amount of storage
space to the database because the overall size of the dimension
tables is very small when compared to the size of the
fact table.
Dimensional Modeling
Dimensional Modeling is the process of taking
business requirements and transforming them into a
data model
specifically designed for analytical reporting. The term
Dimensional Modeling comes from one of the two primary
components of a dimensional model, data warehouse
dimensions.
The other primary component are
facts.
Drill Down, Drill Up, and
Drill Across
Drilling though data is the ability to move
between levels of data
hierarchies
when viewing data within a
data warehouse.
-
Drill Down -
Changing the view of the data to a greater level of detail.
-
Drill Up -
Changing the view of the data to a higher level of aggregation.
-
Drill Across –
Changing the view to another dimensional value at the same level
as the current view.
Drilling down and drilling up allow an
analyst to move down and up levels of summary to see how the
information at the various levels is related. A powerful use of
this capability is to view information at its highest level of
summarization to identify obvious trends in the information. To
further investigate the cause of a trend a user simply drills
down one or more levels to uncover the detail behind the
summarized information
Dynamic Queries
Dynamic Queries are dynamically constructed
SQL
statements (Also
Known As:
Queries)
based on conditions derived from results of other queries.
Desktop query tools often use this approach to simplify the
interface between the tool and the database. Dynamic Queries
parse
at run time.
back to top
E
End Users
Also Known As:
Business Users
End users are the business users that
interface to information through a front-end program or set of
reports.
Enterprise Data Warehouse
Enterprise Data Warehouse is a term used to
describe a relational
data warehouse
integrating strategic
knowledge
from several departments within an enterprise. Often,
Enterprise-level Data Warehouse planning identifies
subject areas
and clearly defines
monetary priority for implementation. The Enterprise Data
Warehouse implementation occurs as development of integrated
Data Mart
components.
Enterprise Information System / Executive
Information System (EIS)
Also Known As:
Digital Dashboard
An EIS is a computer system or user-interface
that presents a summary of a company's important data on a
routine basis.
Entity or Entity Relationship Diagram (ERD)
See
Data Model
Extract Transform and Load (ETL)
ETL is the complete data movement process
from source systems to the production
data warehouse.
Components of ETL are Data
Extraction,
information consolidation or cleansing, and loading.
Extract Specification
The Extract Specification is the standard
defined interface of information between
source
systems and the
data warehouse.
This specification lists the individual attributes and sources
of data elements needed to maintain production loads within the
data warehouse system. Attributes are grouped by their business
characteristics and required data loading strategy.
back to top
F
Fact
Facts are data elements associated with
business activity or business performance measurements ( Also
Known As: Key
Performance Indicators or KPI). Typically, facts are
quantitative values such as gross sales dollars or costs
dollars. Cube
cells
contain fact values.
Fact Group
A fact group is a group of
fact tables
that collectively make up
a
subject area.
Fact Table
In a star schema, the central table
containing
Facts
has foreign-key
relationships to all
dimension
that are relevant to the fact
measurement. There are two types of fields in a fact table:
-
Fields storing
the foreign keys that connect each particular fact to the
appropriate dimension value,
-
Fields storing
the individual
facts.
back to top
G
Granularity
Granularity is the level of detail of
facts
stored in a
data warehouse.
The granularity of the fact table is one of the most significant
design decisions in creating a data warehouse. Facts should be
at an
aggregation
level low enough to
support the lowest level of analysis. For example, if data
warehouse requirements state it is necessary to summarize
product SKU
sales, the fact table granularity
must be at an invoice line-item level, not the invoice ticket
level.
back to top
H
Hash
Hashing is a mathematical method of random
distribution. In databases, hashing mathematically assigns one
of a fixed number of codes to an attribute value in an attempt
to evenly distribute values across multiple data
partitions. Hashing is not effective
when a data set has a large volume with low
Cardinality.
Hierarchy
A hierarchy is a relationship between
attributes in a table consisting of sets and sub-sets.
Hierarchies allow traversing of
aggregation
levels within data by
querying
with
SQL
using a GROUP BY clause. Moving
between the levels of a hierarchy is called
drilling up and drilling down.
Hybrid OLAP
(HOLAP)
HOLAP is a hybrid architecture that combines
the advantages of Relational OLAP ( ROLAP)
and Multi-Dimensional OLAP (MOLAP)
architectures. In HOLAP architecture, aggregated levels of
hierarchies
are stored in MOLAP
cubes
for
query
efficiency. Detail information
(data at the lowest level of
aggregation)
that would be too cumbersome to store in cubes resides in ROLAP
architecture in the production
data warehouse.
HOLAP provides connectivity to the large data sets in the
relational database while taking advantage of the faster
performance of the multi-dimensional aggregation storage.
HyperCube
Also Know As:
Cube
back to top
I
Index
An Index is a database structure enabling
rapid table searching by one or more column criteria. Indexing
subsets of columns in a table enabling queries to access only
the necessary database blocks, thus reducing disk access (and
time). A query processor utilizes an index only if doing so
reduces the number of disk reads needed to return the query
results.
Two common types of indexing strategies used
by
RDBMSs
are Btree and Bitmap.
Btree indexes are tree-like data structures sorted by the index
key field(s). Theoretically, the tree structure has many
branches to minimize the number of comparisons necessary to
reach a leaf node. Bitmap indexes are effective for columns with
low
cardinality
distribution of values. A
bitmapped index contains a record for each record in the table.
The bit count in a bitmap index record matches the number of
distinct values in the column(s) indexed (plus overhead). Binary
comparisons of bitmaps make searches extremely fast.
back to top
L
Legacy System
See Also:
Source Database
A Legacy System is an established computer
system that is an integral part of a company’s daily business
operations. One of the biggest challenges of the
data warehousing
process is to extract and consolidate data from a variety of
legacy systems through out an enterprise.
back to top
M
Materialized
Views
Also Known As:
Aggregate Navigation
A Materialized view is an Oracle
RDBMS
feature that provides
summary-level information using a database snapshot. Once
created, Materialized Views use
Query Redirection
to transparently process a
query
request in the most efficient
manner.
Measure
Also Known As:
Fact
Meta Data
Meta Data is information describing the
application, constraints, source, transformation, meaning,
units, relationships, and relevance of data in the
warehouse. A common
definition of Meta Data is that it is data about data. In Data
Warehousing, Meta Data is any information about a database
attribute
needed to support and manage the
operation of the data warehouse.
Middleware
Middleware is a communications layer allowing
applications to interact across hardware and network
environments.
Multi-Dimensional Analysis
Also Known As:
OLAP (On-Line
Analytical Processing)
Multi-Dimensional Cube
Also Known As:
Cube
Multi-Dimensional
Database Management System (MDBMS)
An MDBMS is a database management system
designed specifically to support efficient storage and retrieval
of
OLAP
data. The MDBMS interfaces with
or contains a user-interface that supports
drilling through hierarchies
in the data. The structure where
an MDBMS stores information is a
cube.
Multi-Dimensional On-Line Analytical
Programming (MOLAP)
Also Known As:
OLAP
back to top
N
Natural Key
A Natural Key is a primary key within a
source
system. Natural keys cannot
always be primary keys in a data warehouse dimension because
multiple versions of a record may need to be stored.
Non-Volatile
Non-Volatile data is data that does not
change over time. Most all data in a
data warehouse
is non-volatile. Since
data warehouses record and reconstruct history, changes to
existing data are rare. Non-volatility is one of the original
defining characteristics of a data warehouse.
Normalization
Normalization is the process of organizing
data to eliminate redundancy. In
OLTP
systems, normalization enforces
referential integrity,
minimizes storage requirements, and optimizes the data insertion
performance.
The downside of normalization is the
increased overhead required to assemble data sets at
query
time. An example of a normalized
design is a contact database where addresses are stored in a
separate table from contact name. Two contacts with the same
address would join to a single address record rather than the
address information duplicated in the same table as the contact
name.
Normalization exists in
Data Warehouse
Fact Tables.
Dimension
Tables in a data warehouse
de-normalized information to increase query performance.
back to top
O
OLAP (On-Line Analytical
Processing)
The term OLAP distinguishes
data warehousing
activities from On-Line
Transaction Processing (OLTP)
business activities. In its broadest usage, OLAP is a synonymous
to data warehousing. In a narrow usage, OLAP refers to the tools
used for Multi-Dimensional Analysis.
The OLAP process organizes and
aggregates
data into levels of business
information (facts)
and reporting these levels by business classifications (dimensions).
A spreadsheet is a two-dimensional analysis tool. OLAP provides
spreadsheet-like analysis simultaneously spanning as many
dimensions as the business needs to support. Several software
packages facilitate manipulating and displaying this information
that is too complex to represent in two-dimensions.
OLTP (On-Line Transaction
Processing)
OLTP is a collective term encompassing the
operations and systems that process transactions as part of a
company’s daily business activities. OLTP systems ( Also
Known As:
Legacy)
are the information sources for data warehouses.
Operational Data Store (ODS)
An ODS is an integrated database of
operational data. ODS sources may include
OLTP
systems or any system containing
current or near-term data. One qualifier of an ODS is data
volatility (frequent updates) and the absence of historical
information.
An ODS is a resource for operational
reporting. Often, an ODS is created to consolidate information
from multiple sources into a repository that is used only for
short-term reporting. Once consolidated, the information in an
ODS can be used as a source for a
data warehouse
system.
Operational
System
Also Known As:
OLTP
system.
back to top
P
Parse
Parsing is the act of translating commands
into a language native to an interfacing system. In
SQL,
English commands are parsed into the host
RDBMS
by an interpreter. The
query
is translated, appropriate
information retrieved, and results are returned to the user.
Partition
A partition is a physical segmentation of
information. Partitions organize information into smaller,
manageable segments primarily for performance reasons.
Pivot Table
A
table
used to summarize data
and rotate rows and column titles
to obtain different views of the data. Rows can be “layered” to
reflect the different levels of data that can be viewed.
Primary Key
A primary key is a
column
or columns of a data set
containing as many distinct values as there are
rows.
Most
tables
in a relational database contain
a single primary key.
back to top
Q
Query
A Query is a request for information asked of
a
RDBMS.
The result of a query is typically a single
row
or set of rows from a
table.
Query Redirection
Query Redirection is an Oracle feature that
automatically changes a user’s
SQL
statement to increase efficiency.
This operation is done when the query optimizer determines that
a
materialized view
or other
table
can process a query request with less overhead than the
requested table. A practical example of query redirection is a
user query that
aggregates
a base table and groups by a
particular
column.
If a materialized view exists for the same table that has the
same query summarized already, the query optimizer accesses the
materialized view instead of the base table.
back to top
R
RAID Disk Management
Redundant Array of Inexpensive Disks (RAID)
is a storage technology that allows virtual disk management of
commodity disk drives. RAID technology provides performance
and/or reliability with the flexibility to utilize disks in a
manner that supports business and computational needs. The
various configurations of RAID each have specific benefits and
liabilities.
-
RAID 0 –
Provides non-redundant disk striping. Each disk is divided into
smaller units (stripes) and stripes from multiple disks are
concatenated into logical volumes. The benefit of RAID 0 is
Input/Output is dispersed across multiple disks, thus removing
the chance of performance problems due to disk saturation.
Sequential read and write operations are performed well, but
random read operations may be slower. The liability of RAID 0 is
no fault tolerance.
-
RAID 1 –
Provides exact mirroring of one disk to another for fault
tolerance. This approach allows one disk to fail while the data
integrity remains intact. The benefits of RAID 1 are read
performance (since smart controllers can retrieve the data
sectors on the disk closest to the present head location) and
complete redundancy. The liability of RAID 1 is the cost to
implement because twice the physical disks are needed.
-
RAID 01 or 10
– RAID 10 is a combination of RAID 0 and RAID 1 (striping with
mirroring). This approach does not use any more disks than RAID
1, but provides performance gains with fault tolerance.
-
RAID 2 – No
longer used, technology was improved and replaced by RAID 3.
-
RAID 3 –
Provides RAID 0 striping with redundancy via parity checking
instead of mirroring. The striping is performed at the
byte-level. In RAID 3, a single disk is dedicated to storing
parity information. If any single disk fails, the data can be
reconstructed from the remaining disk using an XOR (exclusive
OR) operation. Besides the redundancy, benefits of RAID 3
include superior performance doing random read operations. The
liability of this approach is the additional write overhead to
maintain the parity disk.
-
RAID 4 –
Similar to RAID 3 except that the striping is performed at the
block-level.
-
RAID 5 –
Combines RAID 0 and RAID 4 techniques to provide block striping
with redundancy via parity checking but instead of having a
dedicated parity disk, the parity may exist on any of the disks
in the stripe. The benefit of this approach is disk economy (as
opposed to full mirroring) with security. The liability of this
approach is the additional write overhead (especially for
sequential write operations) to maintain the parity, but this
liability is spread across all the disks in the stripe.
-
RAID 6 –
Extends RAID 5 to contain two distributed parity disks. Having
two parity disks allows for dual disk failure with no loss of
data integrity. The liability of this approach is that it takes
three writes to save one block of information.
-
RAID 7 or RAID
S – Hardware implementations of RAID 5 with improved
performance.
Referential
Integrity (RI)
Referential Integrity is the process of
ensuring that relationships between database objects are valid
and remain valid. Enforcement of RI rules occurs automatically
by the
RDBMS
during key field insertion,
update, or delete.
Relational Database Management System (RDBMS)
A Database Management System based on
relational theory. Most modern Database Management Systems
(Oracle, Sybase, Microsoft SQL Server) support a standard
language,
SQL,
however, each database vendor has unique extensions to this
standard.
Relational On-Line Analytical Processing (ROLAP)
ROLAP is an OLAP architecture based on a
relational database. A ROLAP design typically uses a
star schema
architecture, which optimizes information for
query
response. ROLAP is ideal for
large databases or
legacy
data for which analyses are infrequent or do not require rapid
query response times.
Replication
Replication is the physical copying of data
from one database to another. In
data warehousing,
replication takes place when data moves from the
OLTP
system into the
data warehouse.
Replication also occurs when a data warehouse feeds one or more
data marts.
Several software tools handle data
replication into a data warehouse. These tools transfer data out
of and into a variety of database management systems, often
performing data transformation and data cleansing operations in
the process.
Heterogeneous replication occurs when the
source and the target database are not the same database
management system. Bi-directional replication is the ability to
copy data in both directions between two databases.
Repository
A database designed for storage of business
Meta Data
and access by
end users
and administrators.
Roll Up
See Also:
Drill Up, Drill Down, Drill Across
Roll Up is the process of
aggregating
data
to a level higher than the
present level of detail. An example of a rollup of ticket
line-items would be to summarize line-item revenues to a
ticket-level total.
Row
A row is a single record of a database
table,
view,
or
query
result.
back to top
S
Scale, Scalable, and Scalability
Scalability is the ability to grow or enhance
performance to meet an increased demand. In
Data Warehouse
architecture, it pertains to a design’s ability to support
current and future data volumes or concurrent user access.
A system scales (or is scalable) if doubling
the number of processors available nearly doubles the volume of
transactions the system performs.
Schema
A Schema is the logical organization of data
in a database. Schemas consist of
tables,
views,
indexes,
and any other database object needed to define and manipulate a
set of data.
SKU
Also Known As:
Stock Keeping Unit
SKU is a retail term for the distinct code
that identifies a single product offered for sale.
Slice and Dice
Also Known As:
Drill Up, Drill Down, Drill Across
Slowly Changing Dimension (SCD)
Typically, operational systems are only
interested in the present states of business objects. As
business objects change, previous states may be lost because
they are no longer relevant. The inability to accurately
reconstruct the state of business at specific points in time may
easily invalidate the integrity of a Data Warehouse. Business
rules ultimately determine how to implement these changes in
business object states.
An SCD is a warehouse
dimension
whose non-key
attributes
change over time. There are three
implementation strategies to implement SCDs, each based on a
business rule. The three available implementations of SCDs are:
-
Type 1 –
Changes in non-key attributes result in updates to the database
record. Previous values are lost. This approach only supports
dimensions where history is irrelevant.
-
Type 2 –
Changes in non-key attributes result in the creation of new
version of the database record. Time or sequencing values allow
reconstruction of history. This approach is the most commonly
used SCD implementation.
-
Type 3 – The
database record contains multiple columns for specific
attributes to store previous or initial value along with current
value. Changes in specific non-key attributes result in updates
to the database record and updates to the tracking attributes.
This approach only supports a fixed number of versions.
Snapshot
A snapshot is a point-in-time representation
of data
in a system.
Databases
offer this feature to facilitate
replication.
Data Warehouses may represent history by
either transactions or snapshots. Transactional representation
formulates current business views by
aggregating
all past history. Snapshot
representation contains the entire state at various points in
time.
Each approach has advantages and
disadvantages. A transactional inventory data model records each
inventory adjustment as it occurs. A snapshot-inventory data
model records inventory levels at various points in time.
Snowflake
A snowflake is a
Normalization
of a
dimension
table within a
star schema.
In snowflaking, some of the fields of the dimension tables are
split off into separate tables. This achieves a higher level of
normalization, but makes the database design more complex and
can reduce the performance of
Business Intelligence
Tools.
One application of snowflaking is to enforce
referential integrity
during data loads.
Source Database
An operational, production database or a
centralized
warehouse
that feeds into another database.
Sparsity and Density, Sparse and Dense
Sparsity is a measure of
cardinality
within an entire data set,
not just a subset of
attributes.
Space is logically reserved in
cubes
to represent business
relationships between all analysis criteria. If business
relationships between analysis criteria do not exist, there is
an empty location within the cube. In OLAP Cube processing,
sparsity is the degree to which the cells of a Cube are
populated.
SQL (Structured Query
Language)
SQL is the standard language for accessing
relational databases. SQL uses English-Language verbs and nouns
to interface with a DBMS. Verb keywords such as “SELECT”,
“INSERT”, “UPDATE”, “DELETE” indicate the data manipulation
operation being requested. A sample query statement that would
return all employee names from Department 10 might be:
|
SELECT full_name |
|
FROM employee_table |
|
WHERE department_number = 10 |
Staging
Staging is the process of preparing
source data
for
warehouse
loading. The warehouse stage area
is a temporary database structure to hold consolidated
information before a production data warehouse load process.
Star Schema
Also Known As:
Hub and Spoke
Star Schema is an architectural method of
organizing information in a data warehouse that supports complex
analytical processing. Star Schemas consists of a centralized
fact table and one or more dimension tables.
Each of the dimension tables has a single
field primary key field with a one-to-many relationship to a
foreign key in the fact table. The printed appearance of this
architecture favors a star diagram. The fact table forms the
nucleus of the star with points of the star represented by
dimension tables.
The star schema is an intentional
simplification of the database design that would be achieved by
following the standard rules of normalization. The dimension
tables are often de-normalized, to allow for more efficient
querying. Another derivation of a star schema is selectively
normalizing attributes within a dimension ( See
snowflaking).
Static Queries
Static Queries are pre-defined
SQL
statements routinely run against
a database. Static Queries may be embedded within automated
report or hard-coded SQL statements within a user-interface.
Subject Area
A Subject area is an area of interest to the
business. For
data warehousing,
a Subject Area is a grouping of one or more
fact tables
and all related
dimension
tables.
Summary Tables
Summary tables store summarized or aggregated
data.
Synchronization
Synchronization is the process of updating
one or more sides of a distributed database to accurately mirror
each other.
Replication
business rules define the
extent and completeness of synchronization.
back to top
T
Table
A table is a physical grouping of attributes
( columns)
into a common data set. The data set often contains multiple
rows
of information that define the
state or properties of business objects.
Target Database
A Target Database is one in which data will
be loaded or inserted.
Time-variant
Time-variance is classification of events or
object states by time.
Data Warehouses
typically associate events
to the particular version of a business object at the time of
the event. For example, a time-variant view of a person keeps
all changes undergone by that person and can provide an accurate
description of the person at any given point in time.
Time-variant is one of the original defining
characteristics of a data warehouse.
back to top
V
View
A view is a logical representation of a
table
or tables. Views are
SQL
statements that appear like a
table to an end-user. SQL queries referencing a view
transparently execute the underlying SQL statement of the view.
Views are used to hide complex table join
logic, change
column
names, or restrict access to columns.
Virtual Cube
A Virtual Cube is a logical combination of
Multi-Dimensional
Cubes.
Virtual cubes are created to provide analyses across multiple
cubes without requiring physical creation of new cubes.
back to top
W
Warehouse Key
A warehouse key is a primary key unique to a
data warehouse
that has no meaning to an
operational system.
Warehouse keys are integer numbers used for efficiency within
warehouse dimensions to represent concatenated keys (often a
source system primary key plus an effective date).
back to top
|