Data Warehouse Surrogate Key Generation
As those of you who watched my recent webinar Data Modeling Fundamentals With Sisense ElastiCube might recall, a primary key is a unique identifier given to a record in our database, which we can use when querying the database or in order to join multiple sources. This article will discuss the concept of surrogate keys and show some examples of when and how to apply them using simple SQL.
General Guidelines for Selecting Primary Keys
Before we dive into natural vs. surrogate keys, let’s recall four important rules to follow when selecting a primary key for your data model:
- The primary key must be unique for each record. A primary key with duplicates will lead to inaccurate queries with duplicated counts and totals. If two customers are assigned the same primary key, their sales activity will be unintentionally blended together. If the customer is accidentally duplicated, their sales activity will also be duplicated. Database architects refer to this as a loss of referential integrity.
- The primary key must apply uniform rules for all records. Whether your key is strictly numeric, alphanumeric, or a random system-generated value, each record must be programmed in a consistent format. This format must exist despite whatever complexities there are in the business requirements. An inconsistent format can lead to difficult data analysis, especially in parent/child data relationships.
- The primary key must stand the test of time. A key based off of contextual data at the present time, may not have the same contextual meaning later. For example, if a customer ID key is based on customer name, what happens when a customer is acquired or reorganized? Changing key formats should be avoided at all costs. Changing keys will require changing all stored procedures referencing the new key in any JOINs or WHERE clauses, as well as UPDATEs to all existing references to the old key in all of your database tables.
- The primary key must be read-only. In order to stand the test of time, primary keys should never be edited. Edited primary keys can have typos (123123 vs 132123), varying formats based on the user’s preference (1 vs 000001), and allow for overwriting a previously deleted record. Never allow anyone to edit the value of primary keys.
Yes, A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse.
Selecting a Primary Key: Surrogate vs. Natural Keys
Learn how to use surrogate keys, which permit unlimited values, in BigQuery, with an IoT data ingestion example. Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity. Jul 20, 2019 Surrogate keys are widely accepted data warehouse design standard. In this article, we will check data warehouse surrogate key design, advantages and disadvantages. What are surrogate keys in Data warehouse? If you are a data warehouse developer, that you might be thinking what is surrogate key? How and where it is being used? Application data is not used to derive surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary key. For example: A sequential number can be a surrogate key. A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design. Let’s be very clear: Every join between dimension tables and fact.
First, let’s go over the difference between these two forms of primary keys:
A natural key is a key that has contextual or business meaning (for example, in a table containing STORE, SALES, and DATE, we might use the DATE field as a natural key when joining with another table detailing inventory).
A natural key can be system-generated, but natural keys are at least partially determined by a manual process. Some natural keys are totally manually generated. One of the most widely recognized uses of a natural key is a stock ticker symbol – i.e. MSFT, APPL, and GOOGL. Natural keys serve as a great primary key when contextual meaning is important.
A surrogate key is a key which does not have any contextual or business meaning. It is manufactured “artificially” and only for the purposes of data analysis. The most frequently used version of a surrogate key is an increasing sequential integer or “counter” value (i.e. 1, 2, 3). Surrogate keys can also include the current system date/time stamp, or a random alphanumeric string.
See Sisense in action:
When should you stick to natural keys in your data model?
The main advantage of natural keys is in their simplicity and in the fact that the data maintains its original context. They will often be (relatively) easy to recognize to people viewing the data, and relying on natural keys reduces the need to enrich the data using custom SQL. Additionally:
- Natural keys are great for multiple data types in the database. Natural keys allow the user to easily identify the data type from the key, even when multiple data types use similar key formats. Financial databases frequently format their keys using a natural and sequential key together.
Even though all three records contain a sequential ID of 123, the natural key prefix allows the user to immediately identify different data types.
Apr 22, 2019 After registration, you will receive a unique license key and access to the binaries. If you already have an ESXi 6.0 license key, you don't need to sign up for a new key. Download VMware vSphere Hypervisor 6.5 - Binaries; Install ESXi to your Hardware (Create a Bootable ESXi Installer USB Flash Drive). Vmware vcenter license key generator key. May 29, 2017 Enter the vCenter Server 6.x Standard license key and press Enter to add a new line. Enter the vSphere 6.x Standard license key(For ESXi), and click Next. Label the license keys as per your convinient. Click on Next. Your Licenses have been added here. Now let's move to Step#2 to Assign Licenses. Step#2 - Assign License to Virtual Center. Fllowing are the activation key enterprise license for vCenter and vSphere 6.5. Assign license for vCenter. On vCenter Server management console, go to Menu Administrator Licensing Licenses Click to “+. Called VMware to get a free license. Gave me very hard time and offer a 30 days trial after begging them for 2 days.
- Natural keys work well when connecting two systems with two different primary key formats. Thus for example, we can use
To create
- Natural keys make for a more easy-to-understand GUI. A customer ID such as GOOGL is easy for a user to recognize (for instance, you likely knew this stock ticker symbol is for Google). Easier recognition also allows for easier search.
Drawbacks of using natural keys
While it might be tempting and initially easier to rely on existing natural keys, this could prove problematic when scaling the data model, or in a more complex environment, which we will demonstrate using an example of stock tickers:
- Natural keys do not apply uniform rules for each record. Designators or variables in the natural key make the key difficult to query and understand after the fact. For example, stock ticker symbols of preferred shares have a multitude of designators, including P, PR, and /PR. Trying to query for the designator P (SELECT * FROM stock_quotes WHERE stock_ticker_symbol like %P) would return all results where the stock ticker symbol ends in P, regardless if the symbol is actually preferred stock or not.
- Natural keys do not stand the test of time. Symbols which might have been business meaning could become meaningless, or bear a different meaning in the future. Thus, for example, the symbols GOOG and GOOGL do not accurately represent the reorganization of the company from Google to Alphabet.
- Natural keys can be easily confused with each other. Sticking with the previous example – when Twitter was ready to launch their IPO under the ticker TWTR, many investors bought from a defunct electronics company named Tweeter, trading under the ticker TWTRQ. Because TWTR and TWTRQ contain the same first four letters, many investors unintentionally invested in the wrong stock. Tweeter later changed their ticker symbol to THEGQ, which could also be misconstrued with GQ Magazine (a privately held company under Conde Nast).
Advantages of using surrogate keys
As mentioned, a surrogate key sacrifices some of the original context of the data. However, it can be extremely useful for analytical purposes for the following reasons:
- Surrogate keys are unique. Because surrogate keys are system-generated, it is impossible for the system to create and store a duplicate value.
- Surrogate keys apply uniform rules to all records. The surrogate key value is the result of a program, which creates the system-generated value. Any key created as a result of a program will apply uniform rules for each record.
- Surrogate keys stand the test of time. Because surrogate keys lack any context or business meaning, there will be no need to change the key in the future.
- Surrogate keys allow for unlimited values. Sequential, timestamp, and random keys have no practical limits to unique combinations.
Combining Natural and Surrogate Keys
Certain business scenarios might require keeping the natural key intact as a means for users to interact with the database. In these cases …
- If a natural key is recommended, use a surrogate key field as the primary key, and a natural key as a foreign key. While users may interact with the natural key, the database can still have surrogate keys outside of the users’ view, with no interruption to user experience.
- If a natural key must be used without an additional surrogate key, be sure to combine it with a surrogate key element. In our financial database example, Expense Reports (ER-123) have a natural key is used in conjunction with a surrogate sequential key. This format prevents many of the natural key side effects listed above.
An Example of Adding a Surrogate Key Using Custom SQL
In the following example, we will look at a table containing historical data about product prices. By using a custom SQL expression in the Sisense Elasticube Manager, we create the surrogate key ProdDate_Key, which in this case is created by combining the other fields into a single, unique identifier that can easily be queried later.
Original:
SQL used to add surrogate key:
SSELECT DISTINCT
tostring(ProductID)+'_'+tostring(getyear(Date))+'-'+tostring(getmonth(Date))+'-'+tostring(Getday(Date)) AS Prod_Date_Key,
Date,
PH.ProductID,
PH.ListPrice
FROM [ProductListPriceHistory] PH JOIN [AllDates] ON Date between PH.StartDate AND PH.EndDate
Result:
Want to master data modeling? Watch our on demand webinar and learn the fundamental skills every analyst should have.
A surrogate key (or synthetic key, entity identifier, system-generated key, database sequence number, factless key, technical key, or arbitrary unique identifier[citation needed]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.[1]
Definition[edit]
There are at least two definitions of a surrogate:
- Surrogate (1) – Hall, Owlett and Todd (1976)
- A surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application.[2]
- Surrogate (2) – Wieringa and De Jonge (1991)
- A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application.
The Surrogate (1) definition relates to a data model rather than a storage model and is used throughout this article. See Date (1998).
An important distinction between a surrogate and a primary key depends on whether the database is a current database or a temporal database. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.
Although Hall et al. (1976) say nothing about this, others[specify] have argued that a surrogate should have the following characteristics:
- the value is unique system-wide, hence never reused
- the value is system generated
- the value is not manipulable by the user or application
- the value contains no semantic meaning
- the value is not visible to the user or application
- the value is not composed of several values from different domains.
Surrogates in practice[edit]
In a current database, the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated UUID (for example, an HR number for each employee other than the UUID of each employee).
A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server 'identity column', a PostgreSQL or Informixserial
, an Oracle or SQL ServerSEQUENCE
or a column defined with AUTO_INCREMENT
in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID
or SQL Server UNIQUEIDENTIFIER
).
Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile) and guarantees uniqueness.
In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled world. One table row represents a slice of time holding all the entity's attributes for a defined timespan. Those slices depict the whole lifespan of one business entity. For example, a table EmployeeContracts may hold temporal information to keep track of contracted working hours. The business key for one contract will be identical (non-unique) in both rows however the surrogate key for each row is unique.
SurrogateKey | BusinessKey | EmployeeName | WorkingHoursPerWeek | RowValidFrom | RowValidTo |
---|---|---|---|---|---|
1 | BOS0120 | John Smith | 40 | 2000-01-01 | 2000-12-31 |
56 | P0000123 | Bob Brown | 25 | 1999-01-01 | 2011-12-31 |
234 | BOS0120 | John Smith | 35 | 2001-01-01 | 2009-12-31 |
Some database designers use surrogate keys systematically regardless of the suitability of other candidate keys, while others will use a key already present in the data, if there is one.
Some of the alternate names ('system-generated key') describe the way of generating new surrogate values rather than the nature of the surrogate concept.
Approaches to generating surrogates include:
- Universally Unique Identifiers (UUIDs)
- Globally Unique Identifiers (GUIDs)
- Object Identifiers (OIDs)
- Sybase or SQL Server identity column
IDENTITY
ORIDENTITY(n,n)
- Oracle
SEQUENCE
, orGENERATED AS IDENTITY
(starting from version 12.1)[3] - SQL Server
SEQUENCE
(starting from SQL Server 2012)[4] - PostgreSQL or IBM Informix serial
- MySQL
AUTO_INCREMENT
- SQLite
AUTOINCREMENT
- AutoNumber data type in Microsoft Access
AS IDENTITY GENERATED BY DEFAULT
in IBM DB2- Identity column (implemented in DDL) in Teradata
- Table Sequence when the sequence is calculated by a procedure and a sequence table with fields: id, sequenceName, sequenceValue and incrementValue
Advantages[edit]
Immutability[edit]
Surrogate Key Definition
Surrogate keys do not change while the row exists. This has the following advantages:
- Applications cannot lose their reference to a row in the database (since the identifier never changes).
- The primary or natural key data can always be modified, even with databases that do not support cascading updates across related foreign keys.
Requirement changes[edit]
Attributes that uniquely identify an entity might change, which might invalidate the suitability of natural keys. Consider the following example:
- An employee's network user name is chosen as a natural key. Upon merging with another company, new employees must be inserted. Some of the new network user names create conflicts because their user names were generated independently (when the companies were separate).
In these cases, generally a new attribute must be added to the natural key (for example, an original_company column).With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key. Surrogate keys avoid choosing a natural key that might be incorrect.
Performance[edit]
Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Furthermore, a non-redundant distribution of keys causes the resulting b-tree index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than compound keys.
Director Nick Fury has sent a dire call from the future: The multiple dimensions of the multiverse are collapsing upon each other – and it’s up to you to ensure humanity survives! Marvel future fight resource generator key. You can unite the greatest heroes from all corners of the Marvel Universe for the epic battle that will decide the fate of all realities – MARVEL Future Fight!S.H.I.E.L.D.
Compatibility[edit]
While using several database application development systems, drivers, and object-relational mapping systems, such as Ruby on Rails or Hibernate, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.
Uniformity[edit]
When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.
Validation[edit]
It is possible to design key-values that follow a well-known pattern or structure which can be automatically verified. For instance, the keys that are intended to be used in some column of some table might be designed to 'look differently from' those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of Database normalization.
Disadvantages[edit]
Disassociation[edit]
The values of generated surrogate keys have no relationship to the real-world meaning of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. If appropriate database constraints have not been set, or data imported from a legacy system where referential integrity was not employed, it is possible to have a foreign-key value that does not correspond to a primary-key value and is therefore invalid. (In this regard, C.J. Date regards the meaninglessness of surrogate keys as an advantage. [5])
To discover such errors, one must perform a query that uses a left outer join between the table with the foreign key and the table with the primary key, showing both key fields in addition to any fields required to distinguish the record; all invalid foreign-key values will have the primary-key column as NULL. The need to perform such a check is so common that Microsoft Access actually provides a 'Find Unmatched Query' wizard that generates the appropriate SQL after walking the user through a dialog. (It is, however, not too difficult to compose such queries manually.) 'Find Unmatched' queries are typically employed as part of a data cleansing process when inheriting legacy data.
Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by NOT exporting surrogate keys, except as transient data (most obviously, in executing applications that have a 'live' connection to the database).
When surrogate keys supplant natural keys, then domain specific referential integrity will be compromised. For example, in a customer master table, the same customer may have multiple records under separate customer IDs, even though the natural key (a combination of customer name, date of birth, and E-mail address) would be unique. To prevent compromise, the natural key of the table must NOT be supplanted: it must be preserved as a unique constraint, which is implemented as a unique index on the combination of natural-key fields.
Query optimization[edit]
Relational databases assume a unique index is applied to a table's primary key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a candidate key, the index can be a unique index.
These additional indexes, however, will take up disk space and slow down inserts and deletes.
Data Warehouse Surrogate Key Generation Free
Normalization[edit]
Surrogate keys can result in duplicate values in any natural keys. To prevent duplication, one must preserve the role of the natural keys as unique constraints when defining the table using either SQL's CREATE TABLE statement or ALTER TABLE ..ADD CONSTRAINT statement, if the constraints are added as an afterthought.
Business process modeling[edit]
Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.
Inadvertent disclosure[edit]
Proprietary information can be leaked if sequential key generators are used. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. There are a few ways to overcome this problem:
- Increase the sequential number by a random amount.
- Generate a random key such as a UUID
Inadvertent assumptions[edit]
Sequentially generated surrogate keys can imply that events with a higher key value occurred after events with a lower value. This is not necessarily true, because such values do not guarantee time sequence as it is possible for inserts to fail and leave gaps which may be filled at a later time. If chronology is important then date and time must be separately recorded.
See also[edit]
References[edit]
Surrogate Key In Data Warehouse
Citations[edit]
- ^'What is a Surrogate Key? - Definition from Techopedia'. Techopedia.com. Retrieved 2020-02-21.
- ^P A V Hall, J Owlett, S J P Todd, 'Relations and Entities', Modelling in Data Base Management Systems (ed GM Nijssen),North Holland 1976.
- ^http://docs.oracle.com/database/121/SQLRF/statements_7002.htm#SQLRF01402
- ^https://msdn.microsoft.com/en-us/library/ff878091.aspx
- ^ C.J. Date. The primacy of primary keys. From 'Relational Database Writings, 1991-1994. Addison-Wesley, Reading, MA.
Sources[edit]
- This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the 'relicensing' terms of the GFDL, version 1.3 or later.
- Nijssen, G.M. (1976). Modelling in Data Base Management Systems. North-Holland Pub. Co. ISBN0-7204-0459-2.
- Engles, R.W.: (1972), A Tutorial on. CiteSeerX10.1.1.16.3195.Cite journal requires
journal=
(help) - Date, C. J. (1998). 'Chapters 11 and 12'. Relational Database Writings 1994–1997. ISBN0201398141.
- Carter, Breck. 'Intelligent Versus Surrogate Keys'. Retrieved 2006-12-03.
- Richardson, Lee. 'Create Data Disaster: Avoid Unique Indexes – (Mistake 3 of 10)'. Archived from the original on 2008-01-30. Retrieved 2008-01-19.
- Berkus, Josh. 'Database Soup: Primary Keyvil, Part I'. Retrieved 2006-12-03.