To generate surrogate keys, add a Surrogate Key Generator stage to a job with a single output link to another stage. About this task If you want to pass input columns to the next stage in the job, the Surrogate Key Generator stage can also have an input link.
This article demonstrates how to “roll your own” surrogate keys and sequences in a platform-independent way, using standard SQL.
Relational theory talks about something called a “candidate key.” In SQL terms, a candidate key is any combination of columns that uniquely identifies a row (SQL and the relational model aren’t the same thing, but I’ll put that aside for this article). The data’s primary key is the minimal candidate key. Many people think a primary key is something the DBA defines, but that’s not true. The primary key is a property of the data, not the table that holds the data.
Unfortunately, the minimal candidate key is sometimes not a good primary key in the real world. For example, if the primary key is 6 columns wide and I need to refer to a row from another table, it’s impractical to make a 6-column wide foreign key. For this reason, database designers sometimes introduce a surrogate key, which uniquely identifies every row in the table and is “more minimal” than the inherently unique aspect of the data. The usual choice is a monotonically increasing integer, which is small and easy to use in foreign keys.
Every RDBMS of which I’m aware offers a feature to make surrogate keys easier by automatically generating the next larger value upon insert. In SQL Server, it’s called an IDENTITY
column. In MySQL, it’s called AUTO_INCREMENT
. It’s possible to generate the value in SQL, but it’s easier and generally safer to let the RDBMS do it instead. This does lead to some issues itself, such as the need to find out the value that was generated by the last insertion, but those are usually not hard to solve (LAST_INSERT_ID()
and similar functions, for example).
It’s sometimes desirable not to use the provided feature. For instance, I might want to be sure I always use the next available number. In that case, I can’t use the built-in features, because they don’t generate the next available number under some circumstances. For example, SQL Server doesn’t decrement the internal counter when transactions are rolled back, leaving holes in the data (see my article on finding missing numbers in a sequence). Neither MySQL nor SQL Server decrements the counter when rows are deleted.
In these cases, it’s possible to generate the next value in the insert statement. Suppose my table looks like this:
The next value for c1
is simply the maximum value + 1. If there is no maximum value, it is 1, which is the same as 0 + 1.
There are platform-dependent ways to write that statement as well, such as using SQL Server’s ISNULL
function or MySQL’s IFNULL
. This code can be combined into an INSERT
statement, such as the following statement to insert 3 into the second column:
The code above is a single atomic statement and will prevent any two concurrent inserts from getting the same value for c1
. It is not safe to find the next value in one statement and use it in another, unless both statements are in a transaction. I would consider that a bad idea, though. There’s no need for a transaction in the statement above.
Downsides to this approach are inability to find the value of c1
immediately after inserting, and inability to insert multiple rows at once. The first problem is inherently caused by inserting meaningless data, and is always a problem, even with the built-in surrogate keys where the RDBMS provides a mechanism to retrieve the value.
Surrogate keys are often considered very bad practice, for a variety of good reasons I won’t discuss here. Sometimes, though, there is just nothing for it but to artificially unique-ify the data. In these cases, a sequence number can often be a less evil approach. A sequence is just a surrogate key that restarts at 1 for each group of related records. For example, consider a table of log entries related to records in my t1
table:
At this point I might want to enter some more records (0, 11) into t1
:
Now suppose I want the following three log entries for the first row in t1
:
There’s no good primary key in this data. I will have to add a surrogate key. It might seem I could add a date-time column instead, but that’s a dangerous design. It breaks as soon as two records are inserted within a timespan less than the maximum resolution of the data type. It also breaks if two records are inserted in a single transaction where the time is consistent from the first to the last statement. I’m much happier with a sequence column. The following statement will insert the log records as desired:
If I want to enter a log record on another record in t1
, the sequence will start at 1 for it:
MySQL actually allows an AUTO_INCREMENT
value to serve as a sequence for certain table types (MyISAM and BDB). To do tihs, just make the column the last column in a multi-column primary key. I’m not aware of any other RDBMS that does this.
Apache Hive is a SQL-like software used with Hadoop to give users the capability of performing SQL-like queries on its own language, HiveQL, quickly and efficiently. It also gives users additional query and analytical abilities not available on traditional SQL structures.
With Apache Hive, users can use HiveQL or traditional MapReduce systems, depending on individual needs and preferences. Hive is particularly ideal for analyzing large datasets (petabytes) and includes a variety of storage options.
A data lake is essentially a collection of all the data your company collects about its customers, operations, transactions and more. Think of all your company’s data sources as individual pools of data. However, rather than bringing all the data at one time and let users figure out how to use it, we chose to do a focused implementation of data lake where specific datasets targeted for a prioritized business needs are being brought into data lake regulated by data governance.
The data governance registers the data catalogue, defines data quality metrics and rules for data security and access. This creates an environment where everyone has visibility into what data is available, what’s the data quality and where it can be used.
Normalization is a standard process used to model data tables with rules to deal with redundancy of data and anomalies. Simply stated, normalized data, results in creation of multiple relational tables which needs to be joined at the run time to produce the desired results. Joins are expensive and difficult operations to perform specially with large data sets.
We used a hybrid approach for data modeling.
The Logical Data Model was normalized to show entities and their relationships to business users. This helped business users understand the data and its relationships and getting their approval. Business friendly full names were used for attributes to help key stakeholders understand the data better.
Entities were merged and de-normalized to take advantage of HIVE’s ability to process large tables efficiently. This also reduced the number of joins needed for querying the data. Columns names were standardized and abbreviated using enterprise naming conventions.
A surrogate key is any column or set of columns that can be declared as the primary key instead of a 'real' or natural key. It is usually populated using a sequence generator or identity column in a RDBMS, however, Hive doesn’t have a sequence generator or identity column.
To generate the surrogate key value in HIVE, one must use “ROW_NUMBER () OVER ()” function. When the query is run using “ROW_NUMBER () OVER ()” function, the complete data set is loaded into the memory. This requires lot of resources to generate the key which creates major overhead and performance issues, taking hours to run for few hundreds of million rows.
To mitigate the problem a hashing function was used to generate surrogate key values. The MD5 algorithm is a widely used hash function producing a 128-bit hash value. We used HASH “md5” function to generate an alphanumeric surrogate key. Generally, joining using alphanumeric key impacts performance. We measured the performance impact of using alphanumeric key and observed that there was about 10% degradation in performance while querying the data. However, load performance was significantly better and load time reduced from about 16 hours to 90 mins.
By default, HIVE does not allow updates or deletes. To enable this functionality, one need to make changes to ACID setting in Ambari. There is significant query performance degradation when the ACID setting is enabled, and It is recommended not to enable this setting.
However, business required current set of data to be available for analysis. Inserting all the data daily was not a viable option due to large amount of data. This required daily changed data to be processed to provide a current view of the data.
We used an innovative approach to perform updates and deletes. We created a snapshot of source data by applying today's changes using following approach:
Type of Change Yesterday’s Snapshot Today’s Snapshot
Insert Not Existing Record Added
Update Existing Updated row from today’s change
Delete Existing Not Existing
Using this approach, the update and delete functionality is provided in HIVE in a very efficient manner with no impact to query or load performance.
Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. In order to get optimal performance from Hive, it’s important for users to use best practices for Hive implementation. We used following ways to improve Hive performance.
Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics. We created large de-normalized tables that minimized number of joins needed to improve performance.
In non-partitioned tables, Hive would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive - especially in cases of large tables.
Partitioning improves the performance of large tables as only the subset of data required for query needs to be read. Hive supports multi column partitions. The partitioned columns are not included in the table definition but can still be used in query projections.
Partitions were created with extract id for different data sets. The partitions were added to external tables pointing to extract folder for each data set table.
Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism. However, single, complex Hive queries commonly are translated to a number of MapReduce jobs that are executed by default sequencing.
Frequently some of a query’s MapReduce stages are not interdependent and could be executed in parallel. These can take advantage of spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query executions time.
To enable parallel execution, change the configuration parameter SET hive.exec.parallel = true.
Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. By vectorized query execution, one can improve performance of operations like scans, aggregations, filters and joins, by performing them in batches at once instead of single row each time.
To enable vectorization, change this configuration parameter SET hive.vectorized.execution.enabled = true.
To get optimal performance from HIVE, it’s very important to use the right file format. Data profiling of the files should be used to define the file format for tables. We use following formats extensively to create the data lake.
Optimized Row Columnar (ORC) format should be used for:
· Tables with large volume of data
· Query/data access is typically for a subset of columns in the table
Advantages of ORC format are:
· A single file as the output of each task, which reduces the NameNode's load
· Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
· Light-weight indexes stored within the file
o Skip row groups that don't pass predicate filtering
o Seek to a given row
· Block-mode compression based on data type
o Run-length encoding for integer columns
o Dictionary encoding for string columns
· Concurrent reads of the same file using separate RecordReaders
· Ability to split files without scanning for markers
· Bound the amount of memory needed for reading or writing
· Metadata stored using protocol buffers, which allows addition and removal of fields
Data is ingested from the source systems into flat files in the data lake. Though, it is possible to link the flat files to HIVE external table, however, there will be no error handling if there is a problem with data. The rows will be dropped on the floor without any logging.
To mitigate the problem, we converted the extract files into AVRO format. AVRO will handle error rows and it will create a reject file which is sent to source team to fix the data and send the rejected rows in the next extract.
AVRO doesn’t support some datatype of sources like datetime. These should be converted to a different data type in AVRO file and then converted back e.g. convert datetime bigint in AVRO file and then convert back to datetime in the target.
Small tables should be created in default file format. Creating ORC file format for small tables hampers the performance.