Data Process Engine: Related Computing in esProc - Primary Keys and Index Function

In an esProc table sequence, a single or multiple fields can be used as a primary key. We can make query based on the primary key using some special functions, which can both simplify the code and increase computational performance effectively.

1.find and pfind

Primary keys are common used in database tables. The value of a primary key field is used to uniquely identify a record, so primary keys must not contain identical values, which is a mandatory rule in many databases.

In esProc, it is assumed that the primary key has unique value among the records. But no mandatory check will be executed and there won't be an error reporting even if there are identical primary key values. Both T.find(v) function and T.find(v) function can be used to query records in table sequence T according to v, the primary key value. find returns the first record found and pfind returns the sequence number of this record.

It is not a must to define a primary key. When no primary key is defined in esProc, the first field will be used as the primary key. If specified fields are needed to be defined as the primary key, T.primary(F_i,…) function, which means defining F_i,… as the primary key of table sequence T, will be used.

Usually, conventional position functions T.select() and T.pselect() are used to query records in a table sequence. Now we’ll compare usages of this pair of functions with pfind and find.

Here table EMPLOYEE in demo database is the table sequence on which query will be executed. Add FullName field to it so that the employees’ full names can be used to make query:

In order to show the performance advantages of using the primary key to query data, 10,000 full names will be generated randomly and pselect and pfind will be used respectively to make query according to these full names. Compute the time these two methods will take:

Based on the same data, A5 and A9 use pselect and pfind respectively to query positions of the records in the table sequence. In A9, primary function is used to set the primary key for the table sequence before pfind starts to work. A6 and A10 compute respectively the milliseconds the queries will take:

But the query results in A5 and A9 are same:

Using similar method, we can compare select function and find function. To keep in line with find function, @1 option is used in select function to get the first result and return it:

A6 and A10 compute respectively the milliseconds the queries will take:

Still, the query results in A5 and A9 are same:

It can be seen easily from the comparison that the query functions based on the primary key are much more efficient than conventional position functions.

2.Primary key and the index table

Why, in esProc, will the efficiency increase significantly when the primary key is used to make query? The reason is that the index table of the primary key has been used for computing.

During calling a primary function, or making query based on the primary key in a table sequence without an index table for the first time, an index table will be generated according to the primary key. While the index table is being generated, a hash table will be created according to all values of the primary key, which will divide primary key values into many groups by their hash values. These hash values are the corresponding group numbers.

Normally, when we query a certain record in a table sequence according to the field value, we need to examine the records one by one until the target is found. For a table sequence containing n records, an average of n/2 examinations are needed.

Thanks to the index table, it would be different to query a certain record in a table sequence according to the value of the primary key field. The hash values will be computed first according to the primary key values, which enable us to find out directly the corresponding groups in the index table. Then we just need to examine records of the same group. In the same way, for a table sequence containing n records, if its primary key values are distributed in k groups according to hash values, only an average of n/2k comparisons are needed. Using this method, despite hash values must be computed before an index table is generated and the query is executed, the number of comparisons is reduced significantly, and in particular, the index table needs to be generated only once. Therefore, the more the data in a table sequence and the times needed for a query, the higher the efficiency.

During computing, T.index(n)function can be used to create an index table for T’s primary key in advance. n represents the length of the index table. Default value will be used if there is not a defined length.

What we should know is that the reason why the functions for query based on primary key values, such as find and pfind, can enhance computational performance effectively is that an index table has been created for the primary key in a table sequence. Therefore, if the primary key itself can be used as the index to locate records, it is unnecessary to create an index table. EID field itself in the above-mentioned table EMPLOYEE, for example, represents the positions of records in the table sequence, thus it is more efficient to use it to query the corresponding records:

This time 10,000 sequence numbers of employees are generated randomly. A4 finds the corresponding records directly according to these sequence numbers; A7 still uses find function to find records. A5 and A8 compute respectively the time the two methods will take:

We can see that it is much faster to locate records using sequence numbers directly. Because this method doesn’t compare field values, nor does it compute the hash values and create an index table. The query results in A4 and A7 are same:

Thus it can be seen that suitability should be taken into consideration if the index function of the primary key in a table sequence is to be used to increase efficiency in esProc.

3.switch function

Besides find function and pfind function, switch function query records according to the primary values too. It will also use the index table of the primary key automatically in dealing with the operation. For example:

Both A2 and A3 contain personnel information imported from binary text file PersonnelInfo, as shown below:

A4 contains states information:

In both A6 and A9, State field of PersonnelInfo is switched into corresponding states information. Their difference is that A6 uses select@1 function while A9 uses switch function. A7 and A9 compute respectively the time the two methods will take:

After the code in the cellset is executed, values of A2 and A3 are same, as shown below：

State field has been switched into corresponding records in states information table.

Before switch is executed, an index table will also be created for corresponding fields in the table sequence. In this example, an index table is created for ABBR field of states information table in A4 to increase the matching efficiency. So switch function should be properly used when foreign key fields for referencing are to be generated.

Data Process Engine

menu

12/18/2014

Related Computing in esProc - Primary Keys and Index Function

1.find and pfind

2.Primary key and the index table

3.switch function

No comments:

Post a Comment