Select (SQL): Difference between revisions

Content deleted Content added

Inline

Revision as of 04:50, 30 December 2018

The SQL SELECT statement returns a result set of records from one or more tables.^[1]^[2]

A SELECT statement retrieves zero or more rows from one or more database tables or database views. In most applications, SELECT is the most commonly used data query language (DQL) command. As SQL is a declarative programming language, SELECT queries specify a result set, but do not specify how to calculate it. The database translates the query into a "query plan" which may vary between executions, database versions and database software. This functionality is called the "query optimizer" as it is responsible for finding the best possible execution plan for the query, within applicable constraints.

The SELECT statement has many optional clauses:

WHERE specifies which rows to retrieve.
GROUP BY groups rows sharing a property so that an aggregate function can be applied to each group.
HAVING selects among the groups defined by the GROUP BY clause.
ORDER BY specifies an order in which to return the rows.
AS provides an alias which can be used to temporarily rename tables or columns.

Examples

Table "T" Query Result

C1	C2
1	a
2	b

SELECT * FROM T;

C1	C2
1	a
2	b

C1	C2
1	a
2	b

SELECT C1 FROM T;

C1
1
2

C1	C2
1	a
2	b

SELECT * FROM T WHERE C1 = 1;

C1	C2
1	a

C1	C2
1	a
2	b

SELECT * FROM T ORDER BY C1 DESC;

C1	C2
2	b
1	a

Given a table T, the query SELECT * FROM T will result in all the elements of all the rows of the table being shown.

With the same table, the query SELECT C1 FROM T will result in the elements from the column C1 of all the rows of the table being shown. This is similar to a projection in Relational algebra, except that in the general case, the result may contain duplicate rows. This is also known as a Vertical Partition in some database terms, restricting query output to view only specified fields or columns.

With the same table, the query SELECT * FROM T WHERE C1 = 1 will result in all the elements of all the rows where the value of column C1 is '1' being shown — in Relational algebra terms, a selection will be performed, because of the WHERE clause. This is also known as a Horizontal Partition, restricting rows output by a query according to specified conditions.

With more than one table, the result set will be every combination of rows. So if two tables are T1 and T2, SELECT * FROM T1, T2 will result in every combination of T1 rows with every T2 rows. E.g., if T1 has 3 rows and T2 has 5 rows, then 15 rows will result.

The SELECT clause specifies a list of properties (columns) by name, or the wildcard character (“*”) to mean “all properties”.

Limiting result rows

Often it is convenient to indicate a maximum number of rows that are returned. This can be used for testing or to prevent consuming excessive resources if the query returns more information than expected. The approach to do this often varies per vendor.

In ISO SQL:2003, result sets may be limited by using

cursors, or
By introducing SQL window function to the SELECT-statement

ISO SQL:2008 introduced the FETCH FIRST clause.

According to PostgreSQL v.9 documentation, an SQL Window function performs a calculation across a set of table rows that are somehow related to the current row, in a way similar to aggregate functions. ^[3] The name recalls signal processing window functions. A window function call always contains an OVER clause.

ROW_NUMBER() window function

ROW_NUMBER() OVER may be used for a simple table on the returned rows, e.g. to return no more than ten rows:

SELECT * FROM
( SELECT
    ROW_NUMBER() OVER (ORDER BY sort_key ASC) AS row_number,
    columns
  FROM tablename
) AS foo
WHERE row_number <= 11

ROW_NUMBER can be non-deterministic: if sort_key is not unique, each time you run the query it is possible to get different row numbers assigned to any rows where sort_key is the same. When sort_key is unique, each row will always get a unique row number.

RANK() window function

The RANK() OVER window function acts like ROW_NUMBER, but may return more or less than n rows in case of tie conditions, e.g. to return the top-10 youngest persons:

SELECT * FROM (
  SELECT
    RANK() OVER (ORDER BY age ASC) AS ranking,
    person_id,
    person_name,
    age
  FROM person
)AS foo
WHERE ranking <= 10

The above code could return more than ten rows, e.g. if there are two people of the same age, it could return eleven rows.

FETCH FIRST clause

Since ISO SQL:2008 results limits can be specified as in the following example using the FETCH FIRST clause.

SELECT * FROM T FETCH FIRST 10 ROWS ONLY

This clause currently is supported by CA DATACOM/DB 11, IBM DB2, SAP SQL Anywhere, PostgreSQL, EffiProz, H2, HSQLDB version 2.0, Oracle 12c and Mimer SQL.

Microsoft SQL Server 2008 and higher supports FETCH FIRST, but it is considered part of the ORDER BY clause. The ORDER BY, OFFSET, and FETCH FIRST clauses are all required for this usage.

SELECT * FROM T ORDER BY acolumn DESC OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY

Non-standard syntax

Some DBMSs offer non-standard syntax either instead of or in addition to SQL standard syntax. Below, variants of the simple limit query for different DBMSes are listed:

SET ROWCOUNT 10 SELECT * FROM T	MS SQL Server (This also works on Microsoft SQL Server 6.5 while the *Select top 10 from T** does not)
`SELECT * FROM T LIMIT 10 OFFSET 20`	Netezza, MySQL, SAP SQL Anywhere, PostgreSQL (also supports the standard, since version 8.4), SQLite, HSQLDB, H2, Vertica, Polyhedra, Couchbase Server
`SELECT * from T WHERE ROWNUM <= 10`	Oracle
`SELECT FIRST 10 * from T`	Ingres
`SELECT FIRST 10 * FROM T order by a`	Informix
`SELECT SKIP 20 FIRST 10 * FROM T order by c, d`	Informix (row numbers are filtered after order by is evaluated. SKIP clause was introduced in a v10.00.xC4 fixpack)
`SELECT TOP 10 * FROM T`	MS SQL Server, SAP ASE, MS Access, SAP IQ, Teradata
`SELECT * FROM T SAMPLE 10`	Teradata
`SELECT TOP 10 START AT 20 * FROM T`	SAP SQL Anywhere (also supports the standard, since version 9.0.1)
`SELECT FIRST 10 SKIP 20 * FROM T`	Firebird
`SELECT * FROM T ROWS 20 TO 30`	Firebird (since version 2.1)
SELECT * FROM T WHERE ID_T > 10 FETCH FIRST 10 ROWS ONLY	DB2
SELECT * FROM T WHERE ID_T > 20 FETCH FIRST 10 ROWS ONLY	DB2 (new rows are filtered after comparing with key column of table T)

Rows Pagination

Rows Pagination ^[4] is an approach used to limit and display only a part of the total data of a query in the database. Instead of showing hundreds or thousands of rows at the same time, the server is requested only one page (a limited set of rows, per example only 10 rows), and the user starts navigating by requesting the next page, and then the next one, and so on. It is very useful, specially in web systems, where there is no dedicated connection between the client and the server, so the client does not have to wait to read and display all the rows of the server.

Data in Pagination approach:

{rows} = Number of raws in a page
{page_number} = Number of the current page
{begin_base_0} = Number of the row - 1 where the page starts = (page_number-1) * rows
{sorting_cols} = It is very important to sort the rows with a set of columns of the table whose values are unique, with the aim that each time the same query is executed, the rows always appear in the same order. This is achieved by placing any column or columns in the "order by" and adding the field or fields of the primary key or any other unique index at the end of these fields list

Diferent methods

Simplest method (but very inefficient):
1) Select all rows from the database. Remember that {sorting_cols} must have unique values.
2) Read all rows but send to display only when the row_number of the rows read is between {begin_base_0 + 1} and {begin_base_0 + rows}

Select * 
from {table} 
order by {sorting_cols}

Other simple method (a little more efficient than read all rows):
1) Select all the rows from the beginning of the table to the last row to display ({begin_base_0 + rows}). Remember that {sorting_cols} must have unique values.
2) Read the {begin_base_0 + rows} rows but send to display only when the row_number of the rows read is greater than {begin_base_0}

SQL	Dialect
select * from {table} order by {sorting_cols} FETCH FIRST {begin_base_0 + rows} ROWS ONLY	SQL ANSI 2008 Postgresql SQL Server 2012 Derby Oracle 12c DB2 12
Select * from {table} order by {sorting_cols} LIMIT {begin_base_0 + rows}	MySQL SQLite
Select TOP {begin_base_0 + rows} * from {table} order by {sorting_cols}	Sybase 12.5.3, SQL Server 2005
SET ROWCOUNT {begin_base_0 + rows} Select * from {table} order by {sorting_cols} SET ROWCOUNT 0	Sybase 4, SQL Server 4
Select * FROM ( SELECT * FROM {table} ORDER BY {sorting_cols} ) a where rownum <= {begin_base_0 + rows}	Oracle 11

Method with positioning:
1) Select only <rows> rows starting from the next row to display ({begin_base_0 + 1}). Remember that {sorting_cols} must have unique values.
2) Read and send to display all the rows read from the database

SQL	Dialect
Select * from {table} order by {sorting_cols} OFFSET {begin_base_0} ROWS FETCH NEXT {rows} ROWS ONLY	SQL ANSI 2008 Postgresql SQL Server 2012 Derby Oracle 12c DB2 12
Select * from {table} order by {sorting_cols} LIMIT {rows} OFFSET {begin_base_0}	MySQL MariaDB Postgresql SQLite
Select * from {table} order by {sorting_cols} LIMIT {begin_base_0}, {rows}	MySQL MariaDB SQLite
select TOP {begin_base_0 + rows} , _offset=identity(10) into #temp from {table} ORDER BY {sorting_cols} select from #temp where _offset > {begin_base_0} DROP TABLE #temp	Sybase 12.5.3:
SET ROWCOUNT {begin_base_0 + rows} select , _offset=identity(10) into #temp from {table} ORDER BY {sorting_cols} select from #temp where _offset > {begin_base_0} DROP TABLE #temp SET ROWCOUNT 0	Sybase (old version, I do not know when began the identity function)
select TOP {rows} * from ( select *, ROW_NUMBER() over (order by {sorting_cols}) as _offset from {table} ) a where _offset > {begin_base_0}	SQL Server 2005
SET ROWCOUNT {begin_base_0 + rows} select , _offset=identity(int,1,1) into #temp from {table} ORDER BY {sorting_cols} select from #temp where _offset > {begin_base_0} DROP TABLE #temp SET ROWCOUNT 0	SQL Server 2000
SELECT * FROM ( SELECT rownum-1 as _offset, a.* FROM( SELECT * FROM {table} ORDER BY {sorting_cols} ) a WHERE rownum <= {begin_base_0 + cant_regs} ) WHERE _offset >= {begin_base_0}	Oracle 11

Method with filter (it is more sophisticated but necessary for very big dataset):
1) Select only the <rows> rows with filter:
1.1) First Page: select only the first {rows} rows, depending on the type of database. Remember that {sorting_cols} must have unique values, but in the case of a very big dataset it must have other considerations
1.2) Next Page: select only the first {rows} rows, depending on the type of database, where the {sorting_cols} is grater than {last_val} (the value of the {sorting_cols} of the last row in the current page)
1.3) Previous Page: sort the data in the reverse order, select only the first {rows} rows, where the {sorting_cols} is less than {first_val} (the value of the {sorting_cols} of the first row in the current page), and sort the result in the correct order
2) Read and send to display all the rows read from the database

First Page	Next Page	Previous Page	Dialect
select * from {table} order by {sorting_cols} FETCH FIRST {rows} ROWS ONLY	select * from {table} where {sorting_cols} > {last_val} order by {sorting_cols} FETCH FIRST {rows} ROWS ONLY	select * from ( Select * from {table} where {sorting_cols} < {first_val} order by {sorting_cols} DESC FETCH FIRST {rows} ROWS ONLY ) a order by {sorting_cols}	SQL ANSI 2008 Postgresql SQL Server 2012 Derby Oracle 12c DB2 12
select * from {table} order by {sorting_cols} LIMIT {rows}	select * from {table} where {sorting_cols} > {last_val} order by {sorting_cols} LIMIT {rows}	select * from ( select * from {table} where {sorting_cols} < {first_val} order by {sorting_cols} DESC LIMIT {rows} ) a order by {sorting_cols}	MySQL SQLite
select TOP {rows} * from {table} order by {sorting_cols}	select TOP {rows} * from {table} where {sorting_cols} > {last_val} order by {sorting_cols}	select * from ( select TOP {rows} * from {table} where {sorting_cols} < {first_val} order by {sorting_cols} DESC ) a order by {sorting_cols}	SQL Server 2005
SET ROWCOUNT {rows} select * from {table} order by {sorting_cols} SET ROWCOUNT 0	SET ROWCOUNT {rows} select * from {table} where {sorting_cols} > {last_val} order by {sorting_cols} SET ROWCOUNT 0	SET ROWCOUNT {rows} select * from ( select * from {table} where {sorting_cols} < {first_val} order by {sorting_cols} DESC ) a order by {sorting_cols} SET ROWCOUNT 0	Sybase, SQL Server 2000
select * from ( select * from {table} order by {sorting_cols} ) a where rownum <= {rows}	select * from ( select * from {table} where {sorting_cols} > {last_val} order by {sorting_cols} ) a where rownum <= {rows}	select * from ( select * from ( select * from {table} where {sorting_cols} < {first_val} order by {sorting_cols} DESC ) a1 where rownum <= {rows} ) a2 order by {sorting_cols}	Oracle 11

Considerations with very big data sets

When we talk about "Very big data set" we are talking about paging a table with hundreds of thousands or millions of data.

Unique Values

If {sorting_cols} did not have unique values, then, for example, if on a page of 10 rows, these sorting columns were repeated, for example, if there are 5 rows with these repeated columns, and correspond to the last 2 rows of a page, should appear again in the first 3 rows of the next page, but since "{sorting_cols}> {last_val}" was leaked, these next 3 rows will be lost. If these sorting columns are not repeated or are unique, necessarily the first row that fulfills the condition "{sorting_cols}> {last_val}" corresponds to the next value, and queues will never be lost in queries.

Related Index

We might think that "{sorting_cols}" will always correspond to the primary key, an alternating key or a unique index of the table, which although it is true is very correct, it is not always suitable for all cases of real life .

What is mandatory is that because of the large amount of data, there should always be an index associated with the {sorting_cols}. Remember that each time the query is made, the database will try to sort the millions of records by these columns, but if there is already an index, what it does is read the index and it will not waste time sorting the data.

However, depending on the DBMS optimizer, it is possible to add the columns of the primary key to the columns of the index.

For example, in Oracle 10, there was a rule that the optimizer only uses the index when all the columns of the index are used, that is, in this case it is mandatory to have an index with all the columns {sorting_cols}
Most current optimizers are smart enough to use the index when sorting data by starting with the same columns. In this case the {sorting_cols} can start with the columns of an index and add the column or columas of the unique index. However, this method only works when the granulity of the values of the first index is relatively greater than the unique values, so that the time of ordering the unique values within the sets of repeated values is not very large. Otherwise, it will be necessary again to have an index for all columns on {sorting_cols}.

Complex "Greater than (>)" condition

For example, if we have an "Employees" table that has a primary key by "Num_Employee", and an Index by "Last_name, First_name", we can display the data by "Num_Employee" or by "Last_name, First_name".
In the first case, the condition "where {sorting_cols}> {last_val}" and the ordering "order by {sorting_cols}" will be easily implemented as:

where Num_Employee > {Num_Employee_of_last_row}     -- it is the value of the Num_Employee of the last row of the current page
...
order by Num_Employee

But the second case is more complex. We want to show the data sorted by "Last_name, First_name", but since this index does not give us unique values, we add the primary key column, so the {sorting_cols} remains as "Last_name, First_name, Num_Employee". Automatically we already have the columns of the "Order by": "order by Last_name, First_name, Num_Employee" But the condition is more complex.
The easiest is to concatenate the values, placing a separator between each value:

-- Must not use this method because it generates table scan
where Last_name || ',' || First_name  || ',' || Num_Employee >
              {Last_name_of_last_row} || ',' || {First_name_of_last_row}  || ',' || {Num_Employee_of_last_row}
...
order by Last_name, First_name, Num_Employee

However, the concatenation of values means that the index is not used, then the filter is searched in all the records (table scan), which means that a great amount of time is required. To force the optimizer to use the index, it is necessary to use each column individually, as follows:

where 
   (
       Col1 > {Col1_of_last_row}                               -- Col_i_of_last_row is the value of col_i in the last row 
                                                               -- of the current page
    or Col1 = {Col1_of_last_row} and Col2 > {Col2_of_last_row}
    or Col1 = {Col1_of_last_row} and Col2 = {Col2_of_last_row} and Col3 > {Col3_of_last_row}
	...
    or Col1 = {Col1_of_last_row} and Col2 = {Col2_of_last_row} and Col3 = {Col3_of_last_row}...and Col_n > {Col_n_of_last_row}
   )
...
order by Col1, Col2, Col3 ..., Col_n

Then, the optimal method to perform the complex filter of the second case is:

-- Must use this method in order to use the index
where (
       Last_name > {Last_name_of_last_row}                     -- XXX_of_last_row is the value of column XXX in the last row
                                                               -- of the current page
    or Last_name = {Last_name_of_last_row} and First_name > {Fist_name_of_last_row} 
    or Last_name = {Last_name_of_last_row} and First_name = {Fist_name_of_last_row} and Num_Employee > {Num_Employee_of_last_row}
   )
...
order by Last_name, First_name, Num_Employee

I test this method in SQLite and it did the query in 0.34 seconds with 4 millions rows

Hierarchical query

Some databases provide specialised syntax for hierarchical data.

A window function in SQL:2003 is an aggregate function applied to a partition of the result set.

For example,

sum(population) OVER( PARTITION BY city )

calculates the sum of the populations of all rows having the same city value as the current row.

Partitions are specified using the OVER clause which modifies the aggregate. Syntax:

<OVER_CLAUSE> :: =
   OVER ( [ PARTITION BY <expr>, ... ]
          [ ORDER BY <expression> ] )

The OVER clause can partition and order the result set. Ordering is used for order-relative functions such as row_number.

Query evaluation ANSI

The processing of a SELECT statement according to ANSI SQL would be the following:^[5]

select g.*
from users u inner join groups g on g.Userid = u.Userid
where u.LastName = 'Smith'
and u.FirstName = 'John'

the FROM clause is evaluated, a cross join or Cartesian product is produced for the first two tables in the FROM clause resulting in a virtual table as Vtable1
the ON clause is evaluated for vtable1; only records which meet the join condition g.Userid = u.Userid are inserted into Vtable2
If an outer join is specified, records which were dropped from vTable2 are added into VTable 3, for instance if the above query were:
```
select u.*
from users u left join groups g on g.Userid = u.Userid
where u.LastName = 'Smith'
and u.FirstName = 'John'
```
all users who did not belong to any groups would be added back into Vtable3
the WHERE clause is evaluated, in this case only group information for user John Smith would be added to vTable4
the GROUP BY is evaluated; if the above query were:
```
select g.GroupName, count(g.*) as NumberOfMembers
from users u inner join groups g on g.Userid = u.Userid
group by GroupName
```
vTable5 would consist of members returned from vTable4 arranged by the grouping, in this case the GroupName

the HAVING clause is evaluated for groups for which the HAVING clause is true and inserted into vTable6. For example:

select g.GroupName, count(g.*) as NumberOfMembers
from users u inner join groups g on g.Userid = u.Userid
group by GroupName
having count(g.*) > 5

the SELECT list is evaluated and returned as Vtable 7
the DISTINCT clause is evaluated; duplicate rows are removed and returned as Vtable 8
the ORDER BY clause is evaluated, ordering the rows and returning VCursor9. This is a cursor and not a table because ANSI defines a cursor as an ordered set of rows (not relational).

Window function support by RDBMS vendors

The implementation of window function features by vendors of relational databases and SQL engines differs wildly. Apart from MySQL, most databases support at least some flavour of window functions. However, when we take a closer look it becomes clear that most vendors only implement a subset of the standard. Let's take the powerful RANGE clause as an example. Only Oracle, DB2, Spark/Hive, and Google Big Query fully implement this feature. More recently, vendors have added new extensions to the standard, e.g. array aggregation functions. These are particularly useful in the context of running SQL against a distributed file system (Hadoop, Spark, Google BigQuery) where we have weaker data co-locality guarantees than on a distributed relational database (MPP). Rather than evenly distributing the data across all nodes, SQL engines running queries against a distributed filesystem can achieve data co-locality guarantees by nesting data and thus avoiding potentially expensive joins involving heavy shuffling across the network. User-defined aggregate functions that can be used in window functions are another extremely powerful feature.

Generating data in T-SQL

Method to generate data based on the union all

select 1 a, 1 b union all
select 1, 2 union all
select 1, 3 union all
select 2, 1 union all
select 5, 1

SQL Server 2008 supports the "row constructor" specified in the SQL3 ("SQL:1999") standard

select *
from (values (1, 1), (1, 2), (1, 3), (2, 1), (5, 1)) as x(a, b)

References

^ Microsoft. "Transact-SQL Syntax Conventions".
^ MySQL. "SQL SELECT Syntax".
^ PostgreSQL 9.1.24 Documentation - Chapter 3. Advanced Features
^ Ing. Óscar Bonilla, MBA
^ Inside Microsoft SQL Server 2005: T-SQL Querying by Itzik Ben-Gan, Lubor Kollar, and Dejan Sarka

Sources

Horizontal & Vertical Partitioning, Microsoft SQL Server 2000 Books Online.

External links

Windowed Tables and Window function in SQL, Stefan Deßloch
Oracle SELECT Syntax.
Firebird SELECT Syntax.
Mysql SELECT Syntax.
Postgres SELECT Syntax.
SQLite SELECT Syntax.

[1] Microsoft. "Transact-SQL Syntax Conventions".

[2] MySQL. "SQL SELECT Syntax".

[3] PostgreSQL 9.1.24 Documentation - Chapter 3. Advanced Features

[4] Ing. Óscar Bonilla, MBA

[5] Inside Microsoft SQL Server 2005: T-SQL Querying by Itzik Ben-Gan, Lubor Kollar, and Dejan Sarka

[1]

[2]

[3]

[4]

[5]

@@ Line 317: / Line 317: @@
 DROP TABLE #temp
 SET ROWCOUNT 0</pre>
+| Sybase (old version, I do not know when began the identity function)
-| Sybase 4:
 |-
 |

v t e SQL
Versions	SEQUEL SQL-86 SQL-89 SQL-92 SQL:1999 SQL:2003 SQL:2006 SQL:2008 SQL:2011 SQL:2016 SQL:2023
Keywords	As Case Create Delete From Group by Having Insert Join Merge Null Order by Over Prepare Select Truncate Union Update Where With
Related	Edgar Codd Relational database
ISO/IEC SQL parts	Framework Foundation Call-Level Interface Persistent Stored Modules Management of External Data Object Language Bindings Information and Definition Schemas SQL Routines and Types for the Java Programming Language XML-Related Specifications