QUERY OPTIMIZATION AND PROCESSING- THEORY

QUERY PROCESSING AND OPTIMIZATION

Query processing involves extracting data from a database through multiple steps. 

This includes translating high-level queries into low-level expressions at the file system's physical level, optimizing queries, and executing them for actual results.

Query Processing

One of the types of DBMS is Relational Database Management System (RDBMS) where data is stored in the form of rows and columns (in other words, stored in tables) which have  associations with each other. 

The users (both applications and humans) have the liberty to select, insert, update and delete these rows and columns without violating the constraints provided at the time of defining these relational tables. 

Let’s say you want the list of all the employees who have a salary of more than 10,000.

SQL Query 

 

SELECT

    emp_name

FROM

    employee

WHERE

    salary>10000;

 

The problem here is that DBMS won't understand this statement. 

SQL being a High-Level Language makes it easier not just for the users to query data based on their needs but also bridges the communication gap between the DBMS which does not really understand human language. 

In fact, the underlying system of DBMS won't even understand these SQL queries. 

For them to understand and execute a query: 

i) they first need to be converted to a Low-Level Language. 

ii) then SQL queries go through a processing unit that converts them into low-level Language via Relational Algebra in DBMS

Since relational algebra queries are a bit more complex than SQL queries, DBMS expects the user to write only the SQL queries. 

It then processes the query before evaluating it.

 

Phases of Query Processing in DBMS

 

As mentioned in the above image, query processing can be divided into compile-time and run-time phases. 

Compile-time phase includes:

  1. Parsing and Translation (Query Compilation)
  2. Query Optimization
  3. Evaluation (code generation)

In the Runtime phase, the database engine is primarily responsible for interpreting and executing the hence generated query with physical operators and delivering the query output.

Parsing and Translation

The first step in query processing is Parsing and Translation. 

In the first step, queries undergo lexical, syntactic, and semantic analysis. Essentially, the query gets broken down into different tokens and white spaces are removed along with the comments (Lexical Analysis). 

In the next step, the query gets checked for the correctness, both syntax and semantic wise. The query processor first checks the query if the rules of SQL have been correctly followed or not (Syntactic Analysis).

Finally, the query processor checks if the meaning of the query is right or not. Things like if the table(s) mentioned in the query are present in the DB or not? if the column(s) referred from all the table(s) are actually present in them or not? (Semantic Analysis)

Once the above mentioned checks pass, the flow moves to convert all the tokens into relational expressions, graphs, and trees. This makes the processing of the query easier for the other parsers.

Let's consider a query :

SQL Query:

SELECT

    emp_name

FROM

    employee

WHERE

    salary>10000;

 

The above query would be divided into the following tokens: SELECT, emp_name, FROM, employee, WHERE, salary, >, 10000.

The tokens (and hence the query) get validated for

  • The name of the queried table is looked into the data dictionary table.
  • The name of the columns mentioned (emp_name and salary) in the tokens are validated for existence.
  • The type of column(s) being compared have to be of the same type (salary and the value 10000 should have the same data type).

The next step is to translate the generated set of tokens into a relational algebra query. These are easy to handle for the optimizer in further processes.

∏ emp_name (σ salary > 10000)

 

Query Tree

A Query Tree is a tree data structure that represents the hierarchical structure of a database query. It is primarily used in query processing and optimization in databases.

  • Leaf nodes represent relations (tables).
     
  • Internal nodes represent relational algebra operations like selection (σ), projection (π), join (⨝), etc.
     

It helps in visualizing how a query will be executed step-by-step and is used by database systems to optimize queries.

Relations (Tables) → at the bottom (leaves)
 

Operators → in internal nodes:
 

  • σ: Selection (WHERE clause)
     
  • π: Projection (SELECT clause)
     
  • : Join (JOIN clause)
     
  • ×: Cartesian product
     
  • ∪, −, ∩: Set operations (UNION, DIFFERENCE, INTERSECTION)

Example 1: 

 

SQL Query:

 

SELECT name

FROM Student

WHERE age > 20;

Relational Algebra Equivalent:

π_name (σ_age > 20 (Student))

 

Query Tree:

      π_name

           |

      σ_age > 20

           |

        Student

 

Example 2: 

 

SQL Query 

 

SELECT S.name, D.dept_name

FROM Student S, Department D

WHERE S.dept_id = D.id AND S.age > 20;


 

Relational Algebra Equivalent:

 

π_S.name, D.dept_name (σ_S.age > 20 (Student ⨝ Student.dept_id = Department.id Department))



 

Query Tree

 

π_S.name, D.dept_name

                        |

                  σ_S.age > 20

                        |

          ⨝_S.dept_id = D.id

             /               \

        Student          Department


 

Example 3: 

 

SQL Query

 

SELECT e.name, d.name

FROM Employee e, Department d

WHERE e.dept_id = d.dept_id;

 

Relational Algebra Equivalent

 

π_e.name, d.name (Employee ⨝_e.dept_id = d.dept_id Department)


 

Query Tree

   π_e.name, d.name

                |

       ⨝_e.dept_id = d.dept_id

           /               \

     Employee         Department

 

Example 4: 

 

SQL Query

 

SELECT e.name

FROM Employee e, Department d

WHERE e.dept_id = d.dept_id AND d.location = 'Kathmandu';

 

Relational Algebra Equivalent

 

π_e.name (Employee ⨝_e.dept_id = d.dept_id (σ_location = 'Kathmandu' (Department)))

 

Query Tree

 

         π_e.name

                |

         ⨝_e.dept_id = d.dept_id

             /                \

      Employee         σ_location = 'Kathmandu'

                             |

                         Department

 

Query Evaluation

Once the query processor has the above-mentioned relational forms with it, the next step is to apply certain rules and algorithms to generate a few other powerful and efficient data structures. 

These data structures help in constructing the query evaluation plans. 

For example, if the relational graph was constructed, there could be multiple paths from source to destination. 

A query execution plan will be generated for each of the paths.

 

i) One way could be first projecting followed by selection (on the right). 

 

ii) Another way would be to do selection followed by projection (on the left). 


 

Query Optimization

In the next step, DMBS picks up the most efficient evaluation plan based on the cost each plan has. 

The aim here is to minimize the query evaluation time. The optimizer also evaluates the usage of index present in the table and the columns being used. It also finds out the best order of subqueries to be executed so as to ensure only the best of the plans gets executed.

Simply put, for any query, there are multiple evaluation plans to execute it. 

Choosing the one which costs the least is called Query Optimization in DBMS

Some of the factors weighed in by the optimizer to calculate the cost of a query evaluation plan is:

  • CPU time
  • Number of tuples to be scanned
  • Disk access time
  • number of operations

Summary : 

Compile time

  • Parsing and Translation: break the query into tokens and check for the correctness of the query
  • Query Optimisation: Evaluate multiple query execution plans and pick the best out of them.
  • Query Generation: Generate a low level, DB executable code

Runtime time (execute/evaluate the hence generated query)