Take a Dip into PostgreSQL Arrays

Datetime:2017-04-20 05:21:30         Topic: PostgreSQL          Share        Original >>
Here to See The Original Article!!!

There's a number of datatypes available in PostgreSQL. In this article, we're going to take a look at the array datatype.

There are times when you might want to store multiple values in one database column instead of across multiple tables. PostgreSQL gives you this capability with the array datatype.

Arrays are some of the most useful data types for storing lists of information. Whether you have a list of baseball scores, blog tags, or favorite book titles, arrays can be found everywhere. The bigger question, however, is why you'd even consider using arrays. The answer comes down to performance and ease of use. Aggregating and joining data across tables and rows can be difficult, and depending on the type of data you're storing, it might get expensive too.

In this article, we're going to create a table containing student information. We'll create two fields that demonstrate two ways to create arrays. We'll insert some data into the table, index the array columns, show how to query data from one and multidimensional arrays, and we'll look at some of the performance benefits and drawbacks along the way.

So, let's get started ...

Creating a Table with Arrays

We'll start by creating a table of students. It comprises their name , exam scores as a multidimensional array, and contact information that's a one-dimensional array of phone numbers. We'll create the table using both the bracket notation [] for creating arrays and the ARRAY keyword which are the two ways you can designate a field to be an array. So, let's show you how to do that:

CREATE TABLE students (  
    name text,
    scores int[][],
    contacts varchar ARRAY -- or varchar[]
);

We are using the text datatype for the name. The student scores is an array of integers, but since we will insert four exam scores taken at four different times, we'll create a multidimensional array of integers to hold all the scores. This is indicated using the double brackets. Student contacts is also an array using the varchar datatype, but instead of using brackets, we'll use the the ARRAY constructor. The brackets and the ARRAY constructor are synonymous.

Once the table has been created, we can run \d students to see how PostgreSQL has stored the datatypes for each table row.

Table "public.students"
  Column  |        Type         | Modifiers 
----------+---------------------+-----------
 name     | text                | 
 scores   | integer[]           | 
 contacts | character varying[] |

Notice that the table doesn't indicate explicitly that scores is a multidimensional array. That's because PostgreSQL allows arrays to be either multidimensional and one-dimensional. So, even though we created the table with double brackets [][] , PostgreSQL would have allowed us to insert a one-dimensional array as well. At the same time, we could have simply created the scores field as a one-dimensional array, with single brackets [] , and inserted a multidimensional array.

Now, let's move on and insert some data into our student table ...

Inserting Data with Arrays

Once our table has been created, let's insert some data. For this article, we've entered 400 students into our database like the following:

INSERT INTO students VALUES (  
     'student_name', '{{44,93,82,42},{59,74,73,67},{43,54,59,77},{46,45,68,98}}', '{"(555)480-9941","(555)738-3707"}'
);

student_ with a number is the unique student name for each record. The exam scores is a multidimensional array. Each of these arrays contains four exam arrays that represent a single exam enclosed using braces {} . Each of the values within an array represents a portion of the exam. The key when inserting a multidimensional array is that the enclosed arrays must have matching dimensions. So, we couldn't have a student who has only three scores for a single exam, for example, otherwise we'll get an error like this:

SELECT ARRAY[[1,2,3],[4,5]];  
ERROR:  multidimensional arrays must have array expressions with matching dimensions

The last values that we'll insert are enclosed in a one-dimensional array containing two phone numbers for each student. If a student has one number or two numbers, it won't matter. The only constraints we have are for multidimensional arrays where the array dimensions must match.

After inserting the data, we should see something like:

name    |                          scores                           |           contacts            
-----------+-----------------------------------------------------------+-------------------------------
 student_1 | {{91,54,54,68},{64,68,59,89},{95,51,42,63},{89,51,70,97}} | {(555)941-8927,(555)984-8816}
 ...

Now, let's set up some indexes on these array columns.

Indexing Arrays

To index arrays in PostgreSQL, it's best to use a GIN or GiST index . Using either index has its benefits and drawbacks; however, GiST indexes were primarily developed for geometric datatypes, while GIN indexes were designed for arrays. As a rule of thumb, PostgreSQL recommends using a GIN index for static data, especially for arrays, and a GiST index for data that is frequently updated.

Both types of indexes provide us with special operator classes that we can use when searching for data in arrays. GIN indexed queries support <@, @>, =, and && operators, while GiST indexed queries support many more operators like <<, &<, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=, and &&.

Indexing arrays don't require a special syntax to use. Using PostgreSQL's CREATE INDEX operation, we'll create a Btree index for student names and GIN indexes on the other columns like:

CREATE INDEX idx_name ON students (name); -- BTREE Index (no array)  
CREATE INDEX idx_scores ON students USING GIN(scores); -- GIN Index (array)  
CREATE INDEX idx_contacts ON students USING GIN(contacts); -- GIN Index (array)

Once the indexed have been created, let's get into querying the data ...

Accessing Array Elements

To take advantage of our indexes when accessing student scores and contacts, we can use GIN operators to find whether students got a certain exam score, or whether a contact number matches with a student record. For a simple example, we could find out which student has the contact number (555)941-8927. For that, we'd write something like:

SELECT name, contacts FROM students WHERE contacts @> '{(555)941-8927}';

Since we used the @> operator which looks for numbers contained in the array, the GIN index is automatically used. Using EXPLAIN on the query, we can view its performance while searching for the number in the contacts column.

QUERY PLAN                                 
---------------------------------------------------------------------------
 Bitmap Heap Scan on students  (cost=8.01..12.02 rows=1 width=72)
   Recheck Cond: (contacts @> '{(555)941-8927}'::character varying[])
   ->  Bitmap Index Scan on idx_contacts  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (contacts @> '{(555)941-8927}'::character varying[])

Without the GIN index, PostgreSQL would perform a sequential scan that would have to look at all of our contacts in the table before finding returning the right student. Using a data source of more than 400 students, one can see that the costs of the query would be much higher without an index:

QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=72)
   Filter: (contacts @> '{(555)941-8927}'::character varying[])

Finding a specific exam score within a multidimensional array is a little tougher. Let's try to find the number of students who got a 97 on their exam. One way to write this query is like this:

SELECT name, scores FROM students WHERE scores @> '{97}';

The problem with this query is that we will get 87 students back who got a 97 in either four of their exams. At the same time, PostgreSQL didn't use our index because it had to look at each student to find out who got a 97.

QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=234)
   Filter: (scores @> '{97}'::integer[])

This way of querying is quite costly, so let's see if there is a better approach. One of the more interesting ways to search through arrays containing integers is using the intarray extension , which contains special operator classes for arrays. To install the extension, we'd write:

CREATE EXTENSION intarray;

Once the extension is installed, we can drop our scores index DROP INDEX idx_scores; and create a new index using the intarray extension we installed and the gin__int_ops operator class that comes with intarray .

CREATE INDEX idx_scores_with_intarray ON students USING GIN(scores gin__int_ops);

Now that we've created the new index when we run EXPLAIN on the query, we can see that the index is used and it's much faster to search for the score we want.

QUERY PLAN                                    
----------------------------------------------------------------------------------
 Bitmap Heap Scan on students  (cost=8.00..12.02 rows=1 width=234)
   Recheck Cond: (scores @> '{97}'::integer[])
   ->  Bitmap Index Scan on idx_scores_intarray  (cost=0.00..8.00 rows=1 width=0)
         Index Cond: (scores @> '{97}'::integer[])

Sometimes we want to search for particular scores within a single exam and don't want to include all the exams a student has taken. Since we have four arrays representing four exams, we could select one of the arrays/exams and see how many of the 87 students got a 97 on the first exam. To do that, we'd write a query like:

SELECT * FROM students WHERE scores[1:1] @> '{97}';

From this query, we'll get 20 students who got a 97 as one of their scores on their first exam. Notice that the only addition to the query is scores[1:1] . This is an array slice, which selects only the first array of our scores multidimensional array.

The only problem with the query is that, even though we set up an index on the scores column, the index won't be used because we haven't indexed both dimensions of the array.

QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=234)
   Filter: (scores[1:1] @> '{97}'::integer[])

Since the scores array is two dimensional, we'd have to set up two indexes instead of one. To get around this, we can create a materialized view that will contain the student names in one column and their first exam scores as a one-dimensional array in another column. Then we can index the score's column and see what the performance differences are there. Let's set that up now:

CREATE MATERIALIZED VIEW student_scores AS  
    SELECT 
        name, 
        ARRAY(SELECT UNNEST(s.scores[1:1])) AS scores
    FROM
        students AS s;

With the new materialized view, our data looks like the following with scores as a one-dimensional array:

name    |    scores     
------------+---------------
 student_1  | {91,54,54,68}
 student_2  | {82,53,85,59}
 ...

Now, we can set up a GIN index on the scores column:

CREATE INDEX idx_materialized_student_score_view ON student_scores USING GIN(scores);

Let's run the query using the student_scores materialized view:

SELECT * FROM student_scores WHERE scores @> '{97}';

We'll still get the 20 students who got a 97 on their first exam, but let's run EXPLAIN on the query to see its performance.

QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on student_scores  (cost=0.00..10.00 rows=1 width=48)
   Filter: (scores @> '{97}'::integer[])

As we can see, PostgreSQL chose to do a sequential scan on our data without using the index. PostgreSQL intuitively selects the fastest way to query our data, and in this case it was via the sequential scan and not the index. But, if we look at the performance of querying our data using the slice of the scores array ( scores[1:1] ) or creating the materialized view, we can see that creating the materialized view and searching for our scores that way outperformed looking for values through the slice.

Selecting ANY of the Array Elements

Another way that PostgreSQL allows us to search for values is using the array subexpression ANY . This expression lets us check if any of the values in an array meet the expression requirements. The only drawback with this is that it doesn't use an index, so on larger datasets, we might see significant performance loss if we use it. Nonetheless, let's try it out.

When constructing a query using ANY , we place the expression in the WHERE clause on the right side of an operator. So, if we are looking again for students who scored a 97 on an exam we'd write:

SELECT * FROM students WHERE 97 = ANY(scores);

Like the query above where we used the GIN and GiST indexes with the @> operator, the ANY expression provides us with the same result of 87 students. Using EXPLAIN , however, shows us the real performance problems using the expression:

QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on students  (cost=0.00..23.00 rows=87 width=234)
   Filter: (97 = ANY (scores))

Therefore, it's not that advantageous to use ANY and it's favorable to put an index on the columns that we're going to be accessing a lot.

Summing it Up

So we've discussed quite a bit about how to create tables with arrays, how to access them, how to index them and their performance, as well as a trick to put values from a multidimensional array into a materialized view for better indexing. By shedding some light into how PostgreSQL arrays work and how to query them, your performance and understanding of how arrays work will enhance your effectiveness when coming across these datatypes in your database.

If you have any feedback about this or any other Compose article, drop the Compose Articles team a line atarticles@compose.com. We're happy to hear from you.

attribution







New