- Greeplum series (three) basic usage”
- > three, create and manage table
- four, partition table
- five, sequence
- six, index
- seven, view
- eight, data management
Greeplum series (three) basic usage”
《PostgreSQL Tutorials: https://www.yiibai.com/postgresql
psql -d test -h 127.0.0.1 -p 5432 -U gpadmin
psql -d test -h 127.0.0.1 -p 5432 -U gpadmin
Note: the default login table name is gpadmin
1.2 to create a database
create database test; # Need to log on to PSQLCREATEDB test; command line mode
> three, create and manage table
Through the CREATE TABLE command implementation, when creating tables, we need to define the following points:
- Which columns and data types
- What kind of data can be stored by constraint constraints
- Table distribution strategy
- Storage mode. Such as compression, column storage, etc.
- Partition strategy for large tables
3.1 data type”
- For character types, most choose TEXT or VARCHAR.
- For numeric types, try to choose smaller data types.
- For columns that are intended for connection, select the same data type.
create table tb_products( -- 1. A combination of primary key constraints, unique constraints and non empty constraints. The default becomes DK (Distributed Key) id integer integer PRIMARY KEY, -- 2. The only constraint is to ensure that the data of the field is unique in the table. product_no integer UNIQUE, -- 3. Non empty constraints, no empty values can exist name text NOT NULL, -- 4. Checking constraints and making data must satisfy a Boolean expression constraint. price numeric CHECK(price>0) -- 5. Foreign key constraints, GPDB is not currently supported. );
Be careful： There is only one of the primary key constraint and the unique constraint.
3.2 distribution strategy”
When create table and alter table are used, DISTRIBUTED BY (HASH distribution) or DISTRIBUTED RANDOMLY (random distribution) are used to determine how data are distributed. Consider the main points:
- Uniform data distribution: try to ensure that each segment instance stores the same amount of data; use a unique DK as much as possible, such as the primary key, the unique key, and so on.
- Local operations and distributed operations: ensure that the processing of queries (associations, sorties, aggregation, etc.) is done locally as far as possible in each instance, avoiding data redistribution; different tables try to use the same DK as far as possible to avoid random distribution.
- Balanced query processing: to ensure that every segment instance can handle the same amount of workload as much as possible.
Declarations distribution keys:
- Specify at the time of creating or modifying the table definition;
- If not specified, the system will consider the use of the primary key or the first field as the DK of the HASH distribution.
- Columns of geometric type or custom type are not suitable for GP’s DK.
- If there is no suitable column to ensure the average distribution of data, random distribution is used.
create table tb_dk_01(a int, b int) distributed by(b); create table tb_dk_02(a int, b int); create table tb_dk_03(a int primary key, b int); create table tb_dk_04(a int, b int) distributed randomly; # View the table distribution key\d tb_dk_01
3.3 storage mode”
Select heap storage (Heap) or add only (Append-Only) store.
- Heap stores small tables that are suitable for changing data, such as dimension tables.
- Only store extra large tables that are suitable for the warehouse, usually loading data in batches and only read only query operations. They do not support UPADTE and DELETE operations.
# Create a heap table, a default storage modeCREATE TABLE tb_heap_01(id int) DISTRIBUTED BY (id) # Create an additional tableCREATE TABLE tb_ao_01(id int) WITH(appendonly=true)
3.4 row storage and column storage
Select row store (Row-Orientation) or column store (Column-Orientation).
- Update of table data: only row storage is selected.
- If data is often INSERT, consider selecting row storage.
Number of columns for query design:
If SELECT or WHERE is involved in all or most columns of the table, consider row storage. Column storage is suitable for single operations in WHERE or HAVING squadrons for aggregation operations:
SELECT AVG(salary)…WHERE salary>10000
Or use a single column condition in the WHERE condition and return a small number of rows using compressed storage.
SELECT salary, dept…WHERE state=‘CA’
- The number of columns in a table: row storage is more efficient for columns with multiple rows or relatively small row sizes; columns are more neutral in storing only a few columns that only access a wide table.
Compression: the column storage table has the compression advantage.
-- When you create column stores, you can only store additional (Append-Only). create table tb_col_01(a int, b text) with (appendonly=true, orientation=column) distributed by (a);
3.5 compressed storage (supporting only Append-only tables).
Two compression methods: table compression and column level compression.
(1) Select the considerations of compression methods and levels:
- Compression ratio
- Compression speed
- Decompression speed or query efficiency
It is guaranteed that the most effective compression and data size will not be significantly improved under the compression time and query efficiency. The compression rate of ZLIB is higher than that of QUICKLZ, but the rate is low.
-- Note: QUICKLZ has only one compression level, that is, there is no compresslevel parameter, and ZLIB has 1-9 optional. create table tb_zlib_01(a int, b text) with (appendonly=true, compresstype=zlib, compresslevel=5);
(2) Check the compression and distribution of the AO table:
# Look at the distribution of tables, which can only be used for Append-Only tables.Select get_ao_distribution ('table name');The compression rate of the view table can only be used in the Append-Only table.Select get_ao_compression_ratio ('table name');Select gp_segment_id, count (1) from name group by 1;
- COMPRESSTYPE：ZLIB(Higher compression ratio, QUICKLZ (faster compression), RLE_TYPE (run length encoding), none (no compression, default)
- COMPRESSLEVEL：ZLIBIt is optional for class 1-9, lowest for 1 level, highest for 9 level, QUICKLZ only for 1 level compression, RLE_TYPE for 1-4 level, 1 level fast, but low compression rate, 4 level slower, but compression rate is high.
(4) Prioritization of the compressed settings
The lower priority setting has higher priority.
- The column compression settings for sub partitions will override partitions, columns, and table level settings.
- The column compression settings for partitions will override columns and table level settings.
- The compression setting of the column will cover the entire table level settings
Be careful： Storage settings can not be inherited
create table tb_t3 ( c1 int encoding(compresstype=zlib), c2 text, c3 text encoding(compresstype=rle_type), c4 smallint encoding(compresstype=none), defalut column encoding(compresstype=quicklz, blocksize=65536) ) with(appendonly=true, orientation=column) partition by range(c3) ( start('2010-01-01'::date) end('2010-12-31'::date), column c3 encoding(compresstype=zlib) );
3.6 change table”
alter table The command is used to change the definition of an existing table. For example:
alter table tb_01 alter column a set not null;
(1) Modify distribution strategy
-- Data will be redistributed and recursively applied to all sub partitions.Alter table tb_01 set distributed by (b);- note: when the distribution strategy is changed to random distribution, data will not be redistributed.Alter tableTb_01 set distributed randomly;
(2) Redistribution table data
# A random distribution strategy or a table without changing the distribution strategy, forcing redistribution.Alter table tb_01 set with (reorganize=true);
(3) Modify the storage mode of the table
The storage mode can only be specified at CREATE TABLE. If you want to modify, you must use the correct storage option to rebuild the table.
CREATE TABLE tb_zlib_02 (like tb_zlip_01) WITH (appendonly=true, compresstype=quicklz, compresslevel=1, orientation=column); INSERT INTO tb_zlib_02 SELECT * FROM tb_zlip_01; DROP TABLE tb_zlip_01; ALTER TABLE tb_zlib_02 RENAME TO tb_zlib_01; GRANT ALL PRIVILEGES ON tb_zlib_01 TO admin; GRANT SELECT ON tb_zlib_01 TO dylan;
(4) Add a compressed column
ALTER TABLE tb_zlib_01 ADD COLUMN c int DEFAULT 0 ENCODING (COMPRESSTYPE=zlib);
(5) Delete table
-- To delete a table - related view, you must use CASCADEDROP TABLE tb01;DROP TABLE tb03 CASCADE;- - record with DELETE or TRUNCATE emptying tableDELETE FROM tb02;TRUNCATE tb02;
four, partition table
A large table is logically divided into several parts, such as query according to partition conditions, which will reduce the scanning scope and improve the performance of the system. Improving query speed and performance for specific types of data is more convenient for database maintenance and updating.
The partition strategy for the decision table:
- Is the watch large enough? The large fact table is suitable for table partitions.
- Dissatisfied with current performance? Partition is reconsidered when query performance is lower than expected.
- Can the query condition match the partition condition? Is the WHERE condition of the query consistent with the column considered partitioned?
- Do data warehouses need to roll historical data? Rolling demand for historical data is also a consideration for partition design.
- Can the data be evenly partitions according to a certain rule? The rules that try to split the data evenly
4.1 partition table type”
- RangeZoning: (Date range or number range) / as date, price, etc
- List Zoning: For example, a region, a product, etc
4.2 create partitions
TABLE It can only be partitioned at CREATE TABLE. The first step is to select partition type (scope partition, list partition) and partition field.
(1) Defining the date range partition table (range partition)
Use a single date or timestamp field as partitioning key. If necessary, you can use the same field to make sub partitions. The partition increment is defined by using START, END and EVERY clauses to enable GP to automatically generate partitions.
CREATE TABLE tb_cp_01 (id int, date date, amt decimal(10, 2)) DISTRIBUTED BY (id) PARTITION BY RANGE (date) ( -- (Default behavior) INCLUSIVE: contains 2013-01-01; EXCLUSIVE: does not contain 2014-01-01. START (date '2013-01-01') INCLUSIVE END (date '2014-01-01') EXCLUSIVE EVERY (INTERVAL '1 month') );
You can also make a separate name for each partition.
CREATE TABLE tb_cp_02 (id int, date date, amt decimal(10, 2)) DISTRIBUTED BY (id) PARTITION BY RANGE (date) ( PARTITION Jan13 START (date '2013-01-01') INCLUSIVE, PARTITION Feb13 START (date '2013-02-01') INCLUSIVE, PARTITION Mar13 START (date '2013-03-01') INCLUSIVE, PARTITION Apr13 START (date '2013-04-01') INCLUSIVE, PARTITION May13 START (date '2013-05-01') INCLUSIVE, PARTITION Jun13 START (date '2013-06-01') INCLUSIVE, PARTITION Jul13 START (date '2013-07-01') INCLUSIVE, PARTITION Aug13 START (date '2013-08-01') INCLUSIVE, PARTITION Sep13 START (date '2013-09-01') INCLUSIVE, PARTITION Oct13 START (date '2013-10-01') INCLUSIVE, PARTITION Nov13 START (date '2013-11-01') INCLUSIVE, PARTITION Dec13 START (date '2013-12-01') INCLUSIVE END (date '2014-01-01') EXCLUSIVE );
(2) Defining a digital range partition table
CREATE TABLE tb_cp_03 (id int, rank int, year int, gender char(1), count int) DISTRIBUTED BY (id) PARTITION BY RANGE (year) ( START (2010) END (2014) EVERY (1), DEFAULT PARTITION extra );
(3) Create a list partition table (list partition)
You can use any column of data type as partitioning key; you can use multiple column combinations as partitioning keys.
CREATE TABLE tb_cp_04 (id int, rank int, year int, gender char(1), count int ) DISTRIBUTED BY (id) PARTITION BY LIST (gender) ( PARTITION girls VALUES ('F'), PARTITION boys VALUES ('M'), DEFAULT PARTITION other );
(4) Defining a multilevel partition table
When sub partition is needed, multilevel partition design can be used.
CREATE TABLE tb_cp_05 (trans_id int, date date, amount decimal(9, 2), region text) DISTRIBUTED BY (trans_id) PARTITION BY RANGE (date) SUBPARTITION BY LIST (region) SUBPARTITION TEMPLATE ( -- Subdivision SUBPARTITION usa VALUES ('usa'), SUBPARTITION europe VALUES ('europe'), DEFAULT SUBPARTITION other_regions ) ( -- Main partition START (date '2013-09-01') INCLUSIVE END (date '2014-01-01') EXCLUSIVE EVERY (INTERVAL '1 month'), DEFAULT PARTITION outlying_dates );
Create a 3 level sub partition table, which is divided into three levels: year, month and area.
CREATE TABLE tb_cp_06 (id int, year int, month int, day int, region text) DISTRIBUTED BY (id) PARTITION BY RANGE (year) SUBPARTITION BY RANGE (month) SUBPARTITION TEMPLATE ( -- Define two level partitions (2 + default) START (1) END (3) EVERY (1), DEFAULT SUBPARTITION other_months) SUBPARTITION BY LIST (region) SUBPARTITION TEMPLATE ( -- Define three level partitions (2 + default) SUBPARTITION usa VALUES ('usa'), SUBPARTITION europe VALUES ('europe'), DEFAULT SUBPARTITION other_regions) ( -- Defining first level partitions (2 + default) START (2012) END (2014) EVERY (1), DEFAULT PARTITION outlying_years );
4.3 view partition design
Look at the partition table design through the pg_partitions view.
SELECT partitionboundary, partitiontablename, partitionname, partitionlevel, partitionrank FROM pg_partitions WHERE tablename='tb_cp_05';
The following view can also look at the information of the partition table:
- View pg_partition_templates to create SUBPARTITION
- Look at the partition key pg_partition_columns of the partition table
4.4 maintenance partition table”
You must use the ALTER TABLE command to maintain partitions from the top-level table.
(1) Adding new partitions
The original partition table contains the subpartition template design:
ALTER TABLE tb_cp_05 DROP DEFAULT PARTITION; ALTER TABLE tb_cp_05 ADD PARTITION START (date '2014-01-01') INCLUSIVE END (date '2014-02-01') EXCLUSIVE;
The original partition does not contain subpartition template design:
ALTER TABLE tb_cp_05 ADD PARTITION START (date '2014-02-01') INCLUSIVE END (date '2014-03-01') EXCLUSIVE ( SUBPARTITION usa VALUES ('usa'), SUBPARTITION asia VALUES ('asia'), SUBPARTITION europe VALUES ('europe') );
Note: if there is a default partition, you can only split the new partition from the default partition.
(2) rename partition
GP The length of the object is limited to 63 characters and is bound by uniqueness. The name format of the subtable:
<The parent table name is > _< partition level > _prt_< partition name >
Modifying the name of the parent table will affect all partition tables
# The corresponding partition table will be changed to: tbcp05_1_prt_5ALTER TABLE tb_cp_05 rename to tbcp05;
Only modify the partition name:
# The corresponding partition table will be changed to: tbcp05_1_prt_jun13ALTER TABLE tbcp05 RENAME PARTITION FOR ('2013-06-01') TO Jun13;
(3) delete a partition
# Delete the specified partitionALTER TABLE tb_cp_04 DROP PARTITION other; # Delete the default partition:ALTER TABLE tb_cp_04 DROP DEFAULT PARTITION; # For multilevel partitioned tables, delete the default partition for each partition of the same layer:ALTER TABLE tb_cp_06 ALTER PARTITION FOR (RANK(1)) DROP DEFAULT PARTITION; ALTER TABLE tb_cp_06 ALTER PARTITION FOR (RANK(2)) DROP DEFAULT PARTITION;
(4) Add the default partition
# Add the default partition with the ALTER TABLE command:ALTER TABLE tbcp05 ADD DEFAULT PARTITION other; # If it is a multilevel partition table, the default partition is required on the same level for each partition.ALTER TABLE tb_cp_06 ALTER PARTITION FOR (RANK(1)) ADD DEFAULT PARTITION other; ALTER TABLE tb_cp_06 ALTER PARTITION FOR (RANK(2)) ADD DEFAULT PARTITION other;
(5) Emptying partition data
# Use the ALTER TABLE command to empty the partition.ALTER TABLE tbcp05 TRUNCATE PARTITION FOR (RANK (1));
(6) Exchange partition
The swap partition uses an ordinary TABLE to exchange identity with existing partitions. Use the ALTER TABLE command to exchange partitions. Only the lowest level partition table can be exchanged.
CREATE TABLE jan13(LIKE tb_cp_02) WITH(appendonly=true); INSERT INTO jan13 VALUES(1,'2013-01-15',123.45); ALTER TABLE tb_cp_02 EXCHANGE PARTITION for(date '2013-01-01') WITH TABLE jan13;
(7) Split partition
Use the ALTER TABLE command to split an existing partition into two. For example, divide the partition data of one month into a 1-15 day partition and another 16-31 day partition.
ALTER TABLE tb_cp_02 SPLIT PARTITION FOR('2013-01-01') AT ('2013-01-16') INTO (PARTITION jan131to15, PARTITION jan0816to31);
If the partition table has a default partition, the new partition can only be split from the default partition.
ALTER TABLE tb_cp_03 SPLIT DEFAULT PARTITION START (2014) INCLUSIVE END (2015) EXCLUSIVE INTO (PARTITION y2014, DEFAULT PARTITION);
(8) Modify subpartition template
Use the ALTER TABLE SET SUBPARTITION TEMPLATE command to modify the sub partition template of the existing partition table. For example:
ALTER TABLE tb_cp_05 SET SUBPARTITION TEMPLATE ( SUBPARTITION usa VALUES('usa'), SUBPARTITION africa VALUES('africa'), DEFAULT SUBPARTITION other );
Add a partition to table tb_cp_05 after using the new template.
ALTER TABLE tb_cp_05 ADD PARTITION Feb14 START ('2014-02-01') INCLUSIVE END('2014-03-01') EXCLUSIVE;
(1) Load partition table
- The top-level table in the partition table is empty, and the data is stored in the bottom table.
- To avoid data loading failure, you can choose to define the default partition.
- When the partition table is querying, the default partition will always be scanned. If the default partition contains data, the query efficiency will be affected.
- When data is loaded to the parent table using COPY or INSERT, data is automatically routed to the correct partition.
- The method of swap partition can be considered to directly reload data to sub tables to improve performance.
(2) Validation partition strategy
# EXPLAINSee whether the query plan has scanned the related partitionsEXPLAIN SELECT * FROM tb_cp_05 WHERE date= '2013-12-01' AND region= 'USA';
(3) Restriction of selective scanning of partitions
If the query plan shows that the partition table is not selectively scanned, it may be related to the following restrictions:
- The query plan can only provide stable comparison operators, such as:
=, <, <=, >, >=, <>
- The query plan does not identify unstable functions to perform selective scanning. For example, the use of date> clauses in the WHERE clause will enable the query plan to perform zoning scans while time> TIMEOFDAY will not.
CREATE SEQUENCE myserial START 101; -- Create an index SELECT * FROM myserial; -- Query index SELECT setval(‘myserial’,201); -- catalog index INSERT INTO tb02 VALUES(nextval(‘myserial’), ‘abc’); -- Use index ALTER SEQUENCE myserial RESTART WITH 105; -- Modify the index DROP SEQUENCE myserial; -- Delete index
Note: if the mirror function is enabled, nextval is not allowed to be used in UPDATE and DELETE statements.
In distributed databases such as GP, indexes should be used conservatively. GP automatically creates the primary key index for the primary key and ensures that the index is actually used in query workload (such as table Association, query).
GPThe two commonly used types: B-tree and Bitmap indexes.
CREATE INDEX idx_01 ON tb_cp_02(id); -- Create an B-tree index CREATE INDEX bmidx_01 ON tb_cp_02 USING BITMAP(date); -- create bitmap index REINDEX TABLE tb_cp_02; -- Reconstructing all indexes REINDEX INDEX bmidx_01; -- Reconstructing a specific index DROP INDEX bmidx_01; -- Delete index, when loading data, usually delete index first, then load data.
CREATE VIEW vw_01 AS SELECT * FROM tb_cp_03 WHERE gender=‘F’; -- Create a view DROP VIEW vw_01; -- Delete view
eight, data management
8.1 GP transaction management”
- GP Using Mutltiversion Concurrency Control/MVCC to maintain data consistency
- MVCC In order to avoid explicit locking of database transactions, minimize lock contention to ensure performance in a multi-user environment.
- GP Various locking mechanisms are provided to control concurrent access to table data.
- GP Transaction isolation for each transaction
(1) Table 1: GP concurrency control lock mode
|Lock mode||Related SQL command||Conflict lock|
|ACCESS SHARE||SELECT||ACCESS EXCLUSIVE|
|ROW SHARE||SELECT FOR UPDATE, SELECT FOR SHARE||EXCLUSIVE, ACCESS EXCLUSIVE|
|ROW EXCLUSIVE||INSERT, COPY||SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE|
|SHARE UPDATE EXCLUSIVE||VACUUM (without FULL), ANALYZE||SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE|
|SHARE||CREATE INDEX||ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE|
|SHARE ROW EXCLUSIVE||ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE|
|EXCLUSIVE||DELETE, UPDATE||ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE|
(2) Transaction use
Transaction allows multiple SQL statements to be put together as a whole operation, and all SQL together succeed or fail.
The SQL command to execute the transaction in GP:
- Use BEGIN or START TRANSACTION to start a transaction block.
- Commit transaction blocks using END or COMMIT
- Use ROLLBACK to rollback transactions without committing any changes.
- Use SAVEPOINT to selectively save transaction points, and then use ROLLBACK TO SAVEPOINT to rollback to previously saved transactions.
begin insert into tb_cp_01 values(); savepoint p1; insert into tb_cp_01 values(); roleback to p1; end
(3) Transaction isolation level
SQL The standard defines 4 transaction isolation levels:
- Uncommitted reading: equivalent to submission read in GP
- Submitted read (default): when the transaction uses the isolation level, the SELECT query can only see the data before the query starts, and it will never read the changes that other concurrent transactions have not submitted or submitted during the SELECT query.
- Repeatable read: in GP with the equating
- Serializable: This is a strict transaction isolation level. This level requires transactions to be executed sequentially, that is, transactions must be executed one after another instead of parallel.
8.2 recycling space and analysis
Transaction ID management is necessary to execute VACUUM for each table at every 2 million transactions per database.
- System directory maintenance: a large number of CREATE and DROP commands can cause the system table to expand rapidly, affecting the performance of the system.
- Because of the MVCC transaction concurrency model, records that have been deleted or updated still occupy disk space.
- If the database has a large number of update and delete operations, a large number of expired records will be generated.
- Regular operation of VACUUM command can delete expiration records and recycle space.
- Setting parameters of free mapping space: max_fsm_pages, max_fsm_relations
VACUUM tb_cp_02; -- Recycling space ANALYZE tb_cp_02; -- Collecting statistical information required by the query optimizer
8.3 daily reconstruction index”
- For the B-tree index, the newly rebuilt index is faster than the index with more updates.
- The rebuilt index can reclaim overdue space
- In GP, deleting an index and creating it is usually faster than REINDEX.
- When updating index columns, the Bitmap index will not be updated.
8.4 GPDB log file
The server log files are stored in the pg_log directory of each instance data directory. Format: gpdb-YYYY-MM-DD_TIME.csv
Record a little bit by heart every day. The content may not be important, but the habit is very important!