Archives des Oracle

By Franck Pachot

.
This post is part of a series of small examples of recent features. I’m running this in the Oracle 20c preview in the Oracle Cloud. I’ll show a very basic example of “Row Pattern Recognition” (the MATCH_RECOGNIZE clause in a SELECT which is documented as “row pattern matching in native SQL” feature by Oracle”). You may be afraid of those names. Of course, because SQL is a declarative language there is a small learning curve to get beyond this abstraction. Understanding procedurally how it works may help. But when you understand the declarative nature it is really powerful. This post is there to start simple on a simple table with time series where I just want to detect peaks (the points where the value goes up and then down).

Historically, a SELECT statement was operating on single rows (JOIN, WHERE, SELECT) within a set, or an aggregation of rows (GROUP BY, HAVING) to provide a summary. Analytic functions can operate on windows of rows (PARTITION BY, ORDER BY, ROWS BETWEEN,…) where you keep the detailed level or rows and compare it to the aggregated values of the group. A row can then look at its neighbours and when needing to go further, the SQL MODEL can build the equivalent of spreadsheet cells to reference other rows and columns. As in a spreadsheet, you can also PIVOT to move row detail to columns or vice versa. All that can be done in SQL, which means that you don’t code how to do it but just define the result you want. However, there’s something that is easy to do in a spreadsheet application like Excel but not easy to code with analytic functions: looking at a Chart, as a Line Graph, to detect some behaviour. That’s something we can code in SQL with MATCH_RECOGNIZE.

For example, from the “COVID” table I have imported in the previous post I want to see each peak of covid-19 cases in Switzerland:

I did this manually in Excel: showing all labels but keeping only those that are at a peak, whether it is a small peak or high one. There’s one value per day in this timeseries but I’m am not interested by the intermediate values. Only peaks. So, this was done from the .csv imported from http://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ through an external table but, as I imported it into an Oracle table for the previous post (Oracle 18c – select from a flat file).

Ok, let’s show directly the result. Here is a small SQL statement that show me exactly those peaks, each match being numbered:


SQL> select countriesandterritories "Country","Peak date","Peak cases","match#"
  2  from covid
  3  match_recognize (
  4   partition by continentexp, countriesandterritories order by daterep
  5   measures
  6    match_number() as "match#",
  7    last(GoingUp.dateRep) as "Peak date",
  8    last(GoingUp.cases) as "Peak cases"
  9   one row per match
 10   pattern (GoingUp+ GoingDown+)
 11   define
 12    GoingUp as ( GoingUp.cases > prev(GoingUp.cases) ),
 13    GoingDown as ( GoingDown.cases < prev(GoingDown.cases))
 14  )
 15  where countriesandterritories='Switzerland';

       Country    Peak date    Peak cases    match#
______________ ____________ _____________ _________
Switzerland    26-FEB-20                1         1
Switzerland    28-FEB-20                7         2
Switzerland    07-MAR-20              122         3
Switzerland    09-MAR-20               68         4
Switzerland    14-MAR-20              267         5
Switzerland    16-MAR-20              841         6
Switzerland    18-MAR-20              450         7
Switzerland    22-MAR-20             1237         8
Switzerland    24-MAR-20             1044         9
Switzerland    28-MAR-20             1390        10
Switzerland    31-MAR-20             1138        11
Switzerland    03-APR-20             1124        12
Switzerland    08-APR-20              590        13
Switzerland    10-APR-20              785        14
Switzerland    16-APR-20              583        15
Switzerland    18-APR-20              346        16
Switzerland    20-APR-20              336        17
Switzerland    24-APR-20              228        18
Switzerland    26-APR-20              216        19
Switzerland    01-MAY-20              179        20
Switzerland    09-MAY-20               81        21
Switzerland    11-MAY-20               54        22
Switzerland    17-MAY-20               58        23
Switzerland    21-MAY-20               40        24
Switzerland    24-MAY-20               18        25
Switzerland    27-MAY-20               15        26
Switzerland    29-MAY-20               35        27
Switzerland    06-JUN-20               23        28


28 rows selected.

Doing that with analytic functions or MODEL clause is possible, but not easy.

So let’s explain the clauses in this simple example.

Define

I’ll need to define what is a peak. For that, I need to define two very primary patterns. The value I’m looking for, which is the one you see on the graph, is the column “CASES”, which is the number of covid-19 cases for the day and country. How do you detect peaks visually? Like when hiking in mountains: it goes up and when you continue it goes down. Here are those two primary patterns:


 11   define
 12    GoingUp as ( GoingUp.cases >= prev(GoingUp.cases) ),
 13    GoingDown as ( GoingDown.cases < prev(GoingDown.cases))

“GoingUp” matches a row where “cases” value is higher than the preceding row and “GoingDown” matches a row where “cases” is lower than the preceding one. The sense of “preceding one”, of course, depends on an order, like with analytic functions. We will see it below.

Pattern

A peak is when a row matches GoingDown just after matching GoingUp. That’s simple but you can imagine crazy things that a data scientist would want to recognize. And then the MATCH_RECOGNIZE defines patterns in a similar way as Regular Expressions: mentioning the primary patterns in a sequence with some modifiers. Mine is so simple:


 10   pattern (GoingUp+ GoingDown+)

This means: one or more GoingUp followed by one or more GoingDown. This is exactly what I did in the graph above: ignore intermediate points. So, the primary pattern compares a row with the preceding only and consecutive comparisons are walked through and compared with the pattern.

Partition by

As mentioned, I follow the rows in order. For a timeseries, this is simple: the key is the country here, I partition by continent and country, and the order (x-axis) is the date. I’m looking at the peaks per country when the value (“cases”) is ordered by date (“daterep”):


  2  from covid
...
  4   partition by continentexp, countriesandterritories order by daterep
...
 15* where countriesandterritories='Switzerland';

I selected only my country here with a standard where clause, to show simple things.

Measures

Eatch time a pattern is recognized, I want to display only one row (“ONE ROW PER MATCH”) with some measures for it. Of course, I must access to the point I’m interested in: the x-axis date and y-axis value for it. I can reference points within the matching window and I use the pattern variables to reference them. The peak is the last row in the “GoingUp” primary pattern and last(GoingUp.dateRep) and last(GoingUp.cases) are my points:


  5   measures
  6    match_number() as "match#",
  7    last(GoingUp.dateRep) as "Peak date",
  8    last(GoingUp.cases) as "Peak cases"
  9   one row per match

Those measures are accessible in the SELECT clause of my SQL statement. I added the match_number() to identify the points.

Here is the final query, with the partition, measures, pattern and define clauses within the MATCH_RECOGNIZE():


select countriesandterritories "Country","Peak date","Peak cases","match#"
from covid
match_recognize (
 partition by continentexp, countriesandterritories order by daterep
 measures
  match_number() as "match#",
  last(GoingUp.dateRep) as "Peak date",
  last(GoingUp.cases) as "Peak cases"
 one row per match
 pattern (GoingUp+ GoingDown+)
 define
  GoingUp as ( GoingUp.cases > prev(GoingUp.cases) ),
  GoingDown as ( GoingDown.cases < prev(GoingDown.cases))
)
where countriesandterritories='Switzerland';

The full syntax can have more and of course all is documented: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8982

Debug mode

In order to understand how it works (and debug) we can display “all rows” (ALL ROWS PER MATCH instead of ONE ROW PER MATCH in line 9), and add the row columns (DATEREP and CASES in line 1) and, in addition to the match_number() I have added the classifier() measure:


  1  select countriesandterritories "Country","Peak date","Peak cases","match#",daterep,cases,"classifier"
  2  from covid
  3  match_recognize (
  4   partition by continentexp, countriesandterritories order by daterep
  5   measures
  6    match_number() as "match#", classifier() as "classifier",
  7    last(GoingUp.dateRep) as "Peak date",
  8    last(GoingUp.cases) as "Peak cases"
  9   all rows per match
 10   pattern (GoingUp+ GoingDown+)
 11   define
 12    GoingUp as ( GoingUp.cases > prev(GoingUp.cases) ),
 13    GoingDown as ( GoingDown.cases < prev(GoingDown.cases))
 14  )
 15* where countriesandterritories='Switzerland';

“all rows per match” shows all rows where pattern matching is tested, classifier() shows which primary pattern is matched.

Here are the rows around the 10th match. You must keep in mind that rows are processed in order and for each row, it looks ahead to recognize a pattern.


       Country    Peak date    Peak cases    match#      DATEREP    CASES    classifier
______________ ____________ _____________ _________ ____________ ________ _____________
...
Switzerland    24-MAR-20             1044         9 24-MAR-20        1044 GOINGUP
Switzerland    24-MAR-20             1044         9 25-MAR-20         774 GOINGDOWN
Switzerland    26-MAR-20              925        10 26-MAR-20         925 GOINGUP
Switzerland    27-MAR-20             1000        10 27-MAR-20        1000 GOINGUP
Switzerland    28-MAR-20             1390        10 28-MAR-20        1390 GOINGUP
Switzerland    28-MAR-20             1390        10 29-MAR-20        1048 GOINGDOWN
Switzerland    30-MAR-20             1122        11 30-MAR-20        1122 GOINGUP
Switzerland    31-MAR-20             1138        11 31-MAR-20        1138 GOINGUP              
Switzerland    31-MAR-20             1138        11 01-APR-20         696 GOINGDOWN  
Switzerland    02-APR-20              962        12 02-APR-20         962 GOINGUP
Switzerland    03-APR-20             1124        12 03-APR-20        1124 GOINGUP
Switzerland    03-APR-20             1124        12 04-APR-20        1033 GOINGDOWN

You see here how we came to output the 10th matched (28-MAR-20 1390 cases). After the peak of 24-MAR-20 we were going down the next day 25-MAR-20 (look at the graph). This was included in the 10th match because of regular expression “GoingDown+”. Then up 26-MAR-2020 to 28-MAR-20, which matches GoingUp+ followed by a “GoingDown” on 29-MAR-20 which means that a 11th match has been recognized. It continues for all “GoingDown+” but there’s only one here as the next one is a higher value: 1122 > 1048 so the 11th match is closed here on 29-MAR-20. This is where the ONE ROW PER MATCH is returned, when processing the row from 29-MAR-20, with the values from the last row classified as GOINGUP, and defined in the measures, which are 28-MAR-20 and 1390. And then the pattern matching continues from this row and a GoingUp has been detected…

If you want to go further, there are good examples from Lucas Jellama: https://technology.amis.nl/?s=match_recognize
And about its implementation in SQL engines, read Markus Winand https://modern-sql.com/feature/match_recognize

And I’ll probably have more blog posts here in this series about recent features interesting for BI and DWH…

Cet article Oracle 12c – peak detection with MATCH_RECOGNIZE est apparu en premier sur Blog dbi services.

By Franck Pachot

.
In this series of small examples on recent features, I have imported in a previous post, the statistics of covid-19 per day and per countries. This is typical of data that comes as a time-series ordered by date, because this is how it is generated day after day, but where you probably want to query from another dimension, like per countries.

If you want to ingest data faster, you keep it in the order of arrival, and insert it in heap table blocks. If you want to optimize for the future queries on the other dimension, you may load it in a table with a specialized organization where each row has its place: an Index Organized Table, a Hash Cluster, a partitioned table, or a combination of those. With Oracle we are used to storing data without the need to reorganize it. It is a multi-purpose database. But in 12c we have many features that make this reorganization easier, like partitioning, online move and online split. We can then think about a two-phase lifecycle for some operational tables that are used later for analytics:

Fast ingest and query on short time window: we insert data on the flow, with conventional inserts, into a conventional heap table. Queries on recent data is fast as the rows are colocated as they arrived.
Optimal query on history: regularly we reorganize physically the latest ingested rows, to be clustered on another dimension, because we will query for a large time range on this other dimension

Partitioning is the way to do those operations. We can have a weekly partition for the current week. When the week is over new rows will go to a new partition (11g PARTITION BY RANGE … INTERVAL) and we can optionally merge the old partition with the one containing old data, per month or year for example, to get larger time ranges for the past data. This merge is easy (18c MERGE PARTITIONS … ONLINE). And while doing that we can reorganize rows to be clustered together. This is what I’m doing in this post.

Partitioning

From the table, I have created in the previous post I create an index on GEOID (as the goal is to query by countries) and I partition it by range on DATEREP:


SQL> create index covid_geoid on covid(geoid);

Index created.

SQL> alter table covid modify partition by range(daterep) interval (numToYMinterval(1,'year')) ( partition old values less than (date '2020-01-01') , partition new values less than (date '2021-01-01') ) online;

Table altered.

This is an online operation in 12cR2. So I have two partitions, one for “old” data and one for “new” data.

I query all dates for one specific country:


SQL> select trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1
  2  /
   TRUNC(DATEREP,'MON')    MAX(CASES)
_______________________ _____________
01-DEC-19                           0
01-JAN-20                           3
01-FEB-20                          19
01-MAR-20                       21595
01-APR-20                       48529
01-MAY-20                       33955
01-JUN-20                       25178

This reads rows scattered through the whole table because they were inserted day after day.

This is visible in the execution plan: the optimizer does not use the index but a full table scan:


SQL> select * from dbms_xplan.display_cursor(format=>'+cost iostats last')
  2  /
                                                                                       PLAN_TABLE_OUTPUT
________________________________________________________________________________________________________
SQL_ID  2nyu7m59d7spv, child number 0
-------------------------------------
select trunc(daterep,'mon'), max(cases) from covid where geoid='US'
group by trunc(daterep,'mon') order by 1

Plan hash value: 4091160977

-----------------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |    55 (100)|      7 |00:00:00.01 |     180 |
|   1 |  SORT ORDER BY       |       |      1 |     77 |    55   (4)|      7 |00:00:00.01 |     180 |
|   2 |   PARTITION RANGE ALL|       |      1 |     77 |    55   (4)|      7 |00:00:00.01 |     180 |
|   3 |    HASH GROUP BY     |       |      2 |     77 |    55   (4)|      7 |00:00:00.01 |     180 |
|*  4 |     TABLE ACCESS FULL| COVID |      2 |    105 |    53   (0)|    160 |00:00:00.01 |     180 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("GEOID"='US')

This has read 180 blocks, with multiblock reads.

I force the access by index in order to compare the cost:


SQL> select /*+ index(covid) */ trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1
  2  /

   TRUNC(DATEREP,'MON')    MAX(CASES)
_______________________ _____________
01-DEC-19                           0
01-JAN-20                           3
01-FEB-20                          19
01-MAR-20                       21595
01-APR-20                       48529
01-MAY-20                       33955
01-JUN-20                       25178

SQL> select * from dbms_xplan.display_cursor(format=>'+cost iostats last')
  2  /
                                                                                                                     PLAN_TABLE_OUTPUT
______________________________________________________________________________________________________________________________________
SQL_ID  2whykac7cnjks, child number 0
-------------------------------------
select /*+ index(covid) */ trunc(daterep,'mon'), max(cases) from covid
where geoid='US' group by trunc(daterep,'mon') order by 1

Plan hash value: 2816502185

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |             |      1 |        |    95 (100)|      7 |00:00:00.01 |     125 |
|   1 |  SORT ORDER BY                               |             |      1 |     77 |    95   (3)|      7 |00:00:00.01 |     125 |
|   2 |   HASH GROUP BY                              |             |      1 |     77 |    95   (3)|      7 |00:00:00.01 |     125 |
|   3 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |      1 |    105 |    93   (0)|    160 |00:00:00.01 |     125 |
|*  4 |     INDEX RANGE SCAN                         | COVID_GEOID |      1 |    105 |     1   (0)|    160 |00:00:00.01 |       2 |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("GEOID"='US')

Even if the number of blocks is a bit smaller, 125 blocks, they are single block reads and then the cost is higher: 95 for index access when the full table scan was 55. Using hints and comparing the cost is how I often try to understand the optimizer choice and here the reason is clear: because rows are scattered, the clustering factor of the index access is really bad.

I said that I want to merge the partitions. And maybe reorg with an online table move. But now, for this second phase of the lifecycle, I want to cluster rows on the country dimension rather than on arrival date.

Attribute clustering

This preference can be declared on the table with 12c Attribute Clustering:


SQL> alter table covid add clustering by linear order (continentexp, countriesandterritories);

Table altered.

You see that I can mention multiple columns and I don’t need to use the GEOID column that I will use to query. This is not an index. This just a preference to cluster rows and, if they are clustered on the country name, they will be also clustered on continent, country code, geoid,… I have chosen those columns for clarity when reading the DDL:


SQL> exec dbms_metadata.set_transform_param(DBMS_METADATA.SESSION_TRANSFORM,'SEGMENT_ATTRIBUTES',false);

PL/SQL procedure successfully completed.

SQL> ddl covid

  CREATE TABLE "COVID"
   (    "DATEREP" DATE,
        "N_DAY" NUMBER,
        "N_MONTH" NUMBER,
        "N_YEAR" NUMBER,
        "CASES" NUMBER,
        "DEATHS" NUMBER,
        "COUNTRIESANDTERRITORIES" VARCHAR2(50),
        "GEOID" VARCHAR2(10),
        "COUNTRYTERRITORYCODE" VARCHAR2(3),
        "POPDATA2018" NUMBER,
        "CONTINENTEXP" VARCHAR2(10)
   )
 CLUSTERING
 BY LINEAR ORDER ("COVID"."CONTINENTEXP",
  "COVID"."COUNTRIESANDTERRITORIES")
   YES ON LOAD  YES ON DATA MOVEMENT
 WITHOUT MATERIALIZED ZONEMAP
  PARTITION BY RANGE ("DATEREP") INTERVAL (NUMTOYMINTERVAL(1,'YEAR'))
 (PARTITION "OLD"  VALUES LESS THAN (TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) ,
 PARTITION "NEW"  VALUES LESS THAN (TO_DATE(' 2021-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) ) ;

  CREATE INDEX "COVID_GEOID" ON "COVID" ("GEOID")
  ;

As you can see the default is YES for ON LOAD which means that direct-path inserts will cluster rows, and ON DATA MOVEMENT is also YES which is why merging partitions will also cluster rows.

I’ve done that afterward here but this is something you can do at table creation. You mention on which attributes you want to cluster. You mention when: direct-path inserts (YES ON LOAD) and/or table reorganization (YES ON DATA MOVEMENT). This is defined at table level. Beyond those defaults, the table reorganizations (ALTER TABLE … MOVE, ALTER TABLE … MERGE PARTITIONS) can explicitly DISALLOW CLUSTERING or ALLOW CLUSTERING.

Move Partition

When I have ingested some data and think that it would be better to cluster them, maybe at the time this partition is completed and new inserts go to a higher interval, I can reorganize it with a simple ALTER TABLE … MOVE:


SQL> alter table covid move partition new online allow clustering;

Table altered.

This will cluster rows together on the clustering attributes. I mentioned ALLOW CLUSTERING to show the syntax but it is the default (YES ON DATA MOVEMENT) anyway here.

At that point, you may also want to compress the old partitions with basic compression (the compression that does not require an additional option but is possible only with bulk load or data movement). However, be careful: the combination of online operation and basic compression requires the Advanced Compression Option. More info in a previous post on “Segment Maintenance Online Compress” feature usage.

Merge Partition

As my goal is to cluster data on a different dimension than the time one, I may want to have larger partitions for the past ones. Something like the current partition holding a week of data at maximum, but the past partitions being on quarter or yearly ranges. That can be done with partition merging, which is an online operation in 18c (and note that I have a global index here and an online operation does not invalidate indexes):


SQL> alter table covid merge partitions old,new into partition oldmerged online allow clustering;

Table altered.

This is a row movement and clustering on data movement is enabled. Again I mentioned ALLOW CLUSTERING just to show the syntax.

Let’s see the number of buffers read now with index accesss. The statistics of the index (clustering factor) has not been updated, so the optimizer may not choose the index access yet (until dbms_stats runs on stale tables). I’m forcing with an hint:


SQL> select /*+ index(covid) */ trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1;

   TRUNC(DATEREP,'MON')    MAX(CASES)
_______________________ _____________
01-DEC-19                           0
01-JAN-20                           3
01-FEB-20                          19
01-MAR-20                       21595
01-APR-20                       48529
01-MAY-20                       33955
01-JUN-20                       25178

SQL> select * from dbms_xplan.display_cursor(format=>'+cost iostats last')
  2  /
                                                                                                                     PLAN_TABLE_OUTPUT
______________________________________________________________________________________________________________________________________
SQL_ID  2whykac7cnjks, child number 0
-------------------------------------
select /*+ index(covid) */ trunc(daterep,'mon'), max(cases) from covid
where geoid='US' group by trunc(daterep,'mon') order by 1

Plan hash value: 2816502185

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |             |      1 |        |    95 (100)|      7 |00:00:00.01 |       8 |
|   1 |  SORT ORDER BY                               |             |      1 |     77 |    95   (3)|      7 |00:00:00.01 |       8 |
|   2 |   HASH GROUP BY                              |             |      1 |     77 |    95   (3)|      7 |00:00:00.01 |       8 |
|   3 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |      1 |    105 |    93   (0)|    160 |00:00:00.01 |       8 |
|*  4 |     INDEX RANGE SCAN                         | COVID_GEOID |      1 |    105 |     1   (0)|    160 |00:00:00.01 |       5 |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("GEOID"='US')
       filter(TBL$OR$IDX$PART$NUM(,0,8,0,"COVID".ROWID)=1)

The cost has not changed (because of the statistics) but the number of buffers read is minimal: only the 8 buffers where all my rows for this country are clustered. Remember that I clustered on the country name but use the GEOID here in my predicate. That doesn’t matter as long as the rows are together.

Asynchronous global index maintenance

Note the strange predicate on TBL$OR$IDX$PART$NUM(,0,8,0,”COVID”.ROWID)=1 that results from another 12c feature where global indexes are maintained usable during the partition maintenance (which is required for an online operation) but optimized to be cleaned-out asynchronously later. This is visible from DBA_INDEXES:


SQL> select index_name,to_char(last_analyzed,'hh24:mi:ss') last_analyzed,clustering_factor,orphaned_entries from user_indexes where table_name='COVID';

    INDEX_NAME    LAST_ANALYZED    CLUSTERING_FACTOR    ORPHANED_ENTRIES
______________ ________________ ____________________ ___________________
COVID_GEOID    08:33:34                        19206 YES

Orphaned entries mean that some entries in the global index may reference the dropped segment after my MOVE or MERGE and the query has to ignore them.

Those ranges of rowid are determined from the segment concerned, stored in the dictionary:


SQL> select * from sys.index_orphaned_entry$;
   INDEXOBJ#    TABPARTDOBJ#    HIDDEN
____________ _______________ _________
       79972           79970 O
       79972           79971 O
       79972           79980 O
       79972           79973 O

HIDDEN=’O’ means Orphaned and the ROWIDs addressing these partitions are filtered out from the dirty index entries buy the predicated filter(TBL$OR$IDX$PART$NUM(,0,8,0,”COVID”.ROWID)=1) above.

This maintenance of the dirty index will be done during the maintenance window but I can do it immediately to finish my reorganization correctly:


SQL> alter index COVID_GEOID coalesce cleanup;

Index altered.

SQL> select index_name,to_char(last_analyzed,'hh24:mi:ss') last_analyzed,clustering_factor,orphaned_entries from user_indexes where table_name='COVID';

    INDEX_NAME    LAST_ANALYZED    CLUSTERING_FACTOR    ORPHANED_ENTRIES
______________ ________________ ____________________ ___________________
COVID_GEOID    08:33:34                        19206 NO

No orphaned index entries anymore. Note that I could also have called the DBMS_PART.CLEANUP_GIDX procedure to do the same.

This is fine for the query, but as the statistics were not updated, the optimizer doesn’t know yet how clustered is my table. In order to complete my reorganization and have queries benefiting from this immediately, I gather the statistics:


SQL> exec dbms_stats.gather_table_stats(user,'COVID',options=>'gather auto');

PL/SQL procedure successfully completed.

SQL> select index_name,to_char(last_analyzed,'hh24:mi:ss') last_analyzed,clustering_factor,orphaned_entries from user_indexes where table_name='COVID';

    INDEX_NAME    LAST_ANALYZED    CLUSTERING_FACTOR    ORPHANED_ENTRIES
______________ ________________ ____________________ ___________________
COVID_GEOID    08:38:40                          369 NO

GATHER AUTO gathers only the stale ones, and, as soon as I did my MOVE or MERGE, the index was marked as stale (note that the ALTER INDEX COALESCE does not mark them a stale by itself).

And now my query will use this optimal index without the need for any hint:


SQL_ID  2nyu7m59d7spv, child number 0
-------------------------------------
select trunc(daterep,'mon'), max(cases) from covid where geoid='US'
group by trunc(daterep,'mon') order by 1

Plan hash value: 2816502185

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |             |      1 |        |     7 (100)|      7 |00:00:00.01 |       5 |
|   1 |  SORT ORDER BY                               |             |      1 |    101 |     7  (29)|      7 |00:00:00.01 |       5 |
|   2 |   HASH GROUP BY                              |             |      1 |    101 |     7  (29)|      7 |00:00:00.01 |       5 |
|   3 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |      1 |    160 |     5   (0)|    160 |00:00:00.01 |       5 |
|*  4 |     INDEX RANGE SCAN                         | COVID_GEOID |      1 |    160 |     2   (0)|    160 |00:00:00.01 |       2 |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("GEOID"='US')

and, thanks to the coalesce cleanup, there’s no predicate on orphan ROWIDs anymore.

With this pattern, you may realize that my global index on countries is useful only for past data. Not for the recent one that has not been clustered yet. Then, we can even avoid maintaining the index for this partition. We will see that in the next post. it is called partial indexing.

With this pattern, we can even doubt about the need to maintain an index for the old partitions. As all my rows for GEOID=’US’ were packed in a few contiguous blocks, why not just store the range of ROWIDs rather than the list of it? This is called Zone Maps. But this is only available on Exadata and I like to think about Oracle as a multiplatform database.

Those many features came in the recent releases thanks to the development of the Autonomous Database. When the DBA is a cloud provider, whether it is automated or not, all maintenance must be done online without stopping the application. Those features are the bricks to build automatic lifecycle management and performance optimization.

Cet article Oracle 12c – reorg and split table with clustering est apparu en premier sur Blog dbi services.

By Franck Pachot

.
We have an incredible number of possibilities with Oracle. Yes, an index can be global (indexing many partitions without having to be partitioned itself on the same key) and partial (skipping some of the table partitions where we don’t need indexing). In the previous post of this series of small examples on recent features I partitioned a table, with covid-19 cases per day and per country, partitioned on range of date by interval. The index on the country code (GEOID) was not very efficient for data ingested per day, because countries are scattered through all the table. And then I have reorganized the old partitions to cluster them on countries.

My global index on country code is defined as:


SQL> create index covid_geoid on covid(geoid);

Index created.

This is efficient, thanks to clustering, except for the new rows coming again in time order. As those go to a new partition that is small (the idea in the post was to have short time range for the current partition, and larger ones for the old, using the ALTER TABLE … MERGE ONLINE to merge the newly old one to the others). For the current partition only, it is preferable to full scan this last partition. And even avoid maintaining the index entries for this partition as this will accelerate data ingestion.

I think that partial indexing is well known for local indexes, as this is like marking some index partitions as unusable. But here I’m showing it on a global index.

Splitting partitions

In order to continue from the previous previous post where I merged all partitions, I’ll split them again, and this can be an online operation in 12cR2:


SQL> alter table covid split partition oldmerged at (date '2020-04-01') into (partition old, partition new) online;

Table altered.

SQL> alter index COVID_GEOID coalesce cleanup;

Index altered.

I have two partitions, “old” and “new”, and a global index. I also cleaned up the orphaned index entries to get clean execution plans. And it has to be done anyway.

Here is my query, using the index:


SQL> explain plan for select trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1;

Explained.

SQL> select * from dbms_xplan.display();
                                                                                                              PLAN_TABLE_OUTPUT
_______________________________________________________________________________________________________________________________
Plan hash value: 2816502185

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |             |   101 |  1515 |     6  (34)| 00:00:01 |       |       |
|   1 |  SORT ORDER BY                               |             |   101 |  1515 |     6  (34)| 00:00:01 |       |       |
|   2 |   HASH GROUP BY                              |             |   101 |  1515 |     6  (34)| 00:00:01 |       |       |
|   3 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |   160 |  2400 |     4   (0)| 00:00:01 | ROWID | ROWID |
|*  4 |     INDEX RANGE SCAN                         | COVID_GEOID |   160 |       |     1   (0)| 00:00:01 |       |       |
----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("GEOID"='US')

This goes to all partitions, as the ROWID in a global index carries the partition information through the data object id. We see that with Pstart/Pstop=ROWID.

Partial indexing

Now I want to set my global index on countries to be a partial index:


SQL> alter index covid_geoid indexing partial;

Index altered.

This doesnt change anything for the moment. The indexing of partitions will depend on the partition attributes which is by default INDEXING ON.

I set the “new” partition to not maintain indexes (INDEXING OFF), for this partition only.


SQL> alter table covid modify partition new indexing off;

Table altered.

This means that partial indexes will not reference the “new” partition. Whether they are local (which then means no index partition) or global (which then means no index entries for this partition).

And that’s all. Now there will be no overhead in maintaining this index when ingesting new data in this partition.

Table Expansion

And then, the optimizer has a transformation to split the execution plan in two branches: one for the index access and one without. This transformation was introduced in 11g for unusable local partitions and is now used even with global indexes. :


SQL> explain plan for /*+ index(covid) */ select trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1;

Explained.

SQL> select * from dbms_xplan.display();
                                                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________________________________________
Plan hash value: 1031592504

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name        | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |             |   321 |  7062 |    37   (6)| 00:00:01 |       |       |
|   1 |  SORT ORDER BY                                 |             |   321 |  7062 |    37   (6)| 00:00:01 |       |       |
|   2 |   HASH GROUP BY                                |             |   321 |  7062 |    37   (6)| 00:00:01 |       |       |
|   3 |    VIEW                                        | VW_TE_2     |   321 |  7062 |    35   (0)| 00:00:01 |       |       |
|   4 |     UNION-ALL                                  |             |       |       |            |          |       |       |
|*  5 |      TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |    93 |  1395 |     4   (0)| 00:00:01 |     1 |     1 |
|*  6 |       INDEX RANGE SCAN                         | COVID_GEOID |   160 |       |     1   (0)| 00:00:01 |       |       |
|   7 |      PARTITION RANGE SINGLE                    |             |    68 |  1020 |    27   (0)| 00:00:01 |     2 |     2 |
|*  8 |       TABLE ACCESS FULL                        | COVID       |    68 |  1020 |    27   (0)| 00:00:01 |     2 |     2 |
|   9 |      TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |   160 |  4320 |     4   (0)| 00:00:01 | ROWID | ROWID |
|* 10 |       INDEX RANGE SCAN                         | COVID_GEOID |   160 |       |     1   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("COVID"."DATEREP"=TO_DATE(' 2020-04-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "COVID"."DATEREP"<TO_DATE(' 2021-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
  10 - access("GEOID"='US')
       filter(TBL$OR$IDX$PART$NUM("COVID",0,8,0,ROWID)=1 AND TBL$OR$IDX$PART$NUM("COVID",0,0,65535,ROWID)1 AND
              TBL$OR$IDX$PART$NUM("COVID",0,0,65535,ROWID)2)

The TABLE ACCESS BY GLOBAL INDEX ROWID is for partition 1 as mentioned by Pstart/Pstop, which is the “old” one with INDEXING ON. The TABLE ACCESS FULL is for partition 2, the “new” one, that has INDEXING OFF. The optimizer uses predicates on the partition key to select the branch safely.

But this plan has also an additional branch and this TBL$OR$IDX$PART$NUM again because I have interval partitioning. With interval partitioning, there is no known Pstop, it then it has handle the cases where a new partition has been created (with indexing on). Then, the third branch can access by index ROWID for the partitions that are not hardcoded in this plan.

Let’s remove interval partitioning just to get the plan easier to read:


SQL> alter table covid set interval();

Table altered.


SQL> explain plan for select trunc(daterep,'mon'), max(cases) from covid where geoid='US' group by trunc(daterep,'mon') order by 1;

Explained.

SQL> select * from dbms_xplan.display();
                                                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________________________________________
Plan hash value: 3529087922

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name        | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |             |   161 |  3542 |    35   (6)| 00:00:01 |       |       |
|   1 |  SORT ORDER BY                                 |             |   161 |  3542 |    35   (6)| 00:00:01 |       |       |
|   2 |   HASH GROUP BY                                |             |   161 |  3542 |    35   (6)| 00:00:01 |       |       |
|   3 |    VIEW                                        | VW_TE_2     |   161 |  3542 |    33   (0)| 00:00:01 |       |       |
|   4 |     UNION-ALL                                  |             |       |       |            |          |       |       |
|*  5 |      TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| COVID       |    93 |  1395 |     6   (0)| 00:00:01 |     1 |     1 |
|*  6 |       INDEX RANGE SCAN                         | COVID_GEOID |   160 |       |     1   (0)| 00:00:01 |       |       |
|   7 |      PARTITION RANGE SINGLE                    |             |    68 |  1020 |    27   (0)| 00:00:01 |     2 |     2 |
|*  8 |       TABLE ACCESS FULL                        | COVID       |    68 |  1020 |    27   (0)| 00:00:01 |     2 |     2 |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("COVID"."DATEREP"<TO_DATE(' 2020-04-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   6 - access("GEOID"='US')
   8 - filter("GEOID"='US')

Here it is clear: access by index to the partition 1 and full table scan for partition 2. This is exactly what I wanted because I know the clustering factor on the new partition is not very good until I reorganize it (move or merge as I did in the previous post).

All these features help to manage the lifecycle of data. That’s a completely different approach from purpose-built databases where you have one database service for fast ingest with simple queries on recent data (NoSQL folks may think about DynamoDB for that), then streaming data to a relational database for more OLTP queries (RDS to continue with the AWS analogy), and move old data into a database dedicated to analytics (that could be Redshift then). With Oracle, which has always been a multi-purpose database, the goal is to avoid duplication and replication and manage data in-place for all usage. Through the 40 years of this database engine, many approaches have been implemented to cluster data: CLUSTER and IOT can sort (or hash) data as soon as it is inserted, in order to put them at their optimal place for future queries. But the agility of heap tables finally wins. Now, with the ease of in-database data movement (partitioning and online operations) and improvement of full scan (multiblock reads, direct-path reads, storage indexes) we can get the best of both: heap tables with few indexes for fast ingest of current data, reorganize regularly to be clustered, with additional indexes.

I mentioned NoSQL and I mentioned fast ingest. Actually, there’s a feature called Fast Ingest for IoT (lowercase ‘o’ there) that goes further with this idea. Instead of inserting into a persistent segment and reorganize later, rows are buffered in a ‘memoptimized rowstore’ before going to the heap segment in bulk. But that’s an Exadata feature and I like to think about Oracle as a multiplatform database.

Cet article Oracle 12c – global partial index est apparu en premier sur Blog dbi services.

By Franck Pachot

.
Advocates of NoSQL can query their structures without having to read a data model first. And without writing long table join clauses. They store and query a hierarchical structure without the need to follow relationships, and without the need to join tables on a foreign key name, in order to get a caption or description from a lookup table. The structure, like an XML or JSON document, provides metadata to understand the structure and map it to business objects. The API is simple ‘put’ and ‘get’ where you can retrieve a whole hierarchy, with aggregates at all levels, ready to drill-down from summary to details. Without the need to write sum() functions and group by clauses. For analytics, SQL has improved a lot with window functions and grouping sets but, despite being powerful, this makes the API more complex. And, at a time were the acceptable learning curve should reach its highest point after 42 seconds (like watching the first bits of a video or getting to the stackoverflow top-voted answer), this complexity cannot be adopted easily.

Is SQL too complex? If it does, then something is wrong. SQL was invented for end-users: to query data like in plain English, without the need to know the internal implementation and the procedural algorithms that can make sense out of it. If developers are moving to NoSQL because of the complexity of SQL, then SQL missed something from its initial goal. If they go to NoSQL because “joins are expensive” it just means that joins should not be exposed to them. Because optimizing access paths and expensive operations is the job of the database optimizer, with the help of the database designer, but not the front-end developer. However, this complexity is unfortunately there. Today, without a good understanding of the data model (entities, relationships, cardinalities) writing SQL queries is difficult. Joining over many-to-many relationships, or missing a group by clause, can give wrong results. When I see a select with a DISTINCT keyword, I immediately think that there’s an error in the query and the developer, not being certain of the aggregation level he is working on, has masked it with a DISTINCT because understanding the data model was too time-consuming.

In data warehouses, where the database is queried by the end-user, we try to avoid this risk by building simple star schemas with only one fact tables and many-to-one relationships to dimensions. And on top of that, we provide a reporting tool that will generate the queries correctly so that the end-user does not need to define the joins and aggregations. This requires a layer of metadata on top of the database to describe the possible joins, aggregation levels, functions to aggregate measures,… When I was a junior on databases I’ve been fascinated by those tools. On my first Data Warehouse, I’ve built a BusinessObjects (v3) universe. It was so simple: define the “business objects”, which are the attributes mapped to the dimension columns. Define the fact measures, with the aggregation functions that can apply. And for the joins, it was like the aliases in the from clause, a dimension having multiple roles: think about an airport that can be the destination or the origin of a flight. And then we defined multiple objects: all the airport attributes in the destination role, and all the airport attributes as an origin, were different objects for the end-user. Like “origin airport latitude”, rather than “airport latitude” that makes sense only after a join on “origin airport ID”. That simplifies a lot the end-user view on our data: tables are still stored as relational tables to be joined at query time, in order to avoid redundancy, but the view on top of that shows the multiple hierarchies, like in a NoSQL structure, for the ease of simple queries.

But, as I mentioned, this is the main reason for SQL and this should be done with SQL. All these descriptions I did in the BusinessObjects universe should belong to the database dictionary. And that’s finally possible with Analytic Views. Here is an example on the tables I’ve created in a previous post. I am running on the 20c cloud preview, but this can run on 18c or 19c. After importing the .csv of covid-19 cases per day and countries, I’ve built one fact and one snowflake-dimension tables:


create table continents as select rownum continent_id, continentexp continent_name from (select distinct continentexp from covid where continentexp!='Other');
create table countries as select country_id,country_code,country_name,continent_id,popdata2018 from (select distinct geoid country_id,countryterritorycode country_code,countriesandterritories country_name,continentexp continent_name,popdata2018 from covid where continentexp!='Other') left join continents using(continent_name);
create table cases as select daterep, geoid country_id,cases from covid where continentexp!='Other';
alter table continents add primary key (continent_id);
alter table countries add foreign key (continent_id) references continents;
alter table countries add primary key (country_id);
alter table cases add foreign key (country_id) references countries;
alter table cases add primary key (country_id,daterep);

The dimension hierarchy is on country/continent. I should have created one for time (day/month/quarter/year) but the goal is to keep it simple to show the concept.

When looking at the syntax, it may seem complex. But, please, understand that the goal is to put more in the static definition so that runime usage is easier.

Attribute Dimension

I’ll describe the Country/Continent dimension. It can be in one table (Star Schema) or multiple (Snowflake Schema). I opted for snowflake to show how it is supported since 18c. In 12c we have to create a view on it as the using clause can be only a table or view identifier.


create or replace attribute dimension COUNTRIES_DIM_ATT
using COUNTRIES a ,CONTINENTS b join path country_continent on a.CONTINENT_ID=b.CONTINENT_ID
attributes ( a.COUNTRY_ID "Country ID", a.COUNTRY_CODE "Country", a.COUNTRY_NAME "Country name", a.CONTINENT_ID "Continent ID", b.CONTINENT_NAME "Continent")
level "Continent"
  key "Continent ID"
  member name         '#'||to_char("Continent ID")
  member caption      upper(substr("Continent",1,3))
  member description  "Continent"
  determines ("Continent")
level "Country"
  key "Country ID"
  member name         "Country ID"
  member caption      "Country"
  member description  "Country name"
  determines ("Country ID","Country", "Country name", "Continent ID", "Continent")
 all member name 'WORLD'
/

Let’s take it simply, I have an internal name for my dimension COUNTRIES_DIM_ATT and a USING clause which declares the dimension table and an optional join for snowflake schemas with JOIN PATH. Then I’ve declared the attributes which are the projection of those columns. For this example, I decided to use quoted identifiers for the one that I add in this layer, to distinguish them from the table columns. But do as you want.

The most important here is about levels and dependency. In a star schema, we denormalize the fact tables for simplification (and because it is not a problem as there are no updates, and size is not as large as the fact tables). The metadata we declare here describes the relationships. I have two levels: country and continent. And a many-to-one relationship from country to continent. This is what I declare with the LEVEL and DETERMINES keyword: from all the attributes declared, which ones are functional dependencies of others.

The second important description here is standard naming. In the analytic view, I can query the attributes as columns from the USING clause. But for the ease of querying by simple tools, they will also have standard columns names. Each attribute has as MEMBER NAME (I used the 2-letter country code here which is the COUNTRY_ID primary key in my COUNTRIES dimension table. They have a MEMBER CAPTION as a short name and a MEMBER DESCRIPTION for a longer one. Those are standardized names for each object. The idea is to provide a view that can be used without reading the data model: for each level, the end-user can query the name, caption or the description.

The idea is that those hierarchy levels will be selected in the WHERE clause by a LEVEL_NAME instead of mentioning all columns in GROUP BY clause or PARTITION BY analytic function windowing clause. Note that the’s also an ALL level for the top-most aggregation and we can keep the ‘ALL’ name or a specific one like the ‘WORLD’ I’ve defined here for all countries.

This is the most important metadata is defined by the dimension but we don’t query on dimensions. We can only look at the definitions in the dictionary:


SQL> select * FROM user_attribute_dimensions;

      DIMENSION_NAME    DIMENSION_TYPE    CACHE_STAR    MAT_TABLE_OWNER    MAT_TABLE_NAME    ALL_MEMBER_NAME    ALL_MEMBER_CAPTION    ALL_MEMBER_DESCRIPTION    COMPILE_STATE    ORIGIN_CON_ID
____________________ _________________ _____________ __________________ _________________ __________________ _____________________ _________________________ ________________ ________________
COUNTRIES_DIM_ATT    STANDARD          NONE                                               'WORLD'                                                            VALID                           3
CALENDAR_DIM_ATT     STANDARD          NONE                                               'ALL'                                                              VALID                           3
DAYS_DIM_ATT         TIME              NONE                                               'ALL'                                                              VALID                           3

SQL> select * FROM user_attribute_dim_attrs;

      DIMENSION_NAME    ATTRIBUTE_NAME    TABLE_ALIAS       COLUMN_NAME    ORDER_NUM    ORIGIN_CON_ID
____________________ _________________ ______________ _________________ ____________ ________________
DAYS_DIM_ATT         Date              CASES          DATEREP                      0                3
COUNTRIES_DIM_ATT    Country ID        A              COUNTRY_ID                   0                3
COUNTRIES_DIM_ATT    Country           A              COUNTRY_CODE                 1                3
COUNTRIES_DIM_ATT    Country name      A              COUNTRY_NAME                 2                3
COUNTRIES_DIM_ATT    Continent ID      A              CONTINENT_ID                 3                3
COUNTRIES_DIM_ATT    Continent         B              CONTINENT_NAME               4                3
CALENDAR_DIM_ATT     Date              CASES          DATEREP                      0                3

SQL> select * FROM user_attribute_dim_levels;

      DIMENSION_NAME    LEVEL_NAME    SKIP_WHEN_NULL    LEVEL_TYPE                MEMBER_NAME_EXPR               MEMBER_CAPTION_EXPR    MEMBER_DESCRIPTION_EXPR    ORDER_NUM    ORIGIN_CON_ID
____________________ _____________ _________________ _____________ _______________________________ _________________________________ __________________________ ____________ ________________
COUNTRIES_DIM_ATT    Continent     N                 STANDARD      '#'||to_char("Continent ID")    upper(substr("Continent",1,3))    "Continent"                           0                3
DAYS_DIM_ATT         Day           N                 DAYS          TO_CHAR("Date")                                                                                         0                3
COUNTRIES_DIM_ATT    Country       N                 STANDARD      "Country ID"                    "Country"                         "Country name"                        1                3
CALENDAR_DIM_ATT     Day           N                 STANDARD      TO_CHAR("Date")                                                                                         0                3

There are more that we can define here. I the same way we want to simplify the PARTITION BY clause of analytic function, thanks to levels, we avoid the ORDER BY clause with ordering in each level. I keep it simple here.

For drill-down analytics, we query on hierarchies.

Hierarchy

This is a simple declaration of parent-child relationship between levels:


SQL> 
create or replace hierarchy "Countries"
    using COUNTRIES_DIM_ATT
    ( "Country" child of "Continent")
 /

Hierarchy created.

This is actually a view that we can query, and the best way to understand it is to look at it.

The definition from the dictionary just reflects what we have created:


SQL> select * FROM user_hierarchies;

   HIER_NAME    DIMENSION_OWNER       DIMENSION_NAME    PARENT_ATTR    COMPILE_STATE    ORIGIN_CON_ID
____________ __________________ ____________________ ______________ ________________ ________________
Countries    DEMO               COUNTRIES_DIM_ATT                   VALID                           3

SQL> select * FROM user_hier_levels;

   HIER_NAME    LEVEL_NAME    ORDER_NUM    ORIGIN_CON_ID
____________ _____________ ____________ ________________
Countries    Continent                0                3
Countries    Country                  1                3

We can also query USER_HIER_COLUMNS to see what is exposed as a view.

but a simple DESC will show them:


SQL> desc "Countries"

                 Name    Role            Type
_____________________ _______ _______________
Country ID            KEY     VARCHAR2(10)
Country               PROP    VARCHAR2(3)
Country name          PROP    VARCHAR2(50)
Continent ID          KEY     NUMBER
Continent             PROP    VARCHAR2(10)
MEMBER_NAME           HIER    VARCHAR2(41)
MEMBER_UNIQUE_NAME    HIER    VARCHAR2(95)
MEMBER_CAPTION        HIER    VARCHAR2(12)
MEMBER_DESCRIPTION    HIER    VARCHAR2(50)
LEVEL_NAME            HIER    VARCHAR2(9)
HIER_ORDER            HIER    NUMBER
DEPTH                 HIER    NUMBER(10)
IS_LEAF               HIER    NUMBER
PARENT_LEVEL_NAME     HIER    VARCHAR2(9)
PARENT_UNIQUE_NAME    HIER    VARCHAR2(95)

This is like a join on the COUNTRIES and CONTINENTS (defined in the using clause of the attribute dimension) with the attributes exposed. But there are also additional columns that are there with standard names in all hierarchies: member name/caption/description and level information. Because all levels are here, as if we did some UNION ALL over GROUP BY queries.

Additional columns and additional rows for each level. Let’s query it:


SQL> select * from "Countries";

   Country ID    Country                         Country name    Continent ID    Continent    MEMBER_NAME    MEMBER_UNIQUE_NAME    MEMBER_CAPTION                   MEMBER_DESCRIPTION    LEVEL_NAME    HIER_ORDER    DEPTH    IS_LEAF    PARENT_LEVEL_NAME    PARENT_UNIQUE_NAME
_____________ __________ ____________________________________ _______________ ____________ ______________ _____________________ _________________ ____________________________________ _____________ _____________ ________ __________ ____________________ _____________________
                                                                                           WORLD          [ALL].[WORLD]                                                                ALL                       0        0          0
                                                                            1 Asia         #1             [Continent].&[1]      ASI               Asia                                 Continent                 1        1          0 ALL                  [ALL].[WORLD]
AE            ARE        United_Arab_Emirates                               1 Asia         AE             [Country].&[AE]       ARE               United_Arab_Emirates                 Country                   2        2          1 Continent            [Continent].&[1]
AF            AFG        Afghanistan                                        1 Asia         AF             [Country].&[AF]       AFG               Afghanistan                          Country                   3        2          1 Continent            [Continent].&[1]
BD            BGD        Bangladesh                                         1 Asia         BD             [Country].&[BD]       BGD               Bangladesh                           Country                   4        2          1 Continent            [Continent].&[1]
...
VN            VNM        Vietnam                                            1 Asia         VN             [Country].&[VN]       VNM               Vietnam                              Country                  43        2          1 Continent            [Continent].&[1]
YE            YEM        Yemen                                              1 Asia         YE             [Country].&[YE]       YEM               Yemen                                Country                  44        2          1 Continent            [Continent].&[1]
                                                                            2 Africa       #2             [Continent].&[2]      AFR               Africa                               Continent                45        1          0 ALL                  [ALL].[WORLD]
AO            AGO        Angola                                             2 Africa       AO             [Country].&[AO]       AGO               Angola                               Country                  46        2          1 Continent            [Continent].&[2]
BF            BFA        Burkina_Faso                                       2 Africa       BF             [Country].&[BF]       BFA               Burkina_Faso                         Country                  47        2          1 Continent            [Continent].&[2]
...

I’ve removed many rows for clarity, but there is one row for all countries, the deepest level, plus one row for each continent, plus one row for the top summary (‘WORLD’). This is how we avoid GROUP BY in the end-user query: we just mention the level: LEVEL_NAME=’ALL’, LEVEL_NAME=’Continent’, LEVEL_NAME=’Country’. Or query the DEPTH: 0 for the global summary, 1 for continents, 2 for countries. The countries, being the most detailed level can also be queried by IS_LEAF=1. The attributes may be NULL for non-leaf levels, like “Country name” when at ‘Continent’ level, or “Continent” when at ‘ALL’ level.

In addition to the attributes, we have the standardized names, so that the user GUI can see the same column names for all dimensions. I don’t show all countries and I don’t query MEMBER_NAME and MEMBER_CAPTION to get it short here:


SQL>
select MEMBER_NAME,MEMBER_UNIQUE_NAME,LEVEL_NAME,PARENT_LEVEL_NAME,PARENT_UNIQUE_NAME,HIER_ORDER,DEPTH,IS_LEAF
 from "Countries" order by DEPTH,HIER_ORDER fetch first 10 rows only;

   MEMBER_NAME    MEMBER_UNIQUE_NAME    LEVEL_NAME    PARENT_LEVEL_NAME    PARENT_UNIQUE_NAME    HIER_ORDER    DEPTH    IS_LEAF
______________ _____________________ _____________ ____________________ _____________________ _____________ ________ __________
WORLD          [ALL].[WORLD]         ALL                                                                  0        0          0
#1             [Continent].&[1]      Continent     ALL                  [ALL].[WORLD]                     1        1          0
#2             [Continent].&[2]      Continent     ALL                  [ALL].[WORLD]                    45        1          0
#3             [Continent].&[3]      Continent     ALL                  [ALL].[WORLD]                   101        1          0
#4             [Continent].&[4]      Continent     ALL                  [ALL].[WORLD]                   156        1          0
#5             [Continent].&[5]      Continent     ALL                  [ALL].[WORLD]                   165        1          0
AE             [Country].&[AE]       Country       Continent            [Continent].&[1]                  2        2          1
AF             [Country].&[AF]       Country       Continent            [Continent].&[1]                  3        2          1
BD             [Country].&[BD]       Country       Continent            [Continent].&[1]                  4        2          1
BH             [Country].&[BH]       Country       Continent            [Continent].&[1]                  5        2          1

A row can be identified by the level (LEVEL_NAME or DEPTH) and its name but a unique name is generated here with the full path (in MDX style). This is MEMBER_UNIQUE_NAME and we have also the PARENT_UNIQUE_NAME if we want to follow the hierarchy.

Analytic View

Now that I have a view on the hierarchy, I want to join it to the fact table, in order to display the measures at different levels of aggregation. Again, I don’t want the user to think about joins and aggregation functions, and this must be encapsulated in a view, an ANALYTIC VIEW:


create or replace analytic view "COVID cases"
using CASES
dimension by (
  COUNTRIES_DIM_ATT key COUNTRY_ID references "Country ID"
  hierarchies ( "Countries")
 )
measures (
  "Cases"          fact CASES aggregate by sum,
  "Highest cases"  fact CASES aggregate by max
)
/

The USING clause just mentions the fact table. The DIMENSION clause lists all the dimensions (I have only one here for the simplicity of the example, but you will have all dimensions here) and how they join to the dimension (foreign key REFERENCES the lowest level key of the dimension). The MEASURES defines the fact columns and the aggregation function to apply to them. This can be complex to be sure it always makes sense. What is stored in one fact column can be exposed as multiple business objects attribute depending on the aggregation.

There are many functions for measures calculated. For example in the screenshot you will see at the end, I added the following to show the country covid cases as a ration on their continent ones.


 "cases/continent" as 
  ( share_of("Cases" hierarchy COUNTRIES_DIM_ATT."Countries"  level "Continent") )
  caption 'Cases Share of Continent' description 'Cases Share of Continent'

But for the moment I keep it simple with only “Cases” and “Highest cases”.

Here is the description:


SQL> desc "COVID cases"

            Dim Name    Hier Name                  Name    Role            Type
____________________ ____________ _____________________ _______ _______________
COUNTRIES_DIM_ATT    Countries    Country ID            KEY     VARCHAR2(10)
COUNTRIES_DIM_ATT    Countries    Country               PROP    VARCHAR2(3)
COUNTRIES_DIM_ATT    Countries    Country name          PROP    VARCHAR2(50)
COUNTRIES_DIM_ATT    Countries    Continent ID          KEY     NUMBER
COUNTRIES_DIM_ATT    Countries    Continent             PROP    VARCHAR2(10)
COUNTRIES_DIM_ATT    Countries    MEMBER_NAME           HIER    VARCHAR2(41)
COUNTRIES_DIM_ATT    Countries    MEMBER_UNIQUE_NAME    HIER    VARCHAR2(95)
COUNTRIES_DIM_ATT    Countries    MEMBER_CAPTION        HIER    VARCHAR2(12)
COUNTRIES_DIM_ATT    Countries    MEMBER_DESCRIPTION    HIER    VARCHAR2(50)
COUNTRIES_DIM_ATT    Countries    LEVEL_NAME            HIER    VARCHAR2(9)
COUNTRIES_DIM_ATT    Countries    HIER_ORDER            HIER    NUMBER
COUNTRIES_DIM_ATT    Countries    DEPTH                 HIER    NUMBER(10)
COUNTRIES_DIM_ATT    Countries    IS_LEAF               HIER    NUMBER
COUNTRIES_DIM_ATT    Countries    PARENT_LEVEL_NAME     HIER    VARCHAR2(9)
COUNTRIES_DIM_ATT    Countries    PARENT_UNIQUE_NAME    HIER    VARCHAR2(95)
                     MEASURES     Cases                 BASE    NUMBER
                     MEASURES     Highest cases         BASE    NUMBER

I have columns from all hierarchies, with KEY and PROPERTY attributes, and standardized names from the HIERARCHY, and the measures. You must remember that it is a virtual view: you will never query all columns and all rows. You SELECT the columns and filter (WHERE) the rows and levels and you get the result you want without GROUP BY and JOIN. If you look at the execution plan you will see the UNION ALL, JOIN, GROUP BY on the star or snowflake table. But this is out of the end-user concern. As a DBA you can create some materialized views to pre-build some summaries and query rewrite will used them.

We are fully within the initial SQL philosophy: a logical view provides an API that is independent of the physical design and easy to query, on a simple row/column table easy to visualize.

Analytic query

A query on the analytic view is then very simple. In the FROM clause, instead of tables with joins, I mention the analytic view, and instead of mentioning table aliases, I mention the hierarchy. I reference only the standard column names. Only the hierarchy names and the measures are specific. In the where clause, I can also reference the LEVEL_NAME:


SQL> 
select MEMBER_DESCRIPTION, "Cases"
 from "COVID cases" hierarchies ("Countries")
 where ( "Countries".level_name='Country' and "Countries".MEMBER_CAPTION in ('USA','CHN') )
    or ( "Countries".level_name in ('Continent','ALL') )
 order by "Cases";

         MEMBER_DESCRIPTION      Cases
___________________________ __________
Oceania                           8738
China                            84198
Africa                          203142
Asia                           1408945
United_States_of_America       1979850
Europe                         2100711
America                        3488230
                               7209766

Here I wanted to see the total covid-19 cases for all countries (‘ALL’), for each continent, and only two ones at the country level: USA and China. And this was a simple SELECT … FROM … WHERE … ORDER BY without joins and group by. Like a query on an OLAP cube.

If I had no analytic views, here is how I would have queried the tables:


SQL>
select coalesce(CONTINENT_NAME, COUNTRY_NAME,'ALL'), CASES from (
select CONTINENT_NAME, COUNTRY_NAME, sum(CASES) cases, COUNTRY_CODE, grouping(COUNTRY_CODE) g_country
from CASES join COUNTRIES using(COUNTRY_ID) join CONTINENTS using(CONTINENT_ID)
group by grouping sets ( () , (CONTINENT_NAME) , (COUNTRY_CODE,COUNTRY_NAME) )
)
where COUNTRY_CODE in ('USA','CHN') or g_country >0
order by cases
/

   COALESCE(CONTINENT_NAME,COUNTRY_NAME,'ALL')      CASES
______________________________________________ __________
Oceania                                              8738
China                                               84198
Africa                                             203142
Asia                                              1408945
United_States_of_America                          1979850
Europe                                            2100711
America                                           3488230
ALL                                               7209766

This was with GROUPING SETS to add multiple levels and GROUPING() function to detect the level. Without GROUPING SETS I may have done it with many UNION ALL between GROUP BY subqueries.

Back to roots of SQL

You may think that you don’t need Analytic Views because the same can be done by some BI reporting tools. But this should belong to the database. SQL was invented to provide a simple API to users. If you need an additional layer with a large repository of metadata and complex transformations between the user-defined query and the SQL to execute, then something is missed from the initial goal. One consequence is people going to NoSQL hierarchical databases with the idea that they are easier to visualize: simple API (a key-value get) and embedded metadata (as JSON for example). While SQL was more and more powerful to process data in the database, the complexity was going too far and developers prefered to come back to their procedural code rather than learning something new. And the first step of many current developments is to move the data out of the database, to NoSQL, or to an OLAP cube in our case.

Analytic views bring back the power of SQL: the view exposes a Data Mart as one simple table with columns and rows, containing all dimensions and levels of aggregation. The metadata that describes the data model is back where it belongs: the data dictionary. My example here is a very simple one but it can go further, with classification to add more metadata for self-documentation, with more hierarchies (and a special one for the time dimension), and many calculated measures.
SQL on it is simplified, and there are also some GUI over analytic views, like APEX, or SQL Developer:

And if SQL is still too complex, it seems that we can query Analytic Views with MDX (MultiDimensional eXpressions). The MEMBER_UNIQUE_NAME follows the MDX syntax and we can find this in ?/mesg/oraus.msg list of error messages:


/============================================================================
/
/    18200 - 18699 Reserved for Analytic View Sql (HCS) error messages
/
/============================================================================
/
/// 18200 - 18219 reserved for MDX Parser
/

HCS is the initial name of this feature (Hierarchical Cubes). I’ve not seen other mentions of MDX in the Oracle Database documentation, so I’ve no idea if it is already implemented.

Cet article No{Join,GroupBy}SQL – Analytic Views for BI est apparu en premier sur Blog dbi services.

By Franck Pachot

.
Sorting data is an expensive operation and many queries declare an ORDER BY. To avoid the sort operation you can build an index as it maintains a sorted structure. This helps with Top-N queries as you don’t have to read all rows but only those from a range of index entries. However, indexes are sorted by binary values. For NUMBER or DATE datatypes, the internal storage ensures that the order is preserved in the binary format. For character strings, the binary format is ASCII, which follows the English alphabet. That’s fine when your session language, NLS_LANGUAGE, defines an NLS_SORT that follows this BINARY order. But as soon as you set a language that has some specific alphabetical order, having an index on a VARCHAR2 or CHAR column does not help to avoid a SORT operation. However, in Oracle 12.2 we can define the sort order at column level with the SQL Standard COLLATE. One use case is for alpha-numeric columns that have nothing to do with any language. Like some natural keys combining letters and numbers. The user expects them to be listed in alphabetical order but, storing only 7-bits ASCII characters, you don’t care about linguistic collation.

I am running this on the Oracle 20c preview in the Oracle Cloud.

VARCHAR2

It can happen that a primary key is not a NUMBER but a CHAR or VARCHAR2, like this:


SQL> create table demo (ID constraint demp_pk primary key) as
  2  select cast(dbms_random.string('U',1)||to_char(rownum,'FM0999') as varchar2(5)) ID
  3  from xmltable('1 to 10');

Table created.

SQL> select * from demo order by ID;

      ID
________
K0003
K0009
L0007
L0010
M0008
O0002
S0001
W0005
Y0006
Z0004

10 rows selected.

I query with ORDER BY because sorting can make sense on a natural key.

Index

I have an index on this column, which is sorted, and then the execution plan is optimized:


SQL> select * from dbms_xplan.display_cursor(format=>'basic');

                      PLAN_TABLE_OUTPUT
_______________________________________
EXPLAINED SQL STATEMENT:
------------------------
select * from demo order by ID

Plan hash value: 1955576728

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  INDEX FULL SCAN | DEMP_PK |
------------------------------------

13 rows selected.

There’s no SORT operation because the INDEX FULL SCAN follows the index entries in order.

NLS_LANGUAGE

However, there are many countries where we don’t speak English:


SQL> alter session set nls_language='French';

Session altered.

In French, like in many languages, we have accentuated characters and other specificities so that the language-alphabetical order does not always follow the ASCII order.

I’m running exactly the same query:


SQL> select * from demo order by ID;

      ID
________
K0003
K0009
L0007
L0010
M0008
O0002
S0001
W0005
Y0006
Z0004

10 rows selected.

SQL> select * from dbms_xplan.display_cursor(format=>'basic');

                      PLAN_TABLE_OUTPUT
_______________________________________
EXPLAINED SQL STATEMENT:
------------------------
select * from demo order by ID

Plan hash value: 2698718808

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  SORT ORDER BY   |         |
|   2 |   INDEX FULL SCAN| DEMP_PK |
------------------------------------

14 rows selected.

This time, there’s a SORT operation. even if I’m still reading with INDEX FULL SCAN.

NLS_SORT

The reason is that, by setting the ‘French’ language, I’ve also set the French sort collating sequence.


SQL> select * from nls_session_parameters;
                 PARAMETER                           VALUE
__________________________ _______________________________
NLS_LANGUAGE               FRENCH
NLS_SORT                   FRENCH

And this is different from the BINARY one that I had when my language was ‘American’.

Actually, only a few languages follow the BINARY order of the ASCII table:


SQL>
  declare
   val varchar2(64);
  begin
    for i in (select VALUE from V$NLS_VALID_VALUES where PARAMETER='LANGUAGE') loop
    execute immediate 'alter session set nls_language='''||i.value||'''';
    select value into val from NLS_SESSION_PARAMETERS where PARAMETER='NLS_SORT';
    if val='BINARY' then dbms_output.put(i.value||' '); end if;
    end loop;
    dbms_output.put_line('');
  end;
/

AMERICAN JAPANESE KOREAN SIMPLIFIED CHINESE TRADITIONAL CHINESE ENGLISH HINDI TAMIL KANNADA TELUGU ORIYA MALAYALAM ASSAMESE GUJARATI MARATHI PUNJABI BANGLA MACEDONIAN LATIN SERBIAN IRISH

PL/SQL procedure successfully completed.

This is ok for real text but not for my primary key where ASCII order is ok. I can set the NLS_SORT=BINARY for my session, but that’s too wide as my problem is only with a column.

Or I can create an index for the French collation. Actually, this is what is used internally:


SQL> explain plan for select * from demo order by ID;
Explained.

SQL> select * from dbms_xplan.display(format=>'basic +projection');
                                                      PLAN_TABLE_OUTPUT
_______________________________________________________________________
Plan hash value: 2698718808

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  SORT ORDER BY   |         |
|   2 |   INDEX FULL SCAN| DEMP_PK |
------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) NLSSORT("DEMO"."ID",'nls_sort=''GENERIC_M''')[50],
       "DEMO"."ID"[VARCHAR2,5]
   2 - "DEMO"."ID"[VARCHAR2,5]

GENERIC_M is the sort collation for many European languages.

But that again, does not fit the scope of my problem as I don’t want to create an index for any possible NLS_SORT setting.

COLLATE

The good solution is to define the collation for my table column: this ID is a character string, but it is an ASCII character string which has nothing to do with my language. In 18c I can do that:


SQL> alter table demo modify ID collate binary;

Table altered.

The COLLATE is a SQL Standard syntax that exists in other databases, and it came to Oracle in 12cR2.

And that’s all:


SQL> explain plan for select * from demo order by ID;

Explained.

SQL> select * from dbms_xplan.display(format=>'basic +projection');

                                             PLAN_TABLE_OUTPUT
______________________________________________________________
Plan hash value: 1955576728

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  INDEX FULL SCAN | DEMP_PK |
------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "DEMO"."ID"[VARCHAR2,5]

No SORT operation needed, whatever the language I set for my session.

Here is the DDL for my table:


SQL> ddl demo

  CREATE TABLE "SYS"."DEMO"
   (    "ID" VARCHAR2(5) COLLATE "BINARY",
         CONSTRAINT "DEMP_PK" PRIMARY KEY ("ID")
  USING INDEX  ENABLE
   )  DEFAULT COLLATION "USING_NLS_COMP" ;

My column explicitly follows the BINARY collation.

Extended Data Types

Now, all seems easy, but there’s a prerequisite:


SQL> show parameter max_string_size

NAME            TYPE   VALUE
--------------- ------ --------
max_string_size string EXTENDED

I have set my PDB to EXTENDED string size.

If I try the same in a PDB with the ‘old’ limit of 4000 bytes:


SQL> alter session set container=PDB1;

Session altered.

SQL> show parameter max_string_size

NAME            TYPE   VALUE
--------------- ------ --------
max_string_size string STANDARD

SQL> drop table demo;

Table dropped.

SQL> create table demo (ID varchar2(5) collate binary constraint demp_pk primary key);

create table demo (ID varchar2(5) collate binary constraint demp_pk primary key)
 *
ERROR at line 1:
ORA-43929: Collation cannot be specified if parameter MAX_STRING_SIZE=STANDARD is set.

This new feature is allowed only with the Extended Data Types introduced in 12c release 2.

ORDER BY COLLATE

Ok, let’s create the table with the default collation:


SQL> create table demo (ID constraint demp_pk primary key) as
  2  select cast(dbms_random.string('U',1)||to_char(rownum,'FM0999') as varchar2(5)) ID
  3  from xmltable('1 to 10');

Table created.

SQL> select * from dbms_xplan.display_cursor(format=>'basic +projection');

                                                   PLAN_TABLE_OUTPUT
____________________________________________________________________
EXPLAINED SQL STATEMENT:
------------------------
select * from demo order by ID

Plan hash value: 2698718808

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  SORT ORDER BY   |         |
|   2 |   INDEX FULL SCAN| DEMP_PK |
------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) NLSSORT("DEMO"."ID",'nls_sort=''FRENCH''')[50],
       "DEMO"."ID"[VARCHAR2,5]
   2 - "DEMO"."ID"[VARCHAR2,5]

As my NLS_SORT is ‘French’ there is a SORT operation.

But I can explicitly request a BINARY sort for this:


SQL> select * from demo order by ID collate binary;

      ID
________
D0003
H0002
L0009
N0008
P0010
Q0005
R0004
W0007
Y0001
Z0006

10 rows selected.

SQL> select * from dbms_xplan.display_cursor(format=>'basic +projection');

                                             PLAN_TABLE_OUTPUT
______________________________________________________________
EXPLAINED SQL STATEMENT:
------------------------
select * from demo order by ID collate binary

Plan hash value: 2698718808

------------------------------------
| Id  | Operation        | Name    |
------------------------------------
|   0 | SELECT STATEMENT |         |
|   1 |  SORT ORDER BY   |         |
|   2 |   INDEX FULL SCAN| DEMP_PK |
------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) "DEMO"."ID" COLLATE "BINARY"[5],
       "DEMO"."ID"[VARCHAR2,5]
   2 - "DEMO"."ID"[VARCHAR2,5]

I have no idea why there is still a sort operation. I think that the INDEX FULL SCAN returns already the rows in binary order. And that should require additional sorting for the ORDER BY … COLLATE BINARY.

Cet article Oracle non-linguistic varchar2 columns to order by without sorting est apparu en premier sur Blog dbi services.

By Franck Pachot

.
I originally wrote this as a comment on the following post that you may find on internet:
https://www.2ndquadrant.com/en/blog/oracle-to-postgresql-reasons-to-migrate/
but my comment was not published (many links in it… I suppose it has been flagged as spam?) so I put it there.

You should never take any decision on what you read on the internet without verifying. It is totally valid to consider a move to Open Source databases, but doing it without good understanding is a risk for your migration project success.

In italics are the quotes from the article.

Kirk,
As you do a comparison and link to a list of PostgreSQL features, let me refine the name and description of the Oracle features you compare to, so that people can find them and do a fair comparison. I’m afraid they may not recognize the names and descriptions you provide, at least in current versions. As an example, nobody will get search hits for “Federation”, or “plSQL”, or “HTML DB”… in the Oracle documentation but they will find “Oracle Gateway”, “PL/SQL”, “APEX”…

Federation vs. Foreign Data Wrappers

There is no feature called “Federation”.
The closest from your description is Database links and Heterogeneous Services through Database Gateway. They go further than FDW in many points. But anyway, I would never use that for ETL. ETL needs optimized bulk loads and there are other features for that (like External Tables to read files, and direct-path inserts to fast load). If your goal is to federate and distribute some small reference tables, then Materialized Views is the feature you may look for.
https://docs.oracle.com/en/database/oracle/oracle-database/20/heter/introduction.html#GUID-EC402025-0CC0-401F-AF93-888B8A3089FE

plSQL vs. everything else

“Oracle has a built-in programming language called plSQL.”
PL/SQL is more than that. It is compiled (to pcode or native), manages dependencies (tracks dependencies on schema objects), optimized for data access (UDF can even be compiled to run within the SQL engine), can be multithreaded (Parallel Execution). That’s different from PL/pgSQL which is interpreted at execution time. You mention languages as “as plug-ins” and for this, there are other ways to run different languages (external procedures, OJCM, External Table preprocessor,…) but when it comes to performance, transaction control, dependency tracking,… that’s PL/SQL.
https://docs.oracle.com/en/database/oracle/oracle-database/20/lnpls/overview.html#GUID-17166AA4-14DC-48A6-BE92-3FC758DAA940

Application programming

Providing an “API to communicate with the database” is not about open source as the main goal is: encapsulation and hide implementation details. In order to access internal structures, which is what you mention, Oracle provides relational views (known as V$ views) accessible with the most appropriate API for a relational database: SQL
https://docs.oracle.com/en/database/oracle/oracle-database/20/refrn/dynamic-performance-views.html#GUID-8C5690B0-DE10-4460-86DF-80111869CF4C

Internationalization and Localization

The “globalization toolkit” is only one part of the globalization features. You can also use any “any character encoding, collation and code page” but not relying on the OS implementation of it makes it cross-platform compatible and OS upgrade compatible (see https://wiki.postgresql.org/wiki/Locale_data_changes)
https://docs.oracle.com/en/database/oracle/oracle-database/20/nlspg/overview-of-globalization-support.html#GUID-6DD587EE-6686-4802-9C08-124B495978D5

Web Development

“Oracle acknowledges the existence of HTML through HTML DB. PostgreSQL natively supports JSON, XML and plugs in Javascript”. HTML DB can be found in paper books, but the name is “APEX” since 2006. And it is not (only) about HTML, JSON, or XML but is a low-code Rapid Application Development with no equivalent for other databases.
Support for the structures and languages you mention are all there. The latest trend being JSON: https://docs.oracle.com/en/database/oracle/oracle-database/20/adjsn/index.html

Authentication

“Oracle has a built-in authentication system.”
Yes, to be platform-independent, and has many other External Authentication: https://docs.oracle.com/en/database/oracle/oracle-database/20/dbseg/configuring-authentication.html#GUID-BF8E5E84-FE7E-449C-8081-755BAA4CF8DB

Extensibility

“Oracle has a plug-in system”. I don’t know what you are referring to. Oracle is multi-platform proprietary software. Commercial, which means with vendor supported. There are a lot of APIs for extensions, but the vendor must have to control what runs in the engine in order to provide support.

Read Scalability

“PostgreSQL can create a virtually unlimited read cluster”. Oracle has active/active cluster (called RAC) and read replicas (called Active Data Guard). For horizontal scalability, you use the same as for vertical (Parallel Execution) across multiple nodes (in sync, with instance affinity on partitions,…)
https://docs.oracle.com/en/database/oracle/oracle-database/20/vldbg/parallel-exec-intro.html#GUID-F9A83EDB-42AD-4638-9A2E-F66FE09F2B43

Cost

“they don’t mind charging you again for every single instance.”
No, that’s wrong, license metrics are on processors (CPU) or users (NUP). You run as many instances as you want on your licensed servers for your licensed users: https://www.oracle.com/a/ocom/docs/corporate/oracle-software-licensing-basics.pdf
“jamming everything into a single instance just to reduce costs”.
No, database consolidation is recommended to scale the management of multiple databases, but not for licensing costs. If you go there, there are a lot of features to allow isolation and data movement in consolidated databases: Multitenant, Resource Manager, Online Relocate, Lockdown Profiles,…

Performance

“differentiate the tuning parameters for your warehouse to OLTP to reporting to the data lake”: I already mentioned the point about read replicas and about multiple instances in a server. But with oracle, all the parameters I want to set different for OLTP or reporting do not require another instance. They can be set at session or PDB level. As Oracle does not need the filesystem buffer cache, there’s no need to separate on different servers to avoid noisy neighbours.

I hope this helps to look further at the features. There are many reasons to migrate and the main one is the will to move from a commercial model (with license and support) to an open-source one (start with low cost, help from community). But decision must be made on facts and not rumours.

Franck.

Cet article Some myths about PostgreSQL vs. Oracle est apparu en premier sur Blog dbi services.

Introduction

Cloud, Cloud, Cloud. Everyone is talking about the Cloud but a lot of people are still in the fog with Cloud technologies. Let’s talk about basic features of the Oracle Cloud, called OCI for Oracle Cloud Infrastructure.

What is really OCI?

OCI is physically a lot of servers in datacenters all around the world. These servers are not very different from the servers you probably have in your own datacenter. Some of these servers are already in use by the customers, and some are immediately available for existing customers or new customers. Most of the customers will not use complete servers but part of them, thanks to the virtualization layer on OCI. A real server can hold several virtual servers, quite a lot actually. Oracle tells us that there is no overprovisionning on OCI: if you create your own server with 2 CPUs and 64GB of RAM, you’re pretty sure that these resources are available for you on the physical server, even if you don’t plan to use them at full throttle. If you need a complete physical server for yourself, it’s also possible, and it’s easy to provision just like a virtual machine.

What do I need to create a server in OCI?

OCI is actually available through a website, the OCI console, but you’ll have to buy Cloud credits to be able to create resources in this Cloud.

Two other options are available:
– ask your Oracle software provider for free trial Cloud credits for testing OCI
– create a free account and use only always-free resources (quite limitating)

When you’re connected to your brand new OCI account onto the console, just create a compute instance. A compute instance is a server for multi-purpose usage. Several options are available at server creation, like the number of CPUs, the amount of RAM, the size of the boot disk, and the OS that will come pre-installed. Provisionning a simple Linux server takes 2 minutes. Deleting a server is also a matter of minutes.

Can I go straight to server creation?

Not really. You cannot simply create a server, because you’ll need to put this server in a compartment, a kind or virtual container for your servers. So first step is to create a compartment. Compartments are fully isolated between them.

Then, you’ll need a private network (called Virtual Cloud Network or VCN) where to put your server. This private network should be created with care because it cannot overlap your on-premise network, especially if you plan to connect them (you surely need to). With network creation, other basic network components need to be also configured.

What are the basic network resources to configure?

First of all, all these resources are virtual resources in OCI. When configuring your network, you’ll also need at least one subnet from your VCN, a firewall (called security list), a router (route table) and a gateway for connecting this server (NAT gateway for outbound internet connexion or internet gateway for both inbound and outbound connexions).

Your OCI network will be linked to your on-premise network with IPSec VPN technology or FastConnect. This last option being a dedicated connexion to your existing infrastructure that does not go through internet.

So before creating your first server, you’ll need to define and configure all these network settings properly.

How to connect to this server?

If you don’t want to configure a VPN or a FastConnect link for now, you can associate your compute instance to an internet gateway to make it available from everywhere. Security is achieved with SSH keys: you provide your public key(s) on the OCI console for this server, and only you will be able to establish a connexion to your server. Later, a VPN or FastConnect configuration will let you reach all your OCI servers as if they were on your network.

What are the other services available?

If you’re thinking about OCI, it’s probably because you do not only need servers: you need Oracle databases. Actually, you don’t have to provision compute instances to install databases on it. You can directly provision databases, for various versions, Standard or Enterprise Edition, with you own license or without any license (the license fee will be billed as if it were an OCI resource – on a monthly basis). For sure, an underlying server will be provisionned, but you don’t have to create it as a separate task. If you need to connect later to this server, it’s possible as if it were a normal compute instance.

A key feature of OCI is what they call autonomous database: it’s a self-managed database that doesn’t give you access to the server or even the SYSDBA role on the database. You control this kind of DB through a dedicated interface (for loading data for example) and let Oracle automatically manage the classic DBA tasks, even those high-level. Autonomous database comes in two flavours: OLTP or Datawarehouse. Embedded autonomous engine will act differently.

Database services also come with automatic backup you can simply configure when creating the database (or after). Just define what kind of backup you need (mainly choose from various retentions and frequencies) and RMAN will automatically take care of your backups. Restore can be done directly through the OCI console.

Other services are also available, like load balancer or MySQL databases. Some services are free, some come at a cost.

How about the storage?

Multiple storage options are available for your servers depending on your needs:
– block storage: this is similar to LUNs on SAN. Choose the size at block storage creation and plug this storage to your server for a dedicated use
– file storage: this is similar to NFS. A shared storage for multiple servers
– object storage: this storage is usefull to make some files available wherever you need, just by sharing a link

Storage on OCI only relies on SSD disks, so expect high performances regarding I/Os.

How much it costs?

That’s the most difficult question, because you’ll have to define your needs, build your infrastructure on paper, then compute the cost with a cost calculator provided by Oracle. There is two billing options available at this moment: prepaid, with Universal Cloud Credits, or pay-as-you-go based on service utilization. The costs may vary depending on multiple parameters. Base budget for an OCI infrastructure starts from 1000$ a month. Don’t expect an OCI infrastructure to be much less expensive than on-premise servers: it’s mainly interesting because you don’t bother with budgeting, buying, deploying, managing servers on your own. And think about how quick you can deploy a new environment, or destroy an old one. It’s another way of spending your IT budget.

The cost calculator is here.

Conclusion

OCI is a mature Cloud, ready for production and with multiple services available and evolving constantly. Test-it to discover how powerfull it is and make sure to understand all the benefits you can get compared to on-premise solutions.

Cet article Oracle Cloud basics for beginners est apparu en premier sur Blog dbi services.

The global_name in an oracle database has a direct impact on the Golden Gate Extract process registration we need to do when we create an Integrated Extract.

In my example below, I use Oracle GoldenGate MicroServices architecture but the same behaviour occurs with Oracle GoldenGate Classic architecture.

Let’s start with the creation of the Extract process by clicking on the plus button :

Choose Integrated Extract and click on Next button:

Fulfill all mandatory fields:

Add the list of tables you want to replicate:

Click on Create and Run the process. After a few seconds, the following error message appears :

An integrated extract must to be registered into the database and this is the registration process which fails.

By checking into the report file, we see that Logmining server does not exist on the database. This error is a consequence of the extract registration process failure.

The problem is Oracle GoldenGate doesn’t support the character “-” in the GLOBAL_NAME:

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
DB1.IT.DBI-SERVICES.COM

SQL> sho parameter db_domain

NAME     TYPE VALUE
------------------------------------ ----------- ------------------------------
db_domain     string it.dbi-services.com

The solution is to modify the db_domain and rename the global_name :

SQL> alter system set db_domain='it.dbiservices.com' scope=spfile;

System altered.


SQL> alter database rename global_name to DB1.ITDBISERVICES.COM;

Database altered.

Let’s now try create the Extract Process:

Now it works :

Let’s check the report file by clicking in Action/details button :

The new global_name (DB1.ITDBISERVICES.COM) is now active and Logminig server is enabled into the database.

Cet article Oracle GoldenGate 19c: Cannot register Integrated EXTRACT due to ORA-44004 est apparu en premier sur Blog dbi services.

By Franck Pachot

.
Oracle Enterprise Linux (OEL) is a Linux distribution which is binary compatible with Red Hat Enterprise Linux (RHEL). However, unlike RHEL, OEL is open source, free to download, free to use, free to distribute, free to update and gets free bug fixes. And there are more frequent updates in OEL than in CentOS, the free base of RHEL. You can pay a subscription for additional support and features (like Ksplice or Dtrace) in OEL. It can run the same kernel as RHEL but also provides, still for free, the ‘unbreakable kernel’ (UEK) which is still compatible with RHEL but enhanced with optimizations, recommended especially when running Oracle products.

This is not new and I didn’t resist to illustrate the previous paragraph with the animated gif from the years of this UEK arrival. What is new is that OEL is also the base for the new Autonomous Linux which can run in the Oracle Cloud, automates Ksplice for updating the system online, without restart, and sending notifications about these updates. You can use it in the Oracle Cloud Free Tier.

When creating an Always Free compute instance you select the Oracle Autonomous Linux image. I’ve summarized all steps there:

Autonomous Linux image

Generate an API private key


[opc@al ~]$ mkdir ~/.oci
[opc@al ~]$ openssl genrsa -out ~/.oci/oci_api_key.pem 2048 # no passphrase
[opc@al ~]$ chmod go-rwx ~/.oci/oci_api_key.pem
[opc@al ~]$ openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem
writing RSA key

This generates an API key temporarily.

Configure OCI CLI profile


[opc@al ~]$ sudo al-config -u ocid1.user.oc1..aaaaaaaafo2liscovfefeubflxm2rswrzpnnmosue4lczmgaaodwtqdljj3q -t ocid1.tenancy.oc1..aaaaaaaazlv5wxkdhldyvxkkta7rjn25ocovfefexhidte5zgiyauut2i2gq -k ~/.oci/oci_api_key.pem
Configured OCI CLI profile.
Please delete /home/opc/.oci/oci_api_key.pem

This configures the OCI CLI profile for my user (ocid1.user.oc1.. is my user OCID which I find in Oracle Cloud » Governance and Administration » Identity » Users » Users Detail » OCID copy) in my tenancy (ocid1.tenancy.oc1.. is my tenancy OCI I find in Oracle Cloud » Governance and Administration » Administration » Tenancy Details » OCID copy).

Notifications

When something happens autonomously you want to be notified for it. This uses the OCI notification service with a topic you subscribe to. This is also available in the Free Tier. The topic is created with Oracle Cloud » Application Integration » Notifications » Create Topic, where you just put a name and a description and get the OCID for it (ocid1.onstopic.oc1.eu-frankfurt-1… for me).

While in the console, on this topic I’ve created a subscription where I have put my e-mail address. I’ll receive by e-mail all notifications sent to this topic.

Configure OCI notification service topic OCID


[opc@al ~]$ sudo al-config -T ocid1.onstopic.oc1.eu-frankfurt-1.aaaaaaaaamo7khj3xab6oec5xtcovfefeokqszapwsafeje6g6ltlnhd363a
Configured OCI notification service topic OCID.
Publishing message 'AL: Notification enabled on instance AL'
Published message 'AL: Notification enabled on instance AL'

In the Autonomous Linux instance I’ve setup the OCI notification service topic OCID. And that’s all.

Check your e-mails, you have to acknowledge the reception of notifications of course.

Kernel version


[opc@al ~]$ uname -a
Linux al 4.14.35-1902.301.1.el7uek.x86_64 #2 SMP Tue Mar 31 16:50:32 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Here is the kernel version that has been installed


[opc@al ~]$ sudo uptrack-uname -a
Linux al 4.14.35-1902.302.2.el7uek.x86_64 #2 SMP Fri Apr 24 14:24:11 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux

This is the effective kernel updated with Ksplice


[opc@al ~]$ sudo uptrack-show
Installed updates:
[cp1p7rl5] Known exploit detection.
[3kfqruxl] Known exploit detection for CVE-2017-7308.
[6vy9wlov] Known exploit detection for CVE-2018-14634.
[r8wncd28] KPTI enablement for Ksplice.
[3e9je971] Known exploit detection for CVE-2018-18445.
[20bmudk6] Out-of-bounds access when classifying network packets with traffic control index.
[oy5cke5u] NULL dereference while writing Hyper-V SINT14 MSR.
[5jsm8lzj] CVE-2020-9383: Information leak in floppy disk driver.
[5p7yd05e] NULL pointer dereference when initializing Differentiated Services marker driver.
[sajmv0xh] CVE-2018-19854: Information leak in cryptography socket NETLINK_CRYPTO call.
[1gefn4lp] CVE-2019-19965: Denial-of-service in SCSI device removal.
[6hu77eez] Invalid memory access when sending an excessively large packet using Segmentation Offloads.
[f0zxddhg] Livelock in loop device block resize operation.
[2lgm3hz9] CVE-2019-14814, CVE-2019-14815, CVE-2019-14816: Denial-of-service when parsing access point settings in Marvell WiFi-Ex driver.
[3yqxyw42] CVE-2019-20096: Memory leak while changing DCCP socket SP feature values.
[9g5kf79r] Improved fix for CVE-2020-2732: Privilege escalation in Intel KVM nested emulation.
[bq9hiiuj] Race condition in ipoib during high request load causes denial-of-service.
[3youemoz] CVE-2020-11494: Information leak in serial line CAN device communication.
[jpbi3wnm] Use-after-free when removing generic block device.
[if1ety6t] Memory corruption when reading EFI sysfs entries.
[iv8r17d8] CVE-2020-8648: Use-after-free in virtual terminal selection buffer.
[mojwd0zk] Various Spectre-V1 information leaks in KVM.
[nvi6r5wx] CVE-2019-19527: Denial-of-service in USB HID device open.
[o3df6mds] CVE-2020-8647, CVE-2020-8649: Use-after-free in the VGA text console driver.
[kjyqg48a] CVE-2019-19532: Denial-of-service when initializing HID devices.
[74j9dhee] Divide-by-zero when CPU capacity changes causes denial-of-service.
[lgsoxuy7] CVE-2019-19768: Use-after-free when reporting an IO trace.

Effective kernel version is 4.14.35-1902.302.2.el7uek

all details are there about the fixes applied by Ksplice, without any reboot.

One month later

I’ve created that on May 23th, 2020 and writing this one month later.

Here are the e-mails I’ve received from the topic subscription:

And my current machine state:


[opc@al ~]$ uptime
 19:26:39 up 38 days, 13:49,  2 users,  load average: 0.07, 0.02, 0.00
[opc@al ~]$ uname -a
Linux al 4.14.35-1902.301.1.el7uek.x86_64 #2 SMP Tue Mar 31 16:50:32 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
[opc@al ~]$ sudo uptrack-uname -a
Linux al 4.14.35-1902.303.4.1.el7uek.x86_64 #2 SMP Fri May 29 14:56:41 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
[opc@al ~]$

The VM has been running 24/7 without outage and the effective kernel is now higher than when installed.

Ksplice updates

This effective kernel has been updated on Tue Jun 16 08:04:33 GMT 2020 as reported by this e-mail I received:


noreply@notification.eu-frankfurt-1.oraclecloud.com
Jun 16, 2020, 10:04 AM
to AutonomousLinux

+------------------------------------------------------------------------+
|  Summary (Tue Jun 16 08:04:33 GMT 2020)                                |
+------------------------------------------------------------------------+
Ksplice updates installed: yes
Yum updates installed: no
Uptime: 08:04:33 up 24 days,  2:27,  0 users,  load average: 0.72, 0.20, 0.06
+------------------------------------------------------------------------+
|  Ksplice upgrade report                                                |
+------------------------------------------------------------------------+
Running 'ksplice -y all upgrade'.
Updating on-disk packages for new processes
Loaded plugins: langpacks
No packages marked for update
Nothing to do.
The following steps will be taken:
Install [i622mubr] Information leak in KVM_HC_CLOCK_PAIRING hypercall.
Install [35xnb9pi] CVE-2019-9500: Potential heap overflow in Broadcom FullMAC WLAN driver.
Install [ppqwl5uh] CVE-2019-15505: Out-of-bounds access in Technisat DVB-S/S2 USB2.0 driver.
Install [ctobm6wo] CVE-2019-19767: Use-after-free in with malformed ext4 filesystems.
Install [l5so0kqe] CVE-2019-19056, CVE-2019-19057: Denial-of-service in the Marvell mwifiex PCIe driver.
Install [b4iszmv7] CVE-2019-20636: Out-of-bounds write via crafted keycode table.
Install [5oec4s3n] Denial-of-service when mounting an ocfs2 filesystem.
Install [rafq9pe9] CVE-2019-9503: Denial-of-service when receiving firmware event frames over a Broadcom WLAN USB dongle.
Install [nlpu7kxi] Denial-of-service when initializing a serial CAN device.
Install [lnz9di5t] CVE-2020-11608: NULL pointer dereference when initializing USB GSPCA based webcams.
Install [2bodr9yk] CVE-2019-19537: Denial-of-service in USB character device registration.
Install [9iw2y1wn] CVE-2019-19524: Use-after-free when unregistering memoryless force-feedback driver.
Install [h5s7eh41] CVE-2020-11609: NULL pointer dereference when initializing STV06XX USB Camera device.
Install [behlqry8] Denial-of-service via invalid TSC values in KVM.
Install [onllaobw] CVE-2019-12819: Use-after-free during initialization of MDIO bus driver.
Install [fdn63bdc] CVE-2019-11599: Information leak in the coredump implementation.
Install [kb3b03z9] CVE-2019-19058: Denial-of-service in iwlwifi firmware interface.
Install [mgfi6p6r] Use-after-free when writing to SLIP serial line.
Install [hs2h9j8w] CVE-2019-14896, CVE-2019-14897: Denial-of-service when parsing BSS in Marvell 8xxx Libertas WLAN driver.
Install [bb9sd52m] CVE-2020-11668: NULL pointer dereference when initializing Xirlink C-It USB camera device.
Install [p4ygwgyj] Information leak in KVM's VMX operation path.
Install [1uxt1xo6] NFSv4 client fails to correctly renew lease when using fsinfo.
Install [hjoeh3zi] CVE-2020-0543: Side-channel information leak using SRBDS.
Installing [i622mubr] Information leak in KVM_HC_CLOCK_PAIRING hypercall.
Installing [35xnb9pi] CVE-2019-9500: Potential heap overflow in Broadcom FullMAC WLAN driver.
Installing [ppqwl5uh] CVE-2019-15505: Out-of-bounds access in Technisat DVB-S/S2 USB2.0 driver.
Installing [ctobm6wo] CVE-2019-19767: Use-after-free in with malformed ext4 filesystems.
Installing [l5so0kqe] CVE-2019-19056, CVE-2019-19057: Denial-of-service in the Marvell mwifiex PCIe driver.
Installing [b4iszmv7] CVE-2019-20636: Out-of-bounds write via crafted keycode table.
Installing [5oec4s3n] Denial-of-service when mounting an ocfs2 filesystem.
Installing [rafq9pe9] CVE-2019-9503: Denial-of-service when receiving firmware event frames over a Broadcom WLAN USB dongle.
Installing [nlpu7kxi] Denial-of-service when initializing a serial CAN device.
Installing [lnz9di5t] CVE-2020-11608: NULL pointer dereference when initializing USB GSPCA based webcams.
Installing [2bodr9yk] CVE-2019-19537: Denial-of-service in USB character device registration.
Installing [9iw2y1wn] CVE-2019-19524: Use-after-free when unregistering memoryless force-feedback driver.
Installing [h5s7eh41] CVE-2020-11609: NULL pointer dereference when initializing STV06XX USB Camera device.
Installing [behlqry8] Denial-of-service via invalid TSC values in KVM.
Installing [onllaobw] CVE-2019-12819: Use-after-free during initialization of MDIO bus driver.
Installing [fdn63bdc] CVE-2019-11599: Information leak in the coredump implementation.
Installing [kb3b03z9] CVE-2019-19058: Denial-of-service in iwlwifi firmware interface.
Installing [mgfi6p6r] Use-after-free when writing to SLIP serial line.
Installing [hs2h9j8w] CVE-2019-14896, CVE-2019-14897: Denial-of-service when parsing BSS in Marvell 8xxx Libertas WLAN driver.
Installing [bb9sd52m] CVE-2020-11668: NULL pointer dereference when initializing Xirlink C-It USB camera device.
Installing [p4ygwgyj] Information leak in KVM's VMX operation path.
Installing [1uxt1xo6] NFSv4 client fails to correctly renew lease when using fsinfo.
Installing [hjoeh3zi] CVE-2020-0543: Side-channel information leak using SRBDS.
Your kernel is fully up to date.
Effective kernel version is 4.14.35-1902.303.4.1.el7uek
+------------------------------------------------------------------------+
|  Yum upgrade report                                                    |
+------------------------------------------------------------------------+
Running 'yum-cron' with update cmd: default.
+------------------------------------------------------------------------+
|  Ksplice updates status                                                |
+------------------------------------------------------------------------+
Running 'ksplice all show'.
Ksplice user-space updates:
No Ksplice user-space updates installed

Ksplice kernel updates:
Installed updates:
[cp1p7rl5] Known exploit detection.
[3kfqruxl] Known exploit detection for CVE-2017-7308.
[6vy9wlov] Known exploit detection for CVE-2018-14634.
[r8wncd28] KPTI enablement for Ksplice.
[3e9je971] Known exploit detection for CVE-2018-18445.
[20bmudk6] Out-of-bounds access when classifying network packets with traffic control index.
[oy5cke5u] NULL dereference while writing Hyper-V SINT14 MSR.
[5jsm8lzj] CVE-2020-9383: Information leak in floppy disk driver.
[5p7yd05e] NULL pointer dereference when initializing Differentiated Services marker driver.
[sajmv0xh] CVE-2018-19854: Information leak in cryptography socket NETLINK_CRYPTO call.
[1gefn4lp] CVE-2019-19965: Denial-of-service in SCSI device removal.
[6hu77eez] Invalid memory access when sending an excessively large packet using Segmentation Offloads.
[f0zxddhg] Livelock in loop device block resize operation.
[2lgm3hz9] CVE-2019-14814, CVE-2019-14815, CVE-2019-14816: Denial-of-service when parsing access point settings in Marvell WiFi-Ex driver.
[3yqxyw42] CVE-2019-20096: Memory leak while changing DCCP socket SP feature values.
[9g5kf79r] Improved fix for CVE-2020-2732: Privilege escalation in Intel KVM nested emulation.
[bq9hiiuj] Race condition in ipoib during high request load causes denial-of-service.
[3youemoz] CVE-2020-11494: Information leak in serial line CAN device communication.
[jpbi3wnm] Use-after-free when removing generic block device.
[if1ety6t] Memory corruption when reading EFI sysfs entries.
[iv8r17d8] CVE-2020-8648: Use-after-free in virtual terminal selection buffer.
[mojwd0zk] Various Spectre-V1 information leaks in KVM.
[nvi6r5wx] CVE-2019-19527: Denial-of-service in USB HID device open.
[o3df6mds] CVE-2020-8647, CVE-2020-8649: Use-after-free in the VGA text console driver.
[kjyqg48a] CVE-2019-19532: Denial-of-service when initializing HID devices.
[74j9dhee] Divide-by-zero when CPU capacity changes causes denial-of-service.
[lgsoxuy7] CVE-2019-19768: Use-after-free when reporting an IO trace.
[i622mubr] Information leak in KVM_HC_CLOCK_PAIRING hypercall.
[35xnb9pi] CVE-2019-9500: Potential heap overflow in Broadcom FullMAC WLAN driver.
[ppqwl5uh] CVE-2019-15505: Out-of-bounds access in Technisat DVB-S/S2 USB2.0 driver.
[ctobm6wo] CVE-2019-19767: Use-after-free in with malformed ext4 filesystems.
[l5so0kqe] CVE-2019-19056, CVE-2019-19057: Denial-of-service in the Marvell mwifiex PCIe driver.
[b4iszmv7] CVE-2019-20636: Out-of-bounds write via crafted keycode table.
[5oec4s3n] Denial-of-service when mounting an ocfs2 filesystem.
[rafq9pe9] CVE-2019-9503: Denial-of-service when receiving firmware event frames over a Broadcom WLAN USB dongle.
[nlpu7kxi] Denial-of-service when initializing a serial CAN device.
[lnz9di5t] CVE-2020-11608: NULL pointer dereference when initializing USB GSPCA based webcams.
[2bodr9yk] CVE-2019-19537: Denial-of-service in USB character device registration.
[9iw2y1wn] CVE-2019-19524: Use-after-free when unregistering memoryless force-feedback driver.
[h5s7eh41] CVE-2020-11609: NULL pointer dereference when initializing STV06XX USB Camera device.
[behlqry8] Denial-of-service via invalid TSC values in KVM.
[onllaobw] CVE-2019-12819: Use-after-free during initialization of MDIO bus driver.
[fdn63bdc] CVE-2019-11599: Information leak in the coredump implementation.
[kb3b03z9] CVE-2019-19058: Denial-of-service in iwlwifi firmware interface.
[mgfi6p6r] Use-after-free when writing to SLIP serial line.
[hs2h9j8w] CVE-2019-14896, CVE-2019-14897: Denial-of-service when parsing BSS in Marvell 8xxx Libertas WLAN driver.
[bb9sd52m] CVE-2020-11668: NULL pointer dereference when initializing Xirlink C-It USB camera device.
[p4ygwgyj] Information leak in KVM's VMX operation path.
[1uxt1xo6] NFSv4 client fails to correctly renew lease when using fsinfo.
[hjoeh3zi] CVE-2020-0543: Side-channel information leak using SRBDS.

Effective kernel version is 4.14.35-1902.303.4.1.el7uek

--
You are receiving notifications as a subscriber to the topic: AL (Topic OCID: ocid1.onstopic.oc1.eu-frankfurt-1.aaaaaaaaamo7khj3xab6oec5xt5c7ia6eokqszapwsafeje6g6ltlnhd363a). To stop receiving notifications from this topic, unsubscribe.

Please do not reply directly to this email. If you have any questions or comments regarding this email, contact your account administrator.

Ksplice updates

I’ve also seen a notification about failed updates:


+------------------------------------------------------------------------+
|  Summary (Mon Jun 29 08:03:19 GMT 2020)                                |
+------------------------------------------------------------------------+
Ksplice updates installed: failed
Yum updates installed: no
Uptime: 08:03:19 up 37 days,  2:25,  0 users,  load average: 0.31, 0.08, 0.03
+------------------------------------------------------------------------+
|  Ksplice upgrade report                                                |
+------------------------------------------------------------------------+
Running 'ksplice -y all upgrade'.
Updating on-disk packages for new processes
Loaded plugins: langpacks
No packages marked for update
Nothing to do.
Unexpected error communicating with the Ksplice Uptrack server. Please
check your network connection and try again. If this error re-occurs,
e-mail ksplice-support_ww@oracle.com.

(Network error: TCP connection reset by peer)

Ok, network error at that time.
However, the next run was ok:


+------------------------------------------------------------------------+
|  Summary (Tue Jun 30 08:03:13 GMT 2020)                                |
+------------------------------------------------------------------------+
Ksplice updates installed: no
Yum updates installed: no
Uptime: 08:03:13 up 38 days,  2:25,  1 user,  load average: 0.00, 0.00, 0.00

and I can confirm by running manually:


[opc@al ~]$ ksplice -y all upgrade
Error: failed to configure the logger
[opc@al ~]$ sudo ksplice -y all upgrade
Updating on-disk packages for new processes
Loaded plugins: langpacks
ol7_x86_64_userspace_ksplice                                                                                                                     | 2.8 kB  00:00:00
No packages marked for update
100% |################################################################################################################################################################|
Nothing to do.
Nothing to be done.
Your kernel is fully up to date.
Effective kernel version is 4.14.35-1902.303.4.1.el7uek

Ksplice is about the kernel and some user space libraries such as glibc and openssl.
But Autonomous Linux also updates the packages.

Yum updates

In addition to kernel patches, the packages are also updated:


The following updates will be applied on al:
================================================================================
 Package                  Arch    Version                  Repository      Size
================================================================================
Installing:
 kernel                   x86_64  3.10.0-1127.13.1.el7     al7             50 M
Updating:
 bpftool                  x86_64  3.10.0-1127.13.1.el7     al7            8.4 M
 ca-certificates          noarch  2020.2.41-70.0.el7_8     al7            382 k
 kernel-tools             x86_64  3.10.0-1127.13.1.el7     al7            8.0 M
 kernel-tools-libs        x86_64  3.10.0-1127.13.1.el7     al7            8.0 M
 libgudev1                x86_64  219-73.0.1.el7_8.8       al7            107 k
 microcode_ctl            x86_64  2:2.1-61.10.0.1.el7_8    al7            2.7 M
 ntpdate                  x86_64  4.2.6p5-29.0.1.el7_8.2   al7             86 k
 python-perf              x86_64  3.10.0-1127.13.1.el7     al7            8.0 M
 python36-oci-cli         noarch  2.12.0-1.el7             al7            4.4 M
 python36-oci-sdk         x86_64  2.17.0-1.el7             al7             10 M
 rsyslog                  x86_64  8.24.0-52.el7_8.2        al7            620 k
 selinux-policy           noarch  3.13.1-266.0.3.el7_8.1   al7            497 k
 selinux-policy-targeted  noarch  3.13.1-266.0.3.el7_8.1   al7            7.2 M
 systemd                  x86_64  219-73.0.1.el7_8.8       al7            5.1 M
 systemd-libs             x86_64  219-73.0.1.el7_8.8       al7            416 k
 systemd-python           x86_64  219-73.0.1.el7_8.8       al7            143 k
 systemd-sysv             x86_64  219-73.0.1.el7_8.8       al7             95 k
Removing:
 kernel                   x86_64  3.10.0-1127.el7          @anaconda/7.8   64 M

Transaction Summary
================================================================================
Install   1 Package
Upgrade  17 Packages
Remove    1 Package
The updates were successfully applied

All packages are maintained up-to-date without human intervention and without downtime.

Package repository

The package repository is limited:


[opc@al ~]$ yum repolist
Loaded plugins: langpacks
ol7_x86_64_userspace_ksplice/primary_db                                                                                                          | 193 kB  00:00:00
repo id                                                       repo name                                                                                           status
!al7/x86_64                                                   Autonomous Linux 7Server (x86_64)                                                                   3,392
ol7_x86_64_userspace_ksplice                                  Ksplice aware userspace packages for Oracle Linux 7Server (x86_64)                                    438
repolist: 3,830
[opc@al ~]$ yum list all | wc -l
1462

1462 packages in one repo.
As a comparison, here is an Oracle Enterprise Linux image:


[opc@ol ~]$ yum repolist
Loaded plugins: langpacks, ulninfo
Repodata is over 2 weeks old. Install yum-cron? Or run: yum makecache fast
repo id                                                  repo name                                                                                                status
!ol7_UEKR5/x86_64                                        Latest Unbreakable Enterprise Kernel Release 5 for Oracle Linux 7Server (x86_64)                            200
!ol7_addons/x86_64                                       Oracle Linux 7Server Add ons (x86_64)                                                                       421
!ol7_developer/x86_64                                    Oracle Linux 7Server Development Packages (x86_64)                                                        1,319
!ol7_developer_EPEL/x86_64                               Oracle Linux 7Server Development Packages (x86_64)                                                       31,78$
!ol7_ksplice                                             Ksplice for Oracle Linux 7Server (x86_64)                                                                 6,41$
!ol7_latest/x86_64                                       Oracle Linux 7Server Latest (x86_64)                                                                     18,86$
!ol7_oci_included/x86_64                                 Oracle Software for OCI users on Oracle Linux 7Server (x86_64)                                              26$
!ol7_optional_latest/x86_64                              Oracle Linux 7Server Optional Latest (x86_64)                                                            13,91$
!ol7_software_collections/x86_64                         Software Collection Library release 3.0 packages for Oracle Linux 7 (x86_64)                             14,47$
repolist: 87,645
[opc@ol ~]$ yum list all | wc -l
Repodata is over 2 weeks old. Install yum-cron? Or run: yum makecache fast
36720
[opc@ol ~]$

There is a lot more here. Remember that OEL is compatible with RHEL.

If you need more packages you can open a SR and ask to have it added to the Autonomous Linux repository. For example, I use tmux everyday, especially in a free tier VM (see https://blog.dbi-services.com/always-free-always-up-tmux-in-the-oracle-cloud-with-ksplice-updates/).

If you don’t want to ask for it, there’s the possibility to add public-yum-ol7.repo there:


[opc@al ~]$ sudo yum-config-manager --add-repo http://yum.oracle.com/public-yum-ol7.repo
Loaded plugins: langpacks
adding repo from: http://yum.oracle.com/public-yum-ol7.repo
grabbing file http://yum.oracle.com/public-yum-ol7.repo to /etc/yum.repos.d/public-yum-ol7.repo
repo saved to /etc/yum.repos.d/public-yum-ol7.repo

This added the public Oracle Enterprise Linux repository. Is it correct to do that? It depends what you want: the minimum validated by Oracle to be autonomously updated without any problem, or a little additional customization.

And then install the package you want:


[opc@al ~]$ sudo yum install -y tmux

Loaded plugins: langpacks
Resolving Dependencies
--> Running transaction check
---> Package tmux.x86_64 0:1.8-4.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

========================================================================================================================================================================
 Package                              Arch                                   Version                                   Repository                                  Size
========================================================================================================================================================================
Installing:
 tmux                                 x86_64                                 1.8-4.el7                                 ol7_latest                                 241 k

Transaction Summary
========================================================================================================================================================================
Install  1 Package

Total download size: 241 k
Installed size: 554 k
Downloading packages:
tmux-1.8-4.el7.x86_64.rpm                                                                                                                        | 241 kB  00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : tmux-1.8-4.el7.x86_64                                                                                                                                1/1
  Verifying  : tmux-1.8-4.el7.x86_64                                                                                                                                1/1

Installed:
  tmux.x86_64 0:1.8-4.el7

Now the package is installed and will be updated

Autonomous cron

Those updates are scheduled by cron but you change the schedule through the al-config bash script provided:


[opc@al ~]$ sudo al-config -s
Current daily auto update time window(24-hour): 7-11
Current daily auto update time(24-hour): 08:03

This has set a random time during the 7am to 11 am window, which is here 08:03


[opc@al ~]$ cat /etc/cron.d/al-update
# Daily cron job for AL auto updates.
# Created by al-config, do not modify this file.
# If you want to change update time, use
# 'sudo al-config -w ' to set auto update time window
3 8 * * * root /usr/sbin/al-update >/dev/null

That’s the autonomous thing here: you don’t set the crontab job. You just call the al-config with a time window and it sets the crontab for you in a random time within this window.

Let’s play with this:


[opc@al ~]$ sudo al-config -w 0-2
Configured daily auto update time window(24-hour): 0-2
Configured daily auto update time(24-hour): 01:12
Created cron job file /etc/cron.d/al-update .
[opc@al ~]$ sudo al-config -w 0-2
Configured daily auto update time window(24-hour): 0-2
Configured daily auto update time(24-hour): 01:33
Created cron job file /etc/cron.d/al-update .
[opc@al ~]$ sudo al-config -w 0-2
Configured daily auto update time window(24-hour): 0-2
Configured daily auto update time(24-hour): 00:47
Created cron job file /etc/cron.d/al-update .
[opc@al ~]$ sudo al-config -w 0-2
Configured daily auto update time window(24-hour): 0-2
Configured daily auto update time(24-hour): 00:00
Created cron job file /etc/cron.d/al-update .
[opc@al ~]$ sudo al-config -w 0-2
Configured daily auto update time window(24-hour): 0-2
Configured daily auto update time(24-hour): 00:41
Created cron job file /etc/cron.d/al-update .

You see the idea. Very simple. But simple is awesome, right?

What is this scheduled job doing autonomously every day? You see it in the notification e-mail. Basically it runs:


ksplice -y all upgrade
yum-cron
ksplice all show

and sends the output to your e-mail

This is what keeps your Autonomous Linux up-to-date: ksplice, yum, and the output sent to your e-mail through:


Received: by omta-ad1-fd1-102-eu-frankfurt-1.omtaad1.vcndpfra.oraclevcn.com (Oracle Communications Messaging Server 8.1.0.1.20200619 64bit (built Jun 19 2020)) w

This is an excerpt from the notification e-mail headers. “Oracle Communications Messaging Server” is a heritage from Sun which, according to wikipedia, has its roots in Netscape Messaging Server. All those little bricks from years of enterprise IT are nicely wired together to bring this automation known as Autonomous.

Cet article Oracle Autonomous Linux: cron’d ksplice and yum updates est apparu en premier sur Blog dbi services.

When dealing with cascading or far sync in a Data Guard environment, it is important to understand how to configure the RedoRoutes property.
By default, a primary database sends redo to each transport destination that is configured in the destination. We can create more complex transport topology, depending of our environment, using the RedoRoutes property.
Basically the RedoRoutes property has this format

(redo_routing_rule_1) [(redo_routing_rule_n)]

Where each routing rule contains a redo source field and a redo destination field separated by a colon:

(redo source : redo destination)

One can have more information in Oracle documentation

In this blog I am trying to simply explain how to configure the RedoRoutes property in a Data Guard environment with Far Sync Instance. See my previous blog for far sync instance creation.

I am using Oracle 20c.

The first configuration we consider is the following one

We have
1 primary database: prod20_site1
2 standby databases: prod20_site2 and prod20_site4
1 far sync instance fs_site3

For far sync creation with Oracle 20c see my previous blog

Below the status of the broker configuration

DGMGRL> show configuration

Configuration - prod20

  Protection Mode: MaxAvailability
  Members:
  prod20_site1 - Primary database
    prod20_site2 - Physical standby database
    prod20_site4 - Physical standby database
    fs_site3     - Far sync instance

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 47 seconds ago)

Actually, there is no configured RedoRoutes

DGMGRL> show database prod20_site1 redoroutes;
  RedoRoutes = ''
DGMGRL> show database prod20_site2 redoroutes;
  RedoRoutes = ''
DGMGRL> show database prod20_site4 redoroutes;
  RedoRoutes = ''
DGMGRL>

For this configuration I want the primary database to send the redo according following rules

prod20_site2 will receive redo directly from prod20_site1
prod20_site1 =====> prod20_site2

prod20_site4 will receive redo via fs_site3 which will forward redo to prod20_site4
prod20_site1 =====> fs_site3 =====> prod20_site4

and if fs_site3 is not available, prod20_site4 will receive directly redo from prod20_site1
prod20_site1 =====> prod20_site4

For this we have to first edit the primary database RedoRoutes property like

DGMGRL> edit database prod20_site1 set property redoroutes='(local:prod20_site2,(fs_site3 priority=1,prod20_site4 priority=2))';
Property "redoroutes" updated

In this rule we have these meanings

local:prod20_site2: if prod20_site1 is the primary database then redo will be sent to prod20_site2

local: (fs_site3 priority=1,prod20_site4 priority=2 ): if prod20_site1 is the primary database then redo will be sent to fs_site3 or to prod20_site4. As the priority of the fs_site3 is higher, indeed smaller priority numbers mean higher priority, redo will be sent first to fs_site3, and if fs_site3 is unavailable, changes will be sent to prod20_site4.
Just note that as fs_site3 has a higher priority, if fs_site3 becomes available, redo will be again sent to fs_site3.

And then we have to tell to fs_site3 to forward redo received from prod20_site1 to prod20_site4.

DGMGRL> edit far_sync fs_site3 set property redoroutes='(prod20_site1:prod20_site4 ASYNC)';
Property "redoroutes" updated

Below the redoroutes we have configured for prod20_site1 and fs_site3

DGMGRL> show database prod20_site1 redoroutes;
  RedoRoutes = '(local:prod20_site2,(fs_site3 priority=1,prod20_site4 priority=2))'
DGMGRL> show database prod20_site2 redoroutes;
  RedoRoutes = ''
DGMGRL> show database prod20_site4 redoroutes;
  RedoRoutes = ''
DGMGRL> show far_sync  fs_site3 redoroutes;
  RedoRoutes = '(prod20_site1:prod20_site4 ASYNC)'
DGMGRL>

And we can verify the status of our configuration

DGMGRL> show configuration verbose


Configuration - prod20

  Protection Mode: MaxPerformance
  Members:
  prod20_site1 - Primary database
    prod20_site2 - Physical standby database
    fs_site3     - Far sync instance
      prod20_site4 - Physical standby database
    prod20_site4 - Physical standby database (alternate of fs_site3)
…
…
Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS

DGMGRL>

Let’s now consider this configuration where we have two far syn instances. As in the first configuration, we want to send first the redo to far sync instances if possible, otherwise redo will be send directly to standby databases

The RedoRoutes property of the primary can be configured as below

DGMGRL> edit database prod20_site1 set property redoroutes='(local:(fs_site5 priority=1,prod20_site2 priority=2),(fs_site3 priority=1,prod20_site4 priority=2))';
Warning: ORA-16677: Standby database has the same or higher priority than other members specified in the RedoRoutes group.

Property "redoroutes" updated
DGMGRL>

And the redoroutes for the far sysnc fs_site5 can be adjusted like

DGMGRL> edit far_sync fs_site5 set property redoroutes='(prod20_site1:prod20_site2 ASYNC)';
Property "redoroutes" updated
DGMGRL>

We can then verify the satus of the configuration

DGMGRL> show configuration verbose

Configuration - prod20

  Protection Mode: MaxPerformance
  Members:
  prod20_site1 - Primary database
    fs_site5     - Far sync instance
      prod20_site2 - Physical standby database
    prod20_site2 - Physical standby database (alternate of fs_site5)
    fs_site3     - Far sync instance
      prod20_site4 - Physical standby database
    prod20_site4 - Physical standby database (alternate of fs_site3)

…
…

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS

DGMGRL>

As we can see when configuring RedoRoutes, we sometimes have to deal with the property PRIORITY.
This property can have a value between 1 and 8. 1 as the highest priority and 8 the lowest priority.
Let’s consider two destination A and B in the same group

Case 1: A and B with the same prority

Redo will be sent to A or B, let’s say A. When A is unavailable, redo will be then sent to B. And when A become reachable again, redo will continue to be sent to B.

(local:(A priority=1,B priority=1))

Case 2: A has a higher priority then B
Redo will be sent to A. If A becomes unavailable, redo will be sent to B. And if A becomes again reachable, redo will be sent to A as it has the highest priority

(local:(A priority=1,B priority=2))

But sometimes in the same group, we may want to send redo to both members. For example if we consider the following configuration, we just want that redo will be sent to fs_site3 if possible and if fs_site3 is not reachable then changes will be sent to both prod20_site2 et prod20_site4.

In this case we can use the PRIORITY 8 which has a special meaning. If the primary sends redo to a member with PRIORITY 8, then it must also send these redo to each member with the PRIORITY 8 in the group

In the configuration above, we want following rules

prod20_site1 will send changes to fs_site3 which will forward to prod20_site2 and prod20_site4 and if fs_site3 is not avalaible, prod20_site1 will ship redo to both standby databases.

And when fs_site3 becomes again available, redo will be send again to fs_site3

The redoRoutes for the primary database can be like

DGMGRL> edit database prod20_site1 set property redoroutes='(local:(fs_site3 priority=1,prod20_site2 priority=8,prod20_site4 priority=8))';
Warning: ORA-16677: Standby database has the same or higher priority than other members specified in the RedoRoutes group.

Property "redoroutes" updated
DGMGRL>

And for the far sync instance

DGMGRL> edit far_sync fs_site3 set property redoroutes='(prod20_site1:prod20_site2 ASYNC,prod20_site4 ASYNC)';
Property "redoroutes" updated
DGMGRL>

The status of the configuration

DGMGRL> show configuration verbose

Configuration - prod20

  Protection Mode: MaxPerformance
  Members:
  prod20_site1 - Primary database
    fs_site3     - Far sync instance
      prod20_site2 - Physical standby database
      prod20_site4 - Physical standby database
    prod20_site2 - Physical standby database (alternate of fs_site3)
    prod20_site4 - Physical standby database (alternate of fs_site3)
…
…

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS

DGMGRL>

Conclusion

Depending to the configuration, the redo transport topology can be very complex. What I can recommend when dealing with far sync instances, is to think about all possible cases, including switchover and failover. And based of all possible cases to design an architecture for the redo transport. In this blog we just consider the case when prod20_site1 is the primary.

Cet article Oracle Data Guard RedoRoutes : What is Priority 8 ? est apparu en premier sur Blog dbi services.

I had the opportunity to participate to POUG day 1 and wanted through this blog to share some of my feedback on the interesting sessions I could follow.

First of all, I would like to mention the great introduction done by the staff team and the great organization. POUG staff team could adapt to the coronavirus situation and organized excellent POUG virtual days. Well done!

I had the chance today to follow a few sessions :

Developing Clusterware agents by Michal Wesolowski for which I will provide a few feedback and interesting stuff to know, later in this blog.
The Heart of Oracle – how the core RDBMS engine works by Martin Widlake. Martin presented how the heart of oracle works from the instance to the database files going through the redo logs, archive logs. He addressed how blocks are handled, how SQL statements are parsed and optimized. By the way, did you know that during a transaction a single block goes to the SGA (buffer cache) and multiple blocks goes to the PGA, so not shared between other sessions? Same for full table scan. I did not, I was always thinking that all blocks went from the data files to the buffer cache. Also it is good to know that oracle use hashing algorithm to find a block in the SGA.
The foundations of understanding Execution Plans by Jonathan Lewis. Great subject! Using concrete examples, Jonathan covered a complex subject : how does an execution plans work.
Wait Events and DB time by Franck Pachot. My collegue Franck gave a very interesting session explaining wait events and DB time for which I will provide some of the information provided later in this blog.
Analysis of a Couple AWR Reports by Toon Koppelaars. This presentation was a good follow up of Franck’s session. Toon explained how to interpret AWR Report.

Developing Clusterware agents

Michal did a great interactive presentation and demo, having the demos refreshing diagrams displayed in the presentation using webservices development. If you have the opportunity to follow one day one of Michal’s presentation, I would really encourage you to do so. You will enjoy and have fun!

We got a great introduction and explanation on how Clusters are working. From the free solution (clusterware) to payable solution (veritas cluster).

Comparison on some solutions can be found on the next picture :

Grid infrastructure is composed of :

Clusterware / Oracle restart
ASM

The cluster architecture looks like :

Oracle clusterware : database and clustering

We can use oracle clusterware to make high available application with built-in agents.

The command crsctl status type will provide the configuration information of one or more particular resource types. All prefixes ora are Oracle objects.

To create HA applications, cluster_resource or local_resource should be used.

Oracle tools to deal with clusterware :

srvctl : dedicated to ora. resources, to be used form oracle_home and not GI home
crsctl : to be used for custom resources, for monitoring all resources, managing OHAS resources, managing CRS itself (crsctl stop/start crs)

Oracle Grid Infrastructures standalone Agents :
HA agents for oracle applications like GoldenGate, peoplesoft, weblogic, mysql, …. Written in Perl, easy to install and manage.

Standard Edition HA :

Available from GI 19.7 and DB SE2 19.7
No HA/GI for SE2 from 19.3 to 19.6
Need ASM or ACFS

As seen in the next picture, we can understand that clusterware is more complex that we could imagine :

Dependencies between ressources (node 1 -> RES A -> RESB):
To display dependency tree we will use crsctl stat res -dependency [-stop | -pullup].
To force to stop resource and all dependencies : crsctl stop res resA -f.
To start resB and automatically resA first : crsctl start res resB.
To relocate the whole chain on new node : crsctl relocate res resA -n node3 -f.
With hard dependency : both resources should be started on same node.
Pullup dependency is very important and needs to be used when having hard dependency : If resource B depends on resource A and resource A fails and then recovers, then resource B is restarted.

Resource states / return codes are displayed on next picture :

Wait Events and DB time

Wait events have been implemented since 1992 to see where the time is spent when DB is busy, to verify resource utilization, to know load profile, to see which wait can scale or not. Otherwise, without wait events tuning might be done with blinded eyes.

System calls are wait events.

Wait events can be seen in sql_trace, v$ views, statspatck, ASH, AWR, or any other tool like tkprof…

Between the fetch() and resultset, database needs to do some work : CPU work, read blocks,… Between CPU work the server process is just waiting (idle, I/O,…). Idea is then to instrument this time and do profiling.

Idle is a system call as well, waiting for network. Between fetch() and resultset it is user response time. Tuning will try to reduce this response time.

DB time (session is active) = user response time = cpu and/or wait events

Wait events can be SQL net message to client or from client, PL/SQL lock timer.

cpu time = parse queries, read blocks or sort rows.

Tuning idea, investigate execution plans.
Reduce parse queries : use bind variables for similar queries.
Reduce read blocks : Use indexes for better selectivity for the predicates or use hash join to join many rows.
Reduce sort row : Do not selet all columns (select *) if not needed.

wait events :
Only count wait events for DB time on foreground sessions. Other processes wait event can be troubleshoot further if needed.

I/O from user session process to shared buffer cache are named db file and the one to PGA are named direct path :

db file read

read one block to buffer cache : db file sequential read
read multiple blocks to buffer cache : db file scattered read, db file parallel read

db file sequential read

single block read. waits is the throughtput of single block reads (divided by elapsed time for IOPS). wait average time is the latency to read 8k.
physical reads si the number of blocks, physical IO request is the number of IO calls

For average time : look at the storage, get faster disk nvme, …
If count is too high : better execution plan, larger buffer cache or PGA, …

ASH viewer can be downloaded if no diagnostic license.

application wait class : locks

log file sync = commit wait

average time is too high : reduce queue length, get faster disks
count is too high : avoid row by row commit, use no logging operation, look at nowait

system I/O comes in major cases from background processes, backups running, contention on control file write (due to multiplexing), too many log switches

Tools : SQL trace, tkprof, v$ views

Following picture is a nice conclusion to summary wait events. I like it…

Conclusion

This POUG event was a really great event and I would encourage anybody to participate to the next one. Sessions were really interesting with high technical level. Been busy tomorrow I will unfortunately not be able to participate to day 2. Thanks POUG staff team to organize this event! Well done!

Cet article POUG 2020 Workshop Day 1 est apparu en premier sur Blog dbi services.

By Franck Pachot

.
This is a demo about Oracle ACFS snapshots, and how to understand the used and free space, as displayed by “df”, when there are modifications in the base parent or the snapshot children. The important concept to understand is that, when you take a snapshot, any modification to the child or parent will


[grid@cloud ~]$ asmcmd lsdg DATAC1

State    Type  Rebal  Sector  Logical_Sector  Block       AU   Total_MB    Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  HIGH  N         512             512   4096  4194304  214991104  191881152         11943936        59979016              0             Y  DATAC1/

On a database machine in the Oracle Cloud I have a diskgroup with lot of free space. I’ll use this DATAC1 diskgroup to store my ACFS filesystem. the size in MegaByte is not easy to read.
I can have a friendly overview from acfsutil with human readable sizes (in TeraByte there).


[grid@cloud ~]$ acfsutil info storage -u TB -l DATAC1

Diskgroup: DATAC1 (83% free)
  total disk space:         205.03
  ASM file space:            22.04
  total free space:         182.99
  net free with mirroring:   60.99
  usable after reservation:  57.20
  redundancy type:          HIGH

    Total space used by ASM non-volume files:
      used:                     19.03
      mirror used:               6.34

    volume: /dev/asm/dump-19
      total:                     1.00
      free:                      0.22
      redundancy type:         high
      file system:             /u03/dump
----
unit of measurement: TB

There’s already a volume here (/dev/asm/dump-19) which is named ‘DUMP’ and mounted as an ACFS filesystem (/u03/dump) but I’ll create a now one for this demo and will remove.

Logical volume: asmcmd volcreate

I have a diskgroup which exposes disk space to ASM. The first thing to do, in order to build a filesystem on it, is to create a logical volume, which is called ADVM for “ASM Dynamic Volume Manager”.


[grid@cloud ~]$ asmcmd volcreate -G DATAC1 -s 100G MYADVM

I’m creating a 100GB new volume, identified by its name (MYADVM) within the diskgroup (DATAC1). The output is not verbose at all but I can check it with volinfo.


[grid@cloud ~]$ asmcmd volinfo -G DATAC1 -a

Diskgroup Name: DATAC1

         Volume Name: DUMP
         Volume Device: /dev/asm/dump-19
         State: ENABLED
         Size (MB): 1048576
         Resize Unit (MB): 512
         Redundancy: HIGH
         Stripe Columns: 8
         Stripe Width (K): 1024
         Usage: ACFS
         Mountpath: /u03/dump

         Volume Name: MYADVM
         Volume Device: /dev/asm/myadvm-19
         State: ENABLED
         Size (MB): 102400
         Resize Unit (MB): 512
         Redundancy: HIGH
         Stripe Columns: 8
         Stripe Width (K): 1024
         Usage:
         Mountpath:

With “-a” I show all volumes on the diskgroup, but I could have mentioned the volume name. This is what we will do later.

“Usage” is empty because there’s no filesystem created yet, and “Mountpath” is empty because it is not mounted. The information I need is the Volume Device (/dev/asm/myadvm-19) as I’ll have access to it from the OS.


[grid@cloud ~]$ lsblk -a /dev/asm/myadvm-19

NAME          MAJ:MIN  RM  SIZE RO TYPE MOUNTPOINT
asm!myadvm-19 248:9731  0  100G  0 disk

[grid@cloud ~]$ lsblk -t /dev/asm/myadvm-19

NAME          ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED    RQ-SIZE   RA
asm!myadvm-19         0    512      0     512     512    1 deadline     128  128

[grid@cloud ~]$ fdisk -l /dev/asm/myadvm-19

Disk /dev/asm/myadvm-19: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Visible from the OS, I have a 100GB volume with nothing there: I need to format it.

Filesystem: Linux mkfs


[grid@cloud ~]$ grep -v nodev /proc/filesystems

        iso9660
        ext4
        fuseblk
        acfs

ACFS is a filesystem supported by the kernel. I can use “mkfs” to format the volume to ACFS.


[grid@cloud ~]$ mkfs.acfs -f /dev/asm/myadvm-19

mkfs.acfs: version                   = 18.0.0.0.0
mkfs.acfs: on-disk version           = 46.0
mkfs.acfs: volume                    = /dev/asm/myadvm-19
mkfs.acfs: volume size               = 107374182400  ( 100.00 GB )
mkfs.acfs: Format complete.

I used “-f” because my volume was already formatted from a previous test. I have now a 100GB filesystem on this volume.


[grid@cloud ~]$ asmcmd volinfo -G DATAC1 MYADVM

Diskgroup Name: DATAC1

         Volume Name: MYADVM
         Volume Device: /dev/asm/myadvm-19
         State: ENABLED
         Size (MB): 102400
         Resize Unit (MB): 512
         Redundancy: HIGH
         Stripe Columns: 8
         Stripe Width (K): 1024
         Usage: ACFS
         Mountpath:

The “Usage” shows that ASM knows that the ADVM volume holds an ACFS filesystem.


[opc@cloud ~]$ sudo mkdir -p /myacfs

I’ll mount this filesystem to /myacfs


[opc@cloud ~]$ sudo mount -t acfs /dev/asm/myadvm-19 /myacfs

[opc@cloud ~]$ mount | grep /myacfs

/dev/asm/myadvm-19 on /myacfs type acfs (rw)

[opc@cloud ~]$ df -Th /myacfs

Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G  448M  100G   1% /myacfs

[opc@cloud ~]$ ls -alrt /myacfs

total 100
drwxr-xr-x 30 root root  4096 Jun 23 10:40 ..
drwxr-xr-x  4 root root 32768 Jun 23 15:48 .
drwx------  2 root root 65536 Jun 23 15:48 lost+found

[opc@cloud ~]$ sudo umount /myacfs

I wanted to show that you can mount it from the OS but, as we have Grid Infrastructure, we usually want to manage it as a cluster resource.


[opc@cloud ~]$ sudo acfsutil registry -a /dev/asm/myadvm-19 /myacfs -u grid

acfsutil registry: mount point /myacfs successfully added to Oracle Registry

[opc@cloud ~]$ ls -alrt /myacfs

total 100
drwxr-xr-x 30 root root      4096 Jun 23 10:40 ..
drwxr-xr-x  4 grid oinstall 32768 Jun 23 15:48 .
drwx------  2 root root     65536 Jun 23 15:48 lost+found

The difference here is that I’ve set the owner of the filesystem to “grid” as that’s the user I’ll use for the demo.

Used and free space


[grid@cloud ~]$ asmcmd volinfo -G DATAC1 MYADVM

Diskgroup Name: DATAC1

         Volume Name: MYADVM
         Volume Device: /dev/asm/myadvm-19
         State: ENABLED
         Size (MB): 102400
         Resize Unit (MB): 512
         Redundancy: HIGH
         Stripe Columns: 8
         Stripe Width (K): 1024
         Usage: ACFS
         Mountpath: /myacfs

ADVM has the information about the mount path but now to interact with it I’ll use “acfsutil” for ACFS features or the standard Linux commands on filesystems.


[grid@cloud ~]$ du -ah /myacfs

64K     /myacfs/lost+found
96K     /myacfs

I have no files there: only 96 KB used.


[grid@cloud ~]$ df -Th /myacfs

Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G  688M  100G   1% /myacfs

the whole size of 100GB is available.


[grid@cloud ~]$ acfsutil info storage -u GB -l DATAC1

Diskgroup: DATAC1 (83% free)
  total disk space:         209952.25
  ASM file space:           22864.75
  total free space:         187083.87
  net free with mirroring:  62361.29
  usable after reservation: 58473.23
  redundancy type:          HIGH

    Total space used by ASM non-volume files:
      used:                    19488.04
      mirror used:             6496.01

    volume: /dev/asm/myadvm-19
      total:                   100.00
      free:                     99.33
      redundancy type:         high
      file system:             /myacfs
...
----
unit of measurement: GB

“acfs info storage” shows all volumes in the diskgroup (I removed the output for the ‘DUMP’ one here)


[grid@cloud ~]$ acfsutil info fs /myacfs
/myacfs
    ACFS Version: 18.0.0.0.0
    on-disk version:       47.0
    compatible.advm:       18.0.0.0.0
    ACFS compatibility:    18.0.0.0.0
    flags:        MountPoint,Available
    mount time:   Tue Jun 23 15:52:35 2020
    mount sequence number: 8
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   106652823552  (  99.33 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/myadvm-19
        label:
        state:                 Available
        major, minor:          248, 9731
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  106652823552  (  99.33 GB )
        metadata read I/O count:         1203
        metadata write I/O count:        10
        total metadata bytes read:       4927488  (   4.70 MB )
        total metadata bytes written:    40960  (  40.00 KB )
        ADVM diskgroup:        DATAC1
        ADVM resize increment: 536870912
        ADVM redundancy:       high
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: DISABLED

“acfsutil info fs” is the best way to have all information and here it shows the same as what we see from the OS: size=100GB and free=99.33GB

Add a file


[grid@cloud ~]$ dd of=/myacfs/file.tmp if=/dev/zero bs=1G count=42

42+0 records in
42+0 records out
45097156608 bytes (45 GB) copied, 157.281 s, 287 MB/s

I have created a 42 GB file in my filesystem. And now will compare the size info.


[grid@cloud ~]$ du -ah /myacfs

64K     /myacfs/lost+found
43G     /myacfs/file.tmp
43G     /myacfs

I see one additional file with 43GB


[grid@cloud ~]$ df -Th /myacfs

Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G   43G   58G  43% /myacfs

The used space that was 688 MB is now 43GB, the free space which was 100 GB is now 58 GB and the 1% usage is now 43% – this is the exact math (rounded to next GB) as my file had to allocate extents for all its data.


[grid@cloud ~]$ acfsutil info storage -u GB -l DATAC1

Diskgroup: DATAC1 (83% free)
  total disk space:         209952.25
  ASM file space:           22864.75
  total free space:         187083.87
  net free with mirroring:  62361.29
  usable after reservation: 58473.23
  redundancy type:          HIGH

    Total space used by ASM non-volume files:
      used:                    19488.04
      mirror used:             6496.01

    volume: /dev/asm/myadvm-19
      total:                   100.00
      free:                     57.29
      redundancy type:         high
      file system:             /myacfs

Here the “free:” that was 99.33 GB before is now 57.29 GB which matches what we see from df.
total free: 106652823552 ( 99.33 GB )


[grid@cloud ~]$ acfsutil info fs /myacfs

/myacfs
    ACFS Version: 18.0.0.0.0
    on-disk version:       47.0
    compatible.advm:       18.0.0.0.0
    ACFS compatibility:    18.0.0.0.0
    flags:        MountPoint,Available
    mount time:   Tue Jun 23 15:52:35 2020
    mount sequence number: 8
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   61513723904  (  57.29 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/myadvm-19
        label:
        state:                 Available
        major, minor:          248, 9731
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  61513723904  (  57.29 GB )
        metadata read I/O count:         1686
        metadata write I/O count:        130
        total metadata bytes read:       8687616  (   8.29 MB )
        total metadata bytes written:    3555328  (   3.39 MB )
        ADVM diskgroup:        DATAC1
        ADVM resize increment: 536870912
        ADVM redundancy:       high
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: DISABLED

What has changed here is only the “total free:” from 99.33 GB to 57.29 GB

Create a snapshot

For the moment, with no snapshot and no compression, the size allocated in the filesystem (as the “df” used) is the same as the size of the files (as the “du” file size).


[grid@cloud ~]$ acfsutil snap info /myacfs

    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )

The “acfsutil info fs” shows no snapshots, and “acfsutil snap info” shows the same but will give more details where I’ll have created snapshots.


[grid@cloud ~]$ acfsutil snap create -w S1 /myacfs

acfsutil snap create: Snapshot operation is complete.

This has created a read-write snapshot of the filesystem.


[grid@cloud ~]$ du -ah /myacfs

64K     /myacfs/lost+found
43G     /myacfs/file.tmp
43G     /myacfs

Nothing different here as this shows only the base. The snapshots are in a hidden .ACFS that is not listed but can be accessed when mentioning the path.


[grid@cloud ~]$ du -ah /myacfs/.ACFS

32K     /myacfs/.ACFS/.fileid
32K     /myacfs/.ACFS/repl
43G     /myacfs/.ACFS/snaps/S1/file.tmp
43G     /myacfs/.ACFS/snaps/S1
43G     /myacfs/.ACFS/snaps
43G     /myacfs/.ACFS

Here I see my 42 GB file in the snapshot. But, as I did no modifications there, it shares the same extents on disk. This means that even if I see two files of 42 GB there is only 42 GB allocated for those two virtual copies.


[grid@cloud ~]$ df -Th /myacfs

Filesystem         Type  Size  Used Avail Use% Mounted on

/dev/asm/myadvm-19 acfs  100G   43G   58G  43% /myacfs

As there are no additional extents allocated for this snapshot, “df” shows the same as before. Here we start to see a difference between the physical “df” size and the virtual “du” size.


[grid@cloud ~]$ acfsutil info storage -u GB -l DATAC1

Diskgroup: DATAC1 (83% free)
  total disk space:         209952.25
  ASM file space:           22864.75
  total free space:         187083.87
  net free with mirroring:  62361.29
  usable after reservation: 58473.23
  redundancy type:          HIGH

    Total space used by ASM non-volume files:
      used:                    19488.04
      mirror used:             6496.01

    volume: /dev/asm/myadvm-19
      total:                   100.00
      free:                     57.29
      redundancy type:         high
      file system:             /myacfs
        snapshot: S1 (/myacfs/.ACFS/snaps/S1)
          used:          0.00
          quota limit:   none
...
----
unit of measurement: GB

Here I have additional information for my snapshot: “snapshot: S1 (/myacfs/.ACFS/snaps/S1)” with 0 GB used because I did not modify anything in the snapshot yet. The volume still has the same free space.


[grid@cloud ~]$ acfsutil snap info /myacfs
    number of snapshots:  1
    snapshot space usage: 393216  ( 384.00 KB )

The new thing here is “number of snapshots: 1” with nearly no space usage (384 KB)


[grid@cloud ~]$ acfsutil info fs /myacfs
/myacfs
    ACFS Version: 18.0.0.0.0
    on-disk version:       47.0
    compatible.advm:       18.0.0.0.0
    ACFS compatibility:    18.0.0.0.0
    flags:        MountPoint,Available
    mount time:   Tue Jun 23 15:52:35 2020
    mount sequence number: 8
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   61513723904  (  57.29 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/myadvm-19
        label:
        state:                 Available
        major, minor:          248, 9731
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  61513723904  (  57.29 GB )
        metadata read I/O count:         4492
        metadata write I/O count:        236
        total metadata bytes read:       24363008  (  23.23 MB )
        total metadata bytes written:    12165120  (  11.60 MB )
        ADVM diskgroup:        DATAC1
        ADVM resize increment: 536870912
        ADVM redundancy:       high
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  1
    snapshot space usage: 393216  ( 384.00 KB )
    replication status: DISABLED
    compression status: DISABLED

Here I have more detail: those 384 KB are for the file allocation table only.

Write on snapshot

My snapshot shows no additional space but that will not stay as the goal is to do some modifications, within the same file, like a database would do on its datafiles as soon as it is opened read-write.


[grid@cloud ~]$ dd of=/myacfs/.ACFS/snaps/S1/file.tmp if=/dev/zero bs=1G count=10 conv=notrunc

10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 58.2101 s, 184 MB/s

I’ve overwritten 10GB within the 42 GB file but the remaining 32 are still the same (that’s what the conv=notrunc is doing – not truncating the end of file).


[grid@cloud ~]$ du -ah /myacfs

64K     /myacfs/lost+found
43G     /myacfs/file.tmp
43G     /myacfs

Nothing has changed at the base level because I modified only the file in the snapshot.



[grid@cloud ~]$ du -ah /myacfs/.ACFS

32K     /myacfs/.ACFS/.fileid
32K     /myacfs/.ACFS/repl
43G     /myacfs/.ACFS/snaps/S1/file.tmp
43G     /myacfs/.ACFS/snaps/S1
43G     /myacfs/.ACFS/snaps
43G     /myacfs/.ACFS

Nothing has changed either in the snapshot because the file is still the same size.


[grid@cloud ~]$ df -Th /myacfs

Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G   53G   48G  53% /myacfs

The filesystem, has increased by the amount I modified in the snapshot: I had 58GB available and now only 48GB. Because the 10GB of extents that were shared between the shanpshot child and the base (the snapshot parent) are not the same anymore and new extents had to be allocated for the snapshot. This is similar to copy-on-write.

That’s the main point of this blog post: you don’t see this new allocation with “ls” or “du” but only with “df” or the filesystem specific tools.


[grid@cloud ~]$ acfsutil info storage -u GB -l DATAC1

Diskgroup: DATAC1 (83% free)
  total disk space:         209952.25
  ASM file space:           22864.75
  total free space:         187083.87
  net free with mirroring:  62361.29
  usable after reservation: 58473.23
  redundancy type:          HIGH

    Total space used by ASM non-volume files:
      used:                    19488.04
      mirror used:             6496.01

    volume: /dev/asm/myadvm-19
      total:                   100.00
      free:                     47.09
      redundancy type:         high
      file system:             /myacfs
        snapshot: S1 (/myacfs/.ACFS/snaps/S1)
          used:         10.10
          quota limit:   none

----
unit of measurement: GB

the volume free space has decreased fro 57.29 to 47.09 GB


[grid@cloud ~]$ acfsutil snap info /myacfs
snapshot name:               S1
snapshot location:           /myacfs/.ACFS/snaps/S1
RO snapshot or RW snapshot:  RW
parent name:                 /myacfs
snapshot creation time:      Fri Jul  3 08:01:52 2020
file entry table allocation: 393216   ( 384.00 KB )
storage added to snapshot:   10846875648   (  10.10 GB )

    number of snapshots:  1
    snapshot space usage: 10846875648  (  10.10 GB )

10 GB has been added to the snapshot for the extents that are different than the parent.


[grid@cloud ~]$ acfsutil info fs /myacfs

/myacfs
    ACFS Version: 18.0.0.0.0
    on-disk version:       47.0
    compatible.advm:       18.0.0.0.0
    ACFS compatibility:    18.0.0.0.0
    flags:        MountPoint,Available
    mount time:   Fri Jul  3 07:33:19 2020
    mount sequence number: 6
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   50566590464  (  47.09 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/myadvm-19
        label:
        state:                 Available
        major, minor:          248, 9731
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  50566590464  (  47.09 GB )
        metadata read I/O count:         9447
        metadata write I/O count:        1030
        total metadata bytes read:       82587648  (  78.76 MB )
        total metadata bytes written:    89440256  (  85.30 MB )
        ADVM diskgroup:        DATAC1
        ADVM resize increment: 536870912
        ADVM redundancy:       high
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  1
    snapshot space usage: 10846875648  (  10.10 GB )
    replication status: DISABLED

The file entry table allocation has not changed. It was pointing to the parent for all extents before. Now, 10GB of extents are pointing to the snapshot child ones.

Write on parent

If I overwrite a different part on the parent, it will need to allocate new extents as those extents are share with the snapshot.


[grid@cloud ~]$ dd of=/myacfs/file.tmp if=/dev/zero bs=1G count=10 conv=notrunc seek=20

10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 35.938 s, 299 MB/s

this has written 10GB again, but at the 20GB offset, not overlapping the range of extents modified in the snapshot.


[grid@cloud ~]$ df -Th /myacfs
Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G   63G   38G  63% /myacfs

Obviously 10GB had to be allocated for those new extents.


[grid@cloud ~]$ acfsutil snap info /myacfs
snapshot name:               S1
snapshot location:           /myacfs/.ACFS/snaps/S1
RO snapshot or RW snapshot:  RW
parent name:                 /myacfs
snapshot creation time:      Fri Jul  3 09:17:10 2020
file entry table allocation: 393216   ( 384.00 KB )
storage added to snapshot:   10846875648   (  10.10 GB )


    number of snapshots:  1
    snapshot space usage: 21718523904  (  20.23 GB )

Those 10GB modified on the base are allocated for the snapshot, because what was a pointer to the parent is now a copy of extents in the state they were before the write.

I’m not going into details of copy-on-write or redirect-on-write. You can see that at file level with “acfsutil info file” and Ludovico Caldara made a great demo of it a few years ago:

Now what happens if I modify, in the parent, the same extents that have been modified in the snapshot? I don’t need a copy of the previous version of them, right?


[grid@cloud ~]$ dd of=/myacfs/file.tmp if=/dev/zero bs=1G count=10 conv=notrunc

10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 34.9208 s, 307 MB/s

The first 10GB of this file, since the snapshot was taken, have now been modified in the base and in all snapshots (only one here).


[grid@cloud ~]$ df -Th /myacfs
Filesystem         Type  Size  Used Avail Use% Mounted on
/dev/asm/myadvm-19 acfs  100G   74G   27G  74% /myacfs

I had 10GB allocated again for this.


[grid@cloud ~]$ acfsutil snap info /myacfs
snapshot name:               S1
snapshot location:           /myacfs/.ACFS/snaps/S1
RO snapshot or RW snapshot:  RW
parent name:                 /myacfs
snapshot creation time:      Fri Jul  3 09:17:10 2020
file entry table allocation: 393216   ( 384.00 KB )
storage added to snapshot:   10846875648   (  10.10 GB )

    number of snapshots:  1
    snapshot space usage: 32564994048  (  30.33 GB )


[grid@cloud ~]$ acfsutil info file /myacfs/.ACFS/snaps/S1/file.tmp -u |
                awk '/(Yes|No)$/{ if ( $7!=l ) o=$1 ; s[o]=s[o]+$2 ; i[o]=$7 ; l=$7} END{ for (o in s) printf "offset: %4dGB size: %4dGB Inherited: %3s\n",o/1024/1024/1024,s[o]/1024/1024/1024,i[o]}' | sort
offset:    0GB size:   10GB Inherited:  No
offset:   10GB size:   31GB Inherited: Yes

I have aggregated, with awk, the range of snapshot extents that are inherited from the parent. The first 10GB are those that I modified in the snapshot. The others, above 10GB are from the S1 snapshot.


[grid@cloud ~]$ acfsutil info file /myacfs/file.tmp -u | 
                awk '/(Yes|No)$/{ if ( $7!=l ) o=$1 ; s[o]=s[o]+$2 ; i[o]=$7 ; l=$7} END{ for (o in s) printf "offset: %4dGB size: %4dGB Inherited: %3s %s\n",o/1024/1024/1024,s[o]/1024/1024/1024,i[o],x[o]}' | sort
offset:    0GB size:   10GB Inherited:  No
offset:   10GB size:    9GB Inherited: Yes
offset:   19GB size:   10GB Inherited:  No
offset:   30GB size:   11GB Inherited: Yes

Here in the base snapshot I see as inherited, between parent and child, the two ranges that I modified: 10GB from offset 0GB and 10 GB from offset 20GB

That’s the important point: when you take a snapshot, the modifications on the parent, as well as the modifications in the snapshot, will allocate new extents to add to the snapshot.

Why is this important?

This explains why, on ODA, all non-CDB databases are created on a snapshot that has been taken when the filesystem was empty.

Read-write snapshots are really cool, especially with multitenant as they are automated with the CRATE PLUGGABLE DATABASE … SNAPSHOT COPY. But you must keep in mind that the snapshot is made at filesystem level (this may change in 20c with File-Based Snapshots). Any change to the master will allocate space, whether used or not in the snapshot copy.

I blogged about that in the past: https://blog.dbi-services.com/multitenant-thin-provisioning-pdb-snapshots-on-acfs/

Here I just wanted to clarify what you see with “ls” and “du” vs. “df” or “acfsutil info”. “ls” and “du” show the virtual size of the files. “df” shows the extents allocated in the filesystem as base extents or snapshot copies. “acfsutil info” shows those extents allocated as “storage added to snapshot” whether they were allocated for modifications on the parent or child.


[grid@cloud ~]$ acfsutil info file /myacfs/.ACFS/snaps/S1/file.tmp -u
/myacfs/.ACFS/snaps/S1/file.tmp

    flags:        File
    inode:        18014398509482026
    owner:        grid
    group:        oinstall
    size:         45097156608  (  42.00 GB )
    allocated:    45105545216  (  42.01 GB )
    hardlinks:    1
    device index: 1
    major, minor: 248,9731
    create time:  Fri Jul  3 07:36:45 2020
    access time:  Fri Jul  3 07:36:45 2020
    modify time:  Fri Jul  3 09:18:25 2020
    change time:  Fri Jul  3 09:18:25 2020
    extents:
        -----offset ----length | -dev --------offset | inherited
                  0  109051904 |    1    67511517184 | No
          109051904  134217728 |    1    67620569088 | No
          243269632  134217728 |    1    67754786816 | No
          377487360  134217728 |    1    67889004544 | No
          511705088  134217728 |    1    68023222272 | No
          645922816  134217728 |    1    68157440000 | No
          780140544  134217728 |    1    68291657728 | No
          914358272  134217728 |    1    68425875456 | No
         1048576000  134217728 |    1    68560093184 | No
         1182793728  134217728 |    1    68694310912 | No
...

The difference between “du” and “df”, which is also the “storage added to snapshot” displayed by “acfsutil snap info”, is what you see as “inherited”=No in “acfsutil info file -u” on all files, parent and child. Note that I didn’t use compression here, which is another reason for the difference between “du” and “df”.

Cet article Oracle ACFS: “du” vs. “df” and “acfsutil info” est apparu en premier sur Blog dbi services.

By Franck Pachot

.
I’ll reference Alex DeBrie article “SQL, NoSQL, and Scale: How DynamoDB scales where relational databases don’t“, especially the paragraph about “Why relational databases don’t scale”. But I want to make clear that my post here is not against this article, but against a very common myth that even precedes NoSQL databases. Actually, I’m taking this article as reference because the author, in his website and book, has really good points about data modeling in NoSQL. And because AWS DynamoDB is probably the strongest NoSQL database today. I’m challenging some widespread assertions and better do it based on good quality content and product.

There are many use cases where NoSQL is a correct alternative, but moving from a relational database system to a key-value store because you heard that “joins don’t scale” is probably not a good reason. You should choose a solution for the features it offers (the problem it solves), and not because you ignore what you current platform is capable of.

time complexity of joins

The idea in this article, taken from the popular Rick Houlihan talk, is that, by joining tables, you read more data. And that it is a CPU intensive operation. And the time complexity of joins is “( O (M + N) ) or worse” according to Alex DeBrie article or “O(log(N))+Nlog(M)” according to Rick Houlihan slide. This makes reference to “time complexity” and “Big O” notation. You can read about it. But this supposes that the cost of a join depends on the size of the tables involved (represented by N and M). That would be right with non-partitioned table full scan. But all relational databases come with B*Tree indexes. And when we compare to a key-value store we are obviously retrieving few rows and the join method will be a Nested Loop with Index Range Scan. This access method is actually not dependent at all on the size of the tables. That’s why relational database are the king of OTLP applications.

Let’s test it

The article claims that “there’s one big problem with relational databases: performance is like a black box. There are a ton of factors that impact how quickly your queries will return.” with an example on a query like: “SELECT * FROM orders JOIN users ON … WHERE user.id = … GROUP BY … LIMIT 20”. It says that “As the size of your tables grow, these operations will get slower and slower.”

I will build those tables, in PostgreSQL here, because that’s my preferred Open Source RDBMS, and show that:

Performance is not a black box: all RDBMS have an EXPLAIN command that display exactly the algorithm used (even CPU and memory access) and you can estimate the cost easily from it.
Join and Group By here do not depend on the size of the table at all but only the rows selected. You will multiply your tables by several orders of magnitude before the number of rows impact the response time by a 1/10s of millisecond.

Create the tables

I will create small tables here. Because when reading the execution plan, I don’t need large tables to estimate the cost and the response-time scalability. You can run the same with larger tables if you want to see how it scales. And then maybe investigate further the features for big tables (partitioning, parallel query,…). But let’s start with very simple tables without specific optimization.


\timing
create table USERS ( USER_ID bigserial primary key, FIRST_NAME text, LAST_NAME text)
;
CREATE TABLE
create table ORDERS ( ORDER_ID bigserial primary key, ORDER_DATE timestamp, AMOUNT numeric(10,2), USER_ID bigint, DESCRIPTION text)
;

I have created the two tables that I’ll join. Both with an auto-generated primary key for simplicity.

Insert test data

It is important to build the table as they would be in real life. USERS probably come without specific order. ORDERS come by date. This means that ORDERS from one USER are scattered throughout the table. The query would be much faster with clustered data, and each RDBMS has some ways to achieve this, but I want to show the cost of joining two tables here without specific optimisation.


insert into  USERS (FIRST_NAME, LAST_NAME)
 with
  random_words as (

  select generate_series id,
   translate(md5(random()::text),'-0123456789','aeioughij') as word

  from generate_series(1,100)

  )
 select
  words1.word ,words2.word

  from
   random_words words1
  cross join
   random_words words2
  order by words1.id+words2.id

;
INSERT 0 10000
Time: 28.641 ms

select * from USERS order by user_id fetch first 10 rows only;
 user_id |           first_name           |           last_name
---------+--------------------------------+--------------------------------
       1 | iooifgiicuiejiaeduciuccuiogib  | iooifgiicuiejiaeduciuccuiogib
       2 | iooifgiicuiejiaeduciuccuiogib  | dgdfeeiejcohfhjgcoigiedeaubjbg
       3 | dgdfeeiejcohfhjgcoigiedeaubjbg | iooifgiicuiejiaeduciuccuiogib
       4 | ueuedijchudifefoedbuojuoaudec  | iooifgiicuiejiaeduciuccuiogib
       5 | iooifgiicuiejiaeduciuccuiogib  | ueuedijchudifefoedbuojuoaudec
       6 | dgdfeeiejcohfhjgcoigiedeaubjbg | dgdfeeiejcohfhjgcoigiedeaubjbg
       7 | iooifgiicuiejiaeduciuccuiogib  | jbjubjcidcgugubecfeejidhoigdob
       8 | jbjubjcidcgugubecfeejidhoigdob | iooifgiicuiejiaeduciuccuiogib
       9 | ueuedijchudifefoedbuojuoaudec  | dgdfeeiejcohfhjgcoigiedeaubjbg
      10 | dgdfeeiejcohfhjgcoigiedeaubjbg | ueuedijchudifefoedbuojuoaudec
(10 rows)

Time: 0.384 ms

This generated 10000 users in the USERS table with random names.


insert into ORDERS ( ORDER_DATE, AMOUNT, USER_ID, DESCRIPTION)
 with
  random_amounts as (
      ------------> this generates 10 random order amounts
  select
   1e6*random() as AMOUNT
  from generate_series(1,10)
 )
 ,random_dates as (
      ------------> this generates 10 random order dates
  select
   now() - random() * interval '1 year' as ORDER_DATE
  from generate_series(1,10)
 )
select ORDER_DATE, AMOUNT, USER_ID, md5(random()::text) DESCRIPTION from
  random_dates
 cross join
  random_amounts
 cross join
  users
 order by ORDER_DATE
      ------------> I sort this by date because that's what happens in real life. Clustering by users would not be a fair test.
;
INSERT 0 1000000
Time: 4585.767 ms (00:04.586)

I have now one million orders generated as 100 orders for each users in the last year. You can play with the numbers to generate more and see how it scales. This is fast: 5 seconds to generate 1 million orders here, and there are many way to load faster if you feel the need to test on very large data sets. But the point is not there.


create index ORDERS_BY_USER on ORDERS(USER_ID);
CREATE INDEX
Time: 316.322 ms
alter table ORDERS add constraint ORDERS_BY_USER foreign key (USER_ID) references USERS;
ALTER TABLE
Time: 194.505 ms

As, in my data model, I want to navigate from USERS to ORDERS I add an index for it. And I declare the referential integrity to avoid logical corruptions in case there is a bug in my application, or a human mistake during some ad-hoc fixes. As soon as the constraint is declared, I have no need to check this assertion in my code anymore. The constraint also informs the query planner know about this one-to-many relationship as it may open some optimizations. That’s the relational database beauty: we declare those things rather than having to code their implementation.


vacuum analyze;
VACUUM
Time: 429.165 ms

select version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)

I’m in PostgreSQL, version 12, on a Linux box with 2 CPUs only. I’ve run a manual VACUUM to get a reproducible testcase rather than relying on autovacuum kicking-in after those table loading.


select relkind,relname,reltuples,relpages,lpad(pg_size_pretty(relpages::bigint*8*1024),20) "relpages * 8k" from pg_class natural join (select oid relnamespace,nspname from pg_namespace) nsp where nspname='public' order by relpages;

 relkind |       relname       | reltuples | relpages |    relpages * 8k
---------+---------------------+-----------+----------+----------------------
 S       | orders_order_id_seq |         1 |        1 |           8192 bytes
 S       | users_user_id_seq   |         1 |        1 |           8192 bytes
 i       | users_pkey          |     10000 |       30 |               240 kB
 r       | users               |     10000 |      122 |               976 kB
 i       | orders_pkey         |     1e+06 |     2745 |                21 MB
 i       | orders_by_user      |     1e+06 |     2749 |                21 MB
 r       | orders              |     1e+06 |    13334 |               104 MB

Here are the sizes of my tables: The 1 million orders take 104MB and the 10 thousand users is 1MB. I’ll increase it later, but I don’t need this to understand the performance predictability.

When understanding the size, remember that the size of the table is only the size of data. Metadata (like column names) are not repeated for each row. They are stored once in the RDBMS catalog describing the table. As PostgreSQL has a native JSON datatype you can also test with a non-relational model here.

Execute the query


select order_month,sum(amount),count(*)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;

     order_month     |     sum     | count
---------------------+-------------+-------
 2019-11-01 00:00:00 |  5013943.57 |    10
 2019-09-01 00:00:00 |  5013943.57 |    10
 2020-04-01 00:00:00 |  5013943.57 |    10
 2019-08-01 00:00:00 | 15041830.71 |    30
 2020-02-01 00:00:00 | 10027887.14 |    20
 2020-06-01 00:00:00 |  5013943.57 |    10
 2019-07-01 00:00:00 |  5013943.57 |    10
(7 rows)

Time: 2.863 ms

Here is the query mentioned in the article: get all orders for one user, and do some aggregation on them. Looking at the time is not really interesting as it depends on the RAM to cache the buffers at database or filesystem level, and disk latency. But we are developers and, by looking at the operations and loops that were executed, we can understand how it scales and where we are in the “time complexity” and “Big O” order of magnitude.

Explain the execution

This is as easy as running the query with EXPLAIN:


explain (analyze,verbose,costs,buffers)
select order_month,sum(amount)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;
                                                                   QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=389.34..390.24 rows=10 width=40) (actual time=0.112..0.136 rows=7 loops=1)
   Output: (date_trunc('month'::text, orders.order_date)), sum(orders.amount)
   Group Key: (date_trunc('month'::text, orders.order_date))
   Buffers: shared hit=19
   ->  Sort  (cost=389.34..389.59 rows=100 width=17) (actual time=0.104..0.110 rows=100 loops=1)
         Output: (date_trunc('month'::text, orders.order_date)), orders.amount
         Sort Key: (date_trunc('month'::text, orders.order_date))
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=19
         ->  Nested Loop  (cost=5.49..386.02 rows=100 width=17) (actual time=0.022..0.086 rows=100 loops=1)
               Output: date_trunc('month'::text, orders.order_date), orders.amount
               Buffers: shared hit=19
               ->  Index Only Scan using users_pkey on public.users  (cost=0.29..4.30 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
                     Output: users.user_id
                     Index Cond: (users.user_id = 42)
                     Heap Fetches: 0
                     Buffers: shared hit=3
               ->  Bitmap Heap Scan on public.orders  (cost=5.20..380.47 rows=100 width=25) (actual time=0.012..0.031 rows=100 loops=1)
                     Output: orders.order_id, orders.order_date, orders.amount, orders.user_id, orders.description
                     Recheck Cond: (orders.user_id = 42)
                     Heap Blocks: exact=13
                     Buffers: shared hit=16
                     ->  Bitmap Index Scan on orders_by_user  (cost=0.00..5.17 rows=100 width=0) (actual time=0.008..0.008 rows=100 loops=1)
                           Index Cond: (orders.user_id = 42)
                           Buffers: shared hit=3
 Planning Time: 0.082 ms
 Execution Time: 0.161 ms

Here is the execution plan with the execution statistics. We have read 3 pages from USERS to get our USER_ID (Index Only Scan) and navigated, with Nested Loop, to ORDERS and get the 100 orders for this user. This has read 3 pages from the index (Bitmap Index Scan) and 13 pages from the table. That’s a total of 19 pages. Those pages are 8k blocks.

4x initial size scales with same performance: 19 page reads

Let’s increase the size of the tables by a factor 4. I change the “generate_series(1,100)” to “generate_series(1,200)” in the USERS generations and run the same. That generates 40000 users and 4 million orders.


select relkind,relname,reltuples,relpages,lpad(pg_size_pretty(relpages::bigint*8*1024),20) "relpages * 8k" from pg_class natural join (select
oid relnamespace,nspname from pg_namespace) nsp where nspname='public' order by relpages;

 relkind |       relname       |  reltuples   | relpages |    relpages * 8k
---------+---------------------+--------------+----------+----------------------
 S       | orders_order_id_seq |            1 |        1 |           8192 bytes
 S       | users_user_id_seq   |            1 |        1 |           8192 bytes
 i       | users_pkey          |        40000 |      112 |               896 kB
 r       | users               |        40000 |      485 |              3880 kB
 i       | orders_pkey         |        4e+06 |    10969 |                86 MB
 i       | orders_by_user      |        4e+06 |    10985 |                86 MB
 r       | orders              | 3.999995e+06 |    52617 |               411 MB

explain (analyze,verbose,costs,buffers)
select order_month,sum(amount)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;
explain (analyze,verbose,costs,buffers)
select order_month,sum(amount)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;
                                                                   QUERY PLAN                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=417.76..418.69 rows=10 width=40) (actual time=0.116..0.143 rows=9 loops=1)
   Output: (date_trunc('month'::text, orders.order_date)), sum(orders.amount)
   Group Key: (date_trunc('month'::text, orders.order_date))
   Buffers: shared hit=19
   ->  Sort  (cost=417.76..418.02 rows=104 width=16) (actual time=0.108..0.115 rows=100 loops=1)
         Output: (date_trunc('month'::text, orders.order_date)), orders.amount
         Sort Key: (date_trunc('month'::text, orders.order_date))
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=19
         ->  Nested Loop  (cost=5.53..414.27 rows=104 width=16) (actual time=0.044..0.091 rows=100 loops=1)
               Output: date_trunc('month'::text, orders.order_date), orders.amount
               Buffers: shared hit=19
               ->  Index Only Scan using users_pkey on public.users  (cost=0.29..4.31 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=1)
                     Output: users.user_id
                     Index Cond: (users.user_id = 42)
                     Heap Fetches: 0
                     Buffers: shared hit=3
               ->  Bitmap Heap Scan on public.orders  (cost=5.24..408.66 rows=104 width=24) (actual time=0.033..0.056 rows=100 loops=1)
                     Output: orders.order_id, orders.order_date, orders.amount, orders.user_id, orders.description
                     Recheck Cond: (orders.user_id = 42)
                     Heap Blocks: exact=13
                     Buffers: shared hit=16
                     ->  Bitmap Index Scan on orders_by_user  (cost=0.00..5.21 rows=104 width=0) (actual time=0.029..0.029 rows=100 loops=1)
                           Index Cond: (orders.user_id = 42)
                           Buffers: shared hit=3
 Planning Time: 0.084 ms
 Execution Time: 0.168 ms
(27 rows)

Time: 0.532 ms

Look how we scaled here: I multiplied the size of the tables by 4x and I have exactly the same cost: 3 pages from each index and 13 from the table. This is the beauty of B*Tree: the cost depends only on the height of the tree and not the size of the table. And you can increase the table exponentially before the height of the index increases.

16x initial size scales with 21/19=1.1 cost factor

Let’s go further and x4 the size of the tables again. I run with “generate_series(1,400)” in the USERS generations and run the same. That generates 160 thousand users and 16 million orders.


select relkind,relname,reltuples,relpages,lpad(pg_size_pretty(relpages::bigint*8*1024),20) "relpages * 8k" from pg_class natural join (select
oid relnamespace,nspname from pg_namespace) nsp where nspname='public' order by relpages;
 relkind |       relname       |  reltuples   | relpages |    relpages * 8k
---------+---------------------+--------------+----------+----------------------
 S       | orders_order_id_seq |            1 |        1 |           8192 bytes
 S       | users_user_id_seq   |            1 |        1 |           8192 bytes
 i       | users_pkey          |       160000 |      442 |              3536 kB
 r       | users               |       160000 |     1925 |                15 MB
 i       | orders_pkey         |      1.6e+07 |    43871 |               343 MB
 i       | orders_by_user      |      1.6e+07 |    43935 |               343 MB
 r       | orders              | 1.600005e+07 |   213334 |              1667 MB

explain (analyze,verbose,costs,buffers)
select order_month,sum(amount)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;
                                                                   QUERY PLAN                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=567.81..569.01 rows=10 width=40) (actual time=0.095..0.120 rows=8 loops=1)
   Output: (date_trunc('month'::text, orders.order_date)), sum(orders.amount)
   Group Key: (date_trunc('month'::text, orders.order_date))
   Buffers: shared hit=21
   ->  Sort  (cost=567.81..568.16 rows=140 width=17) (actual time=0.087..0.093 rows=100 loops=1)
         Output: (date_trunc('month'::text, orders.order_date)), orders.amount
         Sort Key: (date_trunc('month'::text, orders.order_date))
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=21
         ->  Nested Loop  (cost=6.06..562.82 rows=140 width=17) (actual time=0.026..0.070 rows=100 loops=1)
               Output: date_trunc('month'::text, orders.order_date), orders.amount
               Buffers: shared hit=21
               ->  Index Only Scan using users_pkey on public.users  (cost=0.42..4.44 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=1)
                     Output: users.user_id
                     Index Cond: (users.user_id = 42)
                     Heap Fetches: 0
                     Buffers: shared hit=4
               ->  Bitmap Heap Scan on public.orders  (cost=5.64..556.64 rows=140 width=25) (actual time=0.014..0.035 rows=100 loops=1)
                     Output: orders.order_id, orders.order_date, orders.amount, orders.user_id, orders.description
                     Recheck Cond: (orders.user_id = 42)
                     Heap Blocks: exact=13
                     Buffers: shared hit=17
                     ->  Bitmap Index Scan on orders_by_user  (cost=0.00..5.61 rows=140 width=0) (actual time=0.010..0.010 rows=100 loops=1)
                           Index Cond: (orders.user_id = 42)
                           Buffers: shared hit=4
 Planning Time: 0.082 ms
 Execution Time: 0.159 ms

Yes, by increasing the table size a lot the index may have to split the root block to add a new branch level. The consequence is minimal: two additional pages to read here. The additional execution time is in tens of microseconds when cached in RAM, up to milliseconds if from mechanical disk. With 10x or 100x larger tables, you will do some physical I/O and only the top index branches will be in memory. You can expect about about 20 I/O calls then. With SSD on Fiber Channel (like with AWS RDS PostgreSQL for example), this will still be in single-digit millisecond. The cheapest AWS EBS Provisioned IOPS for RDS is 1000 IOPS then you can estimate the number of queries per second you can run. The number of pages here (“Buffers”) is the right metric to estimate the cost of the query. To be compared with DynamoDB RCU.

64x initial size scales with 23/19=1.2 cost factor

Last test for me, but you can go further, I change to “generate_series(1,800)” in the USERS generations and run the same. That generates 640 thousand users and 64 million orders.



select relkind,relname,reltuples,relpages,lpad(pg_size_pretty(relpages::bigint*8*1024),20) "relpages * 8k" from pg_class natural join (select
oid relnamespace,nspname from pg_namespace) nsp where nspname='public' order by relpages;

 relkind |       relname       |   reltuples   | relpages |    relpages * 8k
---------+---------------------+---------------+----------+----------------------
 S       | orders_order_id_seq |             1 |        1 |           8192 bytes
 S       | users_user_id_seq   |             1 |        1 |           8192 bytes
 i       | users_pkey          |        640000 |     1757 |                14 MB
 r       | users               |        640000 |     7739 |                60 MB
 i       | orders_pkey         |       6.4e+07 |   175482 |              1371 MB
 i       | orders_by_user      |       6.4e+07 |   175728 |              1373 MB
 r       | orders              | 6.4000048e+07 |   853334 |              6667 MB


explain (analyze,verbose,costs,buffers)
select order_month,sum(amount)
from
  (
  select date_trunc('month',ORDER_DATE) order_month, USER_ID, AMOUNT
  from ORDERS join USERS using(USER_ID)
  ) user_orders
where USER_ID=42
group by order_month
;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1227.18..1227.33 rows=10 width=40) (actual time=0.102..0.105 rows=7 loops=1)
   Output: (date_trunc('month'::text, orders.order_date)), sum(orders.amount)
   Group Key: date_trunc('month'::text, orders.order_date)
   Buffers: shared hit=23
   ->  Nested Loop  (cost=7.36..1225.65 rows=306 width=17) (actual time=0.027..0.072 rows=100 loops=1)
         Output: date_trunc('month'::text, orders.order_date), orders.amount
         Buffers: shared hit=23
         ->  Index Only Scan using users_pkey on public.users  (cost=0.42..4.44 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
               Output: users.user_id
               Index Cond: (users.user_id = 42)
               Heap Fetches: 0
               Buffers: shared hit=4
         ->  Bitmap Heap Scan on public.orders  (cost=6.94..1217.38 rows=306 width=25) (actual time=0.015..0.037 rows=100 loops=1)
               Output: orders.order_id, orders.order_date, orders.amount, orders.user_id, orders.description
               Recheck Cond: (orders.user_id = 42)
               Heap Blocks: exact=15
               Buffers: shared hit=19
               ->  Bitmap Index Scan on orders_by_user  (cost=0.00..6.86 rows=306 width=0) (actual time=0.011..0.011 rows=100 loops=1)
                     Index Cond: (orders.user_id = 42)
                     Buffers: shared hit=4
 Planning Time: 0.086 ms
 Execution Time: 0.134 ms
(22 rows)

Time: 0.521 ms

You see how the cost increases slowly, with two additional pages to read here.

Please, test with more rows. When looking at the time (less than 1 millisecond), keep in mind that it is CPU time here as all pages are in memory cache. On purpose, because the article mentions CPU and RAM (“joins require a lot of CPU and memory”). Anyway, 20 disk reads will still be within the millisecond on modern storage.

Time complexity of B*Tree is actually very light

Let’s get back to the “Big O” notation. We are definitely not “O (M + N)”. This would have been without any index, where all pages from all tables would have been full scanned. Any database maintains hashed or sorted structures to improve high selectivity queries. This has nothing to do with SQL vs. NoSQL or with RDBMS vs. Hierarchichal DB. In DynamoDB this structure on primary key is HASH partitioning (and additional sorting). In RDBMS the most common structure is a sorted index (with the possibility of additional partitioning, zone maps, clustering,…). We are not in “O(log(N))+Nlog(M)” but more like “O(log(N))+log(M)” because the size of the driving table (N) has nothing to do on the inner loop cost. The first filtering, the one with “O(log(N))”, has filtered out those N rows. Using “N” and ignoring the outer selectivity is a mistake. Anyway, this O(log) for B*Tree is really scalable as we have seen because the base of this log() function is really small (thanks to large branch blocks – 8KB).

Anyway, the point is not there. As you have seen, the cost of Nested Loop is in the blocks that are fetched from the inner loop, and this depends on the selectivity and the clustering factor (the table physical order correlation with the index order), and some prefetching techniques. This O(log) for going through the index is often insignificant on Range Scans.

40 years of RDBMS optimizations

If you have a database size that is several order of magnitude from those million rows, and where this millisecond latency is a problem, the major RDBMS offer many features and structures to go further in performance and predictability of response time. As we have seen, the cost of this query is not about Index, Nested Loop or Group By. It is about the scattered items: ingested by date but queried by user.

Through its 40 years of optimizing OLTP applications, the Oracle Database have a lot of features, with their popularity depending on the IT trends. From the first versions, in the 80’s, because the hierarchical databases were still in people’s mind, and the fear of “joins are expensive” was already there, Oracle has implemented a structure to store the tables together, pre-joined. This is called CLUSTER. It was, from the beginning, indexed with a sorted structure (INDEX CLUSTER) or a hashed structure (HASH CLUSTER). Parallel Query was quickly implemented to scale-up and, with Parallel Server now known as RAC, to scale-out. Partitioning was introduced to scale further, again with sorted (PARTITION by RANGE) or hashed (PARTITION BY HASH) segments. This can scale further the index access (LOCAL INDEX with partition pruning). And it also scales joins (PARTITION-WISE JOIN) to reduce CPU and RAM to join large datasets. When more agility was needed, B*Tree sorted structures were the winner and Index Organized Tables were introduced to cluster data (to avoid the most expensive operation as we have seen in the examples above). But Heap Tables and B*Tree are still the winners given their agility and because the Join operations were rarely a bottleneck. Currently, partitioning has improved (even scaling horizontally beyond the clusters with sharding), as well as data clustering (Attribute Clustering, Zone Maps and Storage Index). And there’s always the possibility for Index Only access, just by adding more columns to the index, removing the major cost of table access. And materialized views can also act as pre-built join index and will also reduce the CPU and RAM required for GROUP BY aggregation.

I listed a lot of Oracle features, but Microsoft SQL Server has also many similar features. Clustered tables, covering indexes, Indexed Views, and of course partitioning. Open source databases are evolving a lot. PostgreSQL has partitioning, parallel query, clustering,.. Still Open Source, you can web-scale it to a distributed database with YugabyteDB. MySQL is evolving a lot recently. I’ll not list all databases and all features. You can port the testcase to other databases easily.

Oh, and by the way, if you want to test it on Oracle Database you will quickly realize that the query from the article may not even do any Join operation. Because when you query with the USER_ID you probably don’t need to get back to the USERS table. And thanks to the declared foreign key, the optimizer knows that the join is not necessary: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=a5afeba2fdb27dec7533545ab0a6eb0e.

NoSQL databases are good for many use-cases. But remember that it is a key-value optimized data store. When you start to split a document into multiple entities with their own keys, remember that relational databases were invented for that. I blogged about another myth recently: The myth of NoSQL (vs. RDBMS) agility: adding attributes. All feedback welcome, here or on Twitter:

A new blog post on another myth in NoSQL vs. RDBMS:
“joins dont scale”https://t.co/u4aBE4HXb7

— Franck Pachot (@FranckPachot) July 5, 2020

Cet article The myth of NoSQL (vs. RDBMS) “joins dont scale” est apparu en premier sur Blog dbi services.

By Franck Pachot

.
The FETCH FIRST … ROWS ONLY syntax arrived in Oracle 12c and is much more convenient than using a subquery with ‘ORDER BY’ wrapped in a “WHERE ROWNUM < …” around it. But as I mentioned in a previous post it required the FIRST_ROWS() hint to get correct estimations. In SQL you don’t want to overload your code for performance, right? The RDBMS optimizer does the job for you. This was a bug with this new FETCH FIRST syntax, that is fixed (See Nigel Bayliss post about this) in 19c:


SQL> select bugno,value,optimizer_feature_enable,description from  V$SYSTEM_FIX_CONTROL where bugno=22174392;

      BUGNO    VALUE    OPTIMIZER_FEATURE_ENABLE                                                      DESCRIPTION
___________ ________ ___________________________ ________________________________________________________________
   22174392        1 19.1.0                      first k row optimization for window function rownum predicate

You can use this query on database metadata to check that you have the fix enabled (VALUE=1 means that the bug is ON). Yes, Oracle Database is not open source, but a lot of information is disclosed: you can query, with SQL, all optimizer enhancements and bug fixes, and versions they appear. And you can even enable or disable them at query level:


select /*+ opt_param('_fix_control' '22174392:OFF') */
continentexp,countriesandterritories,cases from covid where cases>0
order by daterep desc, cases desc fetch first 5 rows only

This simulates previous versions where the fix were not there.

Here is an example on the COVID table I’ve created in a previous post:https://blog.dbi-services.com/oracle-select-from-file/
I’m running this on the Oracle Cloud 20c preview but you can run the same on any recent version.

FETCH FIRST n ROWS

I’m looking at the Top-5 countries with the highest covid cases in the latest date I have in my database. This means ORDER BY the date and number of cases (both in descending order) and fetching only the first 5 rows.


SQL> select continentexp,countriesandterritories,cases
  2  from covid where cases>0
  3  order by daterep desc, cases desc fetch first 5 rows only
  4  /


   CONTINENTEXP     COUNTRIESANDTERRITORIES    CASES
_______________ ___________________________ ________
America         United_States_of_America       57258
Asia            India                          28701
America         Brazil                         24831
Africa          South_Africa                   12058
Europe          Russia                          6615


SQL> select * from dbms_xplan.display_cursor(format=>'allstats last')
  2  /


                                                                                                         PLAN_TABLE_OUTPUT
__________________________________________________________________________________________________________________________
SQL_ID  753q1ymf0sv0w, child number 1
-------------------------------------
select continentexp,countriesandterritories,cases from covid where
cases>0 order by daterep desc, cases desc fetch first 5 rows only

Plan hash value: 1833981741

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |       |      1 |        |      5 |00:00:00.01 |     239 |       |       |          |
|*  1 |  VIEW                    |       |      1 |      5 |      5 |00:00:00.01 |     239 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK|       |      1 |  20267 |      5 |00:00:00.01 |     239 |   124K|   124K|  110K (0)|
|*  3 |    TABLE ACCESS FULL     | COVID |      1 |  20267 |  18150 |00:00:00.01 |     239 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=5)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("DATEREP") DESC ,INTERNAL_FUNCTION("CASES") DESC )0)

I have queried the execution plan because the RDBMS optimization is not a black box: you can ask for an explanation. Here, it reads the whole table (TABLE ACCESS FULL), SORT it, and FILTER the rows up to number 5.

This is not efficient at all. In a relational database, rather than streaming all changes to another data store, we add purpose-built indexes to scale with a new query use-case:


SQL> create index covid_date_cases on covid(daterep,cases)
  2  /
Index COVID_DATE_CASES created.

That’s all. This index will be automatically maintained with strong consistency, and transparently for any modifications to the table.

NOSORT STOPKEY

I’m running exactly the same query:


SQL> select continentexp,countriesandterritories,cases
  2  from covid where cases>0
  3  order by daterep desc, cases desc fetch first 5 rows only
  4  /

   CONTINENTEXP     COUNTRIESANDTERRITORIES    CASES
_______________ ___________________________ ________
America         United_States_of_America       57258
Asia            India                          28701
America         Brazil                         24831
Africa          South_Africa                   12058
Europe          Russia                          6615

SQL> select * from dbms_xplan.display_cursor(format=>'allstats last')
  2  /

                                                                                              PLAN_TABLE_OUTPUT
_______________________________________________________________________________________________________________
SQL_ID  753q1ymf0sv0w, child number 2
-------------------------------------
select continentexp,countriesandterritories,cases from covid where
cases>0 order by daterep desc, cases desc fetch first 5 rows only

Plan hash value: 1929215041

------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name             | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                  |      1 |        |      5 |00:00:00.01 |       7 |
|*  1 |  VIEW                         |                  |      1 |      5 |      5 |00:00:00.01 |       7 |
|*  2 |   WINDOW NOSORT STOPKEY       |                  |      1 |      6 |      5 |00:00:00.01 |       7 |
|   3 |    TABLE ACCESS BY INDEX ROWID| COVID            |      1 |  20267 |      5 |00:00:00.01 |       7 |
|*  4 |     INDEX FULL SCAN DESCENDING| COVID_DATE_CASES |      1 |      6 |      5 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=5)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("DATEREP") DESC,INTERNAL_FUNCTION("CASES") DESC )0)
       filter("CASES">0)

There is a big difference here. I don’t need to read the table anymore. The index provides the entries already sorted and a FULL SCAN DESCENDING will just read the first ones. Look at the ‘Buffer’ column here which is the read units. In the previous test, without the index, it was 239 blocks. But what was bad there is that the FULL TABLE SCAN has a O(N) time complexity. Now with the index, I read only 7 blocks (3 to go down the index B*Tree and 4 to get the remaining column values). And this has a constant time complexity: even with billions of rows in the table this query will read less than 10 blocks, and most of them probably from memory. This will take less than a millisecond whatever the size of the table is.

3 small conclusions here:

If you think your RDBMS doesn’t scale, be sure that you run the latest version of it. Mainstream database have constant improvement and bug fixes.
You don’t need another database each time you have another use case. Creating indexes can solve many problems and you have full agility when the RDBMS can create them online, and when there’s zero code to change.
You don’t need to test on billions of rows. The execution plan operations tell you what scales or not. You don’t need to tell the optimizer how to proceed, but he tells you how he plans and how he did.

Cet article 19c: scalable Top-N queries without further hints to the query planner est apparu en premier sur Blog dbi services.

By Franck Pachot

.
J’essaie quelque chose de nouveau. Je publie beaucoup en anglais (blog, articles, présentations) mais cette fois quelque chose de 100% francophone. En sortant du confinement, on reprend les transports (train, voiture,…) et c’est l’occasion de se détendre en musique mais aussi de s’informer avec des podcasts. J’ai l’impression que c’est un format qui a de l’avenir: moins contraignant que regarder une video ou ou lire un article ou une newsletter. Alors je teste une plateforme 100% gratuite: Anchor (c’est un peu le ‘Medium’ du Podcast).

Ce qui me paraît intéressant sur Anchor, c’est qu’il y a une appli smartphone qui permet aussi de laisser des messages vocaux. Alors n’hésitez pas pour des remarques, ou questions. Vos expériences sur les bases de données peuvent être intéressantes pour d’autres. Cloud, migrations Open Source,… c’est dans l’air et c’est toujours utile de partager. Avec l’urgence sanitaire, on voit moins de meetups alors pourquoi pas essayer d’autres media.

Pour le moment j’ai publié 3 épisodes sur Oracle, les sujets importants pour cette année: les Versions (19c ou 20c), les Release Updates et le Multitenant. Et un épisode plus général sur les bases serverless. Les sujets viendront en fonction de l’actualité et de mon expérience quotidienne en consulting, ou recherche. Mais aussi de vos questions.

Anchoe me parait pratique, mais c’est aussi disponible sur Spotify (“Follow” permet d’être notifié des nouveaux épisodes):

Et plusieurs autres platformes, comme iTunes:
https://podcasts.apple.com/ch/podcast/dbpod-le-podcast-bases-de-donn%C3%A9es/id1520397763?l=fr

Ou RadioPublic:

Breaker:
https://www.breaker.audio/db-pod-le-podcast-bases-de-donnees
Google Podcasts:
https://www.google.com/podcasts?feed=aHR0cHM6Ly9hbmNob3IuZm0vcy8yMDdjNmIyMC9wb2RjYXN0L3Jzcw==
Overcast:
https://overcast.fm/itunes1520397763/db-pod-le-podcast-bases-de-donn-es
Pocket Casts:
https://pca.st/qtv36wbn

Pour ces premiers épisodes, je découvre… L’enregistrement est beaucoup trop rapide: j’ai eu la main un peu lourde sur Audacity qui permet d’enlever facilement les imperfections. Et quand on s’écoute soi-même… on en voit partout des imperferctions! N’hésitez pas à me faire un feedback. Sur le fond (les sujets dont vous souhaitez que je parle) et la forme (ça ne peut que s’améliorer…). Et à partager autour de vous, collègues ou managers. Les sujets seront moins techniques et plus généralistes que mes blogs.

Cet article DBPod – le podcast Bases de Données est apparu en premier sur Blog dbi services.

What is Linked server ?

Now a days in each company we have different applications and each one of them have different needs in term of databases support. When data exists on multiple databases the management become a little bit hard. For exemple moving some data from an Oracle database to a SQL Server database is not easy.

A linked servers is configured to enable the Database Engine to execute a Transact-SQL statement that includes tables in another instance of SQL Server, or another database product such as Oracle.

In this Blog, we will see step by step guide to create and configure an Oracle linked server in SQL Server Instance.

Download ‘oracle database client’

https://edelivery.oracle.com/osdc/faces/SoftwareDelivery

Install ‘Oracle database client’

Create a tnsnames.ora file in (C:\app\client\Administrator\product\12.2\client_1\network\admin)

WSDBA01 =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.106)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = WSDBA01)
    )
  )

Run powershell as administrator and then do a tnsping
Check your firewall on each server and between them. In this exemple we do not have any firewall.

Check your provider by creating a ‘test_connection.udl’ file in your Desktop
Then double click on it (open it)

Choose ‘Oracle Provider for OLE DB’ and then click Next

Enter your service name (defind in tnsnames.ora file) for ‘Data Source’
Enter a valide username and password and then click on ‘Test Connection’

Open your SSMS (SQL Server Management Studio) and connect to your SQL Server instance
Under ‘Server Objects’ then ‘Linked Servers’ and then ‘Providers’ you must see the ‘OraOLEDB.Oracle’

Right Click on it and then ‘Properties’
Check mark the ‘Allow inprocess’ and then Click on ‘OK’

Right click on ‘Linked server’ and then choose ‘New linked server’
In General section :
‘Linked server’ : Give a name to your Linked server
‘Provider’ : Choose ‘Oracle Provider for OLE DB’
‘Data source’ : Your service name (defined in tnsnames.ora file)

In Security Section
Put a valide username and password and then click on ‘OK’

Now under your Linked server and then ‘Catalogs’ and then ‘default’ your must see every tables that your user have right to query.

enjoy

Cet article Create and configure an Oracle linked server in SQL Server Instance est apparu en premier sur Blog dbi services.

By Franck Pachot

.
Announced by Larry Ellison last week, here it is: the Autonomous Data Guard. You can try it, unfortunately not on the Free Tier.
First you create an Autonomous Database (ATP or ADW) and then you enable Autonomous Data Guard.

You know that “Autonomous” is the marketing brand for the services that automate a lot of things, sometimes based on features that are in Oracle Database for a long time. So let’s see what is behind.

Is it a logical copy?

The slide for the announce mentions that this service maintains a remote copy. That’s right. But the message that it “maintains copy by applying logical changes, not physical changes” is not correct and misleading. What we call “logical changes apply” is logical replication where the changes are transformed to SQL statements and then can be applied to another database that can be a different version, different design,… Like Golden Gate. But Autonomous Data Guard is replicating physical changes. It applies the redo to an exact physical copy of the datafile blocks. Like… a Data Guard physical standby.
But why did Larry Ellison mention “logical” then? Because the apply is at software level. And this is a big difference from storage level synchronisation. We use the term “logical corruption” when a software bug corrupts some data. And we use “physical corruption” when the software write() is ok but the storage write to disk is wrong. And this is why “logical changes” is mentioned there: this software level replication protects from physical corruptions. Data Guard can even detect lost writes between the replicas.
And this is an important message for Oracle because on AWS RDS the standby databases for HA in multi-AZ is at storage level. AWS RDS doesn’t use Data Guard for multi-AZ Oracle. Note that it is different with other databases like Aurora where the changes are written to 6 copies from software redo, Or RDS SQL Server where multi-AZ relies on Always-On.

So, it is not a logical copy but a physical standby database. The point is that it is synchronized by the database software which is more reliable (protects for storage corruption) and more efficient (not all changes need to be replicated and only a few of them must be in sync waiting for the acknowledge).

Is it Autonomous?

Yes, all is automated. The only thing you do is enable it and switchover. Those things are not new. Data Guard was automated in previous versions or the Oracle Database, with the Data Guard Broker, with DBCA creating a standby, with recover from services, and even with automatic failover (FSFO and observer). More than that, “autonomous” means transparent: it happens without service interruption. And that again can be based on many existing features, Application Continuity, Connection Manager Traffic Director,…

So yes it is autonomous and up to a level where the competitors lagging are behind. Currently, AWS application failover is mostly based on DNS changes with all problems coming from caches and timeouts. However, recently, AWS has reduced the gap with AWS RDS Proxy, which is quite new.

This time I totally agree with the term “autonomous”. And I even think it could have been labeled as “serverless” because you don’t see the standby server: you don’t choose the shape, you don’t connect to it. I don’t even see the price but I’ll update this post as soon as I have found it.

Is it Data Guard?

Oh, that’s a great question. And I don’t think we can answer it now. I mentioned many features that can be autonomous, like creating a standby, having a broker to maintain the states, an observer to do the failover,… But that’s all at CDB level in multitenant. However, an Autonomous Database is a PDB. All recovery stuff like redo log shipping is done at CDB level. At least in the current version of it (19c).

However, from the beginning of multitenant, we want to do with a PDB the same things we do with a database. And each release came with more features to look like a standby PDB. Here is a slide I use to illustrate “PDB Switchover”:

So is this Autonomous Data Guard a multitenant feature (refreshable clone) or is it a Data Guard feature? Maybe both.The documentation mentions a RPO of zero for Automatic Failover and a RPO of 5 minutes for manual failover. I don’t think we can have RPO=0 with refreshable clones as the redo is applied with a job that runs every few minutes. So, the automatic failover is probably at CDB level: when the whole CDB is unavailable, as detected by the observer, and standby is in sync, then the standby CDB is activated and all sessions are redirected there (we connect though a Connection Manager). For a manual failover, this must touch only our PDB, and that’s done with a refreshable PDB switchover. They mention RPO=5 minutes because that’s probably to automatic refresh frequency. Then, a manual failover may loose 5 minutes of transactions if the primary is not available. You cannot initiate a failover yourself when the autonomous database is available. When it is available, that’s a switchover without any transaction loss.

So, for the moment in 19c I think that this Autonomous Data Guard is a combination of Data Guard to protect from global failure (failover the whole CDB to another availability domain) and Refreshable PDB for manual failover/switchover. But if you look at the 20c binaries and hidden parameters, you will see more and more mentions of “standby redo logs” and “standby control files” in functions that are related to the PDBs. So you know where it goes: Autonomous means PDB and Autonomous Data Guard will probably push many physical replication features at PDB level. And, once again, when this implementation detail is hidden (do you failover to a CDB standby or a PDB clone?) that deserves a “serverless” hashtag, right? Or should I say that an Autonomous Database is becoming like a CDB-less PDB, where you don’t know in which CDB your PDB is running?

Manual Switchover

Here is my PDB History after a few manual switchovers:


SQL> select * from dba_pdb_history;


                         PDB_NAME    PDB_ID      PDB_DBID                            PDB_GUID     OP_SCNBAS    OP_SCNWRP    OP_TIMESTAMP    OPERATION    DB_VERSION                                                CLONED_FROM_PDB_NAME    CLONED_FROM_PDB_DBID                CLONED_FROM_PDB_GUID    DB_NAME    DB_UNIQUE_NAME       DB_DBID    CLONETAG    DB_VERSION_STRING 
_________________________________ _________ _____________ ___________________________________ _____________ ____________ _______________ ____________ _____________ ___________________________________________________________________ _______________________ ___________________________________ __________ _________________ _____________ ___________ ____________________ 
DWCSSEED                                  3    2852116004 9D288B8436DC5D21E0530F86E50AABD0          1394780            0 27.01.20        CREATE           318767104 PDB$SEED                                                                         1300478808 9D274DFD3CEF1D1EE0530F86E50A3FEF    POD        POD                  1773519138             19.0.0.0.0           
OLTPSEED                                 13     877058107 9EF2060B285F15A5E053FF14000A3E66         14250364            0 19.02.20        CLONE            318767104 DWCSSEED                                                                         2852116004 9D288B8436DC5D21E0530F86E50AABD0    CTRL5      CTRL5                1593803312             19.0.0.0.0           
OLTPSEED_COPY                             4    3032004105 AA768E27A41904D5E053FF14000A4647         34839618            0 15.07.20        CLONE            318767104 OLTPSEED                                                                          877058107 9EF2060B285F15A5E053FF14000A3E66    CTRL5      CTRL5                1593803312             19.0.0.0.0           
OLTPSEED_COPY                             4    3032004105 AA768E27A41904D5E053FF14000A4647         34841402            0 15.07.20        UNPLUG           318767104                                                                                           0                                     CTRL5      CTRL5                1593803312             19.0.0.0.0           
POOLTENANT_OLTPSEED21594803198          289    3780898157 AA780E2370A0AB9AE053DB10000A1A31       3680404503         8559 15.07.20        PLUG             318767104 POOLTENANT_OLTPSEED21594803198                                                   3032004105 AA768E27A41904D5E053FF14000A4647    E3Z1POD    e3z1pod              1240006038             19.0.0.0.0           
CQWRIAXKGYBKVNX_DB202007151508          289    3780898157 AA780E2370A0AB9AE053DB10000A1A31       3864706491         8559 15.07.20        RENAME           318767104 POOLTENANT_OLTPSEED21594803198                                                   3780898157 AA780E2370A0AB9AE053DB10000A1A31    E3Z1POD    e3z1pod              1240006038             19.0.0.0.0           
CQWRIAXKGYBKVNX_DB202007151508          231    1129302642 AA780E2370A0AB9AE053DB10000A1A31       3877635830         8559 15.07.20        CLONE            318767104 CQWRIAXKGYBKVNX_DB202007151508@POD_CDB_ADMIN$_TEMPDBL_PCOUV6FR5J                 3780898157 AA780E2370A0AB9AE053DB10000A1A31    EKG1POD    ekg1pod               918449036             19.0.0.0.0           
CQWRIAXKGYBKVNX_DB202007151508          336    1353046666 AA780E2370A0AB9AE053DB10000A1A31       4062625844         8559 15.07.20        CLONE            318767104 CQWRIAXKGYBKVNX_DB202007151508@POD_CDB_ADMIN$_TEMPDBL_LXW8COMBRV                 3780898157 AA780E2370A0AB9AE053DB10000A1A31    E3Z1POD    e3z1pod              1240006038             19.0.0.0.0           
CQWRIAXKGYBKVNX_DB202007151508          258    1792891716 AA780E2370A0AB9AE053DB10000A1A31       4090531039         8559 15.07.20        CLONE            318767104 CQWRIAXKGYBKVNX_DB202007151508@POD_CDB_ADMIN$_TEMPDBL_YJSIBZ76EE                 3780898157 AA780E2370A0AB9AE053DB10000A1A31    EKG1POD    ekg1pod               918449036             19.0.0.0.0           
CQWRIAXKGYBKVNX_DB202007151508          353    2591868894 AA780E2370A0AB9AE053DB10000A1A31       4138073371         8559 15.07.20        CLONE            318767104 CQWRIAXKGYBKVNX_DB202007151508@POD_CDB_ADMIN$_TEMPDBL_LJ47TUYFEX                 3780898157 AA780E2370A0AB9AE053DB10000A1A31    E3Z1POD    e3z1pod              1240006038             19.0.0.0.0           


10 rows selected.

You see the switchover as a ‘CLONE’ operation. A clone with the same GUID. The primary was the PDB_DBID=1792891716 that was CON_ID=258 in its CDB. And the refreshable clone PDB_DBID=2591868894 opened as CON_ID=353 is the one I switched over, which is now the primary.

I have selected from DBA_PDBS as json-formatted (easy in SQLcl) to show the columns in lines:


{
          "pdb_id" : 353,
          "pdb_name" : "CQWRIAXKGYBKVNX_DB202007151508",
          "dbid" : 3780898157,
          "con_uid" : 2591868894,
          "guid" : "AA780E2370A0AB9AE053DB10000A1A31",
          "status" : "NORMAL",
          "creation_scn" : 36764763159835,
          "vsn" : 318767104,
          "logging" : "LOGGING",
          "force_logging" : "NO",
          "force_nologging" : "NO",
          "application_root" : "NO",
          "application_pdb" : "NO",
          "application_seed" : "NO",
          "application_root_con_id" : "",
          "is_proxy_pdb" : "NO",
          "con_id" : 353,
          "upgrade_priority" : "",
          "application_clone" : "NO",
          "foreign_cdb_dbid" : 918449036,
          "unplug_scn" : 36764717154894,
          "foreign_pdb_id" : 258,
          "creation_time" : "15.07.20",
          "refresh_mode" : "NONE",
          "refresh_interval" : "",
          "template" : "NO",
          "last_refresh_scn" : 36764763159835,
          "tenant_id" : "(DESCRIPTION=(TIME=1594839531009)(TENANT_ID=29A6A11B6ACD423CA87B77E1B2C53120,29A6A11B6ACD423CA87B77E1B2C53120.53633434F47346E29EE180E736504429))",
          "snapshot_mode" : "MANUAL",
          "snapshot_interval" : "",
          "credential_name" : "",
          "last_refresh_time" : "15.07.20",
          "cloud_identity" : "{\n  \"DATABASE_NAME\" : \"DB202007151508\",\n  \"REGION\" : \"us-ashburn-1\",\n  \"TENANT_OCID\" : \"OCID1.TENANCY.OC1..AAAAAAAACVSEDMDAKDVTTCMGVFZS5RPB6RTQ4MCQBMCTZCVCR2NWDUPYLYEQ\",\n  \"DATABASE_OCID\" : \"OCID1.AUTONOMOUSDATABASE.OC1.IAD.ABUWCLJSPZJCV2KUXYYEVTOZSX7K5TXGYOHXMIGWUXN47YPGVSAFEJM3SG2A\",\n  \"COMPARTMENT_OCID\" : \"ocid1.tenancy.oc1..aaaaaaaacvsedmdakdvttcmgvfzs5rpb6rtq4mcqbmctzcvcr2nwdupylyeq\",\n  \"OUTBOUND_IP_ADDRESS\" :\n  [\n    \"150.136.133.92\"\n  ]\n}"
        }

You can see, from the LAST_REFRESH_SCN equal to the CREATION_SCN, that the latest committed transactions were synced at the time of the switchover: the ATP service shows: “Primary database switchover completed. No data loss during transition!”

And here is the result from the same query before the switchover (I display only what had changed):


          "pdb_id" : 258,
          "con_uid" : 1792891716,
          "creation_scn" : 36764715617503,
          "con_id" : 258,
	  "foreign_cdb_dbid" : 1240006038,
          "unplug_scn" : 36764689231834,
          "foreign_pdb_id" : 336,
          "last_refresh_scn" : 36764715617503,
          "tenant_id" : "(DESCRIPTION=(TIME=1594835734415)(TENANT_ID=29A6A11B6ACD423CA87B77E1B2C53120,29A6A11B6ACD423CA87B77E1B2C53120.53633434F47346E29EE180E736504429))",

PDB_ID and CON_ID were different because plugged in a different CDB. But DBID and GUID are the same on the primary and the clone because it is the same datafile content. The FOREIGN_CDB_DBID and FOREIGN_PDB_ID is what references the primary from the standby and the standby from the primary. The LAST_REFRESH_SCN is always equal to the CREATION_SCN, when I query it from the primary, as it was activated without data loss. I cannot query the refreshable clone when it is in standby role.

Autonomous and Serverless doesn’t mean Traceless, fortunately:



SQL> select payloadfrom GV$DIAG_TRACE_FILE_CONTENTS where TRACE_FILENAME in ('e3z1pod4_ora_108645.trc') order by timestamp;   

PAYLOAD
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*** 2020-07-15T18:33:16.885602+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Started Serial Media Recovery
Loading kcnibr for container 353
Dumping database incarnation table:
Resetlogs 0 scn and time: 0x00000000001df6a8 04/23/2020 20:44:09
Dumping PDB pathvec - index 0
   0000 : pdb 353, dbinc 3, pdbinc 0
          db rls 0x00000000001df6a8 rlc 1038516249
          incscn 0x0000000000000000 ts 0
          br scn 0x0000000000000000 ts 0
          er scn 0x0000000000000000 ts 0
   0001 : pdb 353, dbinc 2, pdbinc 0
          db rls 0x00000000001db108 rlc 1038516090
          incscn 0x0000000000000000 ts 0
          br scn 0x0000000000000000 ts 0
          er scn 0x0000000000000000 ts 0
Recovery target incarnation = 3, activation ID = 0
Influx buffer limit = 100000 min(50% x 29904690, 100000)
Start recovery at thread 7 ckpt scn 36764741795405 logseq 0 block 0
*** 2020-07-15T18:33:17.181814+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 7
*** 2020-07-15T18:33:17.228175+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 1
*** 2020-07-15T18:33:17.273432+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 2
*** 2020-07-15T18:33:17.318517+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 3
*** 2020-07-15T18:33:17.363512+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 4
*** 2020-07-15T18:33:17.412186+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 5
*** 2020-07-15T18:33:17.460044+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 6
*** 2020-07-15T18:33:17.502354+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery add redo thread 8
*** 2020-07-15T18:33:17.575400+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_7_2609_1_1035414033.arc
krr_open_logfile: Restricting nab of log-/u02/nfsad1/e19pod/parlog_7_2609_1_1035414033.arc, thr-7, seq-2609                to 2 blocks.recover pdbid-258
*** 2020-07-15T18:33:17.620308+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_1_2610_1_1035414033.arc
Generating fake header for thr-1, seq-2610
*** 2020-07-15T18:33:17.651042+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_2_2604_1_1035414033.arc
krr_open_logfile: restrict nab of remote log with                  thr#-2, seq#-2604, file-/u02/nfsad1/e19pod/parlog_2_2604_1_1035414033.arc, kcrfhhnab-43080320, newnab-43080320
*** 2020-07-15T18:33:41.867605+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_3_2696_1_1035414033.arc
Generating fake header for thr-3, seq-2696
*** 2020-07-15T18:33:41.900123+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_4_2585_1_1035414033.arc
Generating fake header for thr-4, seq-2585
*** 2020-07-15T18:33:41.931004+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_5_2967_1_1035414033.arc
Generating fake header for thr-5, seq-2967
*** 2020-07-15T18:33:41.963625+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_6_2611_1_1035414033.arc
Generating fake header for thr-6, seq-2611
*** 2020-07-15T18:33:41.998296+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery Log /u02/nfsad1/e19pod/parlog_8_2586_1_1035414033.arc
Generating fake header for thr-8, seq-2586
*** 2020-07-15T18:33:53.033726+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 8
*** 2020-07-15T18:33:53.033871+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 6
*** 2020-07-15T18:33:53.033946+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 5
*** 2020-07-15T18:33:53.034015+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 4
*** 2020-07-15T18:33:53.034154+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 3
*** 2020-07-15T18:33:53.034239+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 1
*** 2020-07-15T18:33:53.034318+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 7
==== Redo read statistics for thread 2 ====
Total physical reads (from disk and memory): 21540159Kb
-- Redo read_disk statistics --
Read rate (ASYNC): 21540159Kb in 35.77s => 588.07 Mb/sec
Total redo bytes: 21540159Kb Longest record: 12Kb, moves: 20988/2785829 moved: 179Mb (0%)
Longest LWN: 31106Kb, reads: 5758
Last redo scn: 0x0000216ff58a3d72 (36764744564082)
Change vector header moves = 264262/2971514 (8%)
----------------------------------------------
*** 2020-07-15T18:33:53.034425+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Media Recovery drop redo thread 2
KCBR: Number of read descriptors = 1024
KCBR: Media recovery blocks read (ASYNC) = 77
KCBR: Influx buffers flushed = 9 times
KCBR: Reads = 2 reaps (1 null, 0 wait), 1 all
KCBR: Redo cache copies/changes = 667/667
*** 2020-07-15T18:33:53.187031+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
Completed Media Recovery
----- Abridged Call Stack Trace -----
ksedsts()+426<-krdsod()+251<-kss_del_cb()+218<-kssdel()+216<-krdemr()+5836<-krd_end_rcv()+609<-dbs_do_recovery()+3232<-dbs_rcv_start_main()+4890<-dbs_rcv_start()+115<-kpdbcRecoverPdb()+1039<-kpdbSwitch()+3585<-kpdbcApplyRecovery()+2345<-kpdbcRefreshPDB()+12073<-kpdbSwitchRunAsSysCbk()+20<-rpiswu2()+2004<-kpdbSwitch()+3563<-kpdbcRefreshDrv()+1410<-kpdbSwitch()+3585<-kpdbadrv()+34922<-opiexe()+26658<-opiosq0()+4635<-opipls()+14388<-opiodr()+1202<-rpidrus()+198<-skgmstack()+65<-rpidru()+132<-rpiswu2()+543<-rpidrv()+1266
*** 2020-07-15T18:33:53.252481+00:00 (CQWRIAXKGYBKVNX_DB202007151508(353))
<-psddr0()+467<-psdnal()+624<-pevm_EXIM()+282<-pfrinstr_EXIM()+43<-pfrrun_no_tool()+60<-pfrrun()+902<-plsql_run()+755<-peicnt()+279<-kkxexe()+720<-opiexe()+31050<-kpoal8()+2226<-opiodr()+1202<-ttcpip()+1239<-opitsk()+1897<-opiino()+936<-opiodr()+1202<-opidrv()+1094
<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
----- End of Abridged Call Stack Trace -----
Partial short call stack signature: 0x43b703b8697309f4
Unloading kcnibr
Elapsed: 01:00:03.376

Looking at the call stack (kpdbcRecoverPdb<-kpdbSwitch<-kpdbcApplyRecovery<-kpdbcRefreshPDB<-kpdbSwitchRunAsSysCbk<-rpiswu2<-kpdbSwitch<-kpdbcRefreshDrv<-kpdbSwitch) it is clear that Autonomous Data Guard manual switchover is nothing else than a "Refreshable PDB Switchover", feature introduced in 18c and available only on Oracle Engineered Systems (Exadata and ODA) and Oracle Cloud. At least for the moment.

Cet article A Serverless Standby Database called Oracle Autonomous Data Guard est apparu en premier sur Blog dbi services.

There are several methods to find out where time is spent in an execution plan of a query running in an Oracle database. Classical methods like SQL Trace and running a formatter tool like tkprof on the raw trace, or newer methods like SQL Monitor (when the Tuning Pack has been licensed) or running a query with the GATHER_PLAN_STATISTICS-hint (or with statistics_level=all set in the session) and then using DBMS_XPLAN.DISPLAY_CURSOR(format=>’ALLSTATS LAST’). However, what I often use is the information in SQL_PLAN_LINE_ID of Active Session History (ASH). I.e. more detailed by sampling every second in the recent past through V$ACTIVE_SESSION_HISTORY or in the AWR-history through DBA_HIST_ACTIVE_SESS_HISTORY. However, you have to be careful when interpreting the ASH SQL_PLAN_LINE_ID in an adaptive plan.

Here’s an example:

REMARK: I artificially slowed down the processing with the method described by Chris Antognini here: https://antognini.ch/2013/12/adaptive-plans-in-active-session-history/

SQL> @test

DEPARTMENT_NAME
------------------------------
Executive
IT
Finance
Purchasing
Shipping
Sales
Administration
Marketing
Human Resources
Public Relations
Accounting

11 rows selected.

Elapsed: 00:00:27.83
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST'));

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
SQL_ID  8nuct789bh45m, child number 0
-------------------------------------
 select distinct DEPARTMENT_NAME from DEPARTMENTS d   join EMPLOYEES e
using(DEPARTMENT_ID)   where d.LOCATION_ID like '%0' and e.SALARY>2000
burn_cpu(e.employee_id/e.employee_id/4) = 1

Plan hash value: 1057942366

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name        | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |      1 |        |     11 |00:00:27.82 |      12 |       |       |          |
|   1 |  HASH UNIQUE        |             |      1 |      1 |     11 |00:00:27.82 |      12 |  1422K|  1422K|61440  (0)|
|*  2 |   HASH JOIN         |             |      1 |      1 |    106 |00:00:27.82 |      12 |  1572K|  1572K| 1592K (0)|
|*  3 |    TABLE ACCESS FULL| DEPARTMENTS |      1 |      1 |     27 |00:00:00.01 |       6 |       |       |          |
|*  4 |    TABLE ACCESS FULL| EMPLOYEES   |      1 |      1 |    107 |00:00:27.82 |       6 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   3 - filter(TO_CHAR("D"."LOCATION_ID") LIKE '%0')
   4 - filter(("E"."SALARY">2000 AND "BURN_CPU"("E"."EMPLOYEE_ID"/"E"."EMPLOYEE_ID"/4)=1))

Note
-----
   - this is an adaptive plan


30 rows selected.

So I’d expect to see SQL_PLAN_LINE_ID = 4 in ASH for the 28 seconds of wait time here. But it’s different:

SQL> select sql_plan_line_id, count(*) from v$active_session_history 
   2 where sql_id='8nuct789bh45m'and sql_plan_hash_value=1057942366 group by sql_plan_line_id;

SQL_PLAN_LINE_ID   COUNT(*)
---------------- ----------
               9         28

1 row selected.

The 28 seconds are spent on SQL_PLAN_LINE_ID 9. Why that?

The reason is that ASH provides the SQL_PLAN_LINE_ID in adaptive plans according the whole adaptive plan:

SQL> select * from table(dbms_xplan.display_cursor('8nuct789bh45m',format=>'+ADAPTIVE'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------
SQL_ID  8nuct789bh45m, child number 0
-------------------------------------
 select distinct DEPARTMENT_NAME from DEPARTMENTS d   join EMPLOYEES e
using(DEPARTMENT_ID)   where d.LOCATION_ID like '%0' and e.SALARY>2000
burn_cpu(e.employee_id/e.employee_id/4) = 1

Plan hash value: 1057942366

------------------------------------------------------------------------------------------------------
|   Id  | Operation                      | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT               |                   |       |       |     5 (100)|          |
|     1 |  HASH UNIQUE                   |                   |     1 |    30 |     5  (20)| 00:00:01 |
|  *  2 |   HASH JOIN                    |                   |     1 |    30 |     4   (0)| 00:00:01 |
|-    3 |    NESTED LOOPS                |                   |     1 |    30 |     4   (0)| 00:00:01 |
|-    4 |     NESTED LOOPS               |                   |    10 |    30 |     4   (0)| 00:00:01 |
|-    5 |      STATISTICS COLLECTOR      |                   |       |       |            |          |
|  *  6 |       TABLE ACCESS FULL        | DEPARTMENTS       |     1 |    19 |     3   (0)| 00:00:01 |
|- *  7 |      INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |    10 |       |     0   (0)|          |
|- *  8 |     TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |     1 |    11 |     1   (0)| 00:00:01 |
|  *  9 |    TABLE ACCESS FULL           | EMPLOYEES         |     1 |    11 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   6 - filter(TO_CHAR("D"."LOCATION_ID") LIKE '%0')
   7 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   8 - filter(("E"."SALARY">2000 AND "BURN_CPU"("E"."EMPLOYEE_ID"/"E"."EMPLOYEE_ID"/4)=1))
   9 - filter(("E"."SALARY">2000 AND "BURN_CPU"("E"."EMPLOYEE_ID"/"E"."EMPLOYEE_ID"/4)=1))

Note
-----
   - this is an adaptive plan (rows marked '-' are inactive)


37 rows selected.

So here we see the correct line 9. It’s actually a good idea to use

format=>'+ADAPTIVE'

when gathering plan statistics (through GATHER_PLAN_STATISTICS-hint or STATISTICS_LEVEL=ALL):

SQL> @test

DEPARTMENT_NAME
------------------------------
Executive
IT
Finance
Purchasing
Shipping
Sales
Administration
Marketing
Human Resources
Public Relations
Accounting

11 rows selected.

Elapsed: 00:00:27.96
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST +ADAPTIVE'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------
SQL_ID  8nuct789bh45m, child number 0
-------------------------------------
 select distinct DEPARTMENT_NAME from DEPARTMENTS d   join EMPLOYEES e
using(DEPARTMENT_ID)   where d.LOCATION_ID like '%0' and e.SALARY>2000
burn_cpu(e.employee_id/e.employee_id/4) = 1

Plan hash value: 1057942366

-------------------------------------------------------------------------------------------------------------------------------------------
|   Id  | Operation                      | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT               |                   |      1 |        |     11 |00:00:27.85 |    1216 |       |       |          |
|     1 |  HASH UNIQUE                   |                   |      1 |      1 |     11 |00:00:27.85 |    1216 |  1422K|  1422K|73728  (0)|
|  *  2 |   HASH JOIN                    |                   |      1 |      1 |    106 |00:00:27.85 |    1216 |  1572K|  1572K| 1612K (0)|
|-    3 |    NESTED LOOPS                |                   |      1 |      1 |     27 |00:00:00.01 |       6 |       |       |          |
|-    4 |     NESTED LOOPS               |                   |      1 |     10 |     27 |00:00:00.01 |       6 |       |       |          |
|-    5 |      STATISTICS COLLECTOR      |                   |      1 |        |     27 |00:00:00.01 |       6 |       |       |          |
|  *  6 |       TABLE ACCESS FULL        | DEPARTMENTS       |      1 |      1 |     27 |00:00:00.01 |       6 |       |       |          |
|- *  7 |      INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |      0 |     10 |      0 |00:00:00.01 |       0 |       |       |          |
|- *  8 |     TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  *  9 |    TABLE ACCESS FULL           | EMPLOYEES         |      1 |      1 |    107 |00:00:27.85 |    1210 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   6 - filter(TO_CHAR("D"."LOCATION_ID") LIKE '%0')
   7 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   8 - filter(("E"."SALARY">2000 AND "BURN_CPU"("E"."EMPLOYEE_ID"/"E"."EMPLOYEE_ID"/4)=1))
   9 - filter(("E"."SALARY">2000 AND "BURN_CPU"("E"."EMPLOYEE_ID"/"E"."EMPLOYEE_ID"/4)=1))

Note
-----
   - this is an adaptive plan (rows marked '-' are inactive)


37 rows selected.

Summary: Be careful checking SQL_PLAN_LINE_IDs in adaptive plans. This seems obvious in simple plans like the one above, but you may go in wrong directions if there are plans with hundreds of lines. Your analysis may be wrong if you are looking at the wrong plan step. To do some further validation, you may query sql_plan_operation, sql_plan_options from v$active_session_history as well.

Cet article Oracle ASH SQL_PLAN_LINE_ID in adaptive plans est apparu en premier sur Blog dbi services.

By Franck Pachot

.
NoSQL provides an API that is much simpler than SQL. And one advantage of it is that users cannot exceed a defined amount of resources in one call. You can read this in Alex DeBrie article https://www.alexdebrie.com/posts/dynamodb-no-bad-queries/#relational-queries-are-unbounded which I take as a base for some of my “Myth of NoSQL vs RDBMS” posts because he explains very well how SQL and NoSQL are perceived by the users. But this idea of simpler API to limit what users can do is is quite common, precedes the NoSQL era, and is still valid with some SQL databases. Here I’m demonstrating that some RDBMS provide a powerful API and still can bound what users can do. Oracle Database has a resource manager for a long time, like defining resource limits on a per-service base, and those features are very simple to use in the Oracle Autonomous Database – the managed database in the Oracle Cloud.

I am using the example schema from the ATP database in the free tier, so that anyone can play with this. As usual, what I show on 1 million rows, and one thread, can scale to multiples vCPU and nodes. Once you get the algorithms (execution plan) you know how it scales.


06:36:08 SQL> set echo on serveroutput on time on timing on

06:36:14 SQL> select count(*) ,sum(s.amount_sold),sum(p.prod_list_price) 
              from sh.sales s join sh.products p using(prod_id);


   COUNT(*)    SUM(S.AMOUNT_SOLD)    SUM(P.PROD_LIST_PRICE)
___________ _____________________ _________________________
     918843           98205831.21               86564235.57

Elapsed: 00:00:00.092

I have scanned nearly one million rows from the SALES table and joined it to the PRODUCTS table, aggregated data to show the sum from both tables columns. That takes 92 milliseconds here (including network roundtrip). You are not surprised to get fast response with a join because you have read The myth of NoSQL (vs. RDBMS) “joins dont scale”

Ok, now let’s say that a developer never learned about SQL joins and wants to do the same with a simpler scan/query API:


06:36:14 SQL> declare
    l_count_sales    number:=0;
    l_amount_sold    number:=0;
    l_sum_list_price number:=0;
   begin
    -- scan SALES
    for s in (select * from sh.sales) loop
     -- query PRODUCTS
     for p in (select * from sh.products where prod_id=s.prod_id) loop
      -- aggregate SUM and COUNT
      l_count_sales:=l_count_sales+1;
      l_amount_sold:=l_amount_sold+s.amount_sold;
      l_sum_list_price:=l_sum_list_price+p.prod_list_price;
     end loop;
    end loop;
    dbms_output.put_line('count_sales='||l_count_sales||' amount_sold='||l_amount_sold||' sum_list_price='||l_sum_list_price);
   end;
   /

PL/SQL procedure successfully completed.

Elapsed: 00:02:00.374

I have run this within the database with PL/SQL because I don’t want to add network rountrips and process switches to this bad design. You see that it takes 2 minutes here. Why? Because the risk when providing an API that doesn’t support joins is that the developer will do the join in his procedural code. Without SQL, the developer has no efficient and agile way to do this GROUP BY and SUM that was a one-liner in SQL: he will either loop on this simple scan/get API, or she will add a lot of code to initialize and maintain an aggregate derived from this table.

So, what can I do to avoid a user running this kind of query that will take a lot of CPU and IO resources? A simpler API will not solve this problem as the user will workaround this with many small queries. In the Oracle Autonomous Database, the admin can set some limits per service:

This says: when connected to the ‘TP’ service (which is the one for transactional processing with high concurrency) a user query cannot use more than 5 seconds of elapsed time or the query is canceled.

Now if I run the statement again:


Error starting at line : 54 File @ /home/opc/demo/tmp/atp-resource-mgmt-rules.sql
In command -
declare
 l_count_sales    number:=0;
 l_amount_sold    number:=0;
 l_sum_list_price number:=0;
begin
 -- scan SALES
 for s in (select * from sh.sales) loop
  -- query PRODUCTS
  for p in (select * from sh.products where prod_id=s.prod_id) loop
   -- aggregate SUM and COUNT
   l_count_sales:=l_count_sales+1;
   l_amount_sold:=l_amount_sold+s.amount_sold;
   l_sum_list_price:=l_sum_list_price+p.prod_list_price;
  end loop;
 end loop;
 dbms_output.put_line('count_sales='||l_count_sales||' amount_sold='||l_amount_sold||' sum_list_price='||l_sum_list_price);
end;
Error report -
ORA-56735: elapsed time limit exceeded - call aborted
ORA-06512: at line 9
ORA-06512: at line 9
56735. 00000 -  "elapsed time limit exceeded - call aborted"
*Cause:    The Resource Manager SWITCH_ELAPSED_TIME limit was exceeded.
*Action:   Reduce the complexity of the update or query, or contact your
           database administrator for more information.

Elapsed: 00:00:05.930

I get a message that I exceeded the limit. I hope that, from the message “Action: Reduce the complexity”, the user will understand something like “Please use SQL to process data sets” and will write a query with the join.

Of course, if the developer is thick-headed he will run his loop from his application code and will run one million short queries that will not exceed the time limit per execution. And it will be worse because of the roundtrips between the application and the database. The “Set Resource Management Rules” has another tab than “Run-away criteria”, which is “CPU/IO shares”, so that one service can be throttled when the overall resources are saturated. With this, we can give higher priority to critical services. But I prefer to address the root cause and show to the developer that when you need to join data, the most efficient is a SQL JOIN. And when you need to aggregate data, the most efficient is SQL GROUP BY. Of course, we can also re-design the tables to pre-join (materialized views in SQL or single-table design in DynamoDB for example) when data is ingested, but that’s another topic.

In the autonomous database, the GUI makes it simple, but you can query V$ views to monitor it. For example:


06:38:20 SQL> select sid,current_consumer_group_id,state,active,yields,sql_canceled,last_action,last_action_reason,last_action_time,current_active_time,active_time,current_consumed_cpu_time,consumed_cpu_time 
              from v$rsrc_session_info where sid=sys_context('userenv','sid');


     SID    CURRENT_CONSUMER_GROUP_ID      STATE    ACTIVE    YIELDS    SQL_CANCELED    LAST_ACTION     LAST_ACTION_REASON       LAST_ACTION_TIME    CURRENT_ACTIVE_TIME    ACTIVE_TIME    CURRENT_CONSUMED_CPU_TIME    CONSUMED_CPU_TIME
________ ____________________________ __________ _________ _________ _______________ ______________ ______________________ ______________________ ______________________ ______________ ____________________________ ____________________
   41150                        30407 RUNNING    TRUE             21               1 CANCEL_SQL     SWITCH_ELAPSED_TIME    2020-07-21 06:38:21                       168           5731                         5731                 5731


06:39:02 SQL> select id,name,cpu_wait_time,cpu_waits,consumed_cpu_time,yields,sql_canceled 
              from v$rsrc_consumer_group;


      ID            NAME    CPU_WAIT_TIME    CPU_WAITS    CONSUMED_CPU_TIME    YIELDS    SQL_CANCELED
________ _______________ ________________ ____________ ____________________ _________ _______________
   30409 MEDIUM                         0            0                    0         0               0
   30406 TPURGENT                       0            0                    0         0               0
   30407 TP                           286           21                 5764        21               1
   30408 HIGH                           0            0                    0         0               0
   30410 LOW                            0            0                    0         0               0
   19515 OTHER_GROUPS                 324           33                18320        33               0

You can see one SQL canceled here in the TP consumer group and my session was at 5.7 consumed CPU time.

I could have set the same programmatically with:


exec cs_resource_manager.update_plan_directive(consumer_group => 'TP', elapsed_time_limit => 5);

So, rather than limiting the API, better to give full SQL possibilities and limit the resources used per service: it makes sense to accept only short queries from the Transaction Processing services (TP/TPURGENT) and allow more time, but less shares, for the reporting ones (LOW/MEDIUM/HIGH)

Cet article The myth of NoSQL (vs. RDBMS) “a simpler API to bound resources” est apparu en premier sur Blog dbi services.

Introduction

I’ve been dreaming of this kind of feature: just because most of the ODA configurations now include Disaster Recovery capabilities, through Data Guard or Dbvisit standy. If Dbvisit will obviously never be integrated to odacli, the lack of Data Guard features is now solved by the very latest 19.8 ODA software appliance kit.

How Data Guard was implemented before 19.8?

Those who have been using ODA and Data Guard for a while know that ODA was not aware of a Data Guard configuration. With odacli, you can create databases, mostly primaries, and you can also create an instance, which is a database without any file, just to get the database item in the reposity and a started instance. Creating a standby was done with the standard tools, RMAN for database duplication, and eventually system commands to edit and copy the pfile to standby server. Configuring Data Guard was done with the Data Guard broker dgmgrl, nothing specific to ODA. Lots of steps were also needed, the standby logs creation, the unique naming of the databases, the configuration of standby_file_managemement, … Actually quite a lot of operations not so easy to fully automate.

What are the promises of this 19.8?

As this 19.8 is brand new, I picked up this from the documentation. Quite impatient to test these features as soon as possible.

First, odacli is now aware of Data Guard and can manage a Data Guard configuration. A Data Guard configuration is now an object in the repository, you can manage multiple Data Guard configurations, linking primary and standby databases together. You can also do the SWITCHOVER and FAILOVER operations with odacli. The use of dgmgrl doesn’t seem mandatory here.

The goal of odacli’s Data Guard integration is to simplify everything, as this is the purpose of an appliance.

What are the prerequisites?

For using Data Guard on ODA you will need:
– at least 2 ODAs
– at least 2 different sites (reliable Disaster Recovery absolutely requires different sites)
– similar ODA configuration: mix of lite and HA ODAs is not supported
– Enterprise Edition on your ODAs: Data Guard is embedded with Enterprise Edition and do not exist in Standard Edition 2
– Twice the database size in space (never forget that)
– similar database configuration (shape and settings) and storage configuration. Mix of ASM and ACFS is not supported. Actually, these are best practices.
– All ODAs deployed or upgraded to 19.8 or later but with the same version
– OS customizations, if exist, should be the same on all the ODAs
– ODA backup configuration should exist, to OCI Object Storage or to NFS server (odacli create-backupconfig and odacli modify-database)
– your primary database should already exist

What are the steps for creating a Data Guard configuration?

First, create a backup on the primary with:
odacli create-backup
Once done, save the backup report to a text file with:
odacli describe-backupreport
Copy this backup report to standby ODA and do the restore with:
odacli irestore-database
You will need to restore this database with the STANDBY type (will basically flag the controlfile to standby database).

From now, you have 2 nearly identical databases. You can now create the Data Guard configuration with:
odacli configure-dataguard from the primary ODA. This command will prompt you for various parameters, like the standby ODA IP address, the name of the Data Guard configuration you want, the network to use for Data Guard, the transport type, the listener ports, the protection mode, aso. What’s interesting here is that you can also provide a json file with all these parameters, the same way you do when you deploy the appliance. Far more convenient, and much faster:
odacli configure-dataguard -r my_DG_config.json

You’ll also have to manage the TrustStore passwords, as described in the documentation. I don’t know if it’s new but never have to manage it before.

How to manage Data Guard through odacli?

To have an overview of all your Data Guard configurations, you can use:

odacli list-dataguardstatus

As you can imagine, each Data Guard configuration is identified by an id, and with this id you can have detailed view of the configuration:

odacli describe-dataguardstatus -i xxx

odacli is even able to do the switchover and failover operations you were doing with dgmgrl before:
odacli switchover-dataguard -i xxx odacli failover-dataguard -i xxx
In case of a failover, you probably know that the reinstate of the old primary is needed (because of its current SCN probably being greater than the SCN on standby when the failover was done), you can do the reinstate with odacli too:

odacli reinstate-dataguard -i xxx

Other features

With ODA, you can quite simply migrate a database moving from one home to another. This feature now supports a Data Guard configuration, but you will have to manually disable transport and apply during migration with dgmgrl. It’s quite nice to be able to keep the configuration and not having the need to rebuild it in case of database migration.

If you need to remove a Data Guard configuration, you can do that through odacli: odacli deconfigure-dataguard -i xxx

If you want to use a dedicated network for Data Guard, this is also possible with the network management option of odacli.

Conclusion

The Data Guard management was missing on ODA, and this new version seems to bring nearly all the features. Let’s try it on the next projects!

Cet article ODA: odacli now supports Data Guard in 19.8 est apparu en premier sur Blog dbi services.