Tema 2 - Unità di Bologna l Stefano Rizzi l Dario Maio l Matteo Golfarelli l Ettore Saltarelli
Tecniche di progettazione logica e di interrogazione efficiente di DW l Espressioni NGPSJ (Nested Generalized Projection / Selection / Join) l Materializzazione di viste sulla base di un carico di lavoro complesso l Stima della cardinalità delle viste candidate tenendo conto di vincoli di cardinalità suggeriti dal dominio applicativo l Tecniche di interrogazione (CS)
Higher expressive power Nested GPSJ expressions l NGPSJ expressions extend GPSJ expressions (Gupta-Harinarayan-Quass) considering nesting. generalized projection PM A generalized projection P,M (R) is an extension of duplicate eliminating projection, where P is a set of GROUP-BY attributes and M a set of aggregate measures, each defined by applying an aggregate function involving attributes in R GPSJ expression A GPSJ expression is a selection over a generalized projection over a selection over a set of joins l Nesting l Nesting GPSJ expressions means using the result from an expression as the input for another: n n-1 1 Sequences of aggregate operators can be used on the same measure Selections may affect the results of aggregations Derived measures can be added
The ancestor rewritable equivalent l Given two NGPSJ expressions e, e' on schema S, we say that e' is rewritable on e (e' e) if, by applying a sequence of generalized projections and selections to e, it is possible to obtain a NGPSJ expression that is equivalent to e'. ancestor Given two NGPSJ expression e and e', the ancestor of e and e' is the least NGPSJ expression on which both e and e' can be rewritten.
View materialization l In our approach a candidate view may contain a subset of the tuples at a given aggregation pattern as a consequence of selections on dimension attributes and measures; may contain only a subset of the measures in the fact table as a result of projections; may include measures obtained by applying different aggregation sequences to the same measure may include derived measures and support measures necessary to support queries based on algebraic operators. l The candidate views and the relationships between them can be represented in a query view graph.
Results l Computation of the ancestor of two NGPSJ expressions l Comparison between two NGPSJ expressions l Construction of the query view graph for a given workload, i.e. determination of the set of candidate views
Estimating the cardinality of views l Accurately estimating the actual cardinality of views in DWs is crucial for logical and physical design as well as for query processing and optimization. l If the DW has already been loaded, cardinalities can be estimated by using statistical techniques based on histograms or sampling. l Such techniques cannot be applied if the data warehouse is still under development, and the estimation of view cardinalities is needed for design purposes. l Current approaches are based on estimation models that only exploit the cardinality of the base cube and that of the single attribute domains, which leads to significant overestimation.
Approach overview l We propose a novel approach to estimate the cardinality of views based on a-priori information derived from the application domain (cardinality constraints). l 2-steps approach: first compute bounds for the cardinality, then determine a probabilistic estimate for it.
Cardinality constraints Àlower (w - ) and/or upper (w + ) bound of the cardinality w of a view W; Ák-dependency X Y expressing an upper bound of the ratio between the cardinalities of two views X and Y. l k-dependencies naturally generalize functional dependencies and are useful to characterize the knowledge of the business domain held by the experts in the field. k
Results l Bounding strategy to determine an upper bound for the cardinality of V, given a set of cardinality constraints. l In the absence of k-dependencies: –Domination and minimality results –Branch-and-bound approach l Preliminary results on domination in the presence of k- dependencies l Preliminary results on lower bounding
Scelta ottimale di indici in sistemi di data warehouse 1 Introduzione 1.1 Progettazione Logica 1.2 Progettazione Fisica 2 Architettura del componente di selezione degli indici 3 Determinazione dei piani di accesso 3.1 Caratterizzazione delle interrogazioni 3.2 CBO e RBO 3.3 Elementi di un piano di accesso 3.4 Algoritmo di selezione dei piani 4 Il modello dei costi 4.1 Costi di base 4.2 Costo composto 5 Algoritmo di selezione degli indici 5.1 Dominazione tra indici 5.2 Descrizione dell’algoritmo 6 Conclusioni e problematiche aperte
Architettura del componente di selezione degli indici Physical Scheme WorkloadData Volume Physical Scheme Generation Execution Plan Generation Cost Evaluation queries ph.schema ph.scheme queries execution plan cardinalities cost System Constraints constraints cardinalities
Selezione dei Piani di Accesso (1) l RedBrick e Oracle l Rule-Based Optimizer e Cost-Based Optimizer l Indici considerati: –B + -tree / Bitmap Index –Su singoli attributi e sulle chiavi primarie l Tipologie di Piani di Esecuzione considerate: –Nested Loops –Hybrid Hash Join
Selezione dei Piani di Accesso (2) l Componenti di base: –Table Scan (TS)TS(table) Pred {(tid, value)} –Index Scan (XS)XS(index) Pred {(tid, value)} –Table Access (TA)tid TA(table) Pred (tid, value) –Index Access (XA)value XA(index) {(tid, value)} –Hash Join (HJ)[{value}; {value}] HJ(table, table) Pred {(tid, value)} –TID Intersection (TI)[{tid}; …; {tid}] TI(table) {tid} –Aggregation (AG)[{value}; …; {value}] AG() {value}
Modello dei costi l Costo = numero di pagine di disco a cui si accede l Definizione di una funzione di costo per ogni componente di base l Determinazione del costo di un piano sulla base dei costi elementari e delle cardinalità dell’output TS(FT) PredM [XA(PK_DT1) TA(DT1); XA(PK_DT2) TA(DT12] AG() TS(FT) PredM XA(PK_DT1) XA(PK_DT2) TA(DT1) AG() TA(DT2)
Algoritmo di selezione degli indici l Generazione dell’insieme dei possibili indici l Eliminazione degli insiemi di indici dominati –dominazione intra-vista –dominazione inter-vista l Determinazione dell’insieme ottimale
Deliverable previsti per la fase 3 l D2.P1: prototipo sviluppato per la progettazione logico-fisica