1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures.

1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures

2 2 Attenzione Il simbolo indica trasparenza fornita per approfondimento

3 3 Due tipi di integrazione Integrazione di schemi Integrazione di basi di dati

4 4 Integrazione di dati Schema 1 Tecnica tipica utlizzata: Record linkage

5 5 Fasi rilevanti della integrazione di schemi Requisiti 1 Requisiti 2 A B C D B D E F E F A B C D

6 6 Integrazione di schemi concettuali Schema 1Schema 2Schema n Schema 1 Schema 2 Schema n Schema integrato Nel seguito Vedremo questa

7 7 Examples used

8 8 Avvertenza In questa dispensa verranno utilizzati vari formalismi, i cui aspetti definitori e grafici sono 1. varianti del modello Entita’ Relazione con generalizzazioni descritto nel corso di Basi di dati 1, ovvero 2. un modello Object Oriented. 3. un esempio di schema di base di dati spaziale 4. una descrizione ontologica/tassonomica 5. Un modello relazionale

9 9 Schema 1 Schema 2 Author namebooks titleISBN birthdate Book ISBNauthorstitle namebirthdate Library Example (homogeneous models)

10 The integrated schema (OO) Phd-advisor Thesis Title Person PinName GPA Faculty Rank Adv.Student Schema S1 (OO) Person PinName Student GPA Faculty Rank Schema S2 (relational) Thesis (Phd-advisor, Phd-student, title) Student PhD Student Thesis example (heterogeneous m)

11 Road-Section Overstepping Node begin end on/under Way-Section Bridge Separator TollCrossroadEnd-of-Tunnel Extremity on separate under end begin Schema 1 Schema 2 Spatial Example

12 Marriage Example S1 S2 Pin name gender Person Marriage wife husband date Woman Married to 0:1 Pin name Man Pin name

13 Due ontologie in Protege’ S S’

14 Generazione dello Schema Globale Schema S1 1. Person (SSN, Age, City) 2. City (Name, Region) Schema S2 1. Person (SSN, City)

15 Problemi nella integrazione di schemi concettuali Lo scopo principale dell’integrazione è l’identificazione di tutte le porzioni dei diversi schemi concettuali che si riferiscono a uno stesso aspetto della realtà di interesse, per unificare la loro rappresentazione L’approccio è orientato alla identificazione, analisi e risoluzione di conflitti su schemi I tipi di conflitti negli schemi sono stati descritti nella dispensa 3.1 e ripresi nelle prossime trasparenze 

16 Schema level heterogeneities NB heterogeneity, conflict and correspondence are synonyms in the following Are of two types Name heterogeneities Type heterogeneities Besides name and type heterogeneities we have to consider in the more general context also Model heterogeneities

17 Name heterogeneities (conflicts/ correspondences) Sinonyms – Different names for the same concepts –employee, clerk –exam, course –code, num Homonyms – Same name for different concepts - City as City of birth in one schema, as City of Residence in another schema Hyperonimies – Two conceps related by an IS-A relationship

18 Type heterogeneities The same concepts is represented with different conceptual structures in two schemas Different definition domains for the same attribute in two schemas Attribute in one schema and derived value in another schema Attribute in one schema and entity in another schema Attribute in one schema and generalization hierarchy in another schema Entity in one schema and relationship in another schema Different abstraction levels for the same concept in two schemas: e.g. two entities with homonym names related by an IS-A hierarchy in two schemas Different granularities in the definition domains Different cardinalities in the same relationships Key conflicts

19 Esempi di eterogeneita’ nelle BD spaziali

20 DB1 DB2 DB3 Toll Bridge Tunnel Toll Overstepping Tunnel Overstepping Tunnel Spatial Example Same space, overlapping and non overlapping contents

21 Schema 1 Road-Section Overstepping Node begin end on/under Toll Overstepping Tunnel Overstepping Tunnel

22 Schema 2 Way-Section Bridge Separator TollCrossroadEnd-of-Tunnel Extremity on separate under end begin Toll Bridge Tunnel

23 Una metodologia di integrazione

24 Input output view when Schemas adopt the same model S1S1 S2S2 SnSn IS Schema Integration

25 Input output view when schemas use different models (model heterogeneities) IS Schema Integration with a common model CM Reverse engineering Reverse engineering Reverse engineering E’ conveniente adottare un modello semanticamente ricco perche’ cosi’ possiamo eseguire le fasi della integrazione con maggiore conoscenza disponibile.. Ma non troppo ricco per non rendere la integrazione troppo complessa

26 A Generic Framework for Integration transformation rules schemas transformation schemas investigation rules integration rules Schemas matching schemas integration integrated schema mapping rules Data integration system

27 Phases of the methodology, inputs, outputs, and methods used 0. Define the integration strategy Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model transformation + Reverse engineering 2. Correspondences investigation Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations

28 Step 0: Define the integration strategy

29 Phases of the methodology, inputs, outputs, and methods used 0. Define the integration strategy Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics

30 Strategie Se gli schemi da integrare sono molti (almeno n > 5), un aspetto da affrontare all’inizio e’ l’ordine con cui effettuare il processo di integrazione tra gli schemi. Esistono diverse strategie, che presentano vantaggi e svantaggi le une rispetto alle altre

31 One shot strategy S1S1 S2S2 SnSn IS + Ottimizza il processo di integrazione - Molte corrispondenze tra concetti dei diversi schemi devono essere considerate insieme

32 Strategia “due alla volta” S1S1 S2S2 SnSn IS 1 IS 2 IS n S3S3 Conviene dare priorita’ agli schemi piu’ importanti e piu’ stabili, in questo modo il processo di integrazione Procede con maggiore efficienza.

33 Strategia bilanciata S1S1 S2S2 SnSn IS 1 IS 2 IS n S i+1 Esempio: Vendite, Produzione, Marketing Preferibile quando c’e’ una forte coesione tra gruppi di schemi SkSk S k+1 IS

34 Strategia con schema scheletro S1S1 S2S2 SnSn S i+1 Ha il vantaggio che nello SS le principali eterogeneita’ Sono state risolte, e quindi rende piu’ efficiente il processo SkSk S k+1 SS IS Rappresenta i concetti piu’ rilevanti dell’ universo del discorso

35 1. Pre Integration

36 Phases of the methodology, inputs, outputs, and methods used 0. Define the integration strategy Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model transformation + Reverse engineering

37 Step 1: Pre-Integration Given the heterogeneity of the local sources models DBMSs, GISs, XML, UML, OWL, RDF, WSDL, … Goal of Step 1: Reduce this model heterogeneities as much as possible to make the sources more suitable for integration. REMARK - Often, the source data sets cannot be modified – Keep current applications’ programs alive – Proprietary sources with independent use – => Modifications through a view mechanism

38 1. Pre Integration is made of … 1.1. Data model homogeneization 1.2 Design homogeneization 1.3 Reverse engineering

39 1.1. Data Model Homogeneization source DBs homogeneized DBs DW transformationintegration Goal: use a single, common data model and format – A rich one facilitates translations : semantic data model, OO, F-Logic – A poor one reduces the number of structural conflicts : binary relations, … – A standard one (Entity Relationship with generalizations) is a frequent pragmatic choice

40 PersonChild Family Person Child (1,n) 1.2 Design Homogeneization Explicit the semantics of the schema and enforce standard design rules to reduce the number of structural conflicts –Relational normalization : one fact in one place –Entity relationship normalization –Complex object normalization –E.g. Objectify multivalued attributes:

41 Example of relational normalization R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date)  R11 (#Student, Name, LastName) R12 #Course, CourseName) R13 (#Student, #Course, Grade, Date)

42 Filling the semantic gap: 1.3 Reverse engineering step COBOL files Network (CODASYL-like) databases Relational databases Object-Oriented databases Spreadsheets Problem: what do these legacy data sources contain? A detailed methodology for reverse engineering is provided in slides 4.2 that are not part of the exam.

43 Nel nostro esempio non vi sono trasformazioni da effettuare sugli schemi Schema S1 1. Person (SSN, Age, City) 2. City (Name, Region) Schema S2 1. Person (SSN, City)

44 Step 2. Schema matching (or correspondence investigation)

45 Phases of the methodology, inputs, outputs, and methods used 0. Define the integration strategy Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model transformation + Reverse engineering 2. Schema matching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to dicover correspondences

46 Definition of Schema matching Schema matching46 Formalmente lo Schema Matching è così definito: –Dati due schemi in input, in un qualsiasi modello di dati, una input-mapping e, opzionalmente, informazioni ausiliarie, calcolare un mapping tra gli elementi dei due schemi –Definiamo l’operatore di Match come una funzione che prende due schemi S1 e S2 come input e restituisce un mapping tra i due schemi, chiamato match result: –Mapping: un insieme di elementi legati da corrispondenze (mapping elements), ciascuno dei quali mappa alcuni elementi del primo schema con alcuni elementi del secondo schema.

47 Why schema matching is a difficult task

48 cable bluered cable copperfiber cable color material cable color material Semantic Relativism Information representation depends on perception

49 Semantics of correspondences Real World Schema 1 X Schema 2 Y Database System

50 Examples of correspondences Schema S1 (OO) Schema S2 (relational) Thesis (Phd-advisor, Phd-student, title) Person PinName Student GPA Faculty Rank

51 Book ISBNauthorstitle namebirthdate Author namebooks titleISBN Birth date Some Library Correspondences Schema 1 Schema 2

52 Complex correspondences in spatial databases A road-section of DB1 corresponds to Several way-sections and separator-sections of DB2. DB2 : 4 WaySections 2 Separators DB1 : 1 Road Section Way Separator Road

53 A language for expressing correspondences in schema matching 53

54 A language for asserting Correspondences Syntax: S1 thing1 set_relationship S2 thing2 With Corresponding Identifiers (WCI): thing1id-predicate = thing2id-predicate [With Corresponding Properties (WCP): thing1attribute set_relationship thing2attribute,... ]

55 In cui… Predicate e’ una qualunque espressione di un linguaggio di interrogazione che permette di selezionare istanze. Nel seguito sceglieremo un algebra relazionale, quindi con operatori di proiezione, selezione, join set_relationship : EQUAL, CONTAIN, CONTAINED_IN, INTERSECT, DISJOINT

56 Example of Schema Correspondences S1.Person  S2.Person, WCI: Pin, WCP: name S1.Marriage  S2.Marriage, WCI: S1.  [gender=“F”] Marriage.Person = S2.Marriage.wife, WCP: date contract# date Marriage R Pin name Person Marriage 2:2 0:1 Pin name gender Person wife husband S1 S2 date

57 Schema 1 Schema 2 Author namebooks titleISBN birthdate Book ISBNauthorstitle namebirthdate Correspondences in the Library Example (homogeneous) Book EQUAL books authors EQUAL Author Book--authors EQUAL books-- Author

58 Schema S1 (OO) Person PinName Student GPA Faculty Rank Schema S2 (relational) Thesis (Phd-advisor, Phd-student, title) Corresponding Schema Elements Thesis example Faculty CONTAIN Phd-advisor Student CONTAIN Phd-student

59 1:N Correspondences Marriage Example S1.Person  S2.(Man U Woman), WCI: Pin, WCP: name S1.   [gender=“F”] Person  S2.Woman, WCI: Pin, WCP: name S1.  [gender=“M”] Person  S2.Man, WCI: Pin, WCP: name S1 S2 Pin name gender Person Marriage wife husband date Woman Married to 0:1 Pin name Man Pin name

60 Road-Section Overstepping Node begin end on/under Way-Section Bridge Separator TollCrossroadEnd-of-Tunnel Extremity on separate under end begin Schema 1 Schema 2 Correspondences in the spatial example

61 N:M Correspondences Two road-sections of DB1 correspond to several way-sections and separator-sections of DB2. RoadSection CONTAINED-IN SET ( WaySection, Separator ) WCI: road_way (R: RoadSection, W: SET (WaySection), S: SET (Separator) ) % R.geometry = GENERALIZATION ( AGGREGATION (W,S).geometry ) DB1 DB2

62 Techniques for Schema matching (Correspondence discovery)

63 Una tipica architettura di matching Match selector 1-1 and complex matches Match candidates Explanation module User Domain knowledge and data Target schema T and source schema S Similarity matrix Similarity estimator Match generator

64 Un esempio che mostra intutivamente le diverse fasi in sequenza successivamente approfondiremo soprattutto le tecniche per il passo di match selector e similarity estimator

65 Contesto Supponiamo che una compagnia di e-commerce debba acquisire un’altra azienda. Nel far ciò dovrà integrare i database delle due aziende. I documenti di entrambe le compagnie sono memorizzati secondo gli schemi S e S’ XML che seguono.

66 Match selector S S’ Un primo passo nell'integrazione degli schemi è quello di identificare i candidati da unire. Questo passo corrisponde ad una attivita’ di schema matching. Per esempio, gli elementi con etichetta Electronics in S ed in S’ sono dei candidati per l’unione, mentre l'elemento etichettato Digital Cameras in S’ dovrebbe essere incluso in Photo and Cameras di S.

67 Equivalenza SS’ Elementi candidati per l’equivalenza

68 Generalizzazione SS’ Elementi candidati per la generalizzazione

69 SS’ Disgiunzione Elementi candidati per la disgiunzione

70 Similarity estimator and match selector (1)Usiamo algoritmi di matching basati sull’ analisi linguistica e di struttura. (vedi in seguito le tecniche) (2) La misura di fiducia (per la relazione di equivalenza) tra le entità Photo_and_Cameras in S e Cameras_and_Photos in S’ potrebbero essere di 0.67. (3)Usiamo una soglia di 0.55 per determinare l'allineamento

71 Techniques for discovery of matching candidates and similarity estimator

72 Classification of techniques (approaches) Schema matching72

74 Individual matchers Lot of details in the following

76 Combining different matchers Schema matching76 La combinazione dei matcher può essere effettuata utilizzando: –Un approccio ibrido: Si combinano direttamente diversi approcci di matching per determinare i canditati, basandosi su molteplici criteri e su fonti di informazioni. –Un approccio composto: Si uniscono i risultati di vari matcher eseguiti indipendentemente, compresi i matcher ibridi.

77 Ibrida Architettura: Sequenziale (ibrida)

78 Composta Architettura: Parallela (composita) Aggregation (per esempio Min,Max,Media)

79 Classification of techniques (approaches) 79

80 Element-Level Techniques

81 Techniques 1. Element level, single elements are compared –String-based –Language-based –Constraint-based –Linguistic resources –Alignment reuse

82 1.1 String-based Techniques Prefix / Suffix Checks whether the first string starts (ends) with the second one –prefix: net = network (good!) ; but also hot = hotel (bad!) –suffix: phone = telephone (good!) ; but also word = sword (bad!) Edit distance – Calculates the number of insertions / deletions / substitutions of characters required to transform one string into another, ratio max(length(string1); length(string2)) – EditDistance (NKN,Nikon) = 2/5 = 0.4 N-Gram – Calculates the number of identical n-grams (i.e., sequences of n characters) between them – trigram (3) for the string nikon are nik, iko, kon – n-gram (2) (nikon, konnichiwa) = 3

83 1.2 Language-based Techniques Elimination of stop words – Tokens that are articles, prepositions, conjunctions, and so on, are marked to be discarded – a, the, by, type of Tokenization – Names are parsed into tokens by recognizing punctuation, cases – Hands-Free_Kits => Lemmatization – Tokens are morphologically analyzed in order to find all their possible basic forms – Kits => Kit

84 1.3 Constraint-based Techniques Datatype Comparison integer < real date  [1/4/2005, 30/6/2005] < date [year = 2005] Multiplicity Comparison [1:1] < [1:10]

85 Linguistic resources i thesauri (common knowledge o domini specifici) sono usati per abbinare le parole i nomi delle entità di schemi/ontologie sono considerati come parole di un linguaggio naturale basate sui loro rapporti linguistici (per esempio sinonimi).

86 1.4 Linguistic resources - 1 Sense-based: Wordnet Hierarchy Distance –Si basa sull’ utilizzo di un lessico gia’ disponibile. –Un lessico e’ un insieme di termini, legati da diverse relazioni: –Sinonimia quando due termini diversi hanno lo stesso significato, –Iperonimia, quando uno dei due termini e’ generalizzazione dell’altro –Ecc. Il lessico piu’ esteso disponibile per la lingua inglese e’ Wordnet, http://wordnet.princeton.edu/ http://wordnet.princeton.edu/ Il lessico puo’ essere esteso con nuovi termini e relazioni tra termini

87 1.4 Linguistic resources - 2 Sense-based: Wordnet Hierarchy Distance –Relations between schema entities can be computed in terms of lexical relationships. For example, an equivalence relation is returned if the distance between two input senses is less that a given threshold –red  pink – A  B if A is a hyponym of B: Brand  Name – B  A if A is a hyperonym of B Greece  Europe – A = B if they are synonyms: Quantity = Amount Chromatic color red pink

88 Gloss-based: WordNet gloss comparison Il numero delle stesse parole che occorrono in entrambi input aumenta il valore di somiglianza. Il rapporto equivalente è restituito se il valore risultante di somiglianza eccede una data soglia ESEMPIO: - Il cane maltese è una razza dei cani di piccola taglia che hanno un manto bianco diritto, lungo e setoso. - il levriero afgano è una razza graziosa del levriero, è alto con un manto setoso lungo

89 1.5 Alignment Reuse Previous matchings are reused in the new matching activity –Entire schemas –Schema fragments Techniques to match schema o’ and o’’, given the alignments between o and o’, and between o and o’’ from external resources, storing previous match operations results

90 Classification of techniques (approaches) 90

91 2. Structure-level Techniques Goal: compare elements taking into account their environment / context Taxonomy-based Graph-based Model-based (not examined)

92 2.1 Taxonomy-based Techniques - 1 Schemas are viewed as graph-like structures containing terms and their inter-relationships Bounded path matching – These techniques take two paths with links between classes defined by the hierarchical relations, compare terms and their positions along these paths, and identify similar terms Pupil University City born attends Schema 1: Student University City born attends Schema 2:

93 2.1 Taxonomy-based Techniques - 2 Schemas are viewed as graph-like structures containing terms and their inter-relationships Super(sub)-concepts rules – If super-concepts are the same, the actual concepts are similar to each other City Region Public Administration Unit See next page

94 Taxonomy based Upward cotopic distance

95 2.2. Graph-based Techniques Children –Two non-leaf schema elements are structurally similar if their immediate children sets are highly similar

96 Graph based Leaves Due elementi non-foglia dello schema sono strutturalmente simili se i loro insieme delle foglie sono altamente simili, anche se i loro figli immediati non lo sono. Dall’ ESEMPIO1 :

97 Nel nostro esempio Schema S1 1. Person (SSN, Age, City) 2. City (Name, Region) Schema S2 1. Person (SSN, City) Omonimia interschema Sinonimia intraschema

98 3. Schema integration

99 Phases of the methodology, inputs, outputs, and methods used 0. Define the integration strategy Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model transformation + Reverse engineering 2. Correspondences investigation Input: n source schemas Output: n source schemas + correspondences Method used: techniques to dicover correspondences 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations

100 Conflicts revisited The second step (find matching correspondences), through the use of a rich language for expressing correspondences, has fixed the types of conflicts expressed on the schemas. Let us first provide the new classification of types of conflicts, and corresponding conflict resolution transformations (also called integration rules)

101 New classification of types of conflicts New types 1.Classification conflicts 2.Descriptive conflicts (ex Name conflicts) 3.Structural conflicts (ex Type conflicts) 4.Fragmentation conflicts

102 Phd-advisor Faculty 1. Classification Conflicts Corresponding elements describe different sets of real world objects Correspondence: S1.Faculty CONTAINS S2.PhD- advisor Conflict Resolution: – Introduction of a Generalization/Specialization hierarchy Phd-advisor Faculty S1 S2

103 2. Descriptive conflicts - 1 Corresponding types have different properties, or corresponding properties are described in different ways Object / Entity / Relationship type: – naming conflicts : synonymsNode, Extremity homonymsHighway (EU), Highway (USA) – composition conflicts : different attributes and methods Employee ( E#, name, address ) Employee ( E#, position, salary, department )

104 2. Descriptive conflicts - 2 Solution depends upon: – the type of the descriptive conflict – the type of classification conflict S1 : Employee ( E#, name, address ) S2 : Employee ( E#, position, salary ) S1.Employee EQUAL S2.Employee => IS. Employee ( E#, name, address, position, salary ) S1.Employee CONTAINS S2.Employee => IS. Employee ( E#, name, address, [ position, salary ] )

105 3. Structural Conflicts Different schema element types, e.g.: class, attribute, relationship Library example : – S1 : Book is an Entity – S2 : Books is an attribute of Author Spatial example : – S1 : Tunnel is an attribute of RoadSection – S2 : Tunnel is an Entity Conflict resolution : Choose the less constraining structure – IS: Book, Tunnel are Entities

106 4. Fragmentation Conflicts (in traditional databases) The same phenomenon of the real world is perceived as a single object in one database and as several objects in the other Example – DB1: Soccer Team – DB2: Soccer Player Solution – aggregation relationship Player Team S1 S2 Player Team S1 S2 * Aggregator operator, usually not present natively in conceptual models, has to be simulated

107 DB2 : 4 WaySection 2 Separator DB1 : 1 RoadSection 4. Fragmentation Conflicts (in spatial databases) The same phenomenon of the real world is perceived as a single object in one database and as several objects in the other Example – DB1 : 1 RoadSection – DB2 : N WaySections + M Separators Solution – aggregation relationship

108 Integration rules Rules defining the strategy to solve conflicts Example rules: –If an entity in S1 corresponds to an attribute in S2, keep the entity in the IS –If the population of an entity in S1 is included in the population of another entity in S2, build an is-a hierarchy

109 Definition of mapping rules Defined between the integrated schemas and the source schemas Can be found for the source schema S looking at correspondences and conflict resolution transformations inolved adopted for S in the integration process Example in a spatial database S1: Node S2: Extremity IntegratedS: Node 2.. CRes: Change Extremity in S2 into Node 3. S1 MappingR : Node EQUALS Node 3. S2 MappingR : Node EQUALS Extremity 1. Corr: Node Synonym of Extremity

110 Nel nostro esempio Schema S1 1. Person (SSN, Age, City) 2. City (Name, Region) Schema S2 1. Person (SSN, City) Schema Globale Person (SSN, Age, CityofRes., CityofBirth, Region) Generazione dello schema Globale e della vista globale (vedi 5.3.1)

111 easy to implement, flexible BUT time consuming for the DBA a language schemas integrated schema mapping rules Mapping rules DBA Integration Methods: Manual First method : manual integration “ do it yourself ”

112 schemas integrated schema mapping rules TOOL DBA correspondences Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful Integration Methods: Semi-Automatic Second method : semi-automatic integration “ tell me about the problem, I will try to fix it “

113 Per ulteriori approfondimenti C. Batini et al. Conceptual Database Design, Benjamin and Cummings, 1992.

114 Approfondimenti facoltativo

115 The Newest Trends … XML and Web-data Ontological Assistance Ontology Integration Spatial and Temporal Data Multimedia Data

116 XML Data Integration Aida Boukottaya CSS XSL-Fo Xquery XSLT XLin k XPointer XML-Schema XPath XForms Soap XFrames XHTML MathML SMIL CC/PP SVG RDF OWL WSDL RDF Schema

117 XML-based Data Integration J2EE TM Platform Technologies Involved XML? Different data formats XQuery? Declarative way to query XML documents J2EE TM ? Standards-based infrastructure platform XML Database? –Native XML storage –XML data management –Performance optimizations Get information from diverse sources in XML Join, filter and transform data by XQuery XML Database XQuery Engine Courtesy Oracle

118 Reuse XML Documents 29/10/2004 Aida Boukottaya Source Schema Target Schema Structure transformation XML Data Structures

119 Aida Boukottaya XML Schema Graph  Nodes (atomic/complex)  Edges (containment/of- property/association)  Constraints (over nodes and edges)  Binding between nodes and types (Types, Abstract types, Type derivation) Directed labelled graph with constraint sets:

120 Aida Boukottaya Mapping Operations Six cases: (1) Target and source nodes are equivalent t = connect (s) (2) Qualify semantically similar concepts using different names t = rename (s) (e.g., author = rename (writer)) (3) A particular source element may have A subset of the target desired values t = union (s1, s2) (e.g., publication = union (paper, book)) A superset of the desired values t = selectP (s) (e.g., paper = select kind=paper (publication))

121 Aida Boukottaya Mapping Operations (Cont…) (4) Target values at different level of atomicity Merged values/Splitted values t = merge (s1, s2, s3) (e.g., Name = merge (FirstName, LastName)) (t1, t2, t3) = splitc (s) (e.g., (City, State, Zip) = split ws(Address)) (5) Target elements obtained by a natural join t = joinP (s1, s2) (6) Target values obtained by applying specific functions t = applyf (s1, s2) DateT= applyDate-Conversion (DateS)

122 Path similarity Similarity Measures: P1 is a target path and P2 is a source path: The path P2 includes most of the nodes of P1 in the right order. – Longest Common Subsequence measure (LCS) The occurrences of the P1 nodes are closer to the beginning of P2 than to the tail – Average Positioning (AP) measure The occurrences of the P1 nodes in P2 are close to each other – LCS with minimum gaps (gaps) measure If several match candidates that match exactly the same nodes in P1 exist, P2 is the shortest one – Length difference (ld) measure Aida Boukottaya

123 P2 LCSAPGapsLdPr Media/book/chapter/title/number33020.84 Media/chapter/book/title/number23030.53 Book/chapter/title/subtitle/number32020.92 Catalog/book/chapters/chapter/section/title/ number 34240.68 Path similarity Aida Boukottaya Measuring Path Similarity P1=book/chapter/title

124 Ancestor-Context Child-Context Leaf-Context Node Context = Ancestor-context  Child-context  Leaf-context Structural similarity = Context similarity Aida Boukottaya Structural matching: node context

125 Other New Trends

126 Space and Time Information Data Types and topological relationships

127 Spatial Data Matching Road-number Matching –Semantic Crossroad Matching –Geometric –Topologic Section Matching –Hausdorff distance –Road-number Matching

128 Descriptive conflicts Corresponding types have different properties, or corresponding properties are described in different ways New Problem : different geometries   Different representations of the same geometry – DB1 : tunnel is a line – DB2 : tunnel is represented by its end-points  Conflict Resolution : Multi-representations of spatial objects

129 Different geometries (cont’d) Spatially heterogeneous databases : different scales   different geometric types DB1 : Toll ( point ) DB2 : Toll ( area )   same geometric type, but more or less precise form » Buildings DB1: DB2: Conflict Resolution : – spatial objects with multiple geometries and cartographic generalization functions

130 Multimedia Data Software Technologies for Search and Integration across Heterogeneous- Media Archives – Research and development of cross-media search engines for distributed and heterogeneous digital contents, such as a cross- media meta-search engine for digital images. Hierarchical video segmentation Video segmentation by closed caption Integration of related information Retrieval of related information User Generated Web content TV programs with metadata Web content Katsumi Tanaka tanaka@dl.kuis.kyoto-u.ac.jp tanaka@dl.kuis.kyoto-u.ac.jp

131 Conclusion Information integration: – A lot of work has been made – Few tools exist – No global automatic tool – Current focus on matching Newest Trends – XML, ontologies, space Still to come – Hybrid approaches: DB + reasoning

132 Bibliography (2nd) Christine Parent and Stefano Spaccapietra Database integration: The key to data interoperability in Advances in Object-Oriented Data Modeling, M.P. Papazoglou, S. Spaccapietra, and Z. Tari (Eds.), The MIT Press, 2000 Pavel Shvaiko and Jérôme Euzenat A Survey of Schema-Based Matching Approaches Journal on Data Semantics IV, LNCS, 2005 Ehrard Rahm and Philip A. Bernstein A survey of approaches to automatic schema matching The VLDB Journal, 10: 334-350 (2001) Anastasiya Sotnykova et al. Semantic Mappings in description logics for spatio-temporal database schema integration Journal on Data Semantics III, LNCS 3534, 2005

133 Esempio di proprieta’ interschema Persona Lavoratore Schema 1 Schema 2

134 Appendix Three examples

135 Schema 1 Schema 2 Author namebooks titleISBN birthdate Book ISBNauthorstitle namebirthdate Library Example (homogeneous)

136 The integrated schema (OO) Phd-advisor Thesis Title Person PinName GPA Faculty Rank Adv.Student Schema S1 (OO) Person PinName Student GPA Faculty Rank Schema S2 (relational) Thesis (Phd-advisor, Phd-student, title) Student PhD Student Thesis example

137 Road-Section Overstepping Node begin end on/under Way-Section Bridge Separator TollCrossroadEnd-of-Tunnel Extremity on separate under end begin Schema 1 Schema 2 Spatial Example

138 Marriage Example S1 S2 Pin name gender Person Marriage wife husband date Woman Married to 0:1 Pin name Man Pin name

139 Resti

140 titleISBN namebirthdate AuthorBook R1 R2 Link correspondences Integration of Book and Author without link correspondences Book--authors EQUAL books--Author would generate :

141 titleISBN namebirthdate AuthorBook R Link correspondences Integration of Book and Author with the link correspondence Book--authors EQUAL books--Author would generate

142 GAV & LAV Integration revisited GAV (Global As View): the Integrated Schema provides an integrated description of all the data available in the sources – the IS is used to access data from any sources – queries to the IS are mapped into queries to the sources (as in distributed databases) – the IS is defined to allow access to all data: in case of conflicting specifications, an all-encompassing specification is elaborated for the IS LAV (Local As View): the Integrated Schema provides an integrated description of all the data that is desirable and that somehow matches the requirements of users of the IS – the IS may define data that does not exist in any of the sources (missing / incomplete information problem) Other Sub-Goals for integration: – minimality vs. exhaustiveness of the integrated schema,...

1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures.

Presentazioni simili

Presentazione sul tema: "1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures."— Transcript della presentazione:

Presentazioni simili

Sul progetto

Feed-back

Entrare

Autorizzarsi attraverso i social network:

1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures.

Presentazioni simili

Presentazione sul tema: "1 1 Corso di Architetture della Informazione Anno Accademico 2009-2010 Carlo Batini 5.4.1 Schema integration in data integration architectures."— Transcript della presentazione:

Presentazioni simili

Sul progetto

Feed-back