XML代价估计方法研究综述课件.ppt

资源ID：4029710 资源大小：538KB 全文页数：45页
资源格式： PPT 下载积分：16金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要16金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

XML代价估计方法研究综述课件.ppt

XML 代价估计方法研究综述,XML Group,Outline,研究代价估计的必要性xml中代价估计研究的难点背景知识介绍现有方法分类,研究代价估计的必要性,查询优化的基础不同分支对查询目标的选择率不同，如果选择性低的分支先于选择性高的分支执行，就会有效的减少中间结果从而提供查询效率结构连接在XML中是一个很重要的操作，连接的顺序的选择需要计算不同连接顺序的代价及早给用户提供反馈信息,xml中代价估计研究的难点,XML数据的不规则性是对传统统计信息方法的重要挑战，具体表现在以下几个方面：路径依赖性，如同为name结点，在person下和在city下出现意义就完全不同结构相关性嵌套在不同祖先下面的同类结点的个数差别可能很大，如book结点下的author个数是不确定的值相关性/purchase/Itemtype=bookprice100XML的有序性制约了转换规则的灵活性所有这些问题，都使得在xml中采用传统的代价估计方法不切实际，会带来很大的误差。针对xml数据的特点，我们应该寻求一种新的代价估计方法。,Background Knowledge,XML Data ModelA large,node-labeled tree T(V,E),Example XML Document Tree,Background Knowledge,XML Data ModelA large,node-labeled tree T(V,E),Example XML Document,Background Knowledge,XML Query ModelA node-labeled query tree TQEach node of TQ is labeled with a variable name qi in Q Each edge(qi,qj)of TQ is annotated with an XPath expression path(qi,qj)that describes the specific structural constraints specified in Q,Query Fragment：for$b in doc(“bib.xml”)/bib/book$a in$b/author$t in$b/title,现有估计方法的分类,路径选择性计算Twig匹配个数统计值谓词选择率估计结构连接代价估计,Overview,data tree,different summarization methods based on:,path,all m-length path,F/B-stability,LoreMcHugh et al,VLDB99,pruning,PathTree,Markov TableAboulnaga et al,VLDB01,XSketchPolyzotis et al,VLDB02,type,XSketchesPolyzotis et al,SIGMOD02,+value,+twig,XSketchesPolyzotis et al,SIGMOD04,Region Code,2D Model,1D Model,StatiXFreire et al,SIGMOD02,PH HistogramWu Y EDBT 2002,Interva&Position ModelH.Lu SIGMOD 2003,路径选择性计算,问题描述/a/bc/d/eg/e/f/*/*/e/f,Path Tree&Markov Table,Ashraf Aboulnaga,Alaa R.Alameldeen,and Jeffry F.Naughton.Estimating the selectivity of XML path expressions for Internet scale applications.VLDB 2001,Path Tree,ConstructionStart from an original path treeCount of nodes in absolute pathsAggregate the tree to fit in the limited spaceSibling-*EstimationMatch the query against the path tree,matching*as the last resort.Need to decide whether to use total count or average count.,An XML document and its path tree,Estimation:count(A/B/D)=1 count(A/C/D)=1,Aggregate Path Tree,Aggregate Path Tree,A,1,H,6,E,5,D,7,J,4,I,2,红色结点表示那些不频繁出现的结点，需要删除，从而保证path树能够放入内存,Sibling-*Summarization,A,1,C,9,*,*,K,*,f=23n=2,6,8,3,把查询和path 树匹配需要决定用总数还是平均数,估计方法,Estimation:count(/C/H/K)=11 count(/C/*/K)=23,Markov Table,构造一个表，存储长度=2。当查询路径长度=m时，可以直接从表中读出结果，否则，用公式计算。例如，m3，查询为/A/B/C/D，则当表不能装入内存的时候，进行剪枝操作。,Markov Table:Example,a,b,b,c,d,d,e,e,e,f,d,a,b,d,d,e,1,2,2,1,3,c,f,1,1,m=2,suffix-*,Q1:a/c/e,Twig 匹配个数统计,问题描述Twig Query(basic building block of declarative query languages for XML),FOR$f IN document(“person.xml”)/department/faculty FOR$r in$f/RA,FOR$t in$f/TA,Department,Faculty,RA,TA,$f,$r,$t,XSketch方法,N.Polyzotis,M.Garofalakis.Statistical Synopses for Graph-Structured XML Databases,In Proc.of ACM SIGMOD Conf.2002 N.Polyzotis and M.Garofalakis.Structure and value synopses for xml data graphs.In Proceedings of 28th VLDB Conf.,2002.N.Polyzotis,M.Garofalakis,and Yannis Iosnnidis.Selectivity Estimation for XML Twigs.In Proc.of the 20th Intl.Conf.on Data Engineering,2004N.Polyzotis and M.Garofalakis and Y.Ioannidis.Approximate XML Query Answers.In Proc.ACM SIGMOD 2004,XSketch方法,B/F stability,Definition：Let U,V be sets of elements in an XML data Tree T.We say that V is backward-stable(B-stale)with respect to U,iff for each vV there exists a u U such that edge(u,v)is in T.Similarly,U is said to be forward-stable(F-stable)with respect to V,iff for each u U there exists a v V such that the edge(u,v)is in G.,XSketch方法,How to estimate the cardinality of xpath query such as A/P/K or A/p?,Utilizing edge stability,We can give an accurate match number:Count(A/P/K)=6Count(A/p)=3,But what about A/P/T which doesnt conform to backward stability?,2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A/P)=f(P/T|A/P)=f(P/T)count(P)/(count(B)+count(P),XSketch方法(处理值谓词),v1,v3,v5,v2,v4,B,B,F,F,STN(v5),Dep(v5)=v1,v4,v1,v4,Histogram at v5,H(1,4)=count(v11v2/v3v44/v5),v3 v5 count(v11v2/v33v44/v5)=H(1,4)*count(v33)/count(v3),A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v)f(u/v)A2Value-Independence Outside Correlation Scope,XSkecth方法（处理twig）,Problem with the former modelLet us see a simple example,for t0 in A,t1 in t0/B,t2 in t0/C,(a),(b),(c)Structural XSketch,2000 binding tuples on the first document10100 tuples on the second document,XSkecth方法（处理twig）,Singe-path XSketch model dose not store detailed enough information to capture the correlation patterns between multiple path expressions.If node A records a two-dimensional distribution fA(b,c)for the fraction of elements in A that have b children in B and c children in C.,XSkecth方法（处理twig）,CK,CY,Elements,fp,2,1,p4,0.25,P5,4(0.2521100.25110.51)5,CP,CN,2,1,2,1,0.25,2,1,1,1,0.5,1,1,p8，p9,Another Example,Edge-distribution:fP(CY,CK,CP，CN),P,Y,P,K,A,P,A,N,Statix方法,Freire J,Jayant R,Ramanath M,Roy P and Simeon J.StatiX:Making XML Count.In:Proc.of ACM SIGMOD 2002,USA.,Statix方法,Constructionevery instance node is assigned a unique id(type id+sequential node id)during parsing and validation,meanwhile gathering statisticsConstructing the histogramsHistogramBuild histogram for each typeMight contain histogram describing the distribution of the parents of a typeValue histogram can be built for types that are defined in terms of base types(eg.integer)EstimationConvert the query into SQL(i.e.,relational join on IDs)Use standard histogram multiplication,StatiX:Example,FOR$s in document(“imdb.xml”)/show,$r in$s/reviewWHERE$s/year 1992RETURN$s/title,$r,SELECT title,reviewFROM Show,ReviewWHERE Show.year 1992 AND Show.ID=Review.ParID,STAT Show CARD:5 ID:16 STAT Show_Year VALUE:19902001 BUCKET_NUM:2 BUCKETS:1990,1994):3;1994,2001):2STAT Review CARD:8 ID:3038 PARENT HISTOGRAM Show BUCKET_NUM:3 BUCKETS:1,2):4;2,3):1;3,5):3,STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM:3 BUCKETS:1,2):1;2,6):0;6,12):7,31/28/5=2.4,Summary,结构连接代价估计,问题描述给定任意的祖先节点集合A和后代节点集合D，计算A/D结果集合的大小，估计匹配的祖先后代的结果个数当同一祖先有多个后代，或者同一后代有多个祖先时，节点对个数可能远远大于祖先或后代的个数。应用连接顺序的选择,结构连接代价估计,Existed work2D model:pH histogram Wu Y,Patel J and Jagadish H.V.Estimating Answer Sizes for XML Queries.In:Proc.of the EDBT 2002.Interval and Positional model W.Wang,H.Jiang,H.Lu,and J.F.Xu.Containment join Size Estimation:Models and Methods.In:Proc.Of ACM SIGMOD 2003,Absolute Region Code(start,end),blah 1234 5678 0000,contact,name,phone,blah,office,home,mobile,1234,5678,0000,(1,16),(2,4),(3,3),(5,15),(6,8),(7,7),(9,11),(10,10),(12,14),(13,13),2,3,4,a.start d.start and d.end a.end,pH Histogram,Forbidden Regions For a node,R1和R2是结点A的ForbiddenRegion 两个结点的(start,end)满足：（1）no overlap（2）nested 不可能有部分重叠的(partiallyoverlap)情况,pH Histogram,EstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+1/2(HistP2D+HistP2F),Ancestor-based Join Estimation,EstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1H,Descendant-based Join Estimation,pH Histogram,est=3*(12+1+2+0.25*5)=48.75,start,end,start,start,end,end,For off-diagonal cells:,pH for A,pH for D,Interval&Position Models,Map into new problems under 2 newly proposed models.Interval model and Position modelHistogram based method that is adaptive to the data:PL histogramAssumption:1D uniform of the descendant setEstimation is robust when correlation is lowSampling based method that captures the correlation:IM Sampling and PM SamplingAssumption:Height of the XML data tree is a small constantBoth have theoretical bounds on the accuracy of the estimation,Interval Model,Interval ModelMap elements to 1D line segments(points)IMA(A):interval setIMD(D):point set|A SJ D|=|interval-point pairs|,interval set,point set,Position Model,Position ModelUse frequency tables to record coverage and starting point informationPMA(S):coverage tablePMD(S):start table|A SJ D|=Containment joinEquijoin,coverage table,start table,Point-Line(PL)-Histogram,Partition the pace into partitionsEstimate the overlapping(interval,point)pairs within each bucket.AssumptionsData distribution of A and D are independentD conforms to uniform distribution,Sampling in Interval Model,Adapted from bifocal samplingIM Algorithm:Choose m samples from IMD(D)Sum up the m subjoin sizes.X=Scale up the result by|IMD(D)|/mStatistical GuaranteeE(X)=X(join size)with high probability(by Hoeffding bound),2,1,3,6,m=2,Problem,How to handle parent-child relationship?How to count|A/B/C|How to use estimation for Query Optimization,Question?,Then We done!,Thank you!,

注意事项

本文（XML代价估计方法研究综述课件.ppt）为本站会员（牧羊曲112）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。