欢迎来到三一办公! | 帮助中心 三一办公31ppt.com(应用文档模板下载平台)
三一办公
全部分类
  • 办公文档>
  • PPT模板>
  • 建筑/施工/环境>
  • 毕业设计>
  • 工程图纸>
  • 教育教学>
  • 素材源码>
  • 生活休闲>
  • 临时分类>
  • ImageVerifierCode 换一换
    首页 三一办公 > 资源分类 > PPTX文档下载  

    走进数据科学英文版课件.pptx

    • 资源ID:1922174       资源大小:8.90MB        全文页数:92页
    • 资源格式: PPTX        下载积分:16金币
    快捷下载 游客一键下载
    会员登录下载
    三方登录下载: 微信开放平台登录 QQ登录  
    下载资源需要16金币
    邮箱/手机:
    温馨提示:
    用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)
    支付方式: 支付宝    微信支付   
    验证码:   换一换

    加入VIP免费专享
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    走进数据科学英文版课件.pptx

    Data Mining: Theory & Algorithms,Mining? Warehousing?,5,Technology Advancement,6,Technology Advancement,7,The World of Data,8,Data Rich, Information Poor,9,10,Learning Resources,11,International Conference on Data MiningInternational Conference on Data EngineeringInternational Conference on Machine LearningInternational Joint Conference on Artificial IntelligencePacific-Asia Conference on Knowledge Discovery and Data MiningACM SIGKDD Conference on Knowledge Discovery and Data Mining,Learning Resources,12,Learning Resources,13,Xindong Wu,Zhihua Zhou,Jiawei Han,Jian Pei,Qiang Yang,Chih-Jen Lin,Philip S. Yu,Changshui Zhang,Learning Resources,14,Interdisciplinary,15,Ubiquitous,16,Comprehensive Learning,17,Learning Listening,18,20,Data,Definition“Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.”Data TypesContinuous, BinaryDiscrete, StringSymbolicStoragePhysicalLogicalMajor IssuesTransformationErrors and Corruption,21,What is Big Data?,“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Gartner“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Mckinsey & Company,22,Big Data,23,Public Security,24,Health Care Application,25,Effectiveness Research,Personalized Medicine,Location Data: Urban Planning,26,Location Data: Mobile User,27,Location Data: Shopper,28,Retail Data: Targeted Marketing,29,Retail Data: Sentiment Analysis,30,Social Networks,31,Sports,32,Attractiveness Mining,33,34,Open Data,Technically Open: available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application.Legally Open: explicitly licensed in a way that permits commercial and non-commercial use without restrictions.,35,Where to find data?,36,Open Government Data,37,Data Mining,People have been analysing and investigating data for centuries.StatisticsMean, Variance, Correlation, Distribution In modern days, data are often far beyond human comprehension.Diversity, Volume, DimensionalityDefinitionData Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.Not a fully automatic processHuman interventions are often inevitable.Domain KnowledgeData Collection and Pre-processingSynonym: Knowledge Discovery,38,“If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So whats getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.”An interview with Google Chief Economist Hal Varian from the New York Times,Is DM really important?,39,40,Business Intelligence,41,From Data To Intelligence,42,Data Integration & Analysis,43,The Process of Data Mining,44,45,46,47,DM Techniques - Classification,“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”Given a training set: (x1, y1), , (xn, yn), produce a classifier (function) that maps any unknown object xi to its class label yi.AlgorithmsDecision TreesK-Nearest NeighboursNeural NetworksSupport Vector MachinesApplicationsChurn PredictionMedical Diagnosis,48,X,Y,Classification Boundaries,49,Overfitting Classification,50,Cross Validation,51,Data,Training Set,Test Set,Evaluation,Generated Models,Confusion Matrix,52,TPR=TP/(TP+FN)TNR=TN/(TN+FP)Accuracy=(TP+TN)/(P+N),Receiver Operating Characteristic,53,Very small threshold,Very large threshold,Random guess,Cost Sensitive Learning,54,Lift Analysis,55,56,DM Techniques - Clustering,“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”Distance MetricsEuclidean DistanceManhattan DistanceMahalanobis DistanceAlgorithmsK-MeansSequential LeaderAffinity PropagationApplicationsMarket ResearchImage SegmentationSocial Network Analysis,57,What is the difference between classification and clustering?,Hierarchical Clustering,58,DM Techniques Association Rule,59,Association Rule,60,DM Techniques Regression,61,Overfitting Regression,62,y,x,Seeing is Knowing,63,Performance Dashboard,64,65,Data Preprocessing,Real data are often surprisingly dirty.A Major Challenge for Data MiningTypical IssuesMissing Attribute ValuesDifferent Coding/Naming SchemesInfeasible ValuesInconsistent DataOutliersData QualityAccuracyCompletenessConsistencyInterpretabilityCredibilityTimeliness,66,Data Preprocessing,Data CleaningFill in missing values.Correct inconsistent data.Identify outliers and noisy data.Data IntegrationCombine data from different sources.Data TransformationNormalizationAggregationType ConversionData ReductionFeature SelectionSampling,67,68,Internet Privacy,69,70,Privacy Protection,Data: A Double-Edged SwordPeople can benefit greatly from data analysis.The consequence of information leakage can be catastrophic.People may be reluctant to give sensitive information due to privacy concerns.Drug, Tax, Sexuality How to find out the percentage of people with a certain attribute?The interviewer should not know the true answer of each respondent.Randomized ResponseUsed in structured survey research.Can maintain the confidentiality of respondents.,71,Privacy Protection,Two questions are presented:Q1: I have the attribute A.Q2: I do not have the attribute A.The respondent uses a random device to:Answer Q1 with probability p.Answer Q2 with probability 1-p.The interviewer has no idea about which question is answered.,72,Cloud Computing,73,Cloud Computing,Pay As You GoSoftware as a Service Platform as a ServiceInfrastructure as a Service,74,Parallel Computing,75,Parallel Computing,76,77,Mobile Supercomputing,78,Intel MIC,79,The Big Picture,80,Why bother so many different algorithms?No algorithm is always superior to others.No parameter setting is optimal over all problems.Look for the best match between problem and algorithm.ExperienceTrial and ErrorFactors to consider:ApplicabilityComputational ComplexityInterpretabilityAlways start with simple ones.,No Free Lunch,81,82,Just in Case Someone Asks ,83,Just in Case Someone Asks ,84,Grouping,85,X,Y,Group A,Group B,Violent Crime vs. Video Game,86,Tricky?,Is there correlation between height and business success?Average American Male is 59.Only 3.9% adult American men are taller than 62”.Around 30% CEOs of Fortune 500 are taller than 62.,87,Survivorship Bias,88,这是真的吗?,89,时间去哪儿了?,90,感谢您的聆听!,

    注意事项

    本文(走进数据科学英文版课件.pptx)为本站会员(小飞机)主动上传,三一办公仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三一办公(点击联系客服),我们立即给予删除!

    温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




    备案号:宁ICP备20000045号-2

    经营许可证:宁B2-20210002

    宁公网安备 64010402000987号

    三一办公
    收起
    展开