《走进数据科学英文版课件.pptx》由会员分享,可在线阅读,更多相关《走进数据科学英文版课件.pptx(92页珍藏版)》请在三一办公上搜索。
1、Data Mining: Theory & Algorithms,Mining? Warehousing?,5,Technology Advancement,6,Technology Advancement,7,The World of Data,8,Data Rich, Information Poor,9,10,Learning Resources,11,International Conference on Data MiningInternational Conference on Data EngineeringInternational Conference on Machine
2、LearningInternational Joint Conference on Artificial IntelligencePacific-Asia Conference on Knowledge Discovery and Data MiningACM SIGKDD Conference on Knowledge Discovery and Data Mining,Learning Resources,12,Learning Resources,13,Xindong Wu,Zhihua Zhou,Jiawei Han,Jian Pei,Qiang Yang,Chih-Jen Lin,P
3、hilip S. Yu,Changshui Zhang,Learning Resources,14,Interdisciplinary,15,Ubiquitous,16,Comprehensive Learning,17,Learning Listening,18,20,Data,Definition“Data are pieces of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data are often viewed as
4、 the lowest level of abstraction from which information and knowledge are derived.”Data TypesContinuous, BinaryDiscrete, StringSymbolicStoragePhysicalLogicalMajor IssuesTransformationErrors and Corruption,21,What is Big Data?,“Big data is high-volume, high-velocity and high-variety information asset
5、s that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Gartner“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Mckinsey & Company,22,Big Data,23,Publi
6、c Security,24,Health Care Application,25,Effectiveness Research,Personalized Medicine,Location Data: Urban Planning,26,Location Data: Mobile User,27,Location Data: Shopper,28,Retail Data: Targeted Marketing,29,Retail Data: Sentiment Analysis,30,Social Networks,31,Sports,32,Attractiveness Mining,33,3
7、4,Open Data,Technically Open: available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application.Legally Open: explicitly licensed in a way that permits commercial and non-commercial use without restrictions.,35,Where to find data?,3
8、6,Open Government Data,37,Data Mining,People have been analysing and investigating data for centuries.StatisticsMean, Variance, Correlation, Distribution In modern days, data are often far beyond human comprehension.Diversity, Volume, DimensionalityDefinitionData Mining is the process of automatical
9、ly extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.Not a fully automatic processHuman interventions are often inevitable.Domain KnowledgeData Collection and Pre-processingSynonym: Knowledge Discovery,38,“If you are looking for a career where your ser
10、vices will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So whats getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about
11、 how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.”An interview with Google Chief Economist Hal Varian from the New York Times,Is DM really important?,39,40,Business Intelligence,41,From Data To Intelligence,42,Data Integration & Ana
12、lysis,43,The Process of Data Mining,44,45,46,47,DM Techniques - Classification,“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled
13、 items.”Given a training set: (x1, y1), , (xn, yn), produce a classifier (function) that maps any unknown object xi to its class label yi.AlgorithmsDecision TreesK-Nearest NeighboursNeural NetworksSupport Vector MachinesApplicationsChurn PredictionMedical Diagnosis,48,X,Y,Classification Boundaries,4
14、9,Overfitting Classification,50,Cross Validation,51,Data,Training Set,Test Set,Evaluation,Generated Models,Confusion Matrix,52,TPR=TP/(TP+FN)TNR=TN/(TN+FP)Accuracy=(TP+TN)/(P+N),Receiver Operating Characteristic,53,Very small threshold,Very large threshold,Random guess,Cost Sensitive Learning,54,Lif
15、t Analysis,55,56,DM Techniques - Clustering,“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”Distance MetricsEuclidean DistanceManhattan DistanceMahalanobis DistanceAlgorithmsK-MeansSequential Le
16、aderAffinity PropagationApplicationsMarket ResearchImage SegmentationSocial Network Analysis,57,What is the difference between classification and clustering?,Hierarchical Clustering,58,DM Techniques Association Rule,59,Association Rule,60,DM Techniques Regression,61,Overfitting Regression,62,y,x,See
17、ing is Knowing,63,Performance Dashboard,64,65,Data Preprocessing,Real data are often surprisingly dirty.A Major Challenge for Data MiningTypical IssuesMissing Attribute ValuesDifferent Coding/Naming SchemesInfeasible ValuesInconsistent DataOutliersData QualityAccuracyCompletenessConsistencyInterpret
18、abilityCredibilityTimeliness,66,Data Preprocessing,Data CleaningFill in missing values.Correct inconsistent data.Identify outliers and noisy data.Data IntegrationCombine data from different sources.Data TransformationNormalizationAggregationType ConversionData ReductionFeature SelectionSampling,67,6
19、8,Internet Privacy,69,70,Privacy Protection,Data: A Double-Edged SwordPeople can benefit greatly from data analysis.The consequence of information leakage can be catastrophic.People may be reluctant to give sensitive information due to privacy concerns.Drug, Tax, Sexuality How to find out the percen
20、tage of people with a certain attribute?The interviewer should not know the true answer of each respondent.Randomized ResponseUsed in structured survey research.Can maintain the confidentiality of respondents.,71,Privacy Protection,Two questions are presented:Q1: I have the attribute A.Q2: I do not
21、have the attribute A.The respondent uses a random device to:Answer Q1 with probability p.Answer Q2 with probability 1-p.The interviewer has no idea about which question is answered.,72,Cloud Computing,73,Cloud Computing,Pay As You GoSoftware as a Service Platform as a ServiceInfrastructure as a Serv
22、ice,74,Parallel Computing,75,Parallel Computing,76,77,Mobile Supercomputing,78,Intel MIC,79,The Big Picture,80,Why bother so many different algorithms?No algorithm is always superior to others.No parameter setting is optimal over all problems.Look for the best match between problem and algorithm.Exp
23、erienceTrial and ErrorFactors to consider:ApplicabilityComputational ComplexityInterpretabilityAlways start with simple ones.,No Free Lunch,81,82,Just in Case Someone Asks ,83,Just in Case Someone Asks ,84,Grouping,85,X,Y,Group A,Group B,Violent Crime vs. Video Game,86,Tricky?,Is there correlation between height and business success?Average American Male is 59.Only 3.9% adult American men are taller than 62”.Around 30% CEOs of Fortune 500 are taller than 62.,87,Survivorship Bias,88,这是真的吗?,89,时间去哪儿了?,90,感谢您的聆听!,
链接地址:https://www.31ppt.com/p-1922174.html