Practical text mining with Perl

Bilisoly, Roger

90,77 €(IVA inc.)

Contenido

As a discipline, text mining is relatively young in the larger field of data mining, but is expected to become more prevalent as methodology matures and ascomputing power increases. This book covers text mining ideas from several perspectives—statistics, data mining, linguistics, and information retrieval—andshows readers how to actually perform text mining tasks using Perl. End-of-chapter exercises and an extensive case study are also included. The book is appropriate for data mining analysts, computational biologists, software engineers, students, and anyone interested in extracting information from text documents. INDICE: Preface Acknowledgments 1. Introduction 1.1 Overview of this Book 1.2 Text Mining and Related Fields 1.2.1 Chapter 2: Pattern Matching 1.2.2 Chapter 3: Data Structures 1.2.3 Chapter 4: Probability 1.2.4 Chapter 5: Information Retrieval 1.2.5 Chapter 6: Corpus Linguistics 1.2.6 Chapter 7: Multivariate Statistics 1.2.7 Chapter 8: Clustering 1.2.8 Chapter 9: Three Additional Topics 1.3 Advice for Reading this Book 2. Text Patterns 2.1 Introduction 2.2 Regular Expressions 2.2.1 First Regex: Finding the Word ‘Cat’ 2.2.2 Character Ranges and Finding Telephone Numbers 2.2.3 Testing Regexes with Perl 2.3 Finding Words in a Text 2.3.1 Regex Summary 2.3.2 Nineteenth Century Literature 2.3.3 Perl Variables and the Function split 2.3.4 Match Variables 2.4 Decomposing Poes ‘The Tell-Tale Heart’ into Words 2.4.1 Dashes and String Substitutions 2.4.2 Hyphens 2.4.3 Apostrophes 2.5 A Simple Concordance 2.5.1 Command Line Arguments 2.5.2 Writing to Files 2.6 First Attempt at Extracting Sentences 2.6.1 Sentence Segmentation Preliminaries 2.6.2 Sentence Segmentation for ‘A Christmas Carol’ 2.6.3 Leftmost Greediness and Sentence Segmentation 2.7 Regex Odds and Ends 2.7.1 Match Variables and Backreferences 2.7.2 Regular Expression Operators and Their Output 2.7.3 Lookaround 2.8 References Problems 3. Quantitative Text Summaries 3.1 Introduction 3.2 Scalars, Interpolation and Context in Perl 3.3 Arrays and Context in Perl 3.4 Word Length Application 3.5 Arrays and Functions 3.5.1 Adding and Removing Entries from Arrays 3.5.2 Selecting Subsets ofan Array 3.5.3 Sorting an Array 3.6 Hashes 3.6.1 Using a Hash 3.7 Two Text Applications 3.7.1 Zipfs Law 3.7.2 Perl for Word Games 3.7.2.1 An Aid to Crossword Puzzles 3.7.2.2 Word Anagrams 3.7.2.3 Finding Words in a Set of Letters 3.8Complex Data Structures 3.8.1 References and Pointers 3.8.2 Arrays of Arrays and Beyond 3.8.3 Application: Comparing the Words in Two Poe Stories 3.9 References 3.10 First Transition Problems 4. Probability and Texts 4.1 Introduction4.2 Probability 4.2.1 Probability and Coin Flipping 4.2.2 Probabilities and Texts 4.2.2.1 Estimating Letter Probabilities 4.2.2.2 Estimating Letter Bigram Probabilities 4.3 Conditional Probability 4.3.1 Independence 4.4 Mean and Variance of Random Variables 4.4.1 Sampling and Error Estimates 4.5 The Bag-of-Words Model 4.6 The Effect of Sample Size 4.6.1 Tokens vs. Types 4.7 References Problems 5. Applying Information Retrieval to Text Mining 5.1 Introduction 5.2 Text Counts and Vectors 5.2.1 Counting Words with Perl 5.2.2 Pronouns 5.3 TextCounts and Vectors 5.3.1 Vectors and Angles 5.3.2 Computing Angles between Vectors 5.3.2.1 Subroutines in Perl 5.3.2.2 Computing the Angle between Vectors 5.4 The Term-Document Matrix 5.5 Matrix Multiplication 5.5.1 A Text Application of Matrix Multiplication 5.6 Functions of Counts 5.7 Document Similarity 5.7.1 Inverse Document Frequency 5.7.2 Poe Story Angles Revisited 5.8 References Problems 6. Concordance Lines and Corpus Linguistics 6.1 Introduction 6.2 Sampling 6.2.1 Statistical Survey Sampling 6.2.2 Text Sampling 6.3 Corpus as Baseline 6.3.1 Function vs. Content Words 6.4 Concordancing 6.4.1 Sorting Concordance Lines 6.4.1.1 Code for Sorting Concordance Lines 6.4.2 Application: Word Usage 6.4.3 Application: Word Morphology 6.5 Collocations and Concordance Lines 6.5.1 More Ways to Sort Concordance Lines 6.5.2 Application: Phrasal Verbs 6.5.3 Rare Events 6.6 Applications with References 6.7 Second Transition Problems7. Multivariate Techniques with Text 7.1 Introduction 7.2 Basic Statistics 7.2.1 Z-Scores 7.2.2 Z-Scores and Correlations 7.2.3 Correlations and Cosines 7.2.4 Correlations and Covariances 7.3 Basic linear algebra 7.3.1 2 by 2 Correlation Matrices 7.4 Principal Components Analysis 7.4.1 Finding the Principal Components 7.4.2 PCA and the 68 Poe Short Stories 7.4.3 Another PCA Example withPoes Short Stories 7.4.4 Rotations 7.5 Text Applications 7.5.1 A Word on Factor Analysis 7.6 Applications and References Problems 8. Text Clustering 8.1 Introduction 8.2 Clustering 8.2.1 Two Variable Example of K-Means 8.2.2 K-Means with R 8.2.3 ‘He’ versus ‘She’ in Poes Short Stories 8.2.4 Poe Clusters using Eight Pronouns 8.2.5 Clustering Poe using Principal Components 8.2.6 Hierarchical Clustering of Poes Short Stories 8.3 A Note on Classification 8.3.1 Decision Trees and Over-fitting 8.4 References 8.5 Last Transition Problems 9. ThreeAdditional Topics 9.1 Introduction 9.2 Perl Modules 9.2.1 Modules for Number Words 9.2.2 The StopWords Module 9.2.3 The Sentence Segmentation Module 9.2.4 An Object Oriented Module for Tagging 9.2.5 Miscellaneous Modules 9.3 Other Languages: German 9.4 Permutation Tests 9.4.1 Runs and Hypothesis Testing 9.4.2 Distribution of Character Names 9.5 References Appendix A: Overview of Perl for Text Mining A.1 Basic Data Structures A.1.1 Special Variables and Arrays A.2Operators A.3 Branching and Looping A.4 A Few Perl Functions A.5 Introductionto Regular Expressions Appendix B: Summary of R used in this Book B.1 Basics of R B.1.1 Data Entry B.1.2 Basic Operators B.1.3 Matrix Manipulation B.2 ThisBooks R Code References

Detalles del libro

ISBN: 978-0-470-17643-6
Editorial: John Wiley & Sons
Encuadernacion: Cartoné
Páginas: 320
Fecha Publicación: 05/09/2008
Nº Volúmenes: 1
Idioma: Inglés