Suche
Lesesoftware
Info / Kontakt
Text Mining - Concepts, Implementation, and Big Data Challenge
von: Taeho Jo
Springer-Verlag, 2018
ISBN: 9783319918150 , 376 Seiten
Format: PDF, Online Lesen
Kopierschutz: Wasserzeichen
Preis: 149,79 EUR
eBook anfordern
Preface
6
Contents
8
Part I Foundation
15
1 Introduction
17
1.1 Definition of Text Mining
17
1.2 Texts
18
1.2.1 Text Components
19
1.2.2 Text Formats
20
1.3 Data Mining Tasks
21
1.3.1 Classification
21
1.3.2 Clustering
23
1.3.3 Association
24
1.4 Data Mining Types
25
1.4.1 Relational Data Mining
26
1.4.2 Web Mining
27
1.4.3 Big Data Mining
28
1.5 Summary
30
2 Text Indexing
32
2.1 Overview of Text Indexing
32
2.2 Steps of Text Indexing
34
2.2.1 Tokenization
34
2.2.2 Stemming
36
2.2.3 Stop-Word Removal
37
2.2.4 Term Weighting
38
2.3 Text Indexing: Implementation
40
2.3.1 Class Definition
40
2.3.2 Stemming Rule
43
2.3.3 Method Implementations
45
2.4 Additional Steps
48
2.4.1 Index Filtering
48
2.4.2 Index Expansion
50
2.4.3 Index Optimization
51
2.5 Summary
53
3 Text Encoding
54
3.1 Overview of Text Encoding
54
3.2 Feature Selection
56
3.2.1 Wrapper Approach
56
3.2.2 Principal Component Analysis
57
3.2.3 Independent Component Analysis
59
3.2.4 Singular Value Decomposition
62
3.3 Feature Value Assignment
63
3.3.1 Assignment Schemes
63
3.3.2 Similarity Computation
65
3.4 Issues of Text Encoding
66
3.4.1 Huge Dimensionality
66
3.4.2 Sparse Distribution
67
3.4.3 Poor Transparency
68
3.5 Summary
70
4 Text Association
72
4.1 Overview of Text Association
72
4.2 Data Association
74
4.2.1 Functional View
74
4.2.2 Support and Confidence
75
4.2.3 Apriori Algorithm
77
4.3 Word Association
79
4.3.1 Word Text Matrix
79
4.3.2 Functional View
81
4.3.3 Simple Example
82
4.4 Text Association
84
4.4.1 Functional View
84
4.4.2 Simple Example
85
4.5 Overall Summary
87
Part II Text Categorization
89
5 Text Categorization: Conceptual View
91
5.1 Definition of Text Categorization
91
5.2 Data Classification
93
5.2.1 Binary Classification
93
5.2.2 Multiple Classification
94
5.2.3 Classification Decomposition
95
5.2.4 Regression
97
5.3 Classification Types
98
5.3.1 Hard vs Soft Classification
98
5.3.2 Flat vs Hierarchical Classification
100
5.3.3 Single vs Multiple Viewed Classification
102
5.3.4 Independent vs Dependent Classification
104
5.4 Variants of Text Categorization
106
5.4.1 Spam Mail Filtering
106
5.4.2 Sentimental Analysis
107
5.4.3 Information Filtering
109
5.4.4 Topic Routing
110
5.5 Summary and Further Discussions
111
6 Text Categorization: Approaches
112
6.1 Machine Learning
112
6.2 Lazy Learning
114
6.2.1 K Nearest Neighbor
115
6.2.2 Radius Nearest Neighbor
117
6.2.3 Distance-Based Nearest Neighbor
118
6.2.4 Attribute Discriminated Nearest Neighbor
120
6.3 Probabilistic Learning
121
6.3.1 Bayes Rule
122
6.3.2 Bayes Classifier
123
6.3.3 Naive Bayes
125
6.3.4 Bayesian Learning
127
6.4 Kernel Based Classifier
129
6.4.1 Perceptron
130
6.4.2 Kernel Functions
131
6.4.3 Support Vector Machine
133
6.4.4 Optimization Constraints
135
6.5 Summary and Further Discussions
137
7 Text Categorization: Implementation
139
7.1 System Architecture
139
7.2 Class Definitions
141
7.2.1 Classes: Word, Text, and PlainText
141
7.2.2 Interface and Class: Classifier and KNearestNeighbor
144
7.2.3 Class: TextClassificationAPI
146
7.3 Method Implementations
147
7.3.1 Class: Word
148
7.3.2 Class: PlainText
149
7.3.3 Class: KNearestNeighbor
151
7.3.4 Class: TextClassificationAPI
152
7.4 Graphic User Interface and Demonstration
155
7.4.1 Class: TextClassificationGUI
155
7.4.2 Preliminary Tasks and Encoding
157
7.4.3 Classification Process
162
7.4.4 System Upgrading
165
7.5 Summary and Further Discussions
166
8 Text Categorization: Evaluation
167
8.1 Evaluation Overview
167
8.2 Text Collections
169
8.2.1 NewsPage.com
169
8.2.2 20NewsGroups
170
8.2.3 Reuter21578
171
8.2.4 OSHUMED
173
8.3 F1 Measure
174
8.3.1 Contingency Table
175
8.3.2 Micro-Averaged F1
176
8.3.3 Macro-Averaged F1
178
8.3.4 Example
180
8.4 Statistical t-Test
181
8.4.1 Student's t-Distribution
181
8.4.2 Unpaired Difference Inference
184
8.4.3 Paired Difference Inference
185
8.4.4 Example
187
8.5 Summary and Further Discussions
188
Part III Text Clustering
190
9 Text Clustering: Conceptual View
191
9.1 Definition of Text Clustering
191
9.2 Data Clustering
192
9.2.1 SubSubsectionTitle
193
9.2.2 Association vs Clustering
194
9.2.3 Classification vs Clustering
195
9.2.4 Constraint Clustering
196
9.3 Clustering Types
197
9.3.1 Static vs Dynamic Clustering
198
9.3.2 Crisp vs Fuzzy Clustering
199
9.3.3 Flat vs Hierarchical Clustering
201
9.3.4 Single vs Multiple Viewed Clustering
202
9.4 Derived Tasks from Text Clustering
204
9.4.1 Cluster Naming
204
9.4.2 Subtext Clustering
205
9.4.3 Automatic Sampling for Text Categorization
207
9.4.4 Redundant Project Detection
208
9.5 Summary and Further Discussions
209
10 Text Clustering: Approaches
210
10.1 Unsupervised Learning
210
10.2 Simple Clustering Algorithms
211
10.2.1 AHC Algorithm
212
10.2.2 Divisive Clustering Algorithm
213
10.2.3 Single Pass Algorithm
214
10.2.4 Growing Algorithm
216
10.3 K Means Algorithm
218
10.3.1 Crisp K Means Algorithm
218
10.3.2 Fuzzy K Means Algorithm
219
10.3.3 Gaussian Mixture
220
10.3.4 K Medoid Algorithm
221
10.4 Competitive Learning
224
10.4.1 Kohonen Networks
224
10.4.2 Learning Vector Quantization
226
10.4.3 Two-Dimensional Self-Organizing Map
227
10.4.4 Neural Gas
229
10.5 Summary and Further Discussions
230
11 Text Clustering: Implementation
232
11.1 System Architecture
232
11.2 Class Definitions
234
11.2.1 Classes in Text Categorization System
234
11.2.2 Class: Cluster
237
11.2.3 Interface: ClusterAnalyzer
239
11.2.4 Class: AHCAlgorithm
240
11.3 Method Implementations
242
11.3.1 Methods in Previous Classes
242
11.3.2 Class: Cluster
244
11.3.3 Class: AHC Algorithm
246
11.4 Class: ClusterAnalysisAPI
247
11.4.1 Class: ClusterAnalysisAPI
248
11.4.2 Class: ClusterAnalyzerGUI
249
11.4.3 Demonstration
251
11.4.4 System Upgrading
252
11.5 Summary and Further Discussions
253
12 Text Clustering: Evaluation
255
12.1 Introduction
255
12.2 Cluster Validations
256
12.2.1 Intra-Cluster and Inter-Cluster Similarities
256
12.2.2 Internal Validation
258
12.2.3 Relative Validation
259
12.2.4 External Validation
261
12.3 Clustering Index
263
12.3.1 Computation Process
263
12.3.2 Evaluation of Crisp Clustering
264
12.3.3 Evaluation of Fuzzy Clustering
265
12.3.4 Evaluation of Hierarchical Clustering
267
12.4 Parameter Tuning
269
12.4.1 Clustering Index for Unlabeled Documents
269
12.4.2 Simple Clustering Algorithm with Parameter Tuning
270
12.4.3 K Means Algorithm with Parameter Tuning
271
12.4.4 Evolutionary Clustering Algorithm
272
12.5 Summary and Further Discussions
273
Part IV Advanced Topics
275
13 Text Summarization
277
13.1 Definition of Text Summarization
277
13.2 Text Summarization Types
278
13.2.1 Manual vs Automatic Text Summarization
279
13.2.2 Single vs Multiple Text Summarization
280
13.2.3 Flat vs Hierarchical Text Summarization
282
13.2.4 Abstraction vs Query-Based Summarization
284
13.3 Approaches to Text Summarization
285
13.3.1 Heuristic Approaches
286
13.3.2 Mapping into Classification Task
287
13.3.3 Sampling Schemes
289
13.3.4 Application of Machine Learning Algorithms
291
13.4 Combination with Other Text Mining Tasks
293
13.4.1 Summary-Based Classification
294
13.4.2 Summary-Based Clustering
295
13.4.3 Topic-Based Summarization
296
13.4.4 Text Expansion
298
13.5 Summary and Further Discussions
299
14 Text Segmentation
301
14.1 Definition of Text Segmentation
301
14.2 Text Segmentation Type
302
14.2.1 Spoken vs Written Text Segmentation
302
14.2.2 Ordered vs Unordered Text Segmentation
304
14.2.3 Exclusive vs Overlapping Segmentation
306
14.2.4 Flat vs Hierarchical Text Segmentation
308
14.3 Machine Learning-Based Approaches
310
14.3.1 Heuristic Approaches
310
14.3.2 Mapping into Classification
311
14.3.3 Encoding Adjacent Paragraph Pairs
313
14.3.4 Application of Machine Learning
315
14.4 Derived Tasks
317
14.4.1 Temporal Topic Analysis
317
14.4.2 Subtext Retrieval
319
14.4.3 Subtext Synthesization
320
14.4.4 Virtual Text
321
14.5 Summary and Further Discussions
322
15 Taxonomy Generation
324
15.1 Definition of Taxonomy Generation
324
15.2 Relevant Tasks to Taxonomy Generation
325
15.2.1 Keyword Extraction
325
15.2.2 Word Categorization
327
15.2.3 Word Clustering
329
15.2.4 Topic Routing
330
15.3 Taxonomy Generation Schemes
332
15.3.1 Index-Based Scheme
332
15.3.2 Clustering-Based Scheme
333
15.3.3 Association-Based Scheme
334
15.3.4 Link Analysis-Based Scheme
336
15.4 Taxonomy Governance
337
15.4.1 Taxonomy Maintenance
337
15.4.2 Taxonomy Growth
339
15.4.3 Taxonomy Integration
340
15.4.4 Ontology
342
15.5 Summary and Further Discussions
344
16 Dynamic Document Organization
346
16.1 Definition of Dynamic Document Organization
346
16.2 Online Clustering
347
16.2.1 Online Clustering in Functional View
347
16.2.2 Online K Means Algorithm
349
16.2.3 Online Unsupervised KNN Algorithm
350
16.2.4 Online Fuzzy Clustering
351
16.3 Dynamic Organization
353
16.3.1 Execution Process
353
16.3.2 Maintenance Mode
354
16.3.3 Creation Mode
355
16.3.4 Additional Tasks
356
16.4 Issues of Dynamic Document Organization
357
16.4.1 Text Representation
358
16.4.2 Binary Decomposition
358
16.4.3 Transition into Creation Mode
359
16.4.4 Variants of DDO System
360
16.5 Summary and Further Discussions
361
References
363
Index
368