My work: March 2005

Sunday, March 27, 2005

Algorithms in the Real World (Guy Blelloch, Fall 97)

http://www-2.cs.cmu.edu/~guyb/real-world/compress/

Readings for Algorithms for Indexing and Searching

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15850-s99/www/readings.html

Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994. Chapter 3: Indexing

Christos Faloutsos. Searching Multimedia Data Bases by Content. Kluwer Academic, 1996.
Frakes and Baeza-Yates (ed.). Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992
Frants V. J., Shapiro J., Voiskunskii V. G. Automated Information Retrieval Theory and Methods Academic Press, Aug 1997.
Lesk M. Practical Digital Libraries Books, Bytes & Bucks Morgan Kaufman Publishers, 1997.
Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
Sparck Jones K. and Willett P. (editors). Readings in Information Retrieval. Morgan Kaufman Publishers, 1997.
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.

Thursday, March 24, 2005

Some paper about feature selection

http://www.tsi.enst.fr/~campedel/Biblio/biblio.html

Feature selection

Overview
"An Introduction to Variable and Feature Selection" [pdf]
Isabelle Guyon & Andre Elisseeff
Journal of Machine Learning Research 3 (2003) 1157-1182

related papers [Link]

"Selection of Relevant Features and Examples in Machine Learning" [ps]
Avrim L. Blum & Pat Langley (1997)
Special issue of Artificial Intelligence on 'Relevance', R.Greiner and D.Subramanian(Eds.)

"Wrappers for Feature Subset Selection" [ps]
Ron Kohavi and George H. John (1997)

"Attribute Selection for Modeling" [pdf]
I. Kononenko and S.J. Hong
Future Generation Compute System, november 1997

Supervised methods
"Dimensionality Reduction via Sparse Support vector Machines" [pdf]
Jinbo Bi, Kristin P. Bennett, Mark Embrechts, Curt M. Breneman and Minghu Song
Journal of Machine Learning Research3 (2003) 1229-1243

"Feature Extraction by Non-Parametric Mutual Information Maximization" [pdf]
Kari Torkkola
Journal of Machine Learning Research3 (2003) 1415-1438

"Use of the Zero-Norm with Linear Models and Kernel Methods" [pdf] similar : [ps]
Jason Weston, Andre Elisseeff, Bernhard Scholkopf and Mike Tipping
Journal of Machine Learning Research3 (2003) 1439-1461

"Iterative Relief" [pdf]
Bruce Draper, Carol Kaito and Jose Bins (2003)

"Theoretical and Empiracal Analysis of ReliefF and RReliefF" [pdf]
Marko Robnik-Sikonja and Igor Kononenko
Journal of Machine Learning (2003) 53:23-69

"Bayesian Learning of Sparse Classifiers" [pdf]
Mario A.T. Figueiredo & Anil K. Jain (2001)

"Gene Selection for Cancer Classification using Support Vector Machines" [pdf]
Isabelle Guyon, Jason Weston, Stephen Barnhill, M.D. and Vladimir Vapnik
submitted to Machine Learning (2000)

Unsupervised methods
"Feature Selection for Unsupervised and Supervised Inference : the Emergence of Sparsity in a Weighted-based Approach" [pdf] + more details [pdf]
Lior Wolf and Amnon Shashua (2003)

"Unsupervised Feature selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition" [pdf]
M. Morita, R. Sabourin, F. Bortolozzi and C.Y. Suen (2003)

"Unsupervised Feature Selection Using Feature Similarity" [pdf]
Pabitra Mitra, C.A. Murthy and Sankar K. Pal
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24, No4, April 2002

"Evolutionary model selection in unsupervised learning" [pdf]
YongSeog Kim, W. Nick Street and Filipo Menczer
Intelligent Data Analysis 6 (2002) 531-556

"Unsupervised Clustering and Feature Discrimination with Application to Image Database Categorization" [pdf]
Hichem Frigui, Nozha Boujemaa and Soon-Ann Lim (2001)

Tuesday, March 15, 2005

a lot of useful code

http://cis.stvincent.edu/swd/index.html

Thursday, March 10, 2005

Analysis of sequential, temporal and spatial data

http://www.cs.helsinki.fi/u/gionis/seminar.html

Overview
Many interesting data-mining applications rely on processing sequential, temporal, and/or spatial data. Examples include mining of genetic sequences, pattern discovery and rule extraction in time series, understanding the market from stock price movements, maintaining information about moving agents in a field, modeling biological data distributed over a geographical terrain, etc.

This seminar focuses on studying recent research in the above mentioned area. The objectives of the seminar are

to provide an overview of the latest papers in the area,
to study common underlying techniques, and
to help identifying potential research projects.

Topics to be addressed include techniques for pattern discovery, indexing, clustering, and segmentation.

Format and Participation
The format of the seminar will be weekly presentations from the participants. Discussion will follow the presentations.

Students that make one presentation and show adequate attendance will receive 2 credit units. It would also be possible to work on a research or programming project and receive 3 credit units. Auditors (no requirements nor credit units) are welcome.

Topics
Sequential data
Frequent-subsequences mining
Discovery of frequent episodes in event sequences, Mannila, Toivonen, and Verkamo, Data Mining and Knowledge Discovery, 1997.
SPADE: An Efficient Algorithm for Mining Frequent Sequences, Zaki, Machine Learning, 2000.
Reliable Detection of Episodes in Event Sequences, Gwadera, Atallah, and Szpankowski, ICDM, 2003.
Structure discovery
DNA Segmentation as A Model Selection Process, Li, RECOMB, 2001.
An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes, Cohen, Heeringa, and Adams ICDM, 2002.
Regulatory Element Detection using a Probabilistic Segmentation Model, Bussemaker, Li, and Siggia, ISMB, 2002.
Sequence Modeling with Mixtures of Conditional Maximum Entropy Distributions, Pavlov, ICDM, 2003.

Temporal data
Similarity search
A signature technique for similarity-based queries, Faloutsos, Jagadish, Mendelzon, and Milo, International Conference on Compression and Complexity of Sequences, 1997.
Finding similar time series, Das, Gunopulos, and Mannila, European Symposium on Principles of Data Mining and Knowledge Discovery, 1997.
Time-Series Similarity Problems and Well-Separated Geometric Sets, Bollobas, Das, Gunopulos, and Mannila, Nordic Journal on Computing, 2001.
Fast similarity search in the presence of noise, scaling, and translation in time-series databases, Agrawal, Lin, Sawhney, and Shim, VLDB, 1995.
Similarity-based queries for time series data, Rafiei and Mendelzon, ICDE, 1997.
Efficiently supporting ad hoc queries in large datasets of time sequences, Korn, Jagadish, and Faloutsos, SIGMOD, 1997.
Locally adaptive dimensionality reduction for indexing large time series databases, Keogh, Chakrabarti, Mehrotra, and Pazzani, SIGMOD, 2001.
Pattern discovery
Finding patterns in time series: a dynamic programming approach, Berndt and Clifford, Advances in Knowledge Discovery and Data Mining, 1996.
Rule discovery from time series, Das, Lin, Mannila, Renganathan, and Smyth, ICDM, 1998.
Event detection from time series data, Guralnik, and Srivastava, SIGKKD, 1999.
A general probabilistic framework for clustering individuals and objects, Cadez, Gaffney, and Smyth, SIGKDD, 2000.
Finding simple intensity descriptions from event sequence data, Mannila and Salmenkivi, SIGKDD, 2001.
Mining surprising patterns using temporal description length, Chakrabarti, Sarawagi, and Dom, VLDB, 1998.
Infominer: mining surprising periodic patterns, Yang, Wang, and Yu, SIGKDD, 2001.
Finding Surprising Patterns in a Time Series Database in Linear Time and Space, Keogh, Lonardi, and Chiu, SIGKDD, 2002.
Finding Motifs in Time Series, Lin, Keogh, Lonardi, and Patel, Second Workshop on Temporal Data Mining, 2002.
A New Approach to Analyzing Gene Expression Time Series Data, Bar-Joseph, Gerber, Gifford, and Jaakkola, RECOMB, 2002.
Bursty and Hierarchical Structure in Streams, Kleinberg, SIGKDD, 2002.
Segmentation
An Online Algorithm for Segmenting Time Series, Keogh, Chu, Hart, and Pazzani, ICDM, 2001.
Finding recurrent sources in sequences, Gionis and Mannila, RECOMB, 2003.

Spatial data
Clustering
Clustering for Mining in Large Spatial Databases, Ester, Kriegel, and Sander, Special Issue on Data Mining, KI-Journal, 1998.
Clustering Spatial Data Using Random Walks, Harel and Koren, SIGKDD, 2001.
A Hypergraph Based Clustering Algorithm for Spatial Data Sets, Cherng and Lo, ICDM, 2001.
Mining frequent neighboring class sets in spatial databases, Morimoto, SIGKDD, 2001.
Mining Confident Co-location Rules without a Support Threshold, Huang, Xiong and Shekhar, ACM SAC, 2003.
Data Mining Techniques for Autonomous Exploration of Large Volumes of Geo-referenced Crime Data, Estivill-Castro and Lee, International Conference on Geocomputation, 2001.
A Weighted Average Likelihood Ratio Test for Spatial Disease Clustering, Gangnon and Clayton, Statistics in Medicine, 2001.

Data Mining - Reading List

Books

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Inter-science, 2001.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, The MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001.
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001
T. M. Mitchell, Machine Learning, McGraw Hill, 1997.
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2001

Papers
Introduction(INR)

V. Ganti, J. Gehrke, R. Ramakrishnan. Mining very large databases. COMPUTER, 32(8):38-45, 1999.
Michael Goebel and Le Gruenwald, “A Survey of Data Mining software Tools”, ACM SIGKDD Exploration, June 1999. Volume 1, Issue 1
David Han, “Statistics and Data Mining: Intersecting Disciplines ”, ACM SIGKDD Exploration, June 1999. Volume 1, Issue 1
S. Chaudhuri, U. Dayal, and V. Ganti, Database Technology for Decision Support Systems. Computer, 34(12):48-55, Dec. 2001.
Data Preprocessing

D. Barbará et al. The New Jersey Data Reduction Report. Bulletin of the Technical Committee on Data Engineering, 20, Dec. 1997, pp. 3-45.
Liu H.; Hussain F.; Tan C.L.; Dash M.. Discretization: An enabling techniques. Data Mining and Knowledge Discovery, 6(4): 393-423, 2002.
V. Raman and J. M. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System, Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB'01), Rome, Italy, pp. 381-390, Sept. 2001.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative Data Cleaning: Language, Model, and Algorithms Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB'01), Rome, Italy, pp. 371-380, Sept. 2001.
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser. Proc. 2002 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'02), Madison, WI, pp. 240-251, June 2002.
Data Warehouse, OLAP, and Data Generalization

S. Chaudhuri, and U. Dayal. An overview of data warehousing and OLAP technology.ACM SIGMOD Record, 26(1):65-74, 1997.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-54, 1997.
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD'96, pp. 205-216, Montreal, Canada, June 1996.
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB'96), pp. 506-521, Bombay, India, Sept. 1996.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD'97, pp. 159-170, Tucson, Arizona, May 1997.
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf. Data Engineering (ICDE'97), Birmingham, England, April 1997.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), Valencia, Spain, pp. 168-182, March 1998.
S. Sarawagi Explaining Differences in Multidimensional Aggregates. In Proc. Int. Conf. of Very Large Data Bases (VLDB'99), pp. 42-53
K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. In EDBT'98, pp. 263-277, Valencia, Spain, March 1998.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD'99, pp. 359--370, Philadelphia, PA, June 1999.
J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In SIGMOD'01, pp. 1--12, Santa Barbara, CA, May 2001.
G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. In VLDB'01, Rome, Italy, Sept. 2001.
W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02) , San Fransisco, CA, April 2002.
L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube, Proc. 2002 Int. Conf. on Very Large Data Bases (VLDB'02), Hong Kong, China, Aug. 2002.
D. Xin, J. Han, X. Li, B. W. Wah, “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03), Berlin, Germany, Sept. 2003.
J. Han. Towards on-line analytical mining in large databases.ACM SIGMOD Record, 27:97-107, 1998.
J. Han, Y. Cai and N. Cercone, Knowledge Discovery in Databases: An Attribute-Oriented Approach in (VLDB'92) , Vancouver, Canada, August 1992, pp. 547-559.
G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. In Proc. Int. Conf. of Very Large Data Bases (VLDB'01), Rome, Italy, pp. 531-540
Mining Frequent Patterns and Association Rules in Large Databases

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB'94, pp. 487-499, Santiago, Chile, Sept. 1994.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In VLDB'95, pp. 420-431, Zürich, Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining generalized association rules. In VLDB'95, pp. 407-419, Zürich, Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In SIGMOD'96, pp. 1-12, Montreal, Canada, June 1996.
B. Lent, A. Swami, and J. Widom. Clustering association rules. In ICDE'97, pp. 220-231, Birmingham, England, April 1997.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In SIGMOD'97, pp. 265-276, Tucson, Arizona, May 1997.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In SIGMOD'98, pp. 13-24 Seattle, Washington, June 1998.
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, 261-270, Aug. 1999.
J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based, multidimensional data mining. COMPUTER, 32(8): 46-50, 1999.
J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation., Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000.
J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases , Proc. 2001 Int. Conf. on Data Mining (ICDM'01)}, San Jose, CA, Nov. 2001.
Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, Proc. 2002 SIAM Int. Conf. Data Mining (SDM'02), Arlington, VA, pp. 457-473, April 2002.
J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets”, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
Y. Xu, J. X. Yu, G. Liu, H. Lu, From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns, Proc. 2002 Int. Conf. on Data Mining (ICDM'02)}, Japan, Dec. 2002
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
G. Liu, H. Lu, Y. Xu, J. X. Yu, Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns, Proc. 2003 Int. Conf. on Database Systems for Advanced Applications (DASFAA’03), Kyoto, Japan, March 2003.
G. Liu, H. Lu, W. Lou, J. X. Yu , On Computing, Storing and Querying Frequent Patterns, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
J. Han, J. Wang, Y. Lu, and P. Tzvetkov, “Mining Top-K Frequent Closed Patterns without Minimum Support”, Proc. 2002 Int. Conf. on Data Mining (ICDM'02), Maebashi, Japan, Dec. 2002.
Mohammad El-Hajj and Osmar R. Zaïane, Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining, in Proc. 2003 Int'l Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD), Washington, DC, USA, August 24-27, 2003
Mohammad El-Hajj and Osmar R. Zaïane, Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations, in Proc. of 5th International Conference on Data Warehousing and Knowledge Discovery (DaWak'2003), Prague, Czech Republic, September 3-5, 2003
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In VLDB'96, pp. 122-133, Bombay, India, Sept. 1996.
T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3(4): 373-408, 1999.
A. Savasere, E. Omiecinski, S. B. Navathe, Mining for Strong Negative Associations in a Large Database of Customer Transactions, In ICDE’98,Feb., 1998, Orlando, Florida.
E. Omiecinski. Alternative Interest Measures for Mining Associations, IEEE Trans. Knowledge and Data Engineering, 15(1):57-69, 2003.
Cristian Bucila, Johannes Gehrke, Daniel Kifer, Walker White: DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272.
B. Goethals, M. Zaki: FIMI: Workshop on Frequent Itemset Mining Implementations (An Introduction). ICDM-FIMI Workshop, Melbourne, Florida, Nov. 2003.
Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Selecting the Right Interestingness Measure for Association Patterns . In Proc of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 32-41 (2002).
Classification and Prediction

J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In VLDB'96, pp. 544-555, Bombay, India, Sept. 1996.
J. Gehrke, R. Ramakrishnan, V. Ganti. RainForest: A framework for fast decision tree construction of large datasets. In VLDB'98, pp. 416-427, New York, NY, August 1998.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999
S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2(4): 345-389, 1998.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001.
X. Yin and J. Han, “CPAR: Classification based on Predictive Association Rules”, Proc. 2003 SIAM Int.Conf. on Data Mining (SDM'03), San Fransisco, CA, May 2003.
H. Yu, J. Yang, and J. Han, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
Cluster Analysis

R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In VLDB'94, pp. 144-155, Santiago, Chile, Sept. 1994.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD'96, pp. 103-114, Montreal, Canada, June 1996.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD'96, pp. 226-231, Portland, Oregon, August 1996.
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD'98, pp. 73-84, Seattle, Washington, June 1998.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD'98, pp. 94-105, Seattle, Washington, June 1998.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In SIGMOD'99, pp. 49-60, Philadelphia, PA, June 1999.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In VLDB'98, pp. 428-439, New York, NY, August 1998.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases , Proc. 2001 Int. Conf. on Database Theory (ICDT'01), London, U.K., Jan. 2001.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, Proc. the ACM SIGMOD International Conference on Management of Data (SIGMOD), Madison, Wisconsin, 2002.
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada, 2002.
Stream Data Mining(STR)

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, Proc. IEEE Symposium on Foundations of Computer Science (FOCS'00), Redondo Beach, CA, pp. 359-366, 2000
S. Babu and J. Widom Continuous Queries over Data Streams. SIGMOD Record, pp. 109-120, Sept. 2001.
B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom, “Models and Issues in Data Stream Systems”, Proc. 2002 ACM-SIGACT/SIGART/SIGMOD Int. Conf. on Principles of Data base (PODS'02), Madison, WI, June 2002. (Conference tutorial)
M. Garofalakis, J. Gehrke, R. Rastogi, “Querying and Mining Data Streams: You Only Get One Look”, Tutorial at 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI, June 2002.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, " Multi-Dimensional Regression Analysis of Time-Series Data Streams '', Proc. 2002 Int. Conf. on Very Large Data Bases (VLDB'02), Hong Kong, China, Aug. 2002.
Stratis Viglas, Jeffrey Naughton, Rate-Based Query Optimization for Streaming Information Sources, SIGMOD’02
Samuel Madden, Mehul Shah, Joseph Hellerstein, Vijayshankar Raman, Continuously Adaptive Continuous Queries over Streams, SIGMOD02.
Alin Dobra, Minos N. Garofalakis, Johannes Gehrke, Rajeev Rastogi:, Processing Complex Aggregate Queries over Data Streams, SIGMOD’02
Gurmeet Singh Manku, Rajeev Motwani.. Approximate Frequency Counts over Data Streams, VLDB’02
Yunyue Zhu, Dennis Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, VLDB’02
J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous data streams. Proc. 2001 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'01), Santa Barbara, CA, pp. 13-24, May 2001.
Geoff Hulten, Laurie Spencer, Pedro Domingos: Mining time-changing data streams. KDD 2001: 97-106
J. Han, `` Mining Dynamics of Data Streams in Multidimensional Space '' (in PowerPoint), ICDM'02 Keynote Speech, Maebashi City, Japan, Dec. 2002.
C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Evolving Data Streams”, Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03), Berlin, Germany, Sept. 2003.
H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept-Drifting Data Streams using Ensemble Classifiers”, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities”, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), Next Generation Data Mining, 2003.
Spatio-temporal and Time-series Data Mining(STT)

K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. In Proc. 4th Int'l Symp. on Large Spatial Databases (SSD'95), pp. 47-66, Portland, Maine, Aug. 1995.
S. Shekhar, P. Zhang, Y. Huang, R. Vatsavai, Trend in Spatail Data Mining, as a chapter to appear in Data Mining: Next Generation Challenges and Future Directions, Hillol Kargupta and Anupam Joshi(eds.), AAAI/MIT Press, 2003, (pdf, PS)
J. Han, R. B. Altman, V. Kumar, H. Mannila and D. Pregibon, “ Emerging Scientific Applications in Data Mining”, Communications of ACM, 45(8):54-58, 2002.
Shashi Shekhar and Yan Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th Intl. Symp. on Spatial and Temporal Databases (SSTD), Redondo Beach, CA, July 2001
R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE'95, pp. 3-14, Taipei, Taiwan, March 1995.
Mannila H.; Toivonen H.; Inkeri Verkamo A., Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery, 1997, vol. 1, no. 3, pp. 259-289(31)
Ester M., Kriegel H.-P., Sander J, Algorithms and Applications for Spatial Data Mining, in: Geographic Data Mining and Knowledge Discovery, Research Monographs in GIS, Taylor and Francis, 2001, pp. 160-187.
M. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pp. 223-234, Edinburgh, UK, Sept. 1999.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. , Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In VLDB'95, pp. 490-501, Zurich, Switzerland, Sept. 1995.
Y.-S. Moon, K.-Y. Whang, W.-K. Loh. Duality-Based Subsequence Matching in Time-Series Databases., Proc. 2001 Int. Conf. Data Engineering (ICDE'01), Heidelberg, Germany, pp. 263-272, April 2001
R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. In VLDB'95, pp. 502-514, Zürich, Switzerland, Sept. 1995.
J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In ICDE'99, pp. 106-115, Sydney, Australia, April 1999.
J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns with Constraints in Large Databases”, Proc. 2002 Int. Conf. on Information and Knowledge Management (CIKM'02)}, Washington, D.C., Nov. 2001.
X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, Proc. 2002 Int. Conf. on Data Mining (ICDM'02), Maebashi, Japan, Dec. 2002.
X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003.
X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets”, Proc. 2003 SIAM Int.Conf. on Data Mining (SDM'03), San Fransisco, CA, May 2003.
S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.. (PS, PDF)
Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Steve Klooster, Christopher Potter, Discovery of Climate Indices using Clustering, Proc of the Ninth ACM SIGKDD Int'l Conf on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC, Aug 24-27 (2003).
Information Retrieval and Web Mining(IRW)

S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. COMPUTER, 32(8):60-67, 1999.
J. M. Kleinberg. “Authoritative Sources in a Hyperlinked Environment”. Journal of ACM, 46(5):604-632, 1999.
H. Yu, J. Han, and K. C.-C. Chang, " PEBL: Positive Example Based Learning for Web Page Classification Using SVM '', Proc. 2002 Int. Conf. on Knowledge Discovery in Databases (KDD'02), Edmonton, Canada, July 2002.
K. Wang, S. Zhou and S. C. Liew. “Building hierarchical classifiers using class proximity”. In VLDB99, Edinburgh, UK, Sept. 1999.
Mukund Deshpande and George Karypis, Selective Markov Models for Predicting Web-Page Accesses, 1st SIAM Data Mining Conference, 2001
J. Han, and K. C.-C. Chang, “Data Mining for Web Intelligence”, Computer, Nov. 2002
Chris Ridings and Mike Shishigin, “PageRand Uncovered”, Google Tech, September, 2002
Pang-Ning Tan, Vipin Kumar, “Discovery of Web Robot Sessions based on their Navigational Patterns”, Data Mining and Knowledge Discovery, 6(1): 9-35 (2002)
Bio-mining(BIO)

J. Yang, P. Yu, W. Wang, and J. Han, '' Mining Long Sequential Patterns in a Noisy Environment '', Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI, June 2002.
Ying Zhao and George Karypis, “Prediction of Contact Maps Using Support Vector Machines”, IEEE Symposium on Bioinformatics and Bioengineering, 2003
Mukund Deshpande, Michihiro Kuramochi, and George Karypis, Frequent Sub-structure Based Approaches for Classifying Chemical Compounds, IEEE International Conference on Data Mining, 2003
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, Proc. the ACM SIGMOD International Conference on Management of Data (SIGMOD), Madison, Wisconsin, 2002.
Visual Data Mining(VIS)

Tutorial KDD-2002 on "Visual Data Mining: Background, Techniques, and Drug Discovery Applications" by M. Ankerst, G. Grinstein, and D. Keim, Tutorial Notes (14 MByte), Edmonton, Canada.
Tutorial IEEE Visualization 2000 on "An Introduction to Information Visualization Techniques for Exploring Large Databases"
Tutorial Notes (11 MByte)
Data Mining Applications and Trends in Data Mining(TRD)

H. Mannila, Theoretical Frameworks of Data Mining. SIGKDD Explorations , 1(2): 30-32, 2000
C. Clifton and D. Marks. Security and Privacy Implications of Data Mining. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), Montreal, Canada, pp. 15-20, June 1996.
R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 439-450, Dallas, TX, May 2000.

Data Mining - Related Websites

http://www.kdnuggets.com/
Popular site containing links to data sets used for KDD research. Also for data mining consulting, data mining jobs and advertising

http://www.ualberta.ca/~unisecr/policy/sec30.html
Code of Student Behaviour

http://kdd.ics.uci.edu/
UCI KDD Database Repository
Popular site containing links to data sets used for KDD research.

Wednesday, March 09, 2005

Datasets

WHO’s Communicable Disease Global Atlas
http://globalatlas.who.int/

UCI Knowledge Discovery in Databases Archive
http://kdd.ics.uci.edu/

Raw input data for "small" Sequoia benchmark
http://epoch.cs.berkeley.edu:8000/sequoia/benchmark/

Kent Ridge Biomedical Data Set Repository
http://sdmc.i2r.a-star.edu.sg/rp/
Summarization of Kent Ridge Biomedical Data Set Repository:
Breast Cancer: 78 +19 samples, 24481 features(genes), 2 classes;
Central Nervous System : 60 samples, 7129 genes, 2 classes;
Colon Tumor: 62 samples, 2000 genes, 2 classes;

Diffuse Large B-Cell Lymphoma (DLBCL)
DLBCL-Stanford: 47 samples, 4026 genes, two classes;
DLBCL-Harvard : 58+19 samples, 6817 genes, 2 classes;
DLBCL-NIH: 240 samples, 7399 microarray features, 2 classes;

Leukemia
Leukemia-ALLAML (WhiteHead, MIT) : 38+34 samples, 7129 probes from 6817 genes, 2 classes;
Leukemia-MLL (WhiteHead, MIT) : 57+15 samples, 12582 genes, 3 classes;
Leukemia-subtype (Stjude) : 215+112 samples, 12558 genes, 7 classes;

Lung Cancer
LungCancer-DanaFarberCancerInstitute-HarvardMedicalSchool : 203 samples, 12600 genes, 5 classes;
LungCancer-BrighamAndWomenHospital-HarvardMedicalSchool : 181 samples, 12533 genes, 2 classes;
LungCancer-Michigan : 86+10 samples, 7129 genes, 2 classes;
LungCancer-Ontario : 39 samples, 2880 genes, 2 classes;

Ovarian Cancer
OvarianCancer-NCI-PBSII-061902 : 91+162 samples, 15154 M/Z identities, 2 classes; OvarianCancer-NCI-QStar : 216 samples, 373401 features, 2 classes;

Prostate Cancer : (a) 52+50+25+9 samples, 126000 genes, two classes; (b) 21 samples, two classes

Genomic Sequences
Translation Initiation Site Prediction : 3312 sequences, 927 features, two classes; Polyadenylation Signal Prediction: 2327 (training) + 982 (testing), 168 features, two classes

Spatial-temporal Data Mining - Reading List

http://people.cas.sc.edu/guod/courses/geog763/syllabus.html

Textbooks and Readings:
Pattern Classification (Duda, Hart and Stork, 2001)
The Elements of Statistical Learning : Data Mining, Inference, and rediction (Hastie, Tibshirani and Friedman, 2001)
Geographic Data Mining and Knowledge Discovery (Miller and Han, 2001)
Quantitative Geography--Perspectives on Spatial Data Analysis (Fotheringham, Brunsdon and Charlton, 2000)

My work