Data Mining Algorithms Using Leukemia Incidence & Survivability
Dept. of Computer Science, Hood College
PROJECT DIRECTIONS & RESEARCH PLAN
Week #0: Familiarization with WEKA and data mining techniques (book reading). [DONE 6/14]
Week #1: Topic familiarization and data pre-processing training.
- Learn the basics about leukemia. Understand the disease, how it
manifests, what are the symptoms and causes. Articles to read:
article from the Mayo Clinic
article from WebMD
- Download the sample dataset: sample data. The data set is an ASCII, "pipe" (|) delimited file [?].
Remove the data not related to our study (anything with SITE not Leukemia) to create our "working" dataset.
- Open the file using a spreadsheet (ex. Excel). There are over half a million records. Howeve, not all of them are relevant to us. Using spreadsheet functions answer the
(a) Using the SITE and the EVENT_TYPE columns: how many patients total were either diagnosed (Incidence) or died (Mortality) from Leukemia?
(b) What percentage of all occurences was the percentage of incidence vs. mortality?
(c) Create a table that has as a column the states (column AREA) and the Incidence and Mortality rates (use actual counts and the percentages).
Load up the data into WEKA using the Explore facility. In order to do this you will need to propoerly format the data as WEKA requires.
Week #2 Pre-processing and pre-analysis of the SEER dataset.
Week #3 Analysis based on specific algorithmic techniques.
- We have now access to the data. You can download the full data (~210MB) here.
- Browse the contents of the Documentation for the ASCII Text Data Files web page but pay particular attention
to: Data Dictionary for the Population Data Files,
Population Estimates Used in SEER*Stat, SEER Extent of Disease 1988, SEER Site Recode
ICD-O-3, Site/histology recode based on
International Classification of Childhood Cancer and the Extended Classification Table.
NOTE: You don't need to "learn" or "memorize" any of the information. But what you
browse through here is very important because it contains descriptions for the data and inoformation
about how the data was collected that will help us better understand what the dataset contains.
- We want to work on predictive analysis of the data and assess the effectiveness of data
mining algorithms. To do this, we need to identify two "chuncks" of the data. The first chunk is the
training dataset (TrD), for example, we can pick 10 consecutive years (1950-1959). If we mine this
data set we should come up with some prediction of what should happen in the future years. But we know
what happened in the future years (we have the data from 1960) which will be our target data set (TaD)
and we can compare how well our algorithm(s) performed. To do this, you need to identify a
10-year period in the actual data. Make sure that the consecutive years you select are
homogeneous, they all contain the same fields and the information is coded consistently (you will
have to check this against the documentation you have covered in item #2 above).
- Counts. As you are moving through the data sets you need to be keeping counts of teh number of
records in each version of teh dataset. HOw many records are in the dataset?
- Integrity/Consistency and Cleaning-up of the TrD. If there are inconsistencies (ex. the first two
years the code for Leukemia is #3456J and the last eight years
the code for leukemia is #PPPL6 because they changed the way they were coding the data, we will have
to adjust and make the whole data set consistent.
Then we need to remove any data that is not related to our disease. Having entries in the data that is
not related to leukemia will not help us very much so this data needs to be removed. We need to be
careful here and make sure that we don't remove data that may indicate the signs of early inicdence of
the disease (if such thing exists) in case the disease has evolutionary characteristics (ex. starts as
brain cancer and then is diagnosed as leukemia [this is not the case with leukemia, I am just
giving a, perhaps bad, example])...
- Once you have the appropriate training dataset for 10 years of Leukemia
(10TrDL), load up the data into WEKA using the Explore facility. In order to do
this you will need to propoerly format the data as WEKA requires.
- NOTE: Anything you do from now on needs to be carefully
in a single document. Also, as you create datasets for different year
ranges in ARFF format for WEKA you need to save these files using
- Now that we have multiple years (1973-2008) of Leukemia records,
select a consecutive 15 year range. Select a range that is as consistent
as possible (ex. if they changed the coding for diagnosis on 1975, it is
best not to pick the range 1973-1987 but instead pick 1976-1990). The year
immediately after the end of the range is our target prediction year. Does
the range before 2006 (make 2006 the target year) work?
- Load the latest 10 years (before the target year) on Weka.
- Using Weka and the techniques you have read in the green book, run
predictive algorithms. I am leaving it up to you to decide which ones to
run (definitely include J48 please).
- Record (cut-n-paste) the output for each algorithm in a document.
- Count the number of Leukemia instances in the target year. How close
were the predictions? This is our key question.
For these results you need to create a table like the one below:
| ||Prediction|| Actual Target Year
|Algorithm x |
|Algorithm y |
|Algorithm z |
| etc |
- Which one is the algorithn with the "strongest" prediction? Why do
you think is the case?
- Using the "strongest" algorithm, repeat the experiment by
(a) removing from the 10-year data, one year of data (from the beginning
of the range) until you only have one year left
(b) adding to the 10-year data one more year (to the beginning of the
range), ex. if you selected 1970-1979 with target 1980, run 11 years
1969-1979, 12 years 1968-1979, 13 years 1967-1979 etc until 15 years.
- Record the results in a table:
| ||Prediction|| Actual Target Year
|7 years |
|8 years |
|9 years |
|10 years |
|11 years |
|12 years |
|13 years |
Week #4 Results.
Week #5 TBA