Data Mining Algorithms Using Leukemia Incidence & Survivability

George Dimitoglou
Dept. of Computer Science, Hood College


Week #0: Familiarization with WEKA and data mining techniques (book reading). [DONE 6/14]

Week #1: Topic familiarization and data pre-processing training.

  1. Learn the basics about leukemia. Understand the disease, how it manifests, what are the symptoms and causes. Articles to read:
    article from the Mayo Clinic
    article from WebMD
  2. Download the sample dataset: sample data. The data set is an ASCII, "pipe" (|) delimited file [?].
  3. Remove the data not related to our study (anything with SITE not Leukemia) to create our "working" dataset.
  4. Load up the data into WEKA using the Explore facility. In order to do this you will need to propoerly format the data as WEKA requires.
Week #2 Pre-processing and pre-analysis of the SEER dataset.
  1. We have now access to the data. You can download the full data (~210MB) here.
  2. Browse the contents of the Documentation for the ASCII Text Data Files web page but pay particular attention to: Data Dictionary for the Population Data Files, Population Estimates Used in SEER*Stat, SEER Extent of Disease 1988, SEER Site Recode ICD-O-3, Site/histology recode based on International Classification of Childhood Cancer and the Extended Classification Table.
    NOTE: You don't need to "learn" or "memorize" any of the information. But what you browse through here is very important because it contains descriptions for the data and inoformation about how the data was collected that will help us better understand what the dataset contains.
  3. We want to work on predictive analysis of the data and assess the effectiveness of data mining algorithms. To do this, we need to identify two "chuncks" of the data. The first chunk is the training dataset (TrD), for example, we can pick 10 consecutive years (1950-1959). If we mine this data set we should come up with some prediction of what should happen in the future years. But we know what happened in the future years (we have the data from 1960) which will be our target data set (TaD) and we can compare how well our algorithm(s) performed. To do this, you need to identify a 10-year period in the actual data. Make sure that the consecutive years you select are homogeneous, they all contain the same fields and the information is coded consistently (you will have to check this against the documentation you have covered in item #2 above).
  4. Counts. As you are moving through the data sets you need to be keeping counts of teh number of records in each version of teh dataset. HOw many records are in the dataset?
  5. Integrity/Consistency and Cleaning-up of the TrD. If there are inconsistencies (ex. the first two years the code for Leukemia is #3456J and the last eight years the code for leukemia is #PPPL6 because they changed the way they were coding the data, we will have to adjust and make the whole data set consistent. Then we need to remove any data that is not related to our disease. Having entries in the data that is not related to leukemia will not help us very much so this data needs to be removed. We need to be careful here and make sure that we don't remove data that may indicate the signs of early inicdence of the disease (if such thing exists) in case the disease has evolutionary characteristics (ex. starts as brain cancer and then is diagnosed as leukemia [this is not the case with leukemia, I am just giving a, perhaps bad, example])...
  6. Once you have the appropriate training dataset for 10 years of Leukemia (10TrDL), load up the data into WEKA using the Explore facility. In order to do this you will need to propoerly format the data as WEKA requires.
Week #3 Analysis based on specific algorithmic techniques.
  1. NOTE: Anything you do from now on needs to be carefully recorded in a single document. Also, as you create datasets for different year ranges in ARFF format for WEKA you need to save these files using meaningful names.
  2. Now that we have multiple years (1973-2008) of Leukemia records, select a consecutive 15 year range. Select a range that is as consistent as possible (ex. if they changed the coding for diagnosis on 1975, it is best not to pick the range 1973-1987 but instead pick 1976-1990). The year immediately after the end of the range is our target prediction year. Does the range before 2006 (make 2006 the target year) work?
  3. Load the latest 10 years (before the target year) on Weka.
  4. Using Weka and the techniques you have read in the green book, run predictive algorithms. I am leaving it up to you to decide which ones to run (definitely include J48 please).
  5. Record (cut-n-paste) the output for each algorithm in a document.
  6. Count the number of Leukemia instances in the target year. How close were the predictions? This is our key question.

    For these results you need to create a table like the one below:

    Prediction Actual Target Year CountsSuccess percentage
    Algorithm x
    Algorithm y
    Algorithm z

  7. Which one is the algorithn with the "strongest" prediction? Why do you think is the case?
  8. Using the "strongest" algorithm, repeat the experiment by

    (a) removing from the 10-year data, one year of data (from the beginning of the range) until you only have one year left

    (b) adding to the 10-year data one more year (to the beginning of the range), ex. if you selected 1970-1979 with target 1980, run 11 years 1969-1979, 12 years 1968-1979, 13 years 1967-1979 etc until 15 years.

  9. Record the results in a table:
    Prediction Actual Target Year CountsSuccess percentage
    7 years
    8 years
    9 years
    10 years
    11 years
    12 years
    13 years

Week #4 Results.
Week #5 TBA