Question
Problem 1:
Clustering analysis on the "CCND3 Cyclin D3" gene expression values of the Golub et al. (1999) data.
(a) Conduct hierarchical clustering using single linkage and Ward linkage. Plot the cluster dendrogram for both fit. Get two clusters from each of the methods. Use function table() to compare the clusters with the two patient groups ALL/AML. Which linkage function seems to work better here?
(b) Use k-means cluster analysis to get two clusters. Use table() to compare the two clusters with the two patient groups ALL/AML.
(c) Which clustering approach (hierarchical versus k-means) produce the best matches to the two diagnose groups ALL/AML?
(d) Find the two cluster means from the k-means cluster analysis. Perform a bootstrap on the cluster means. Do the confidence intervals for the cluster means overlap? Which of these two cluster means is estimated more accurately?
(e) Produce a plot of K versus SSE, for K=1, …, 30. How many clusters does this plot suggest?

Problem 2 :
Cluster analysis on part of Golub data.
(a) Select the oncogenes and antigens from the Golub data. (Hint: Use grep() ).
(b) On the selected data, do clustering analysis for the genes (not for the patients). Using K-means and K-medoids with K=2 to cluster the genes. Use table() to compare the resulting two clusters with the two gene groups oncogenes and antigens for each of the two clustering analysis.
(c) Use appropriate tests (from previous modules) to test the marginal independence in the two by two tables in (b). Which clustering method provides clusters related to the two gene groups?
(d) Plot the cluster dendrograms for this part of golub data with single linkage and complete linkage, using Euclidean distance.

Problem 3:
Clustering analysis on NCI60 cancer cell line microarray data (Ross et al. 2000) We use the data set in package ISLR from r-project (Not Bioconductor). You can use the following commands to load the data set.

install.packages('ISLR') library(ISLR)
ncidata<-NCI60$data
ncilabs<-NCI60$labs The ncidata (64 by 6830 matrix) contains 6830 gene expression measurements on 64 cancer cell lines. The cancer cell lines labels are contained in ncilabs. We do clustering analysis on the 64 cell lines (the rows).
(a) Using k-means clustering, produce a plot of K versus SSE, for K=1,…, 30. How many clusters appears to be there?
(b) Do K-medoids clustering (K=7) with 1-correlation as the dissimilarity measure on the data. Compare the clusters with the cell lines.
Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references.
Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.

library("multtest")
data(golub)
dim(golub)
# [1] 3051   38
golub <- data.frame(golub)
gol.fac <- factor( golub.cl, levels=0:1, labels=c("ALL","AML"))

# Problem 1
clusdata <- data.frame(golub[1042,])
clusdata <- as.vector(clusdata, mode="numeric")
hc.sing <- hclust( dist(clusdata,method="euclidian"),method="single")
hc.ward <- hclust( dist(clusdata,method="euclidian"),method="ward.D2")

plot(hc.sing, labels=gol.fac)
rect.hclust(hc.sing,k=2)
groups <- cutree(hc.sing,k=2)
table(groups, gol.fac)
#          gol.fac
# groups   ALL AML
#    1    27 10
#    2      0   1
# The table suggests that the algorithm groups all of them
# into one cluster for ALL except for one patient to AML which is
# actually a correct classification.
This is only a preview of the solution.
Please use the purchase button to see the entire solution.
By purchasing this solution you'll be able to access the following files:
Solution.R
Purchase Solution
$68.25 $34.13
Google Pay
Amazon
Paypal
Mastercard
Visacard
Discover
Amex
View Available Mathematics Tutors 641 tutors matched
Ionut
(ionut)
Master of Computer Science
Hi! MSc Applied Informatics & Computer Science Engineer. Practical experience in many CS & IT branches.Research work & homework
5/5 (6,804+ sessions)
1 hour avg response
$15-$50 hourly rate
Pranay
(math1983)
Doctor of Philosophy (PhD)
Ph.D. in mathematics and working as an Assistant Professor in University. I can provide help in mathematics, statistics and allied areas.
4.6/5 (6,688+ sessions)
1 hour avg response
$40-$50 hourly rate
Leo
(Leo)
Doctor of Philosophy (PhD)
Hi! I have been a professor in New York and taught in a math department and in an applied math department.
4.9/5 (6,435+ sessions)
2 hours avg response

Similar Homework Solutions