Understanding Python Data Mining and Cluster Analysis in Context
This chapter delves into using Python lists for data storage, implementing data mining applications, and exploring cluster analysis. Learn about clusters, centroids, Euclidean distance, visualization, and how to write functions for data analysis. The concepts of indefinite iteration and loop control are also discussed, along with practical examples to enhance understanding.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Python Programming in Context Chapter 7
Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To understand and implement cluster analysis To use visualization as a means of displaying patterns
Cluster Data points that have something in common Clusters are dissimilar to each other Use simple Euclidean distance to measure how close one point is to another Centroid is a point that represents a cluster (not necessarily a real data point)
Listing 7.1 def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance
Listing 7.2 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 for aline in datafile: key = key + 1 score = int(aline) datadict[key] = [score] return datadict
Indefinite Iteration Repeating a process an unknown number of times Control is based on a boolean expression Infinite loop is possible Any for loop can be written as a while loop
Listing 7.3 while <condition>: statement1 statement2 ... statementn
Listing 7.4 sum = 0 for anum in range(1,11): sum = sum + anum print(sum)
Listing7.5 sum = 0 anum = 1 #initialization while anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of state print(sum)
Listing 7.6 sum = 0 anum = 1 while anum <= 10: sum = sum + anum print(sum)
Listing 7.7 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score] aline = datafile.readline() return datadict
Creating Clusters Decide on number of clusters Choose data points to be initial centroids Assign data points to be members of a centroid Recompute centroids Repeat
Listing 7.8 def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = [] while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids
Listing 7.9 def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([]) for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist) mindist = min(distances) index = distances.index(mindist) clusters[index].append(akey) dimensions = len(datadict[1])
Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters
Listing 7.10 def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5, examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")
Visualizing Clusters Earthquake data Show clusters on a map Use turtle module to plot data
Listing 7.11 def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90 quakeT.hideturtle() quakeT.up() colorlist = ["red","green","blue","orange","cyan","yellow"] for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()