Understanding Python Data Mining and Cluster Analysis in Context

Slide Note

This chapter delves into using Python lists for data storage, implementing data mining applications, and exploring cluster analysis. Learn about clusters, centroids, Euclidean distance, visualization, and how to write functions for data analysis. The concepts of indefinite iteration and loop control are also discussed, along with practical examples to enhance understanding.

olu_m Follow

Uploaded on Oct 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Python Programming in Context Chapter 7

Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To understand and implement cluster analysis To use visualization as a means of displaying patterns

Cluster Data points that have something in common Clusters are dissimilar to each other Use simple Euclidean distance to measure how close one point is to another Centroid is a point that represents a cluster (not necessarily a real data point)

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Listing 7.1 def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance

Figure 7.5

Listing 7.2 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 for aline in datafile: key = key + 1 score = int(aline) datadict[key] = [score] return datadict

Indefinite Iteration Repeating a process an unknown number of times Control is based on a boolean expression Infinite loop is possible Any for loop can be written as a while loop

Listing 7.3 while <condition>: statement1 statement2 ... statementn

Figure 7.6

Listing 7.4 sum = 0 for anum in range(1,11): sum = sum + anum print(sum)

Listing7.5 sum = 0 anum = 1 #initialization while anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of state print(sum)

Listing 7.6 sum = 0 anum = 1 while anum <= 10: sum = sum + anum print(sum)

Listing 7.7 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score] aline = datafile.readline() return datadict

Creating Clusters Decide on number of clusters Choose data points to be initial centroids Assign data points to be members of a centroid Recompute centroids Repeat

Listing 7.8 def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = [] while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids

Listing 7.9 def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([]) for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist) mindist = min(distances) index = distances.index(mindist) clusters[index].append(akey) dimensions = len(datadict[1])

Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters

Figure 7.7

Listing 7.10 def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5, examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")

Visualizing Clusters Earthquake data Show clusters on a map Use turtle module to plot data

Figure 7.8

Listing 7.11 def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90 quakeT.hideturtle() quakeT.up() colorlist = ["red","green","blue","orange","cyan","yellow"] for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()

Figure 7.9

Understanding Python Data Mining and Cluster Analysis in Context

Download Presentation

Presentation Transcript

Related

More Related Content