Understanding Python Data Mining and Cluster Analysis in Context

Slide Note
Embed
Share

This chapter delves into using Python lists for data storage, implementing data mining applications, and exploring cluster analysis. Learn about clusters, centroids, Euclidean distance, visualization, and how to write functions for data analysis. The concepts of indefinite iteration and loop control are also discussed, along with practical examples to enhance understanding.


Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Python Programming in Context Chapter 7

  2. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To understand and implement cluster analysis To use visualization as a means of displaying patterns

  3. Cluster Data points that have something in common Clusters are dissimilar to each other Use simple Euclidean distance to measure how close one point is to another Centroid is a point that represents a cluster (not necessarily a real data point)

  4. Figure 7.1

  5. Figure 7.2

  6. Figure 7.3

  7. Figure 7.4

  8. Listing 7.1 def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance

  9. Figure 7.5

  10. Listing 7.2 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 for aline in datafile: key = key + 1 score = int(aline) datadict[key] = [score] return datadict

  11. Indefinite Iteration Repeating a process an unknown number of times Control is based on a boolean expression Infinite loop is possible Any for loop can be written as a while loop

  12. Listing 7.3 while <condition>: statement1 statement2 ... statementn

  13. Figure 7.6

  14. Listing 7.4 sum = 0 for anum in range(1,11): sum = sum + anum print(sum)

  15. Listing7.5 sum = 0 anum = 1 #initialization while anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of state print(sum)

  16. Listing 7.6 sum = 0 anum = 1 while anum <= 10: sum = sum + anum print(sum)

  17. Listing 7.7 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score] aline = datafile.readline() return datadict

  18. Creating Clusters Decide on number of clusters Choose data points to be initial centroids Assign data points to be members of a centroid Recompute centroids Repeat

  19. Listing 7.8 def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = [] while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids

  20. Listing 7.9 def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([]) for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist) mindist = min(distances) index = distances.index(mindist) clusters[index].append(akey) dimensions = len(datadict[1])

  21. Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters

  22. Figure 7.7

  23. Listing 7.10 def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5, examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")

  24. Visualizing Clusters Earthquake data Show clusters on a map Use turtle module to plot data

  25. Figure 7.8

  26. Listing 7.11 def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90 quakeT.hideturtle() quakeT.up() colorlist = ["red","green","blue","orange","cyan","yellow"] for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()

  27. Figure 7.9

Related


More Related Content