Python Data Mining and Cluster Analysis in Context

Python Programming in Context
Chapter 7
Objectives
To use Python lists as a means of storing data
To implement a nontrivial data mining
application
To understand and implement cluster analysis
To use visualization as a means of displaying
patterns
Cluster
Data points that have something in common
Clusters are dissimilar to each other
Use simple Euclidean distance to measure
how close one point is to another
Centroid is a point that represents a cluster
(not necessarily a real data point)
Figure 7.1
Figure 7.2
Figure 7.3
Figure 7.4
Listing 7.1
def euclidD(point1, point2):
    sum = 0
    for index in range(len(point1)):
        diff = (point1[index]-point2[index]) ** 2
        sum = sum + diff
    euclidDistance = math.sqrt(sum)
    return euclidDistance
Figure 7.5
Listing 7.2
def readFile(filename):
    datafile = open(filename, "r")
    datadict = {}
    key = 0
    for aline in datafile:
       key = key + 1
       score = int(aline)
       datadict[key] = [score]
    return datadict
Indefinite Iteration
Repeating a process an unknown number of
times
Control is based on a boolean expression
Infinite loop is possible
Any for loop can be written as a while loop
Listing 7.3
while <condition>:
   statement1
   statement2
   ...
   statementn
Figure 7.6
Listing 7.4
sum = 0
for anum in range(1,11):
    sum = sum + anum
print(sum)
Listing7.5
sum = 0
anum = 1                        #initialization
while anum <= 10:        #condition
    sum = sum + anum
    anum = anum + 1      #change of state
print(sum)
Listing 7.6
sum = 0
anum = 1
while anum <= 10:
    sum = sum + anum
print(sum)
Listing 7.7
def readFile(filename):
    datafile = open(filename, "r")
    datadict = {}
    key = 0
    aline = datafile.readline()
    while aline != "":
       key = key + 1
       score = int(aline)
       datadict[key] = [score]
       aline = datafile.readline()
    return datadict
Creating Clusters
Decide on number of clusters
Choose data points to be initial centroids
Assign data points to be members of a
centroid
Recompute centroids
Repeat
Listing 7.8
def createCentroids(k, datadict):
    centroids=[]
    centroidCount = 0
    centroidKeys = []
    while centroidCount < k:
       rkey = random.randint(1,len(datadict))
       if rkey not in centroidKeys:
           centroids.append(datadict[rkey])
           centroidKeys.append(rkey)
           centroidCount = centroidCount + 1
    return centroids
Listing 7.9
def createClusters(k, centroids, datadict, repeats):
    for apass in range(repeats):
        print("****PASS",apass,"****")
        clusters = []
        for i in range(k):
           clusters.append([])
        for akey in datadict:
           distances = []
           for clusterIndex in range(k):
               dist = euclidD(datadict[akey],centroids[clusterIndex])
               distances.append(dist)
           mindist = min(distances)
           index = distances.index(mindist)
           clusters[index].append(akey)
        dimensions = len(datadict[1])
Listing 7.9 continued
        for clusterIndex in range(k):
           sums = [0]*dimensions
           for akey in clusters[clusterIndex]:
               datapoints = datadict[akey]
               for ind in range(len(datapoints)):
                   sums[ind] = sums[ind] + datapoints[ind]
           for ind in range(len(sums)):
               clusterLen = len(clusters[clusterIndex])
               if clusterLen != 0:
                  sums[ind] = sums[ind]/clusterLen
           centroids[clusterIndex] = sums
        for c in clusters:
           print ("CLUSTER")
           for key in c:
               print(datadict[key], end=" ")
           print()
    return clusters
Figure 7.7
Listing 7.10
def clusterAnalysis(dataFile):
    examDict = readFile(dataFile)
    examCentroids = createCentroids(5, examDict)
    examClusters = createClusters(5,
examCentroids, examDict, 3)
clusterAnalysis("cs150exams.txt")
Visualizing Clusters
Earthquake data
Show clusters on a map
Use turtle module to plot data
Figure 7.8
Listing 7.11
def visualizeQuakes(dataFile):
    datadict = readFile(dataFile)
    quakeCentroids = createCentroids(6, datadict)
    clusters = createClusters(6, quakeCentroids, datadict, 7)
    quakeT = turtle.Turtle()
    quakeWin = turtle.Screen()
    quakeWin.bgpic("worldmap.gif")
    quakeWin.screensize(448,266)
    quakeWin.setup(width=500, height=300)
    wFactor = (quakeWin.screensize()[0]/2)/180
    hFactor = (quakeWin.screensize()[1]/2)/90
    quakeT.hideturtle()
    quakeT.up()
    colorlist = ["red","green","blue","orange","cyan","yellow"]
    for clusterIndex in range(6):
        quakeT.color(colorlist[clusterIndex])
        for akey in clusters[clusterIndex]:
            lon = datadict[akey][0]
            lat = datadict[akey][1]
            quakeT.goto(lon*wFactor,lat*hFactor)
            quakeT.dot()
    quakeWin.exitonclick()
Figure 7.9
Slide Note
Embed
Share

This chapter delves into using Python lists for data storage, implementing data mining applications, and exploring cluster analysis. Learn about clusters, centroids, Euclidean distance, visualization, and how to write functions for data analysis. The concepts of indefinite iteration and loop control are also discussed, along with practical examples to enhance understanding.

  • Python
  • Data Mining
  • Cluster Analysis
  • Visualization
  • Indefinite Iteration

Uploaded on Oct 07, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Python Programming in Context Chapter 7

  2. Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To understand and implement cluster analysis To use visualization as a means of displaying patterns

  3. Cluster Data points that have something in common Clusters are dissimilar to each other Use simple Euclidean distance to measure how close one point is to another Centroid is a point that represents a cluster (not necessarily a real data point)

  4. Figure 7.1

  5. Figure 7.2

  6. Figure 7.3

  7. Figure 7.4

  8. Listing 7.1 def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance

  9. Figure 7.5

  10. Listing 7.2 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 for aline in datafile: key = key + 1 score = int(aline) datadict[key] = [score] return datadict

  11. Indefinite Iteration Repeating a process an unknown number of times Control is based on a boolean expression Infinite loop is possible Any for loop can be written as a while loop

  12. Listing 7.3 while <condition>: statement1 statement2 ... statementn

  13. Figure 7.6

  14. Listing 7.4 sum = 0 for anum in range(1,11): sum = sum + anum print(sum)

  15. Listing7.5 sum = 0 anum = 1 #initialization while anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of state print(sum)

  16. Listing 7.6 sum = 0 anum = 1 while anum <= 10: sum = sum + anum print(sum)

  17. Listing 7.7 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score] aline = datafile.readline() return datadict

  18. Creating Clusters Decide on number of clusters Choose data points to be initial centroids Assign data points to be members of a centroid Recompute centroids Repeat

  19. Listing 7.8 def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = [] while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids

  20. Listing 7.9 def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([]) for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist) mindist = min(distances) index = distances.index(mindist) clusters[index].append(akey) dimensions = len(datadict[1])

  21. Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters

  22. Figure 7.7

  23. Listing 7.10 def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5, examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")

  24. Visualizing Clusters Earthquake data Show clusters on a map Use turtle module to plot data

  25. Figure 7.8

  26. Listing 7.11 def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90 quakeT.hideturtle() quakeT.up() colorlist = ["red","green","blue","orange","cyan","yellow"] for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()

  27. Figure 7.9

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#