Python Data Mining and Cluster Analysis in Context

Python Programming in Context

Chapter 7

Objectives

•

To use Python lists as a means of storing data

•

To implement a nontrivial data mining

application

•

To understand and implement cluster analysis

•

To use visualization as a means of displaying

patterns

Cluster

•

Data points that have something in common

•

Clusters are dissimilar to each other

•

Use simple Euclidean distance to measure

how close one point is to another

•

Centroid is a point that represents a cluster

(not necessarily a real data point)

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Listing 7.1

def euclidD(point1, point2):

    sum = 0

    for index in range(len(point1)):

        diff = (point1[index]-point2[index]) ** 2

        sum = sum + diff

    euclidDistance = math.sqrt(sum)

    return euclidDistance

Figure 7.5

Listing 7.2

def readFile(filename):

    datafile = open(filename, "r")

    datadict = {}

    key = 0

    for aline in datafile:

       key = key + 1

       score = int(aline)

       datadict[key] = [score]

    return datadict

Indefinite Iteration

•

Repeating a process an unknown number of

times

•

Control is based on a boolean expression

•

Infinite loop is possible

•

Any for loop can be written as a while loop

Listing 7.3

while <condition>:

   statement1

   statement2

...

   statementn

Figure 7.6

Listing 7.4

sum = 0

for anum in range(1,11):

    sum = sum + anum

print(sum)

Listing7.5

sum = 0

anum = 1                        #initialization

while anum <= 10:        #condition

    sum = sum + anum

    anum = anum + 1      #change of state

print(sum)

Listing 7.6

sum = 0

anum = 1

while anum <= 10:

    sum = sum + anum

print(sum)

Listing 7.7

def readFile(filename):

    datafile = open(filename, "r")

    datadict = {}

    key = 0

    aline = datafile.readline()

    while aline != "":

       key = key + 1

       score = int(aline)

       datadict[key] = [score]

       aline = datafile.readline()

    return datadict

Creating Clusters

•

Decide on number of clusters

•

Choose data points to be initial centroids

•

Assign data points to be members of a

centroid

•

Recompute centroids

•

Repeat

Listing 7.8

def createCentroids(k, datadict):

    centroids=[]

    centroidCount = 0

    centroidKeys = []

    while centroidCount < k:

       rkey = random.randint(1,len(datadict))

       if rkey not in centroidKeys:

           centroids.append(datadict[rkey])

           centroidKeys.append(rkey)

           centroidCount = centroidCount + 1

    return centroids

Listing 7.9

def createClusters(k, centroids, datadict, repeats):

    for apass in range(repeats):

        print("****PASS",apass,"****")

        clusters = []

        for i in range(k):

           clusters.append([])

        for akey in datadict:

           distances = []

           for clusterIndex in range(k):

               dist = euclidD(datadict[akey],centroids[clusterIndex])

               distances.append(dist)

           mindist = min(distances)

           index = distances.index(mindist)

           clusters[index].append(akey)

        dimensions = len(datadict[1])

Listing 7.9 continued

        for clusterIndex in range(k):

           sums = [0]*dimensions

           for akey in clusters[clusterIndex]:

               datapoints = datadict[akey]

               for ind in range(len(datapoints)):

                   sums[ind] = sums[ind] + datapoints[ind]

           for ind in range(len(sums)):

               clusterLen = len(clusters[clusterIndex])

               if clusterLen != 0:

                  sums[ind] = sums[ind]/clusterLen

           centroids[clusterIndex] = sums

        for c in clusters:

           print ("CLUSTER")

           for key in c:

               print(datadict[key], end=" ")

           print()

    return clusters

Figure 7.7

Listing 7.10

def clusterAnalysis(dataFile):

    examDict = readFile(dataFile)

    examCentroids = createCentroids(5, examDict)

    examClusters = createClusters(5,

examCentroids, examDict, 3)

clusterAnalysis("cs150exams.txt")

Visualizing Clusters

•

Earthquake data

•

Show clusters on a map

•

Use turtle module to plot data

Figure 7.8

Listing 7.11

def visualizeQuakes(dataFile):

    datadict = readFile(dataFile)

    quakeCentroids = createCentroids(6, datadict)

    clusters = createClusters(6, quakeCentroids, datadict, 7)

    quakeT = turtle.Turtle()

    quakeWin = turtle.Screen()

    quakeWin.bgpic("worldmap.gif")

    quakeWin.screensize(448,266)

    quakeWin.setup(width=500, height=300)

    wFactor = (quakeWin.screensize()[0]/2)/180

    hFactor = (quakeWin.screensize()[1]/2)/90

    quakeT.hideturtle()

    quakeT.up()

    colorlist = ["red","green","blue","orange","cyan","yellow"]

    for clusterIndex in range(6):

        quakeT.color(colorlist[clusterIndex])

        for akey in clusters[clusterIndex]:

            lon = datadict[akey][0]

            lat = datadict[akey][1]

            quakeT.goto(lon*wFactor,lat*hFactor)

            quakeT.dot()

    quakeWin.exitonclick()

Figure 7.9

Slide Note

Embed Share

Download

This chapter delves into using Python lists for data storage, implementing data mining applications, and exploring cluster analysis. Learn about clusters, centroids, Euclidean distance, visualization, and how to write functions for data analysis. The concepts of indefinite iteration and loop control are also discussed, along with practical examples to enhance understanding.

olu_m Follow

Uploaded on Oct 07, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Python Programming in Context Chapter 7

Objectives To use Python lists as a means of storing data To implement a nontrivial data mining application To understand and implement cluster analysis To use visualization as a means of displaying patterns

Cluster Data points that have something in common Clusters are dissimilar to each other Use simple Euclidean distance to measure how close one point is to another Centroid is a point that represents a cluster (not necessarily a real data point)

Figure 7.1

Figure 7.2

Figure 7.3

Figure 7.4

Listing 7.1 def euclidD(point1, point2): sum = 0 for index in range(len(point1)): diff = (point1[index]-point2[index]) ** 2 sum = sum + diff euclidDistance = math.sqrt(sum) return euclidDistance

Figure 7.5

Listing 7.2 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 for aline in datafile: key = key + 1 score = int(aline) datadict[key] = [score] return datadict

Indefinite Iteration Repeating a process an unknown number of times Control is based on a boolean expression Infinite loop is possible Any for loop can be written as a while loop

Listing 7.3 while <condition>: statement1 statement2 ... statementn

Figure 7.6

Listing 7.4 sum = 0 for anum in range(1,11): sum = sum + anum print(sum)

Listing7.5 sum = 0 anum = 1 #initialization while anum <= 10: #condition sum = sum + anum anum = anum + 1 #change of state print(sum)

Listing 7.6 sum = 0 anum = 1 while anum <= 10: sum = sum + anum print(sum)

Listing 7.7 def readFile(filename): datafile = open(filename, "r") datadict = {} key = 0 aline = datafile.readline() while aline != "": key = key + 1 score = int(aline) datadict[key] = [score] aline = datafile.readline() return datadict

Creating Clusters Decide on number of clusters Choose data points to be initial centroids Assign data points to be members of a centroid Recompute centroids Repeat

Listing 7.8 def createCentroids(k, datadict): centroids=[] centroidCount = 0 centroidKeys = [] while centroidCount < k: rkey = random.randint(1,len(datadict)) if rkey not in centroidKeys: centroids.append(datadict[rkey]) centroidKeys.append(rkey) centroidCount = centroidCount + 1 return centroids

Listing 7.9 def createClusters(k, centroids, datadict, repeats): for apass in range(repeats): print("****PASS",apass,"****") clusters = [] for i in range(k): clusters.append([]) for akey in datadict: distances = [] for clusterIndex in range(k): dist = euclidD(datadict[akey],centroids[clusterIndex]) distances.append(dist) mindist = min(distances) index = distances.index(mindist) clusters[index].append(akey) dimensions = len(datadict[1])

Listing 7.9 continued for clusterIndex in range(k): sums = [0]*dimensions for akey in clusters[clusterIndex]: datapoints = datadict[akey] for ind in range(len(datapoints)): sums[ind] = sums[ind] + datapoints[ind] for ind in range(len(sums)): clusterLen = len(clusters[clusterIndex]) if clusterLen != 0: sums[ind] = sums[ind]/clusterLen centroids[clusterIndex] = sums for c in clusters: print ("CLUSTER") for key in c: print(datadict[key], end=" ") print() return clusters

Figure 7.7

Listing 7.10 def clusterAnalysis(dataFile): examDict = readFile(dataFile) examCentroids = createCentroids(5, examDict) examClusters = createClusters(5, examCentroids, examDict, 3) clusterAnalysis("cs150exams.txt")

Visualizing Clusters Earthquake data Show clusters on a map Use turtle module to plot data

Figure 7.8

Listing 7.11 def visualizeQuakes(dataFile): datadict = readFile(dataFile) quakeCentroids = createCentroids(6, datadict) clusters = createClusters(6, quakeCentroids, datadict, 7) quakeT = turtle.Turtle() quakeWin = turtle.Screen() quakeWin.bgpic("worldmap.gif") quakeWin.screensize(448,266) quakeWin.setup(width=500, height=300) wFactor = (quakeWin.screensize()[0]/2)/180 hFactor = (quakeWin.screensize()[1]/2)/90 quakeT.hideturtle() quakeT.up() colorlist = ["red","green","blue","orange","cyan","yellow"] for clusterIndex in range(6): quakeT.color(colorlist[clusterIndex]) for akey in clusters[clusterIndex]: lon = datadict[akey][0] lat = datadict[akey][1] quakeT.goto(lon*wFactor,lat*hFactor) quakeT.dot() quakeWin.exitonclick()

Figure 7.9

Python Data Mining and Cluster Analysis in Context

Download Presentation

Presentation Transcript

Related

More Related Content