Neural Networks – Part 2

Neural Networks – Part 2
Slide Note
Embed
Share

The training process of perceptrons in handling multiclass problems, understand the notation for training sets, and optimize model parameters for single perceptrons. Learn how perceptrons serve as regression models and binary classifiers in neural networks.

  • Perceptrons
  • Multiclass Problems
  • Training Sets
  • Regression Models
  • Binary Classifiers

Uploaded on Feb 15, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Neural Networks Part 2 Training Perceptrons Handling Multiclass Problems CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

  2. Training a Neural Network In some cases, the training process can find the best solution using a closed-form formula. Example: linear regression, for the sum-of-squares error In some cases, the training process can find the best weights using an iterative method. Example: sequential learning for linear regression. In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck). We use gradient descent to find local minima of the error function. In recent years this approach has produced spectacular results in real-world applications. 2

  3. Notation for Training Set We have a set ? of N training examples. ? = {?1,?2, ,??} Each ?? is a D-dimensional column vector. ??= (??,1,??,2, ,??,?) We also have a set ? of N target outputs. ? = ?1,?2, ,?? ?? is the target output for training example ??. Each ?? is a K-dimensional column vector: ??= (??,1,??,2, ,??,?) Note: K typically is not equal to D. In your assignment, K is equal to the number of classes. K is also equal to the number of units in the output layer. 3

  4. Perceptron Learning Before we discuss how to train an entire neural network, we start with a single perceptron. Remember: given input ??, a perceptron computes its output ? using this formula: ?(??) = ? + ???? What are the model parameters that we want to optimize during training? 4

  5. Perceptron Learning Before we discuss how to train an entire neural network, we start with a single perceptron. Remember: given input ??, a perceptron computes its output ? using this formula: ?(??) = ? + ???? What are the model parameters that we want to optimize during training? The weights: Bias weight ?. Weight vector ?. 5

  6. Regression or Classification? If using the sigmoid as activation function, a perceptron outputs a continuous value between 0 and 1. With other frequently used activation functions (ReLU, tanh, etc), the output is still a continuous value. Thus, perceptrons and neural networks are regression models, since they produce continuous outputs. However, perceptrons and neural networks can easily be used for classification. A perceptron can be treated as a binary classifier: One class label is 0. One class label is 1. Neural networks can do multiclass classification (more details on that later). 6

  7. Perceptron Learning Given input ??, a perceptron computes its output ? using this formula: ?(??) = ? + ???? We use sum-of-squares as our error function. ??(?,?) is the contribution of training example ??: ???,? =1 2 2?(??) ?? The error ? over the entire training set is defined as: ? ?,? = ?=1 ???,? Important: a single perceptron has a single output. Therefore, for perceptrons (but NOT for neural networks in general), we assume that ?? is one-dimensional. ? 7

  8. Perceptron Learning Suppose that a perceptron is using the step function as its activation function . ?(?) = ? + ??? = 0,if ? + ??? < 0 ? = 0,if ? < 0 1,if ? 0 1,if ? + ??? 0 Can we apply gradient descent in that case? No, because ?(?,?) has a gradient that is the zero vector most of the time. 8

  9. Perceptron Learning ?(?,?) has a gradient that is the zero vector most of the time. Small changes of ? or ? usually lead to no changes in ? + ??? . The only exception is when the change in ? or ? causes ? + ??? to switch signs (from positive to negative, or from negative to positive). Why is that a problem for learning? 9

  10. Perceptron Learning ?(?,?) has a gradient that is the zero vector most of the time. Small changes of ? or ? usually lead to no changes in ? + ??? . The only exception is when the change in ? or ? causes ? + ??? to switch signs (from positive to negative, or from negative to positive). Why is that a problem for learning? If we want to do gradient descent, and the gradient is the zero vector, we end up making no updates to the parameters, so nothing is learned. 10

  11. Perceptron Learning A better option is setting to the sigmoid function: 1 ? ? = ? + ??? = 1 + ? ? ??? Then, measured just on a single training object ??, the error ??(?,?) is defined as: 2 2 1 2 =1 1 ???,? = ?? ? ?? ?? 1 + ? ? ???? 2 Reminder: if our neural network is a single perceptron, then the target output ??must be one-dimensional. These formulas, so far, deal only with training a single perceptron. 11

  12. Computing the Gradient 2 2=1 1 2?? ? ?? 1 ??? + ? = 2?? 1+? ? ???? In this form, ???,? is differentiable. If we do the calculations, the gradients turn out to be: ??? ??= ? ?? ?? ? ?? 1 ? ?? ??? ??= ? ?? ?? ? ?? 1 ? ?? ?? 12

  13. Computing the Gradient 2 2=1 1 2?? ? ?? 1 ??? + ? = 2?? 1+? ? ???? From the previous slide, the gradients are: ??? ??= ?? ? ?? ? ?? 1 ? ?? ??? ??= ?? ? ?? ? ?? 1 ? ?? ?? Note that ??? ?? is a D-dimensional vector. It is a scalar (shown in red) multiplied by vector ??. 13

  14. Weight Update ??? ??= ?? ? ?? ? ?? 1 ? ?? ??? ??= ?? ? ?? ? ?? 1 ? ?? ?? So, we update the bias weight ? and weight vector ? as follows: ? = ? ? ?? ? ?? ? ?? 1 ? ?? ? = ? ? ?? ? ?? ? ?? 1 ? ?? ?? 14

  15. Weight Update (From previous slide) Update formulas: ? = ? ? ?? ? ?? ? ?? 1 ? ?? ? = ? ? ?? ? ?? ? ?? 1 ? ?? ?? As before, ? is the learning rate parameter. It is a positive real number that should be chosen carefully, so as not to be too big or too small. In terms of individual weights ??, the update rule is: ??= ?? ? ?? ? ?? ? ?? 1 ? ?? ??,? 15

  16. Perceptron Learning - Summary Input: Training inputs ?1,, ,??, target outputs ?1, ,?? For a binary classification problem, each ?? is set to 0 or 1. 1. Initialize ? and each ?? to small random numbers. For example, set ? and each ?? to a random value between 0.1 and 0.1 2. For n = 1 to N: a) Compute ? ??. b) ? = ? ? ?? ? ?? ? ?? 1 ? ?? c) For d = 0 to D: ??= ?? ? ?? ? ?? 3. If some stopping criterion has been met, exit. 4. Else, go to step 2. ? ?? 1 ? ?? ??,? 16

  17. Stopping Criterion At step 3 of the perceptron learning algorithm, we need to decide whether to stop or not. One thing we can do is: Compute the cumulative squared error E(w) of the perceptron at that point: ? ? 2 1 2 ? ?,? = ???,? = ?? ? ?? ?=1 ?=1 Compare the current value of ? ?,? with the value of ? ?,? computed at the previous iteration. If the difference is too small (e.g., smaller than 0.00001) we stop. 17

  18. Using Perceptrons for Multiclass Problems Multiclass means that we have more than two classes. A perceptron outputs a number between 0 and 1. This is sufficient only for binary classification problems. For more than two classes, there are many different options. We will follow a general approach called one-versus-all classification (also known as OVA classification). This approach is a general method, that can be combined with various binary classification methods, so as to solve multiclass problems. Here we see the method applied to perceptrons. 18

  19. A Multiclass Example Suppose we have this training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= dog ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= dog ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= cat ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= fox ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= cat ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= fox In this training set: We have three classes. Each training input ?? is a five-dimensional vector. The class labels ?? are strings. 19

  20. Converting to One-Versus-All Suppose we have this training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= dog, ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= dog, ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= cat, ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= fox, ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= cat, ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= fox, Step 1: Generate new class labels ??, where classes are numbered sequentially starting from 1. Thus, in our example, the class labels become 1, 2, 3. ?1= 1 ?2= 1 ?3= 2 ?4= 3 ?5= 2 ?6= 3 20

  21. Converting to One-Versus-All Training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 2: Convert each label ?? to a one-hot vector ??. Vector ?? has as many dimensions as the number of classes. How many dimensions should we use in our example? 21

  22. Converting to One-Versus-All Training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 2: Convert each label ?? to a one-hot vector ??. Vector ?? has as many dimensions as the number of classes. In our example we have three classes, so each ?? is 3-dimensional. If ??= ?, then set the i-th dimension of ?? to 1. Otherwise, set the i-th dimension of ?? to 0. ? ?1= ?,?,? ?2= ?,?,? ?3= ?,?,? ?4= ?,?,? ?5= ?,?,? ?6= ?,?,? ? ? ? ? ? 22

  23. Converting to One-Versus-All Training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 2: Convert each label ?? to a one-hot vector ??. Vector ?? has as many dimensions as the number of classes. In our example we have three classes, so each ?? is 3-dimensional. If ??= ?, then set the i-th dimension of ?? to 1. Otherwise, set the i-th dimension of ?? to 0. ?1= 1,0,0? ?2= 1,0,0? ?3= 0,1,0? ?4= 0,0,1? ?5= 0,1,0? ?6= 0,0,1? 23

  24. Converting to One-Versus-All Training set: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 3: Train three separate perceptrons (as many as the number of classes). For training the first perceptron, use the first dimension of each ?? as target output for ??. ?1= 1,0,0? ?2= 1,0,0? ?3= 0,1,0? ?4= 0,0,1? ?5= 0,1,0? ?6= 0,0,1? 24

  25. Training Set for the First Perceptron Training set used to train the first perceptron: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 0 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 0 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 0 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 0 Essentially, the first perceptron is trained to output 1 when: The original class label ??is dog . The sequentially numbered class label ?? is 1. 25

  26. Converting to One-Versus-All Training set for the multiclass problem: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 3: Train three separate perceptrons (as many as the number of classes). For training the second perceptron, use the second dimension of each ?? as target output for ??. ?1= 1,0,0? ?2= 1,0,0? ?3= 0,1,0? ?4= 0,0,1? ?5= 0,1,0? ?6= 0,0,1? 26

  27. Training Set for the Second Perceptron Training set used to train the second perceptron: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 0 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 0 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 1 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 0 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 1 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 0 Essentially, the second perceptron is trained to output 1 when: The original class label ??is cat . The sequentially numbered class label ?? is 2. 27

  28. Converting to One-Versus-All Training set for the multiclass problem: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 1 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 1 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 2 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 3 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 2 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 3 Step 3: Train three separate perceptrons (as many as the number of classes). For training the third perceptron, use the third dimension of each ?? as target output for ??. ?1= 1,0,0? ?2= 1,0,0? ?3= 0,1,0? ?4= 0,0,1? ?5= 0,1,0? ?6= 0,0,1? 28

  29. Training Set for the Third Perceptron Training set used to train the third perceptron: ?1= 0.5,2.4,8.3,1.2,4.5?, ?1= 0 ?2= 3.4,0.6,4.4,6.2,1.0?, ?2= 0 ?3= 4.7,1.9,6.7,1.2,3.9?, ?3= 0 ?4= 2.6,1.3,9.4,0.7,5.1?, ?4= 1 ?5= 8.5,4.6,3.6,2.0,6.2?, ?5= 0 ?6= 5.2,8.1,7.3,4.2,1.6?, ?6= 1 Essentially, the third perceptron is trained to output 1 when: The original class label ??is fox . The sequentially numbered class label ?? is 3. 29

  30. One-Versus-All Perceptrons: Recap Suppose we have ? classes ?1, ,??, where ? > 2. We have training inputs ?1,, ,??, and target values ?1, ,??. Each target value ?? is a K-dimensional vector: ??= (??,1,??,2, ,??,?) ??,?= 0 if the class of ?? is not Ck. ??,?= 1 if the class of ?? is Ck. For each class ??, train a perceptron ?? by using ??,? as the target value for ??. So, perceptron ?? is trained to recognize if an object belongs to class ?? or not. In total, we train ? perceptrons, one for each class. 30

  31. One-Versus-All Perceptrons To classify a test pattern ?: Compute the responses ??(?) for all ? perceptrons. Find the perceptron ?? such that the value ?? (?) is higher than all other responses. Output that the class of x is ?? . In summary: we assign ? to the class whose perceptron produced the highest output value for ?. 31

  32. Parenthesis: Categorical Attributes A few slides ago we saw how to convert strings representing class labels to one-hot vectors. The same approach can also be used for attributes. In our examples so far, the attributes have always been numbers. Sometimes, we have categorical attributes. These are attributes that can be represented as strings, that belong to a specific set of categories. Examples: Blood type of a person: A, B, AB, or O. Type of vehicle: passenger car, motorcycle, truck, bus. Occupation: student, teacher, bus driver, police officer 32

  33. Mixed Attributes A dataset can contain inputs where some attributes are numerical, and some attributes are categorical. Example: UCI Census Income Dataset. Accessible from: https://archive.ics.uci.edu/ml/datasets/Census+Income Classification goal: predict if someone earns more than $50K/year. Some of the attributes in that dataset: age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, education: Bachelors, Some-college, 11th, HS-grad, Prof-school, occupation: Tech-support, Craft-repair, Other-service, Sales, race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. hours-per-week: continuous. 33

  34. Mixed Attributes How can we train a perceptron (or Na ve Bayes classifier, or linear regression model) on such a dataset? Our methods so far have assumed that each input is a vector of real numbers. Answer: Convert each categorical attribute, separately, into a one-hot vector. The input can then be represented as a vector that is a concatenation of all these one-hot vectors, plus all the numerical attributes. 34

  35. Multiclass Neural Networks For perceptrons, we saw that we can perform multiclass (i.e., for more than two classes) classification using the one-versus-all (OVA) approach: We train one perceptron for each class. These multiple perceptrons can also be thought of as a single neural network. 35

  36. OVA Perceptrons as a Single Network Layer 1 (Input layer) Layer 2 (Output layer) ?1,1 ?2,1 ?1,2 ?1,3 ?2,2 ?1,4 ?2,3 ?1,5 36

  37. Multiclass Neural Networks For perceptrons, we saw that we can perform multiclass (i.e., for more than two classes) classification using the one-versus-all (OVA) approach: We train one perceptron for each class. These multiple perceptrons can also be thought of as a single neural network. In the simplest case, a neural network designed to recognize multiple classes looks like the previous example. In the general case, there are also hidden layers. 37

  38. A Network for Our Example Layer 1 (Input layer) (1st Hidden Layer) Layer 2 Layer 3 (2nd Hidden Layer) Layer 4 (Output layer) ?1,1 ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 38

  39. Input Layer: How many units does it have? Could we have a different number? Is the number of input units a hyperparameter? Layer 1 (input) Layer 2 (hidden) Layer 3 (hidden) ?1,1 Layer 4 (output) ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 39

  40. In our example, the input layer it must have five units, because each input is five-dimensional. We don t have a choice. Layer 1 (input) Layer 2 (hidden) Layer 3 (hidden) ?1,1 Layer 4 (output) ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 40

  41. This network has two hidden layers, with four units per layer. The number of hidden layers and the number of units per layer are hyperparameters, they can take different values. Layer 2 (hidden) Layer 3 (hidden) ?1,1 ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 41

  42. Output Layer: How many units does it have? Could we have a different number? Is the number of output units a hyperparameter? Layer 4 (output) ?1,1 ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 42

  43. In our example, the output layer must have three units, because we want to recognize three different classes (dog, cat, fox). We have no choice. Layer 4 (output) ?1,1 ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 43

  44. Network connectivity: In this neural network, at layers 2, 3, 4, every unit receives as input the output of ALL units in the previous layer. This is a design choice, other choices are also possible. ?1,1 ?2,1 ?3,1 ?4,1 ?1,2 ?3,2 ?2,2 ?1,3 ?4,2 ?3,3 ?2,3 ?1,4 ?4,3 ?3,4 ?2,4 ?1,5 44

  45. Next: Training The next set of slides will describe how to train such a network. Training a neural network is done using gradient descent. The specific method is called backpropagation, but it really is just a straightforward application of gradient descent for neural networks. 45

More Related Content