Resistant Learning on the Envelope Bulk for Anomalous Patterns Detection

Resistant Learning on the Envelope Bulk for

Identifying Anomalous Patterns

Fang Yu

Department of Management

Information Systems

National Chengchi University

Taipei , Taiwan

 yuf@nccu.edu.tw

Rua-Huan Tsaih

Department of Management

Information Systems

National Chengchi University

Taipei, Taiwan

tsaih@mis.nccu.edu.tw

Shin-Ying Huang

Research Center for Information

Technology Innovation

Academia Sinica

Taipei , Taiwan

smichelle19@citi.sinica.edu.tw

Yennun Huang

Research Center for Information

Technology Innovation

Academia Sinica

Taipei , Taiwan

yennunhuang@citi.sinica.edu.tw

Introduction

•

The

outlier detection

 usually relies on incremental learning

techniques to constantly adjust the boundaries for identifying the

majority that are evolved throughout the time and thus to recognize

the anomalous ones.

•

There are challenges to derive an algorithm for such detection of

anomalous patterns.

•

The resistant learning are those whose numerical results are not

impacted significantly by outlying observations. However, the

conventional outlier detection studies do not appear to generalize to

the resistant learning problems.

Introduction

(con.)

•

This study proposes a new resistant learning algorithm with envelope

module that learns to evolve a

nonlinear fitting function wrapped

with a constant-width envelope

for containing the majority of

observations and thus identifying anomalous patterns.

•

Tsaih and Cheng (2009) propose an outlier detection algorithm which

can cope with the context of resistant learning; however, the

algorithm is rather complicated and time-consuming when both sizes

of the reference pool and the input dimensionality are large.

R. H. Tsaih, and T. C. Cheng, “A Resistant Learning Procedure for Coping with

Outliers,” Annals of Mathematics and Artificial Intelligence, vol. 57, no. 2, pp. 161-

180, 2009.

Related works

•

Methods for

model selection

and

variable selection

in the presence

of outliers have been discussed in Hoeting et al. (1996) and Atkinson

and Riani (2002).

•

Knorr and Ng (1997, 1998) and Knorr et al. (2000) focus on the

development of algorithms for identifying the

distance-based

outliers

 in large data sets.

•

Chuang et al. (2002) propose a robust

support vector regression

method

 for the problem of function approximation with outliers.

•

Sluban et al. (2014) aim at detecting noisy instances for improved

data understanding, data cleaning and outlier identification by

presenting an

ensemble-based noise ranking methodology

for

explicit noise and outlier identification.

The resistant learning

•

In the context of model estimation, the response

 is modeled as

) + δ, where

 is the parameter vector and δ is the error

term.

•

The least squares estimator (LSE) is one of the most popular

methods for performing the estimation.

•

Resistant is equivalent to robust.

–

Robust procedures

 are those whose results are not impacted

significantly by violations of the model assumptions.

–

Resistant procedures

are those whose numerical results are not

impacted significantly by outlying observations

The resistant learning

(con.)

The concept of envelope module

•

This study changes the algorithm proposed by Tsaih and Cheng (2009),

which uses a tiny

ε

 value, into the envelope module with a non-tiny

ε

 value

that evolves a nonlinear fitting function

 wrapped with an envelope whose

width is 2

ε

•

The setting of the

ε

 value depends on the user’s perception of the data and its

associated outliers.

•

Given that the error terms follow the normal distribution, the user can set the



 value of the proposed envelope module to 1.96 and define the outliers as

the points that have residuals that are greater than

ε





•

Parameter



 is the standard deviation of the residuals of the current reference

observations.

•

Parameter



 is a constant that is equal to or greater than 1.0. The larger the



value is, the more stern is the outlier detection.

The envelope module

The envelope module

(con.)

(Step 2:  If

, STOP.)

The envelope module

(con.)

•

The modeling procedure implemented by Step 6 to Step 7 requires proper

values of

and

 so that the obtained envelope contains at least

observations at the end of the

th

 stage.

•

The augmenting mechanism should recruit extra hidden nodes to render

                       . Tsaih and Cheng (2009) add two extra hidden nodes to the

previous SLFN estimate.

•

The proposed envelope module wants to evolve the fitting function around an

envelope, to

contain at least

 observations at the

th

 stage

. Therefore, the



value adopted here is much larger than the value in Tsaih and Cheng (2009).

•

The proposed module would result in a fitting function with an envelope that

includes all of the observations. The outliers are expected to be included at

later stages.

The envelope module

(con.)

•

Assuming that the errors follow a normal distribution N(0,



) and the

outliers are the points that have residuals greater than

ε

(in absolute value).

•

If the diagnostic quantity is greater than

ε



, then the next point is treated as

a potential outlier. Here,



 is a constant that is equal to or greater than one,

depending on how stringent the threshold is for the outliers.

•

Regarding the order information for identifying the outliers, we propose two

approaches, fixed and flexible.

–

The

fixed approach

is to treat the last 5% of the observations as

potential outliers. If



 0.95

 AND the diagnostic quantity is greater

than

ε



, then the rest point is recorded as the identified outlier

–

For the

flexible approach

, if



 AND the diagnostic quantity is

greater than

ε



, then the rest point is recorded as the identified outlier.

An illustrative experiment

Supplementary information

Interval estimation

Confidence interval (CI)

-3.0   -2.0    -1.0           0.0          1.0      2.0    3.0

34%

34%

14%

14%

2%

2%

-1.96

1.96

-2.58

2.58

An illustrative experiment

(con.)

•

Table II shows the number of theoretical outliers in 100 simulated data sets.

There are 60 runs with 5 or fewer than 5 theoretical outliers, and there are

3.55 theoretical outliers on average.

An illustrative experiment

(con.)

•

Without losing generality, the 10

th

 data set shown in Fig. 1. There are six

theoretical outliers.

•

Fig. 2 shows the graphs of                                 and the corresponding next

point (





) obtained at Step 3 of the 71

st

, 72

nd

, 96

th

, 98

th

, and 100

th

 stage as

well as the graph of the final fitting function and its envelope regarding the

th

 simulation run.

{(

):



The graphs of {(

):



and the corresponding next point (



y

obtained in Step 3 of the

st

, 72

nd

, 96

th

, 98

th

and

th

stage and the graph of the final fitting function and its envelope regarding the

th

data set

The evaluation

•

We use the non-linear regression method associated with the function form

of equation (5) as the benchmark for evaluating the performance on outlier

detection of our proposed algorithm.

•

Table III lists the mean and standard deviation of Type I and II errors of the

outlier detection.

•

The envelope module with the flexible approach

contributes a 42.48% (= 95%



 52.52%) effect on

the outlier detection, which is significantly large.

•

No information: Type II error is 95%.

•

Knowing the function form (5)’: Type II error is 19.51%.

Table III. Type I and II errors

Discussion

•

This study proposes an envelope module which adopts both the deviance

information and the order information to identify the outliers.

•

 At the

th

 stage, the envelope has evolved to contain the reference

observations of

{(

):

             }



{(





)}, and the identified

outlier is the next point (





), whose deviation from the fitting function

is

greater than

ε





, where



 is the standard deviation of the residuals of {(

):

            }.

•

In contrast with the algorithm proposed by Tsaih and Cheng (2009), the

envelope module uses a non-tiny

ε

 value instead of a tiny

ε

 value that results

in a nonlinear fitting function

 around the envelope whose width is 2

ε

. Also,

the envelope should contain at least

 observations at the

th

 stage, which

tends to result in overfitting to the noisy data.

Discussion

(con.)

•

This study has fulfilled the following two objectives:

1.

Revise the algorithm of Tsaih and Cheng (2009) to form an effective

way of identifying outliers in the context of resistant learning.

2.

Set up an illustrative experiment to justify the effectiveness of the

envelope module in identifying outliers in the context of resistant

learning.

•

This study is the first study to derive an effective module for outlier

detection both contexts of resistant learning and changing environments.

Future Research

•

Integrate the moving window strategy with the envelope module into an

outlier detection algorithm that can work in both contexts of resistant

learning and changing environments.

•

Set up a real-world experiment (regarding security applications such as the

detection of abnormal network behaviors and zero-day attacks) to explore

the effectiveness of the derived outlier detection algorithm.

•

Explore the reality of identified outliers in a real-world experiment.

Thank you for your listening

Q & A

Q & A

Slide Note

Embed Share

Download

Outlier detection often relies on incremental learning to adjust boundaries for identifying majority patterns and recognizing anomalies. This study introduces a resistant learning algorithm with an envelope module that evolves a nonlinear fitting function wrapped in a constant-width envelope. The algorithm aims to efficiently identify anomalous patterns amidst large datasets, providing a robust approach to outlier detection.

crumrine_l Follow

Uploaded on Sep 18, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Resistant Learning on the Envelope Bulk for Identifying Anomalous Patterns Shin-Ying Huang Research Center for Information Technology Innovation Academia Sinica Taipei , Taiwan smichelle19@citi.sinica.edu.tw Fang Yu Rua-Huan Tsaih Department of Management Information Systems National Chengchi University Taipei, Taiwan tsaih@mis.nccu.edu.tw Yennun Huang Department of Management Information Systems National Chengchi University Taipei , Taiwan yuf@nccu.edu.tw Research Center for Information Technology Innovation Academia Sinica Taipei , Taiwan yennunhuang@citi.sinica.edu.tw

Introduction The outlier detection usually relies on incremental learning techniques to constantly adjust the boundaries for identifying the majority that are evolved throughout the time and thus to recognize the anomalous ones. There are challenges to derive an algorithm for such detection of anomalous patterns. The resistant learning are those whose numerical results are not impacted significantly by outlying observations. However, the conventional outlier detection studies do not appear to generalize to the resistant learning problems. 2

Introduction (con.) This study proposes a new resistant learning algorithm with envelope module that learns to evolve a nonlinear fitting function wrapped with a constant-width envelope for containing the majority of observations and thus identifying anomalous patterns. Tsaih and Cheng (2009) propose an outlier detection algorithm which can cope with the context of resistant learning; however, the algorithm is rather complicated and time-consuming when both sizes of the reference pool and the input dimensionality are large. R. H. Tsaih, and T. C. Cheng, A Resistant Learning Procedure for Coping with Outliers, Annals of Mathematics and Artificial Intelligence, vol. 57, no. 2, pp. 161- 180, 2009. 3

Related works Methods for model selection and variable selection in the presence of outliers have been discussed in Hoeting et al. (1996) and Atkinson and Riani (2002). Knorr and Ng (1997, 1998) and Knorr et al. (2000) focus on the development of algorithms for identifying the distance-based outliers in large data sets. Chuang et al. (2002) propose a robust support vector regression method for the problem of function approximation with outliers. Sluban et al. (2014) aim at detecting noisy instances for improved data understanding, data cleaning and outlier identification by presenting an ensemble-based noise ranking methodology for explicit noise and outlier identification. 4

The resistant learning In the context of model estimation, the response y is modeled as f(x, w) + , where w is the parameter vector and is the error term. The least squares estimator (LSE) is one of the most popular methods for performing the estimation. Resistant is equivalent to robust. Robust procedures are those whose results are not impacted significantly by violations of the model assumptions. Resistant procedures are those whose numerical results are not impacted significantly by outlying observations. 5

The resistant learning (con.) Tsaih and Cheng (2009) propose an algorithm with a tiny pre-specified value (say, 10-6) that can deduce a proper nonlinear function form f and ? such that |yc- ? ??, ? | , for all c. ? denotes an estimate of w Robustness analysis entails adopting the idea of a C-step (Rousseeuw and Driessen, 1999) for deriving an (initial) subset of m+1 reference observations to fit the linear regression model, ordering the residuals of all N observations at each stage and then augmenting the reference subset gradually based upon the smallest trimmed sum of squared residuals principle. The weight-tuning mechanism, the recruiting mechanism, and the reasoning mechanism allow the single-hidden layer feed-forward neural networks (SLFN) to adapt dynamically during the process. The deletion diagnostic approach is employed with the diagnostic quantity as the number of pruned hidden nodes when one observation is excluded from the reference pool. 6

The concept of envelope module This study changes the algorithm proposed by Tsaih and Cheng (2009), which uses a tiny value, into the envelope module with a non-tiny value that evolves a nonlinear fitting function f wrapped with an envelope whose width is 2 . The setting of the value depends on the user s perception of the data and its associated outliers. Given that the error terms follow the normal distribution, the user can set the value of the proposed envelope module to 1.96 and define the outliers as the points that have residuals that are greater than * * . Parameter is the standard deviation of the residuals of the current reference observations. Parameter is a constant that is equal to or greater than 1.0. The larger the value is, the more stern is the outlier detection. 7

The envelope module Given the observation x, all of the corresponding values of hidden nodes are first calculated with ??= ??? (??0 corresponding value f(x) is then calculated as ? ? = ? ? ?0 ? ?+ ?=1 ???) for all i and the ??? ? ???. ?+ ?=1 ?? Step 1: Arbitrarily obtain the initial m+1 reference observations. Let I(m+1) be the set of indices of these observations. Set up an acceptable SLFN estimate with one hidden node regarding the reference observations {(xc, yc): c I(m+1)}. Set n = m+2. Step 2: If n > N, STOP. Step 3: Present the n reference observations (xc, yc) with the smallest n squared residuals among the current N squared residuals. Let I(n) be the set of indices of these observations. Step 4: If c I(n), go to Step 7. Step 5: Assume I(n), and c I(n)-{ } . Set ?=w. ) 1 N 2 2 ( n 2 ec ( ) ) 1 2 ( ) 1 N ) 1 N 2 2 2 2 ( ( n n 2 ec 2 ( ) ( ) e ) 1 2 ( ) 1 2 ( 8

The envelope module (con.) Step 6: Apply the gradient descent mechanism to adjust weights w until one of the following two cases occurs: 6.1 If the deduced envelope (with the width ) contains at least n observations, then go to Step 7. 6.2 If the deduced envelope does not contain at least n observations, then set w = ? and apply the augmenting mechanism to add extra hidden nodes to obtain an acceptable SLFN estimate. Step 7: Implement the pruning mechanism to delete all of the potentially irrelevant hidden nodes; n + 1 n; go to Step 2. (Step 2: If n > N, STOP.) 9

The envelope module (con.) The modeling procedure implemented by Step 6 to Step 7 requires proper values of w and p so that the obtained envelope contains at least n observations at the end of the nthstage. The augmenting mechanism should recruit extra hidden nodes to render . Tsaih and Cheng (2009) add two extra hidden nodes to the 2 ) 1 ( N ) 1 2 2 ( n 2 ec ( ) previous SLFN estimate. The proposed envelope module wants to evolve the fitting function around an envelope, to contain at least n observations at the nthstage. Therefore, the value adopted here is much larger than the value in Tsaih and Cheng (2009). The proposed module would result in a fitting function with an envelope that includes all of the observations. The outliers are expected to be included at later stages. 10

The envelope module (con.) Assuming that the errors follow a normal distribution N(0, 2) and the outliers are the points that have residuals greater than (in absolute value). If the diagnostic quantity is greater than * , then the next point is treated as a potential outlier. Here, is a constant that is equal to or greater than one, depending on how stringent the threshold is for the outliers. Regarding the order information for identifying the outliers, we propose two approaches, fixed and flexible. The fixed approach is to treat the last 5% of the observations as potential outliers. If n 0.95NAND the diagnostic quantity is greater than * , then the rest point is recorded as the identified outlier For the flexible approach, if n N-kAND the diagnostic quantity is greater than * , then the rest point is recorded as the identified outlier. 11

An illustrative experiment We apply the proposed envelope module to 100 simulation runs to evaluate the effectiveness of detecting the outliers. We use the nonlinear model stated in (5) to generate a set of 100 observations for which the explanatory variable X is equally spaced from 0.0 to 20.0 and the error is normally distributed, with a mean of 0 and a variance of 1. Y=0.5 + 0.4*X + 0.8*Exp(-X) + Error. (5) Here, we set the value of the proposed envelope module to 3, which is smaller than but close to 1.96, the threshold for the theoretical outliers. The value of the proposed envelope module is set such that 3 * is equal to 2.5. The idea behind taking a larger rejection bound of 2.5 is similar to that in the Repeated Significance Tests (1971) and Group Sequential Tests (1977). 12

Supplementary information Interval estimation Confidence interval (CI) 34% 34% 14% 14% 2% 2% z -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 2.58 -1.96 1.96 -2.58

An illustrative experiment (con.) Table II shows the number of theoretical outliers in 100 simulated data sets. There are 60 runs with 5 or fewer than 5 theoretical outliers, and there are 3.55 theoretical outliers on average. Number of theoretical outliers Number of simulation runs 1 2 3 4 5 6 7 8 9 11 2 11 16 14 17 21 7 5 4 3 14

An illustrative experiment (con.) Without losing generality, the 10thdata set shown in Fig. 1. There are six theoretical outliers. Y 11 9 7 5 3 1 X -1 0 5 10 15 20 -3 -5 {(xc, yc): c } ( n I ) 1 Fig. 2 shows the graphs of and the corresponding next point (x , y ) obtained at Step 3 of the 71st, 72nd, 96th, 98th, and 100thstage as well as the graph of the final fitting function and its envelope regarding the 10thsimulation run. Step 3: Present the n reference observations (xc, yc) that are the ones with the smallest n squared residuals among the current N squared residuals. Let I(n) be the set of indices of these observations. 15

The graphs of {(xc, yc): c and the corresponding next point (x , y )obtained in Step 3 of the71st, 72nd, 96th, 98th, and 100th stage and the graph of the final fitting function and its envelope regarding the 10thdata set. ( n I } ) 1 the 71ststage the 96thstage the 72ndstage Y Y Y 11 11 11 9 9 9 7 7 7 5 5 5 3 3 3 1 1 1 X X X (1) (1) (1) 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 (3) (3) (3) (5) (5) (5) the final result the 98thstage the 100thstage Y Y Y 11 11 11 9 9 9 7 7 7 5 5 5 3 3 3 1 1 1 X X X (1) (1) (1) 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 (3) (3) (3) (5) (5) (5) 16

The evaluation We use the non-linear regression method associated with the function form of equation (5) as the benchmark for evaluating the performance on outlier detection of our proposed algorithm. Y=?0+ ?1*X + ?2*Exp(-X) + Error. (5) Table III lists the mean and standard deviation of Type I and II errors of the outlier detection. Table III. Type I and II errors Envelope module Benchmark Flexible Fixed Type I error Type II error 1.19% 52.52% 1.22% 54.53% 0.47% 19.51% The envelope module with the flexible approach contributes a 42.48% (= 95% 52.52%) effect on the outlier detection, which is significantly large. No information: Type II error is 95%. Knowing the function form (5) : Type II error is 19.51%. 17

Discussion This study proposes an envelope module which adopts both the deviance information and the order information to identify the outliers. At the nthstage, the envelope has evolved to contain the reference observations of {(xc, yc): c outlier is the next point (x , y ), whose deviation from the fitting function f is greater than * * , where is the standard deviation of the residuals of {(xc, yc): c }. ) 1 ( n I } {(x , y )}, and the identified ( n I ) 1 In contrast with the algorithm proposed by Tsaih and Cheng (2009), the envelope module uses a non-tiny value instead of a tiny value that results in a nonlinear fitting function f around the envelope whose width is 2 . Also, the envelope should contain at least n observations at the nthstage, which tends to result in overfitting to the noisy data. 18

Discussion (con.) This study has fulfilled the following two objectives: 1. Revise the algorithm of Tsaih and Cheng (2009) to form an effective way of identifying outliers in the context of resistant learning. 2. Set up an illustrative experiment to justify the effectiveness of the envelope module in identifying outliers in the context of resistant learning. This study is the first study to derive an effective module for outlier detection both contexts of resistant learning and changing environments. 19

Future Research Integrate the moving window strategy with the envelope module into an outlier detection algorithm that can work in both contexts of resistant learning and changing environments. Set up a real-world experiment (regarding security applications such as the detection of abnormal network behaviors and zero-day attacks) to explore the effectiveness of the derived outlier detection algorithm. Explore the reality of identified outliers in a real-world experiment. 20

Thank you for your listening Q & A

Resistant Learning on the Envelope Bulk for Anomalous Patterns Detection

Download Presentation

Presentation Transcript

Related

More Related Content