Conditional and Reference Class Linear Regression: A Comprehensive Overview

Conditional and Reference Class

Linear Regression

Brendan Juba

Washington University in St. Louis

Based on

ITCS’17

 paper and joint works with:

Hainline, Le, and Woodruff; Calderon, Li, Li, and Ruan.

AISTATS’19

arXiv:1806.02326

Outline

1.

Introduction and motivation

2.

Overview of sparse regression via list-

learning

3.

General regression via list-learning

4.

Recap and challenges for future work

How

 can we determine

which data is

useful

and

relevant

for making data-driven

inferences

Conditional Linear Regression

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

c(x) =

•

∨

•

•

•

Reference Class Regression

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

x*

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

c(x) =

•

∨

•

∨

•

Motivation

•

Rosenfeld et al. 2015:

 some

sub-populations

have

strong risk factors

for cancer that are

insignificant

in

the full population

•

“Intersectionality”

in social sciences:

subpopulations may behave differently

•

Good

 experiments

isolate a set of conditions

 in which

a desired effect

has

a simple model

☞

It’s

useful

 to find these

“segments”

or

“conditions”

in which

simple

models

fit

And, we

 don’t

expect to be able to model

all

cases

 simply…

Results

: algorithms for

conditional linear regression

We can solve this problem

for

k-DNF

conditions

on

 Boolean attributes,

regression

on

 real

attributes when

•

is sparse:

‖

‖

≤ s

for

constant s

) for all

 norms,

•

loss

ε



Õ(

εn

k/2

) for general

k-DNF

•

loss

ε



Õ(

εT log log n

for

T-term k-DNF

•

For general coefficients

with

σ-

subgaussian

 residuals,

max

terms t of c

‖Cov(y|t) - Cov(y|c)‖

op

  sufficiently small

 loss

ε



Õ(

εT (log log n + log σ)

for

T-term k-DNF

Technique:

“list regression”

(BBV’08, J’17, CSV’17)

Why only

k-DNF

Why only approximate?

•

Same construction

: algorithms for

conditional

linear regression

 solve

agnostic learning

 task for

c(x)

on Boolean

–

State-of-the-art for k-DNFs suggests: require

poly(n)

 = # Boolean) blow-up of loss,

α(n)

•

ZMJ‘17: achieve

Õ

(n

k/2

blow-up for k-DNF for the

corresponding agnostic learning task

•

JLM ‘18: achieve

Õ(

T log log n

blow-up for

T-term k-DNF

–

Formal evidence of hardness?

Open question!

Outline

1.

Introduction and motivation

2.

Overview of sparse regression via list-

learning

3.

General regression via list-learning

4.

Recap and challenges for future work

Main technique–

“list regression”

•

Given examples

we can find

a polynomial-size

list

of

candidate

linear predictors

including

an

approximately

optimal linear rule

w’

 on the

unknown subset

S = {(x

(j)

,y

(j)

,z

(j)

): c*(x

(j)

)=1, j=1,…,m}

for the

unknown

condition

c*(x)

•

We learn a

condition

 for each

, take a pair

(w’,c’)

 for which

c’

satisfies

μ

-fraction of data

Sparse max-norm

“list regression”

(J.’17)

•

Fix a set of

 coordinates

,…,i

•

For the unknown subset

the optimal linear rule using

,…,i

is

the solution to the following linear program

minimize

ε

 subject to

-ε ≤ w

(j)

+ … + w

(j)

- z

(j)

 ≤ ε

for

∈

☞

s+1

dimensions –

optimum attained at basic

feasible solution given by

s+1

tight constraints

Sparse max-norm

“list regression”

 (J.‘17)

•

The optimal linear rule using

,…,i

is given by

the solution to the system

(j

+ … + w

(j

- z

(j

 = σ

ε

for

r = 1,…,s+1

for some

,…,j

s+1

∈

σ

,…,

σ

s+1

∈

{-1,+1}

•

Enumerate

,…,i

,…,j

s+1

σ

,…,

σ

s+1

in

[d]

×[m]

s+1

×{-1,+1}

s+1

, solve for each

–

Includes all

,…,j

s+1

∈

s+1

(regardless of S!)

–

List has size

s+1

s+1

= poly(d,m)

for constant

Summary – algorithm for max-norm

conditional linear regression

(J.’17)

For each

,…,i

,…,j

s+1

σ

,…,

σ

s+1

in

[d]

×[m]

s+1

×{-1,+1}

s+1

1.

Solve

(j

+ … + w

(j

- z

(j

 = σ

ε

for

r = 1,…,s+1

2.

If

 ε > ε*

(given)

, continue to next iteration.

3.

Initialize

to

k-DNF

over all terms of size

4.

For

j=1,…,m

, if

|<w,y

(j)

> - z

(j)

| > ε

•

For each term

in

, if

T(x

(j)

)=1

–

Remove

 from

5.

If

#{j:c(x

(j)

)=1} > μ’m

μ’

initially 0)

•

Put

μ’ = #{j:c(x

(j)

)=1}/m

w’=w

c’=c

Learn condition c using

“labels” provided by w

Choose condition c that

includes the most data

Extension to

norm

list-regression

Joint work with Hainline, Le, Woodruff (AISTATS’19)

norm

conditional linear regression

Joint work with Hainline, Le, Woodruff

(AISTATS’19)

norm

conditional linear regression

Joint work with Hainline, Le, Woodruff

(AISTATS’19)

norm

conditional linear regression

vs. selective linear regression

LIBSVM benchmarks.

Red

: conditional regression;

Black

: selective regression

Bodyfat (m=252, d=14)

Boston (m=506, d=13)

Cpusmall (m=8192, d=12)

Space_GA (m=3107, d=6)

Outline

1.

Introduction and motivation

2.

Overview of sparse regression via list-

learning

3.

General regression via list-learning

Joint work with Calderon, Li, Li, and Ruan (arXiv).

4.

Recap and challenges for future work

CSV’17-style

List Learning

•

Basic algorithm:

Soft relaxation of fitting unknown

↕︎

(alternating)

   Outlier detection and reduction

•

Improve accuracy by clustering output

and

“recentering”

•

We reformulate CSV’17 in terms of

terms

(rather than individual points)

Relaxation of fitting unknown

•

For fixed weights (

“inlier” indicators

(1)

,…,u

(T)

∈

[0,1]

•

Each

term

has its own parameters

(t)

•

Solve:

min

w,Y

∑

(t)

|t|

(w

(t)

) + λ tr(Y)

: enclosing ellipsoid;

λ

 carefully chosen constant)

(1)

(2)

(3)

⋱

(outlier)

Enclosing

ellipsoid

(outlier)

Outlier detection and reduction

(1)

(2)

(3)

⋱

(outlier)

•

•

•

•

ŵ

(1)

•

•

ŵ

(outlier)

Intuition

: points fit by parameters in small

ellipsoid have a good coalition for which

objective value changes little (for smooth

loss

). Outliers cannot find a good

coalition and are damped/removed.

Clustering and recentering

Full algorithm:

(initially, one data cluster)

   Basic algorithm on cluster centers

↕︎

(alternating)

Cluster outputs

(1)

(2)

(3)

⋱

(k)

(j)

(l)

Next iteration

run basic algorithm

with parameters of

radius R/2 centered

on each cluster

Overview of analysis (1/3)

Loss bounds from basic algorithm

Overview of analysis (2/3)

From loss to accuracy via convexity

Overview of analysis (3/3)

Reduce

 by clustering & recentering

•

In each iteration, for all significant

in

c*

‖

(t)

-w*

‖

≤

O(

max

∈

c*

‖Cov(y|t) - Cov(y|c*)‖

κ√

μ

where

κ

is the

convexity parameter

of

(w)

•

Padded decompositions (FRT’04)

: can find a list of

clusterings such that one cluster in each contains

(t)

 for all significant

in

c*

with high probability

•

Moreover, if

κ

≥ Ω(

Tlog(1/

μ

max

∈

c*

‖Cov(y|t)-Cov(y|c*)‖

√

μ

then we obtain a new cluster center

ŵ in our list

such that

‖

ŵ-w*

‖

≤

with high probability

•

So: we can iterate, reducing

→

poly(

γμ

m/σT)

–

large

⟹

‖

ŵ

(t)

-w*

‖

⟶

Finishing up: obtaining a

k-DNF

from

ŵ

•

Cast as weighted partial set cover instance

terms

 = sets of points) using

∑

i:t(x

(i)

)=1

(w)

as

weight of

term

 and the

ratio objective

cost

size

–

ZMJ’17: with ratio objective, still obtain

O(log

μ

m)

approximation

•

Recall

: we chose

ŵ

to optimize

∑

∈

∑

i:t(x

(i)

)=1

(w)

(=cost of cover

 in set cover) – adds only

-factor

•

Recall

: we only consider terms satisfied with

probability

≥

γμ

 – use at most

γ

terms

•

Haussler’88, again: only need

∼

O(

Tk

γ

 log n)

points

Summary: guarantee for general

conditional linear regression

Comparison to

sparse

 conditional

linear regression on benchmarks

https://github.com/wumming/lud

Space_GA (m=3107, d=6)

Cpusmall (m=8192, d=12)

Small benchmarks

 (Bodyfat, m=252, d=14; Boston, m=506, d=13):

does not converge

Outline

1.

Introduction and motivation

2.

Overview of sparse regression via list-

learning

3.

General regression via list-learning

4.

Recap and challenges for future work

Summary: new algorithms for

conditional linear regression

We can solve

conditional linear regression

for

k-DNF

conditions

on

 Boolean attributes,

regression

on

 real attributes when

•

is sparse:

‖

‖

≤ s

for

constant s

) for all

 norms,

•

loss

ε



Õ(

εn

k/2

 for general

k-DNF

•

loss

ε



Õ(

εT log log n

for

T-term k-DNF

•

For general coefficients

with

σ-

subgaussian

 residuals,

max

terms t of c

‖Cov(y|t) - Cov(y|c)‖

op

  sufficiently small

 loss

ε



Õ(

εT (log log n + log σ)

for

T-term k-DNF

Technique:

“list regression”

(BBV’08, J’17, CSV’17)

Open problems

1.

Remove covariance requirement!

2.

Improve large-formula error bounds to

O(n

k/2

ε)

for general (dense) regression

3.

Algorithms without semidefinite programming?

Without padded decompositions?

–

Note:

algorithms for

 sparse regression

already

satisfy

1—3.

4.

Formal evidence for hardness of polynomial-

factor approximations for agnostic learning?

5.

Conditional supervised learning

for

other

hypothesis classes?

References

•

Hainline, Juba, Le, Woodruff. Conditional sparse l

-norm regression with optimal

probability. In AISTATS, PMLR 89:1042-1050, 2019.

•

Calderon, Juba, Li, Li, Ruan. Conditional Linear Regression. arXiv:1806.02326 [cs.LG],

2018.

•

Juba. Conditional sparse linear regression. In ITCS, 2017.

•

Rosenfeld, Graham, Hamoudi, Butawan, Eneh, Kahn, Miah, Niranjan, Lovat. MIAT: A

novel attribute selection approach to better predict upper gastrointestinal cancer. In

DSAA, pp.1—7, 2015.

•

Balcan, Blum, Vempala. A discriminative framework for clustering via similarity

functions. In STOC pp.671—680, 2008.

•

Charikar, Steinhardt, Valiant. Learning from untrusted data. In STOC, pp.47—60, 2017.

•

Fakcharoenphol, Rao, Talwar. A tight bound on approximating arbitrary metrics by tree

metrics. JCSS 69(3):485-497, 2004.

•

Zhang, Mathew, Juba. An improved algorithm for learning to perform exception-

tolerant abduction. In AAAI, pp.1257—1265, 2017.

•

Juba, Li, Miller. Learning abduction under partial observability. In AAAI, pp.1888—1896,

2018.

•

Cohen, Peng.

 row sampling by Lewis weights. In STOC, pp.47—60, 2017.

•

Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning

framework. Artificial Intelligence, 36:177—221, 1988.

•

Peleg. Approximation algorithms for the label cover

max

 and red-blue set cover

problems. J. Discrete Algorithms, 5:55—64, 2007.

Slide Note

Embed Share

Download

In this comprehensive presentation, the concept of conditional and reference class linear regression is explored in depth, elucidating key aspects such as determining relevant data for inference, solving for k-DNF conditions on Boolean and real attributes, and developing algorithms for conditional linear regression. The motivation behind this research, inspired by sub-population risk factors and social science intersectionality, underscores the importance of isolating simple models within specific conditions. The results reveal promising algorithms for addressing these regression challenges.

hlan Follow

Uploaded on Sep 13, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Conditional and Reference Class Linear Regression Brendan Juba Washington University in St. Louis Based on ITCS 17 paper and joint works with: Hainline, Le, and Woodruff; Calderon, Li, Li, and Ruan. AISTATS 19 arXiv:1806.02326

Outline 1. Introduction and motivation 2. Overview of sparse regression via list- learning 3. General regression via list-learning 4. Recap and challenges for future work

How can we determine which data is useful and relevant for making data-driven inferences?

Conditional Linear Regression Given: data drawn from a joint distribution over x {0,1}n, y Rd, z R Find: A k-DNF c(x) Recall: OR of terms of size k terms: ANDs of Boolean literals Parameters w Rd Such that on the condition c(x), the linear rule <w,y> predicts z. z w y c(x) =

Reference Class Regression Given: data drawn from a joint distribution over x {0,1}n, y Rd, z R point of interest x* Find: A k-DNF c(x) Parameters w Rd Such that on the condition c(x), the linear rule <w,y> predicts z and c(x*)=1. z w x* y c(x) =

Motivation Rosenfeld et al. 2015: some sub-populations have strong risk factors for cancer that are insignificant in the full population Intersectionality in social sciences: subpopulations may behave differently Good experimentsisolate a set of conditions in which a desired effect has a simple model It s useful to find these segments or conditions in which simplemodelsfit - And, we don t expect to be able to model all cases simply

Results: algorithms for conditional linear regression We can solve this problem for k-DNF conditions on n Boolean attributes, regression on d real attributes when w is sparse: ( w 0 s for constant s) for all lp norms, loss ( nk/2) for general k-DNF loss ( T log log n) for T-term k-DNF For general coefficients w with -subgaussian residuals, maxterms t of c Cov(y|t) - Cov(y|c) op sufficiently small l2 loss ( T (log log n + log )) for T-term k-DNF Technique: list regression (BBV 08, J 17, CSV 17)

Why only k-DNF? Theorem(informal): algorithms to find w and c satisfying E[(<w,y>-z)2|c(x)] (n) when a conjunction c* exists enable PAC-learning of DNF (in the usual, distribution-free model). Sketch: encode DNF s labels for x as follows Label 1 z 0 (easy to predict with w = 0) Label 0 z has high variance (high prediction error) Terms of the DNF are conjunctions c* with easy z c selecting x with easy to predict z gives weak learner for DNF

Why only approximate? Same construction: algorithms for conditional linear regression solve agnostic learning task for c(x) on Boolean x State-of-the-art for k-DNFs suggests: require poly(n) (n = # Boolean) blow-up of loss, (n) ZMJ 17: achieve (nk/2) blow-up for k-DNF for the corresponding agnostic learning task JLM 18: achieve (T log log n) blow-up for T-term k-DNF Formal evidence of hardness? Open question!

Outline 1. Introduction and motivation 2. Overview of sparse regression via list- learning 3. General regression via list-learning 4. Recap and challenges for future work

Main technique list regression Given examples we can find a polynomial-size list of candidate linear predictors including an approximately optimal linear rule w on the unknown subset S = {(x(j),y(j),z(j)): c*(x(j))=1, j=1, ,m} for the unknown condition c*(x). We learn a condition c for each w, take a pair (w ,c ) for which c satisfies -fraction of data

Sparse max-norm list regression (J. 17) Fix a set of s coordinates i1, ,is. For the unknown subset S, the optimal linear rule using i1, ,is is the solution to the following linear program minimize subject to - wi1xi1 s+1 dimensions optimum attained at basic feasible solution given by s+1 tight constraints (j) + + wisxis (j) - z(j) for j S.

Sparse max-norm list regression (J. 17) The optimal linear rule using i1, ,is is given by the solution to the system wi1xi1 for some j1, ,js+1 S, 1, , s+1 {-1,+1}. Enumerate (i1, ,is,j1, ,js+1, 1, , s+1) in [d]s [m]s+1 {-1,+1}s+1, solve for each w Includes all (j1, ,js+1) Ss+1 (regardless of S!) List has size dsms+12s+1 = poly(d,m) for constant s. (jr) + + wisxis (jr) - z(jr) = r for r = 1, ,s+1

Summary algorithm for max-norm conditional linear regression (J. 17) For each (i1, ,is,j1, ,js+1, 1, , s+1) in [d]s [m]s+1 {-1,+1}s+1, 1. Solve wi1xi1 2. If > * (given), continue to next iteration. 3. Initialize c to k-DNF over all terms of size k 4. For j=1, ,m, if |<w,y(j)> - z(j)| > For each term T in c, if T(x(j))=1 Remove T from c 5. If #{j:c(x(j))=1} > m, ( initially 0) Put = #{j:c(x(j))=1}/m, w =w, c =c (jr) + + wisxis (jr) - z(jr) = r for r = 1, ,s+1 Learn condition c using labels provided by w Choose condition c that includes the most data

Extension to lp-norm list-regression Joint work with Hainline, Le, Woodruff (AISTATS 19) Consider the matrix S = [y(j),z(j)] j=1, ,m:c*(x(j))=1 lp-norm of S(w,-1) approximates lp-loss of w on c Since w* Rd is s-sparse, there exists a small sketch matrix S such that Sv p S v p for all vectors v on these s coordinates (Cohen-Peng 15): moreover, rows of S can be O(1) rescaled rows of S New algorithm: search approximate weights, minimize lp-loss to find candidates for w

lp-norm conditional linear regression Joint work with Hainline, Le, Woodruff (AISTATS 19) Using the polynomial-size list containing an approximation to w*, we still need to extract a condition c such that E[|<w,y>-z|P|c(x)] (n) Use |<w,y(i)>-z(i)|P as weight/label for ith point Easy (T log log n) approximation for T-term k-DNF: only consider terms t with E[|<w,y>-z|P|t(x)]Pr[t(x)] Pr[c*(x)] Greedy algorithm for partial set cover: (terms = sets of points) covering (1- )Pr[c*(x)]-fraction Obtains T log m-size cover small k-DNF c Haussler 88: to estimate T-term k-DNFs, only require m = O(Tk log n) points ( T log log n) loss on c Reference class: add any surviving term satisfied by x*

lp-norm conditional linear regression Joint work with Hainline, Le, Woodruff (AISTATS 19) Using the polynomial-size list containing an approximation to w*, we still need to extract a condition c such that E[|<w,y>-z|P|c(x)] (n) Use |<w,y(i)>-z(i)|P as weight/label for ith point ZMJ 17/Peleg 07: More sophisticated algorithm achieving (nk/2) approximation for general k-DNF (plug in directly to obtain conditional linear regression) J.,Li 19: can obtain same guarantee for reference class

l2-norm conditional linear regression vs. selective linear regression LIBSVM benchmarks. Red: conditional regression; Black: selective regression Bodyfat (m=252, d=14) Boston (m=506, d=13) Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12)

Outline 1. Introduction and motivation 2. Overview of sparse regression via list- learning 3. General regression via list-learning Joint work with Calderon, Li, Li, and Ruan (arXiv). 4. Recap and challenges for future work

CSV17-style List Learning Basic algorithm: Soft relaxation of fitting unknown S Outlier detection and reduction Improve accuracy by clustering output w and recentering We reformulate CSV 17 in terms of terms (rather than individual points) (alternating)

Relaxation of fitting unknown c For fixed weights ( inlier indicators) u(1), ,u(T) [0,1] Each term t has its own parameters w(t) w(1) w(2)w(3) Enclosing ellipsoid Y w(outlier) w(outlier) Solve: minw,Y tu(t)|t|lt (w(t)) + tr(Y) (Y: enclosing ellipsoid; : carefully chosen constant)

Outlier detection and reduction Fix parameters w(1), ,w(T) Rd Give each term t its own formula indicators ct (t) Must find coalition c(t) of -fraction (|c(t)|= t c(t)|t |, = 1/m t c|t|) such that w(t) (t) = 1/|c(t)| t c(t) ct (t)|t |w(t ) w(1) w(2)w(3) (1) (outlier) Intuition: points fit by parameters in small ellipsoid have a good coalition for which objective value changes little (for smooth loss l). Outliers cannot find a good coalition and are damped/removed. w(outlier) Reduce inlier indicator u(t) by 1-|lt(w(t)) - lt( (t))|/maxt |lt (w(t )) lt ( (t ))|-factor

Clustering and recentering Full algorithm: (initially, one data cluster) Basic algorithm on cluster centers Cluster outputs (alternating) Next iteration: run basic algorithm with parameters of radius R/2 centered on each cluster w(1) w(2)w(3) w(j) w(k) w(l)

Overview of analysis (1/3) Loss bounds from basic algorithm Follows same essential outline as CSV 17 Guarantee for basic algorithm: we find such that given w* R, E[l( )|c*]-E[l(w*)|c*] O(R maxw,t c* [ E[lt(w)|t]- E[l(w)|c*] / ) Where [ E[lt(w)|t]- E[l(w)|c*] (w-w*)(Cov(y|t) - Cov(y|c*)) The bound is O(R2maxt c* Cov(y|t) - Cov(y|c*) / ) for all R 1/poly( m/ T) (errors -subgaussian on c*)

Overview of analysis (2/3) From loss to accuracy via convexity We can find: such that given w* R, E[l( )|c*]-E[l(w*)|c*] O(R2 Cov(y|t) - Cov(y|c*) / ) Implies that for all significant t in c*, w(t)-w* 2 O(R2 T maxt c* Cov(y|t) - Cov(y|c*) / ) where is the convexity parameter of l(w) Iteratively reduce R with clustering and recentering

Overview of analysis (3/3) Reduce R by clustering & recentering In each iteration, for all significant t in c*, w(t)-w* 2 O(R2 T maxt c* Cov(y|t) - Cov(y|c*) / ) where is the convexity parameter of l(w) Padded decompositions (FRT 04): can find a list of clusterings such that one cluster in each contains w(t) for all significant t in c* with high probability Moreover, if (Tlog(1/ )maxt c* Cov(y|t)-Cov(y|c*) / ) then we obtain a new cluster center in our list such that -w* R/2 So: we can iterate, reducing R 1/poly( m/ T) m large (t)-w* 0 with high probability

Finishing up: obtaining a k-DNF from Cast as weighted partial set cover instance (terms = sets of points) using i:t(x(i))=1li(w) as weight of term t and the ratio objective (cost/size) ZMJ 17: with ratio objective, still obtain O(log m) approximation Recall: we chose to optimize t c i:t(x(i))=1li(w) (=cost of cover c in set cover) adds only T-factor Recall: we only consider terms satisfied with probability /T use at most T/ terms Haussler 88, again: only need O(Tk/ log n) points

Summary: guarantee for general conditional linear regression Theorem. Suppose that D is a distribution over x {0,1}n, y Rd, and z R such that for some T-term k-DNF c* and w* Rd, <w*,y>-z is -subgaussian on D|c* with E[(<w*,y>-z)2|c*(x)] , Pr[c*(x)] , and E[(<w,y>-z)2|c*(x)] is -strongly convex in w with (Tlog(1/ )maxt c* Cov(y|t)-Cov(y|c*) / ) . There is a polynomial-time algorithm that uses examples from D to find w and c s.t. w.p. 1- , E[(<w,y>-z)2|c(x)] ( T (log log n + log )) and Pr[c(x)] (1- ) .

Comparison to sparse conditional linear regression on benchmarks Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12) Small benchmarks (Bodyfat, m=252, d=14; Boston, m=506, d=13): does not converge https://github.com/wumming/lud

Outline 1. Introduction and motivation 2. Overview of sparse regression via list- learning 3. General regression via list-learning 4. Recap and challenges for future work

Summary: new algorithms for conditional linear regression We can solve conditional linear regression for k-DNF conditions on n Boolean attributes, regression on d real attributes when w is sparse: ( w 0 s for constant s) for all lp norms, loss ( nk/2) for general k-DNF loss ( T log log n) for T-term k-DNF For general coefficients w with -subgaussian residuals, maxterms t of c Cov(y|t) - Cov(y|c) op sufficiently small l2 loss ( T (log log n + log )) for T-term k-DNF Technique: list regression (BBV 08, J 17, CSV 17)

Open problems 1. Remove covariance requirement! 2. Improve large-formula error bounds to O(nk/2 ) for general (dense) regression 3. Algorithms without semidefinite programming? Without padded decompositions? Note: algorithms for sparse regression already satisfy 1 3. 4. Formal evidence for hardness of polynomial- factor approximations for agnostic learning? 5. Conditional supervised learning for other hypothesis classes?

References Hainline, Juba, Le, Woodruff. Conditional sparse lp-norm regression with optimal probability. In AISTATS, PMLR 89:1042-1050, 2019. Calderon, Juba, Li, Li, Ruan. Conditional Linear Regression. arXiv:1806.02326 [cs.LG], 2018. Juba. Conditional sparse linear regression. In ITCS, 2017. Rosenfeld, Graham, Hamoudi, Butawan, Eneh, Kahn, Miah, Niranjan, Lovat. MIAT: A novel attribute selection approach to better predict upper gastrointestinal cancer. In DSAA, pp.1 7, 2015. Balcan, Blum, Vempala. A discriminative framework for clustering via similarity functions. In STOC pp.671 680, 2008. Charikar, Steinhardt, Valiant. Learning from untrusted data. In STOC, pp.47 60, 2017. Fakcharoenphol, Rao, Talwar. A tight bound on approximating arbitrary metrics by tree metrics. JCSS 69(3):485-497, 2004. Zhang, Mathew, Juba. An improved algorithm for learning to perform exception- tolerant abduction. In AAAI, pp.1257 1265, 2017. Juba, Li, Miller. Learning abduction under partial observability. In AAAI, pp.1888 1896, 2018. Cohen, Peng. lp row sampling by Lewis weights. In STOC, pp.47 60, 2017. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant s learning framework. Artificial Intelligence, 36:177 221, 1988. Peleg. Approximation algorithms for the label covermax and red-blue set cover problems. J. Discrete Algorithms, 5:55 64, 2007.

Conditional and Reference Class Linear Regression: A Comprehensive Overview

Download Presentation

Presentation Transcript

Related

More Related Content