Conge 精进

Machine Learning笔记 第02周

2016-01-27
本文 4003 字,阅读全文约需 12 分钟

本学期学习Machine Learning。本课程在Udacity上有免费版本,并提供详细的笔记。本文的笔记是超级脱水版,目的是自用。

Week 02 tasks: * Lectures: Instance Based Learning, Ensemble B&B, and Kernel Methods & SVMS.

  • Readings: Mitchel Chapter 8, Schapire’s overview to Machine learning and Jiri Matas’ slides on AdaBoost.

SL4: Instance Based Learning

Instance Based Learning Before

  • in supervised learning, we learn a function from data and use the function to predict future data.

Instance Based Learning Before

  • Now we keep everything in a database and f(x)=lookup(x). this database approach has pros (fast and simple) and cons (no generalizable and overfitting).

 Cost of the House example

  • classify unknow price houses by
    • nearest neighbor
    • k nearest neighbors

K-NN

  • in K-NN, there are a lot decisions can be made to get the result: e.g. classification: by voting or weighted voting; regression: average or weighted average.

Quiz 1: compute my neighbors

  • the running time and space required for each algorithm.
  • Sorted data points is more efficient in query.
  • Lazy learning (push work to when it’s needed) V.S. eager learning

quiz 2: Domain K-nnowledge

  • Different distance function and K makes very different prediction of the query point q.

K-NN Bias

  • K-NN works well when data has 1) locality, 2) smoothness and 3) equality of features

 Curse of Dimensionality

  • This is bad
  • the curse is not just for KNN, it’s for ML.

Some Other Stuff

  • Picking d(), picking K, replacing average with local weighted regression.

Wrap up


SL5: Ensemble Learning Boosting

 Ensemble Learning Boosting

  • example of spam email rules: but the rules are not definitive.

Ensemble learning algrithm

algorithm:

  1. pick up rules from subset of training data and
  2. combine rules.

Ensemble learning algrimth continue

For each step of ensemble learning algorithm, there are multiple ways of doing things.

  • picking up dataset uniformly randomly and
  • combine by average give us bagging. See examples below.

Quize 1. Ensemble Learning Outputs

Example 1:

  • N points and learner is zeor order polynormial
  • combining method: mean
  • Subsets are N disjoint, each w/ one data point.

What’s the ensemble output? A: average of N data points.

Now, another example:

Ensemble learning example

  • 3rd order polynomial (redline) on subset is better than 4th order polynomial on whole data on testing. It somehow removes overfitting.
  • Bagging or bootstrapping: take a random subset and combine by the mean

Ensemble Boosting

New subsetting and combination method:

  • Choose hard examples: hard relative to if we were to stop now, how well we do?
    • hard samples are samples which hypotheis do not match with the target function
  • Use weighted average

Quiz 2: Error

  • subscript D stands for distribution. So h is the hypothesis and C is the true underlying concept.
  • Olde definition: an error rate or an error percentage as the total number of mismatches over the total number of examples
  • New Definition: error is the probability, given the underlying distribution, that I will disagree with the true concept on some particular instance X.
  • Quiz 2: say we have 4 instance, we got 2 correct and 2 wrong, what’s the error? A: 0.5, assuming each instance has the same possibility to be seen.

Quiz 3: Error when instances have different possibility

  • What’s the error rate using new definition? A: 0.1.

Weak Learner

  • a weak learner is a learner that, no matter what the distribution is over your data, will do better than chance when it tries to learn labels on that data.

Quiz4: Weak Learning

  • Good D: find distribution over the 4 different examples, such that a learning algorithm that has to choose between one of those hypotheses will in fact be able to find one that does better than chance (by having an expected error less than 1/2)

  • evil D: find a distribution such that if you have that distribution over the four examples, a learning algorithm that only looked at H1, H2 and H3 would not be able to return one of them that has an expected error less than 1/2.

Quiz4: answer, and my answer above also passed the grader

Boosting in code

  • a training set made up of a bunch of xi, yi pairs.
  • loop at every time step t, from step 1, to some T
    • construct a distribution Dt
    • Given that distribution, find a weak classifier to output hypothesis Ht with some small error  ϵt, and ϵt can’t be bigger than 1/2.
  • eventually, output some final hypothesis. Hfinal.
  • the two really important things are left out:
    1. where do we get this distribution and
    2. where do we get this final hypothesis?

Quiz, what will happen when D agrees

  • Important part 1: how to construct distributions
    • initial D(i) = 1/n. No special weights
    • for Dt(i), use normalized (by Zt) distribution which is weighted by e to the alpha, Y, and ht .
  • When y and ht disagree, the distribution will change. but whether the change is increase or decrease will depends on normalization factor and others.

Final Hypothesis

  • Hfinal is defined as a SGN function of the sum of all the weighted hypothesis.

Boosting example

  • in three steps, boosting will solve the problem of separating red positive sign and green negative sign.
  • Size of sign represents their weights.

Quiz 6 answer

  • by average the three hypothesis, we can get the correct Hypothesis here.

Intuition of why boosting will converge to a good answer?

I am lost here, and need to read the proof!

Recap

In practice, Boosting might not have a overfitting problem (testing error goes high)

2016-01-24 SL4 完成
2016-21-25 SL5,初稿完成
原文地址 https://conge.livingwithfcs.org/2016/01/27/Machine-Learning-bi-ji-di-02-zhou/

Comments

Content