Conge 精进

Machine Learning笔记 第04周

本文 2553 字,阅读全文约需 8 分钟

Week 04 tasks:

  • Lectures: VC Dimensions and Bayesian Learning.
  • Reading: Mitchell Chapter 7 and Chapter 6.

SL8: VC Dimensions

Quiz 1: Which Hypothesis Spaces Are Infinite

  • m>= 1/ε( ln H +ln(1/𝛿) ). Here the sample size m is dependent on the size of hypothesis H , the error ε and the failure parameter_ 𝛿_. What happens if H is infinite?
  • quiz 1: Which Hypothesis Spaces Are Infinite

 Maybe It Is Not So Bad

  • In the example above, although the hypothesis space is infinite (syntactic), we can still explore the space efficiently because a lot of hypothesis are not that meaningfully different (semantic).

What Does VC Stand For

  • VC dimension: what is the largest set of inputs that the hypothesis class can shatter.
  • Vapnic-Chervonenkis

Quiz 2: internal training

  • not sure how to answer this question. need to rewatch.

Quiz 3: Linear Separators

  • Here VC = 3.

The ring

  • the vc dimension is going to end up being d plus 1 because the number of parameters needed to represent a d dimensional hyperplane is __ d plus 1__.

quiz 4: polygons

  • if the hypothesis is that points inside some convex polygon, then the VC = infinite.

Sample size with infinate hypothesis space

VC of finite H

recap lesson 8

Bayesian Learning

Bayesian Learning

  • the best hypothesis is the most probable hypothesis given data and domain knowledge.
  • argmaxh∈ H Pr(h D)

Bayes Rule

Bayes Rule

  • Bayes Rule: Pr(h D) = Pr(D h)Pr(h)/Pr(D)
    • Pr(D) is the prior about data
    • Pr(h) is the prior of hypothesis, and it’s the domain knowledge.
    • Pr(D h) is the possibility of data given h, it is much easier than Pr(h D) to compute.

Quiz 1

  • comparing the probability of one having /not having spleentitis.

Bayesian Learning

Bayesian Learning

  • to find the largest Pr(D h), we could drop P(D) for the bayes rule because it doesn’t matter since our task is to find the best h. MAP: maximum a posterior.
  • If we don’t have a strong prior or we assume the prior is uniform for every h, we can drop Pr(h). ML: maximulikelihoodod_
  • the hard part is to look into every h
  • Since H is often very large, this learning algorithm is not practical

 Bayesian Learning in Action

Bayesian Learning when the data has no noise

  • given a bunch of data, your probability of a particular hypothesis being correct, or being the best one or the right one, is simply uniform over all of the hypotheses that are in the version space. That is, are consistent with the data that we see.

Quiz 2:

  • given <x,d> pairs, and di =k * xi which has a probability of Pr(1/2k), what is the probability of D given d.

Bayes learning given gausion error

  • given training data, figure out f(x) and with its error term. If the error can be modeled by Gaussian function, then
  • hML can be simplified to minimizing a sum of squared error.

Quiz 3

  • find best hypothesis from the three.
    • calculate and compare squared error.

Quiz 4: small trees

  • hMAP can be transformed to minimize the length of hypothesis (size of h) and the length of the D h (which is misclassification error)
  • there is a tradeoff between size of h and error. this is called minimum description length
  • there is a unit problem: unit of error and size need to be figured out

 Bayesian Classification

Bayesian Classification

  • when we do the Classification, we will have each hypothesis to vote


  • Bayes optimal classifier = weighted voting by h.
2016-02-08 SL8 完成
2016-02-08 凌晨,SL9 完成.第一稿发布