跳到主要内容

Reflex Based Models

Reflex agent
  • 从环境中获取输入
  • 通过预测器,预测输出
  • 输出结果
  • 非线性

    Quadratic predictors
    二次预测器
    Quadratic clasifiers
    二次分类器
    Decision boundary - 一个圆
    Piecewise constant predictors
    分段常数预测器
    Predictors with periodicity structure
    周期性结构预测器

    Linear predictors

    feature template
    特征模板
    group of features all computed in a similar way
    e.g. 字符串 以 .com,.cn 结尾
    • dense feature
    • sparse feature
      • 大量 0

    Linear classifier

     fw(x)=sign(s(x,w))={+1if wϕ(x)>01if wϕ(x)<0?if wϕ(x)=0\ f_w(x)= sign(s(x,w)) = \begin{cases} +1 & \text{if} \space w \cdot \phi(x) > 0 \\ -1 & \text{if} \space w \cdot \phi(x) < 0 \\ ? & \text{if} \space w \cdot \phi(x) = 0 \end{cases}

    Margin

    • larger values are better
    m(x,y,w)=s(x,w)×ym(x,y,w) = s(x,w) \times y

    Linear regression

    fw(x)=s(x,w)=wϕ(x)f_w(x) = s(x,w) = w \cdot \phi(x)

    Residual

    amount by which the prediction fw(x)f_w(x) overshoots the target yy

    r(x,y,w)=fw(x)y=s(x,w)yr(x,y,w) = f_w(x) - y = s(x,w) - y

    Loss minimization

    Loss function
    损失函数
    Loss(x,y,w)Loss(x,y,w)
    weights ww, output yy, input xx.
    Classification case
    分类问题
    NameZero-one lossHinge lossLogistic loss
    Loss1m(x,y,w)<=01_{m(x,y,w) <= 0}max(1m(x,y,w),0)\text{max}(1−m(x,y,w),0)log(1+em(x,y,w))\text{log}(1+e^{−m(x,y,w)})
    Illustration

    Regression case

    NameSquared lossAbsolute deviation loss
    Loss(x,y,w)\textrm{Loss}(x,y,w)(res(x,y,w))2(\textrm{res}(x,y,w))^2res(x,y,w)\|\textrm{res}(x,y,w)\|
    Illustration

    Zero-one loss

    Loss01(x,y,w)=1[fw(x)y]=1[(wϕ(x))ymargin0]\begin{alignat*}{2} \text{Loss}_{0-1}(x,y,\text{w}) &= 1[f_\text{w}(x) \ne y] \\ &= 1[ \underbrace{ (\text{w} \cdot \phi(x)) y }_{\text{margin}} \le 0 ] \end{alignat*}

    Logistic regression

    Looslogistic(x,y,w)=log(1+em(x,y,w))\textrm{Loos}_\text{logistic}(x,y,w) = \text{log}(1+e^{−m(x,y,w)})

    Loss minimization framework

    TrainLoss(w)=1Dtrain(x,y)DtrainLoss(x,y,w)\textrm{TrainLoss}(w)=\frac{1}{|\mathcal{D}_{\textrm{train}}|}\sum_{(x,y)\in\mathcal{D}_{\textrm{train}}}\textrm{Loss}(x,y,w)
    group DRO
    Group distributionally robust optimization
    TrainLoosmax(w)=maxgTrainLoosg(w)\textrm{TrainLoos}_\text{max}(w) = \underset{g}{\text{max}} \textrm{TrainLoos}_g(w) TrainLoosmax(w)=TrainLoosg*(w)where g*=argmaxgTrainLossg(w)\nabla\textrm{TrainLoos}_\text{max}(\mathbf{w}) = \nabla \textrm{TrainLoos}_{\text{g}^\text{*}} (\mathbf{w}) \\ \text{where } g^\text{*}=\underset{g}{\text{argmax}} \text{TrainLoss}_g(\mathbf{w})

    Non-linear predictors

    kk-nearest neighbors
    KNN
    kk相邻
    用于分类、回归

    • kk
      • 与 bias 成正比
      • 与 variance 成反比
    Neural networks
    神经网络
    zj[i]=wj[i]Tx+bj[i]z_j^{[i]}={w_j^{[i]}}^Tx+b_j^{[i]}
    • w - weight
    • b - bias
    • x - input
    • z - non-activated output

    Stochastic gradient descent

    Gradient descent
    梯度下降

    wwηwLoss(x,y,w)w\longleftarrow w-\eta\nabla_w \textrm{Loss}(x,y,w)
    • ηR\eta \in \mathbb R
      • learning rate - step size
      • 学习速率 - 每次更新多少
    Stochastic gradient descent - SGD
    随机梯度下降
    Stochastic updates - 每次训练更新
    Batch gradient descent - BGD
    批量梯度下降
    Batch updates - 一次训练集更新一次

    Fine-tuning models

    Hypothesis class
    假设类
    F={fw:wRd}\mathcal{F}=\left\{f_w:w\in\mathbb{R}^d\right\}
    Logistic function
    逻辑函数
    σ\sigma - sigmoid function
    z],+[,σ(z)=11+ez\boxed{\forall z\in]-\infty,+\infty[,\quad\sigma(z)=\frac{1}{1+e^{-z}}} σ(z)=σ(z)(1σ(z))\sigma'(z)=\sigma(z)(1-\sigma(z))
    Backpropagation
    后向传播
    gi=outfig_i=\frac{\partial\textrm{out}}{\partial f_i}

    Approximation error
    hypothesis <-> predictor
    ϵapprox\epsilon_\text{approx}
    Estimation error
    predictor <-> best predictor
    ϵest\epsilon_\text{est}

    • Regularization
      • keep the model from overfitting
      • LASSO
        • Shrinks coefficients to 0
        • Good for variable selection
      • Ridge
        • Makes coefficients smaller
      • Elastic Net
        • Tradeoff between variable selection and small coefficients
    • Hyperparameters
    • Sets vocabulary
    Training setValidation setTesting set
    训练集验证集
    hold-out
    development set
    测试集
    Dtrain\mathcal{D}_{\textrm{train}}Dtest\mathcal{D}_{\textrm{test}}
    80%20%
    用于训练模型用于估算模型模型未见过的数据

    Unsupervised Learning

    k-means

    Clustering
    nDtrainn \in \mathcal D _ \textrm{train}, clustering 将点 ϕ(xi)\phi(x_i) 划分为 kkzi{1,...,k}z_i \in \{1,...,k\}
    Objective function
    初始化 clustering 的函数
    选取 kk 个点作为初始的 kk 个 cluster 的中心点
    Lossk-means(x,μ)=i=1nϕ(xi)μzi2\textrm{Loss}_{\textrm{k-means}}(x,\mu)=\sum_{i=1}^n||\phi(x*i)-\mu*{z_i}||^2
    k-means
  • 随机选取 kk 个点作为初始的 kk 个 cluster 的中心点 - Objective function 2. 计算每个点到 kk 个 cluster 的中心点的距离 - Algorithm 3. 将每个点划分到距离最近的 cluster 4. 重新计算每个 cluster 的中心点 5. 重复 2-4 直到收敛
  • Algorithm

    zi=arg minjϕ(xi)μj2andμj=i=1n1{zi=j}ϕ(xi)i=1n1_{zi=j}\boxed{z*i=\underset{j}{\textrm{arg min}}||\phi(x_i)-\mu_j||^2}\quad\textrm{and}\quad\boxed{\mu_j=\frac{\displaystyle\sum*{i=1}^n1*{\{z_i=j\}}\phi(x_i)}{\displaystyle\sum*{i=1}^n1\_{\{z_i=j\}}}}

    Principal Component Analysis

    • Eigenvalue, eigenvector
    Az=λz\boxed{Az=\lambda z}

    Spectral theorem

    Λ diagonal,A=UΛUT\exists\Lambda\textrm{ diagonal},\quad A=U\Lambda U^T ϕj(xi)ϕj(xi)μjσjwhereμj=1ni=1nϕj(xi)andσj2=1ni=1n(ϕj(xi)μj)2\boxed{\phi*j(x_i)\leftarrow\frac{\phi_j(x_i)-\mu_j}{\sigma_j}}\quad\textrm{where}\quad\boxed{\mu_j = \frac{1}{n}\sum*{i=1}^n\phi*j(x_i)}\quad\textrm{and}\quad\boxed{\sigma_j^2=\frac{1}{n}\sum*{i=1}^n(\phi_j(x_i)-\mu_j)^2}

    Misc

    • y=w1x1+w2x2+...+wnxn+by = w_1x_1 + w_2x_2 + ... + w_nx_n + b
      • yy is the output
      • xix_i is the input
      • wiw_i is the weight
      • bb is the bias
    • 有监督学习
      • 有输入和输出
      • 输入是特征
      • 输出是标签
      • 通过学习输入和输出的关系, 从而预测未知的输出