To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Local case-control sampling

From Wikipedia, the free encyclopedia

In machine learning, local case-control sampling [1] is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

YouTube Encyclopedic

  • 1/3
    Views:
    24 144
    47 449
    425
  • Mod-01 Lec-18 Acceptance Sampling
  • 9. Volatility Modeling
  • Module 11: Sampling Instrumentation for Airborne Nanomaterials

Transcription

Imbalanced datasets

In classification, a dataset is a set of N data points , where is a feature vector, is a label. Intuitively, a dataset is imbalanced when certain important statistical patterns are rare. The lack of observations of certain patterns does not always imply their irrelevance. For example, in medical studies of rare diseases, the small number of infected patients (cases) conveys the most valuable information for diagnosis and treatments.

Formally, an imbalanced dataset exhibits one or more of the following properties:

  • Marginal Imbalance. A dataset is marginally imbalanced if one class is rare compared to the other class. In other words, .
  • Conditional Imbalance. A dataset is conditionally imbalanced when it is easy to predict the correct labels in most cases. For example, if , the dataset is conditionally imbalanced if and .

Algorithm outline

In logistic regression, given the model , the prediction is made according to . The local-case control sampling algorithm assumes the availability of a pilot model . Given the pilot model, the algorithm performs a single pass over the entire dataset to select the subset of samples to include in training the logistic regression model. For a sample , define the acceptance probability as . The algorithm proceeds as follows:

  1. Generate independent for .
  2. Fit a logistic regression model to the subsample , obtaining the unadjusted estimates .
  3. The output model is , where and .

The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the decision boundary of the classifier and is thus more informative.

Obtaining the pilot model

In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a subsample selected through another sampling technique can be used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size , first estimate a model using samples from weighted case control sampling, then collect another samples using local case-control sampling.

Larger or smaller sample size

It is possible to control the sample size by multiplying the acceptance probability with a constant . For a larger sample size, pick and adjust the acceptance probability to . For a smaller sample size, the same strategy applies. In cases where the number of samples desired is precise, a convenient alternative method is to uniformly downsample from a larger subsample selected by local case-control sampling.

Properties

The algorithm has the following properties. When the pilot is consistent, the estimates using the samples from local case-control sampling is consistent even under model misspecification. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with , the factor 2 is improved to .

References

  1. ^ Fithian, William; Hastie, Trevor (2014). "Local case-control sampling: Efficient subsampling in imbalanced data sets". The Annals of Statistics. 42 (5): 1693–1724. arXiv:1306.3706. doi:10.1214/14-aos1220. PMC 4258397. PMID 25492979.
This page was last edited on 22 August 2022, at 13:19
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.