The dot product is a scalar. The dot product of two vectors gives you the value of the magnitude of one vector multiplied by the magnitude of the projection of the other vector on the first vector.

The cross product is a vector. The magnitude of the cross product of two vectors is the magnitude of one vector multiplied by the magnitude of the projection of the other vector in the direction orthogonal to the first vector.

** Q-learning** is a model-free reinforcement learning algorithm to learn quality of actions telling an agent what action to take under what circumstances. It does not require a model (hence the connotation “model-free”) of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

For any finite Markov decision process (FMDP), *Q*-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.

*Q*-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.

“Q” names the function that the algorithm computes with the maximum expected rewards for an action taken in a given state.

In statistics, a **sampling distribution** or **finite-sample distribution** is the probability distribution of a given random-sample-based statistic. If an arbitrarily large number of samples, each involving multiple observations (data points), were separately used in order to compute one value of a statistic (such as, for example, the sample mean or sample variance) for each sample, then the sampling distribution is the probability distribution of the values that the statistic takes on. In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.

**Hyperparameter optimization** or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

A hyperparameter is a parameter whose value is used to control the learning process.

By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns.

These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem.

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

The objective function takes a tuple of hyperparameters and returns the associated loss.

Cross-validation is often used to estimate this generalization performance.

** k-means clustering** is a method of vector quantization, originally from signal processing, that aims to partition

An **ROC curve** (**receiver operating characteristic curve**) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

- True Positive Rate
- False Positive Rate

**True Positive Rate** (**TPR**) is a synonym for recall and is therefore defined as follows:

**False Positive Rate** (**FPR**) is defined as follows:

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

**Figure 4. TP vs. FP rate at different classification thresholds.** …

Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit.

Frequency of first significant digit of physical constants plotted against Benford’s law

**Benford’s law**, also called the **Newcomb–Benford law**, the **law of anomalous numbers**, or the **first-digit law**, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data.

The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small.

For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. …

The **gambler’s fallacy**, also known as the **Monte Carlo fallacy** or the **fallacy of the maturity of chances**, is the erroneous belief that if a particular event occurs more frequently than normal during the past it is less likely to happen in the future (or vice versa), when it has otherwise been established that the probability of such events does not depend on what has happened in the past. Such events, having the quality of historical independence, are referred to as statistically independent.

The fallacy is commonly associated with gambling, where it may be believed, for example, that the next dice roll is more than usually likely to be six because there have recently been less than the usual number of sixes.

The term “Monte Carlo fallacy” originates from the best known example of the phenomenon, which occurred in the Monte Carlo Casino in 1913.

In machine learning and statistics, the **learning rate** is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model “learns”.

In the adaptive control literature, the learning rate is commonly referred to as **gain**.

In setting a learning rate, there is a trade-off between the rate of convergence and overshooting.

While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction.

A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.

**Web scraping**, **web harvesting**, or **web data extraction** is data scraping used for extracting data from websites.

It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Web scraping a web page involves fetching it and extracting from it.

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.

Keep in mind that web scraping may be against the terms of use of some websites.

About