The dot product is a scalar. The dot product of two vectors gives you the value of the magnitude of one vector multiplied by the magnitude of the projection of the other vector on the first vector.

The cross product is a vector. The magnitude of the cross product of two vectors is the magnitude of one vector multiplied by the magnitude of the projection of the other vector in the direction orthogonal to the first vector.

Q-learning is a model-free reinforcement learning algorithm to learn quality of actions telling an agent what action to take under what circumstances. It does not require a model (hence the connotation “model-free”) of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.

Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.

“Q” names the function that the algorithm computes with the maximum expected rewards for an action taken in a given state.

In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. If an arbitrarily large number of samples, each involving multiple observations (data points), were separately used in order to compute one value of a statistic (such as, for example, the sample mean or sample variance) for each sample, then the sampling distribution is the probability distribution of the values that the statistic takes on. In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.

Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

A hyperparameter is a parameter whose value is used to control the learning process.

By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns.

These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem.

Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

The objective function takes a tuple of hyperparameters and returns the associated loss.

Cross-validation is often used to estimate this generalization performance.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the clusterwith the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. …

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

  • True Positive Rate
  • False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

False Positive Rate (FPR) is defined as follows:

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

Figure 4. TP vs. FP rate at different classification thresholds.

Image for post
Image for post

Each bar represents a digit, and the height of the bar is the percentage of numbers that start with that digit.

Frequency of first significant digit of physical constants plotted against Benford’s law

Benford’s law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data.

The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small.

For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. …

The gambler’s fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the erroneous belief that if a particular event occurs more frequently than normal during the past it is less likely to happen in the future (or vice versa), when it has otherwise been established that the probability of such events does not depend on what has happened in the past. Such events, having the quality of historical independence, are referred to as statistically independent.

The fallacy is commonly associated with gambling, where it may be believed, for example, that the next dice roll is more than usually likely to be six because there have recently been less than the usual number of sixes.

The term “Monte Carlo fallacy” originates from the best known example of the phenomenon, which occurred in the Monte Carlo Casino in 1913.

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model “learns”.

In the adaptive control literature, the learning rate is commonly referred to as gain.

In setting a learning rate, there is a trade-off between the rate of convergence and overshooting.

While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction.

A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Web scraping a web page involves fetching it and extracting from it.

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.

Keep in mind that web scraping may be against the terms of use of some websites.


Mau Ruanova

Welcome to my Data Science blog. Please visit my career portfolio at 🚀🌎

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store