from wystswolf

The consequences of a touched eyeball are that you can run, but you cannot hide.

Wolfinwool · Isaiah 14-16

NARRATOR:

For Jehovah will show mercy to Jacob, and he will again choose Israel. He will settle them in their land, and the foreign residents will join them and attach themselves to the house of Jacob.

And peoples will take them and bring them to their own place, and the house of Israel will possess them as male and female servants in Jehovah’s land; and they will be the captors of those who held them captive, and they will have in subjection those who were forcing them to work.

In the day when Jehovah gives you rest from your pain and from your turmoil and from the hard slavery imposed on you, you will recite this proverb against the king of Babylon:


ISRAEL (PROVERB AGAINST THE KING OF BABYLON):

How the one forcing others to work has met his end! How the oppression has ended!

Jehovah has broken the rod of the wicked, the staff of the rulers, the one furiously striking peoples with unceasing blows, the one angrily subduing nations with relentless persecution.

The whole earth now rests, free of disturbance. People cry out for joy.

Even the juniper trees rejoice over you, along with the cedars of Lebanon. They say, ‘Ever since you have fallen, no woodcutter comes up against us.’

Even the Grave underneath is stirred up to meet you when you come. Because of you, it awakens those powerless in death, all the oppressive leaders of the earth. It makes all the kings of the nations rise from their thrones.

All of them speak up and say to you: ‘Have you also become weak like us? Have you become like us?

Down to the Grave your pride has been brought, the sound of your stringed instruments. Maggots are spread beneath you as a bed, and worms are your covering.’

How you have fallen from heaven, O shining one, son of the dawn! How you have been cut down to the earth, you who vanquished nations!

You said in your heart, ‘I will ascend to the heavens. Above the stars of God I will lift up my throne, and I will sit down on the mountain of meeting, in the remotest parts of the north. I will go up above the tops of the clouds; I will make myself resemble the Most High.’

Instead, you will be brought down to the Grave, to the remotest parts of the pit.

Those seeing you will stare at you; they will closely examine you, saying: ‘Is this the man who was shaking the earth, who made kingdoms tremble, who made the inhabited earth like the wilderness and overthrew its cities, who refused to let his prisoners go home?’

All other kings of the nations, yes, all of them, lie down in glory, each one in his own tomb.

But you are discarded without a grave, like a detested sprout, clothed with the slain who were stabbed with the sword, who go down to the stones of a pit, like a carcass trampled underfoot.

You will not join them in a grave, for you destroyed your own land, you killed your own people. The offspring of evildoers will never again be named.

Prepare a slaughtering block for his sons because of the guilt of their forefathers, so that they will not rise up and take over the earth and fill the land with their cities.


JEHOVAH OF ARMIES:

I will rise up against them. And I will wipe out from Babylon name and remnant and descendants and posterity.

And I will make her a possession of porcupines and a region of marshes, and I will sweep her with the broom of annihilation.


NARRATOR:

Jehovah of armies has sworn: “Just as I have intended, so it will occur, and just as I have decided, that is what will come true.

I will crush the Assyrian in my land, and I will trample him on my mountains. His yoke will be removed from them, and his load will be removed from their shoulder.”

This is what has been decided against all the earth, and this is the hand that is stretched out against all the nations.

For Jehovah of armies has decided, and who can thwart it? His hand is stretched out, and who can turn it back?

In the year that King Ahaz died, this pronouncement was made:


JEHOVAH (PRONOUNCEMENT AGAINST PHILISTIA):

Do not rejoice, Philistia, any of you, just because the staff of the one striking you has been broken. For from the root of the serpent will come a poisonous snake, and its offspring will be a flying fiery snake.

While the firstborn of the lowly feed and the poor lie down in security, I will put your root to death with famine, and what is left of you will be killed.

Wail, O gate! Cry out, O city! All of you will lose heart, O Philistia! For a smoke is coming from the north, and there are no stragglers in his ranks.

How should they answer the messengers of the nation? That Jehovah has laid the foundation of Zion, and that the lowly ones of his people will take refuge in her.


CHAPTER 15

NARRATOR (PRONOUNCEMENT AGAINST MOAB):

Because it has been devastated in a night, Ar of Moab has been silenced. Because it has been devastated in a night, Kir of Moab has been silenced.

He has gone up to the House and to Dibon, to the high places to weep. Moab wails over Nebo and over Medeba. Every head is shaved bald, every beard is clipped.

In its streets they have put on sackcloth. On their roofs and in their public squares they all wail; they go down weeping.

Heshbon and Elealeh cry out; their voice is heard as far as Jahaz. That is why the armed men of Moab keep shouting. He is trembling.

My heart cries out over Moab. Its fugitives have fled as far as Zoar and Eglath-shelishiyah. On the ascent of Luhith they weep as they go up; on the way to Horonaim they cry out over the catastrophe.

For the waters of Nimrim are desolate; the green grass has dried up, the grass is gone and nothing green is left.

That is why they are carrying away what is left of their stores and their riches; they are crossing the valley of poplars.

For the outcry echoes throughout the territory of Moab. The wailing reaches to Eglaiim; the wailing reaches to Beer-elim.

For the waters of Dimon are full of blood, and I have more in store for Dimon: a lion for those of Moab who escape and for those remaining in the land.


CHAPTER 16

NARRATOR:

Send a ram to the ruler of the land, from Sela through the wilderness to the mountain of the daughter of Zion.

Like a bird chased away from its nest, so the daughters of Moab will be at the fords of Arnon.


COUNSEL TO MOAB:

Offer counsel, carry out the decision. Make your shadow at high noon like the night. Conceal the dispersed and do not betray those fleeing.

May my dispersed ones reside in you, O Moab. Become a place of concealment to them because of the destroyer. The oppressor will reach his end, the destruction will come to an end, and those trampling others down will perish from the earth.

Then a throne will be firmly established in loyal love. The one who sits on it in the tent of David will be faithful; he will judge fairly and will swiftly execute righteousness.


NARRATOR:

We have heard about the pride of Moab—he is very proud— his haughtiness and his pride and his fury; but his empty talk will come to nothing.

So Moab will wail for Moab; they will all wail. Those who are stricken will moan for the raisin cakes of Kir-hareseth.

For the terraces of Heshbon have withered, the vine of Sibmah. The rulers of the nations have trampled its bright-red branches; they had reached as far as Jazer; they had extended into the wilderness. Its shoots had spread out and gone as far as the sea.

That is why I will weep over the vine of Sibmah as I weep for Jazer. With my tears I will drench you, O Heshbon and Elealeh, because the shouting over your summer fruit and your harvest has ended.

Rejoicing and joyfulness have been taken away from the orchard, and there are no songs of joy or shouting in the vineyards. The treader no longer treads out wine in the presses, for I have caused the shouting to cease.

That is why deep within me I am boisterous over Moab, like the strumming of a harp, and my innermost being over Kir-hareseth.

Even when Moab wears himself out on the high place and goes to pray in his sanctuary, he will accomplish nothing.

This is the word that Jehovah previously spoke concerning Moab.

And now Jehovah says: “Within three years, like the years of a hired worker, the glory of Moab will be disgraced with much tumult of every sort, and those who remain will be very few and insignificant.”

 
Read more... Discuss...

from Justina Revolution

I did my 5 phase routine with Loosening, Cosmos Palm, Silk Reeling, and Swimming Dragon Baguazhang. This was so good as the sun rose behind me. I am increasing my power, my flexibility, my meditative abilities, and my body, mind, and spirit senses.

Weaving energy around my body, spreading my awareness from horizon to horizon. Generating stillness in both limited and unlimited forms. This is glorious. I am generating a world of benefits and my evolution, the activation of my DNA upgrades all beings in the multiverse.

There is no separation. It’s all one thing. I did the Monroe guided portal meditation last night. I know this energy of the portal. It is Akasha and I am joined with all beings in that beautiful pregnant void.

The Void is not emptiness or annihilation. It is the pregnant field from whence all things arise and to which all things return. This is my reality. As solid and true as my fist. Nothing is ever gone. Nothing is ever lost. There is no past and no future because there is no time. There is no loss because there is no space. Nothing can come to you or leave you. It is all here right now in this very moment.

 
Read more... Discuss...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This fourth module covers critical considerations when building and deploying ML models in the real world, including productionisation best practices, automation, and responsible engineering.

Production ML systems

Introduction

The model is only a small part of real-world production ML systems. It often represents only 5% or less of the total codebase in the system. MlSystem.png

Source: Production ML systems | Machine Learning | Google for Developers

Static versus dynamic training

Machine learning models can be trained statically (once) or dynamically (continuously).

Static training (offline training) Dynamic training (online training)
Advantages Simpler. You only need to develop and test the model once. More adaptable. Keeps up with changes in data patterns, providing more accurate predictions.
Disadvantages Sometimes stale. Can become outdated if data patterns change, requiring data monitoring. More work. You must build, test, and release a new product continuously.

Choosing between static and dynamic training depends on the specific dataset and how frequently it changes.

Monitoring input data is essential for both static and dynamic training to ensure reliable predictions.

Source: Production ML systems: Static versus dynamic training | Machine Learning | Google for Developers

Static versus dynamic inference

Inference involves using a trained model to make predictions on unlabelled examples, and it can be done as follows:

  • Static inference (offline inference, batch inference) generates predictions in advance and caches them, which suits scenarios where prediction speed is critical.

  • Dynamic inference (online inference, real-time inference) generates predictions on demand, offering flexibility for diverse inputs.

Static inference (offline inference, batch inference) Dynamic inference (online inference, real-time inference)
Advantages No need to worry about cost of inference; allows post-verification of predictions before pushing Can infer a prediction on any new item as it comes in
Disadvantages Limited ability to handle uncommon inputs Compute-intensive and latency-sensitive; monitoring needs are intensive

Choosing between static and dynamic inference depends on factors such as model complexity, desired prediction speed, and the nature of the input data.

Static inference is advantageous when cost and prediction verification are prioritised, while dynamic inference excels in handling diverse, real-time predictions.

Source: Production ML systems: Static versus dynamic inference | Machine Learning | Google for Developers

When to transform data?

Feature engineering can be performed before or during model training, each with its own advantages and disadvantages.

  • Transforming data before training allows for a one-time transformation of the entire dataset but requires careful recreation of transformations during prediction to avoid training-serving skew.
  • Transforming data during training ensures consistency between training and prediction but can increase model latency and complicate batch processing.
    • When transforming data during training, considerations such as Z-score normalisation across batches with varying distributions need to be addressed.

Source: Production ML systems: When to transform data? | Machine Learning | Google for Developers

Deployment testing

Deploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.

Reproducible model training involves deterministic seeding, fixed initialisation order, averaging multiple runs, and using version control.

Integration tests ensure that different components of the ML pipeline work together seamlessly and should run continuously and for new model or software versions.

Before serving a new model, validate its quality by checking for sudden and gradual degradations against previous versions and fixed thresholds.

Ensure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.

Source: Production ML systems: Deployment testing | Machine Learning | Google for Developers

Monitoring pipelines

ML pipeline monitoring involves validating data (using data schemas) and features (using unit tests), tracking real-world metrics, and addressing potential biases in data slices.

Monitoring training-serving skew, label leakage, model age, and numerical stability is crucial for maintaining pipeline health and model performance.

  • Training-serving skew means that input data during training differs from input data during serving, for example because training and serving data use different schemas (schema skew) or because engineered data differs between training and serving (feature skew).
  • Label leakage means that the ground truth labels being predicted have inadvertently entered the training features.
  • Numerical stability involves writing tests to check for NaN and Inf values in weights and layer outputs, and testing that more than half of the outputs of a layer are not zero.

Live model quality testing uses methods such as human labelling and statistical analysis to ensure ongoing model effectiveness in real-world scenarios.

Implementing proper randomisation through deterministic data generation enables reproducible experiments and consistent analysis.

Maintaining invariant hashing ensures that data splits remain consistent across experiments, contributing to reliable analysis and model evaluation.

Source: Production ML systems: Monitoring pipelines | Machine Learning | Google for Developers

Questions to ask

Continuously monitor models in production to evaluate feature importance and potentially remove unnecessary features, ensuring prediction quality and resource efficiency.

  • Regularly assess whether features are genuinely helpful and whether their value outweighs the cost of inclusion.

Data reliability is crucial. Consider data source stability, potential changes in upstream data processes, and the creation of local data copies to control versioning and mitigate risks.

Be aware of feedback loops, where a model's predictions influence future input data, potentially leading to unexpected behaviour or biased outcomes, especially in interconnected systems.

Source: Production ML systems: Questions to ask | Machine Learning | Google for Developers

Automated machine learning

Introduction

AutoML automates tasks in the machine learning workflow, such as data engineering (feature selection and engineering), training (algorithm selection and hyperparameter tuning), and analysis, making model building faster and easier. ml-workflow.png

While manual training involves writing code and iteratively adjusting it, AutoML reduces repetitive work and the need for specialised skills.

Source: Automated Machine Learning (AutoML) | Google for Developers

Benefits and limitations

Benefits:

  • To save time.
  • To improve the quality of an ML model.
  • To build an ML model without needing specialised skills.
  • To smoke test a dataset. AutoML can give quick baseline estimates of whether a dataset has enough signal relative to noise.
  • To evaluate a dataset. AutoML can help determine which features may be worth using.
  • To enforce best practices. Automation includes built-in support for applying ML best practices.

Limitations:

  • Model quality may not match that of manual training.
  • Model search and complexity can be opaque. Models generated with AutoML are difficult to reproduce manually.
  • Multiple AutoML runs may show greater variance.
  • Models cannot be customised during training.

Large amounts of data are generally required for AutoML, although specialised systems using transfer learning (taking a model trained on one task and adapting its learned representations to a different but related task) can reduce this requirement.

AutoML suits teams with limited ML experience or those seeking productivity gains without customisation needs. Custom (manual) training suits cases where model quality and customisation matter most.

Source: AutoML: Benefits and limitations | Machine Learning | Google for Developers

Getting started

AutoML tools fall into two categories:

  • Tools that require no coding.
  • API and CLI tools.

The AutoML workflow follows steps similar to traditional machine learning, including problem definition, data gathering, preparation, model development, evaluation, and potential retraining.

  • Some AutoML systems also support model deployment.

Data preparation is crucial for AutoML and involves labelling, cleaning and formatting data, and applying feature transformations.

No-code AutoML tools guide users through model development with steps such as data import, analysis, refinement, and configuration of run parameters before starting the automated training process.

  • Users still need to carry out semantic checks to select the appropriate semantic type for each feature (for example recognising that postal codes are categorical rather than numeric), and to set transformations accordingly.

Source: AutoML: Getting started | Machine Learning | Google for Developers

Fairness

Introduction

Before putting a model into production, it is critical to audit training data and evaluate predictions for bias.

Source: Fairness | Machine Learning | Google for Developers

Types of bias

Machine learning models can be susceptible to bias due to human involvement in data selection and curation.

Understanding common human biases is crucial for mitigating their impact on model predictions.

Types of bias include reporting bias, historical bias, automation bias, selection bias, coverage bias, non-response bias, sampling bias, group attribution bias (in-group bias and out-group homogeneity bias), implicit bias, confirmation bias, and experimenter's bias, among others.

Source: Fairness: Types of bias | Machine Learning | Google for Developers

Identifying bias

Missing or unexpected feature values in a dataset can indicate potential sources of bias.

Data skew, where certain groups are under- or over-represented, can introduce bias and should be addressed.

Evaluating model performance by subgroup ensures fairness and equal performance across different characteristics.

Source: Fairness: Identifying bias | Machine Learning | Google for Developers

Mitigating bias

Machine learning engineers use two primary strategies to mitigate bias in models:

  • Augmenting training data.
  • Adjusting the model's loss function.

Augmenting training data involves collecting additional data to address missing, incorrect, or skewed data, but it can be infeasible due to data availability or resource constraints.

Adjusting the model's loss function involves using fairness-aware optimisation functions rather than the common default log loss.

The TensorFlow Model Remediation Library provides optimisation functions designed to penalise errors in a fairness-aware manner:

  • MinDiff aims to balance errors between different data slices by penalising differences in prediction distributions.
  • Counterfactual Logit Pairing (CLP) penalises discrepancies in predictions for similar examples with different sensitive attribute values.

Source: Fairness: Mitigating bias | Machine Learning | Google for Developers

Evaluating for bias

Aggregate model performance metrics such as precision, recall, and accuracy can hide biases against minority groups.

Fairness in model evaluation involves ensuring equitable outcomes across different demographic groups.

Fairness metrics can help assess model predictions for bias.

  • Demographic parity
  • Equality of opportunity
  • Counterfactual fairness

Candidate pool of 100 students: 80 students belong to the majority group (blue), and 20 students belong to the minority group (orange): fairness_metrics_candidate_pool.png

Source: Fairness: Evaluating for bias | Machine Learning | Google for Developers

Demographic parity

Demographic parity aims to ensure equal acceptance rates for majority and minority groups, regardless of individual qualifications.

Both the majority (blue) and minority (orange) groups have an acceptance rate of 20%: fairness_metrics_demographic_parity.png

While demographic parity promotes equal representation, it can overlook differences in individual qualifications within each group, potentially leading to unfair outcomes.

Qualified students in both groups are shaded in green, and qualified students who were rejected are marked with an X: fairness_metrics_demographic_parity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 16/35 = 46% Minority acceptance rate = Qualified minority accepted / Qualified minority = 4/15 = 27%

When the distribution of a preferred label (“qualified”) differs substantially between groups, demographic parity may not be the most appropriate fairness metric.

There may be additional benefits/drawbacks of demographic parity not discussed here that are also worth considering.

Source: Fairness: Demographic parity | Machine Learning | Google for Developers

Equality of opportunity

Equality of opportunity focuses on ensuring that qualified individuals have an equal chance of acceptance, regardless of demographic group.

Qualified students in both groups are shaded in green: fairness_metrics_equality_of_opportunity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 14/35 = 40% Minority acceptance rate = Qualified minority accepted / Qualified minority = 6/15 = 40%

Equality of opportunity has limitations, including reliance on a clearly defined preferred label and challenges in settings that lack demographic data.

It is possible for a model to satisfy both demographic parity and equality of opportunity under specific conditions where positive prediction rates and true positive rates align across groups.

Source: Fairness: Equality of opportunity | Machine Learning | Google for Developers

Counterfactual fairness

Counterfactual fairness evaluates fairness by comparing predictions for similar individuals who differ only in a sensitive attribute such as demographic group.

This metric is particularly useful when datasets lack complete demographic information for most examples but contain it for a subset.

Candidate pool, with demographic group membership unknown for most candidates (icons shaded in grey): fairness_metrics_counterfactual_satisfied.png

Counterfactual fairness may not capture broader systemic biases across subgroups. Other fairness metrics, such as demographic parity and equality of opportunity, provide a more holistic view but may require complete demographic data.

Summary

Selecting the appropriate fairness metric depends on the specific application and desired outcome, with no single “right” metric universally applicable.

For example, if the goal is to achieve equal representation, demographic parity may be the optimal metric. If the goal is to achieve equal opportunity, equality of opportunity may be the best metric.

Some definitions of fairness are mutually incompatible.

Source: Fairness: Counterfactual fairness | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This third module covers advanced ML model architectures.

Neural networks

Introduction

Neural networks are a model architecture designed to automatically identify non-linear patterns in data, eliminating the need for manual feature cross experimentation.

Source: Neural networks | Machine Learning | Google for Developers

Nodes and hidden layers

In neural network terminology, additional layers between the input layer and the output layer are called hidden layers, and the nodes in these layers are called neurons. HiddenLayerBigPicture.png

Source: Neural networks: Nodes and hidden layers | Machine Learning | Google for Developers

Activation functions

Each neuron in a neural network performs the following two-step action:

  • Calculates the weighted sum of input values.
  • Applies an activation function to that sum.

Common activation functions include sigmoid, tanh, and ReLU.

The sigmoid function maps input x to an output value between 0 and 1: $$ F(x) = \frac{1}{1 + e^{-x}} $$ sigmoid.png

The tanh function (short for “hyperbolic tangent”) maps input x to an output value between -1 and 1: $$ F(x) = \tanh{(x)} $$ tanh.png

The rectified linear unit activation function (or ReLU, for short) applies a simple rule:

  • If the input value is less than 0, return 0.
  • If the input value is greater than or equal to 0, return the input value. $$ F(x) = \max{(0,x)} $$

ReLU often outperforms sigmoid and tanh because it reduces vanishing gradient issues and requires less computation. relu.png

A neural network consists of:

  • A set of nodes, analogous to neurons, organised in layers.
  • A set of learned weights and biases connecting layers.
  • Activation functions that transform each node's output, which may differ across layers.

Source: Neural networks: Activation functions | Machine Learning | Google for Developers

Training using backpropagation

Backpropagation is the primary training algorithm for neural networks. It calculates how much each weight and bias in the network contributed to the overall prediction error by applying the chain rule of calculus. It works backwards from the output layer to tell the gradient descent algorithm which equations to adjust to reduce loss.

In practice, this involves a forward pass, where the network makes a prediction and the loss function measures the error, followed by a backward pass that propagates that error back through the layers to compute gradients for each parameter.

Best practices for neural network training:

  • Vanishing gradients occur when gradients in earlier layers become very small, slowing or stalling training, and can be mitigated by using the ReLU activation function.
  • Exploding gradients happen when large weights cause excessively large gradients in early layers, disrupting convergence, and can be addressed with batch normalisation or by lowering the learning rate.
  • Dead ReLU units emerge when a ReLU unit's output gets stuck at 0, halting gradient flow during backpropagation, and can be avoided by lowering the learning rate or using ReLU variants like LeakyReLU.
  • Dropout regularisation is a technique to prevent overfitting by randomly dropping unit activations in a network for a single gradient step, with higher dropout rates indicating stronger regularisation (0 = no regularisation, 1 = drop out all nodes).

Source: Neural Networks: Training using backpropagation | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification models predict from multiple possibilities (binary classification models predict just two).

Multi-class classification can be achieved through two main approaches:

  • One-vs.-all
  • One-vs.-one (softmax)

One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently. one_vs_all_binary_classifiers.png

This approach is fairly reasonable when the total number of classes is small.

We can create a more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. one_vs_all_neural_net.png

Note that the probabilities do not sum to 1. With a one-vs.-all approach, the probability of each binary set of outcomes is determined independently of all the other sets (the sigmoid function is applied to each output node independently).

One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function in the output layer. It assigns decimal probabilities to each class such that all probabilities add up to 1.0. This additional constraint helps training converge more quickly.

Note that the softmax layer must have the same number of nodes as the output layer. one_vs_one_neural_net.png

The softmax formula extends logistic regression to multiple classes: $$ p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} $$

Full softmax is fairly cheap when the number of classes is small but can become computationally expensive with many classes.

Candidate sampling offers an alternative for increased efficiency. It computes probabilities for all positive labels but only a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we do not have to provide probabilities for every non-dog example.

One label versus many labels

Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For multi-label problems, use multiple independent logistic regressions instead.

Example: To classify dog breeds from images, including mixed-breed dogs, use one-vs.-all, since it predicts each breed independently and can assign high probabilities to multiple breeds, unlike softmax, which forces probabilities to sum to 1.

Source: Neural networks: Multi-class classification | Machine Learning | Google for Developers

Embeddings

Introduction

Embeddings are lower-dimensional representations of sparse data that address problems associated with one-hot encodings.

A one-hot encoded feature “meal” of 5,000 popular meal items: food_images_one_hot_encodings.png

This representation of data has several problems:

  • Large input vectors mean a huge number of weights for a neural network.
  • The more weights in your model, the more data you need to train effectively.
  • The more weights, the more computation required to train and use the model.
  • The more weights in your model, the more memory is needed on the accelerators that train and serve it.
  • Poor suitability for on-device machine learning (ODML).

Embeddings, lower-dimensional representations of sparse data, address these issues.

Source: Embeddings | Machine Learning | Google for Developers

Embedding space and static embeddings

Embeddings are low-dimensional representations of high-dimensional data, often used to capture semantic relationships between items.

Embeddings place similar items closer together in the embedding space, allowing for efficient machine learning on large datasets.

Example of a 1D embedding of a sparse feature vector representing meal items: embeddings_1D.png

2D embedding: embeddings_2D.png

3D embedding: embeddings_3D_tangyuan.png

Distances in the embedding space represent relative similarity between items.

Real-world embeddings can encode complex relationships, such as those between countries and their capitals, allowing models to detect patterns.

In practice, embedding spaces have many more than three dimensions, although far fewer than the original data, and the meaning of individual dimensions is often unclear.

Embeddings usually are task-specific, but one task with broad applicability is predicting the context of a word.

Static embeddings like word2vec represent all meanings of a word with a single point, which can be a limitation in some cases. When each word or data point has a single embedding vector, this is called a static embedding.

word2vec can refer both to an algorithm for obtaining static word embeddings and to a set of word vectors that were pre-trained with that algorithm.

Source: Embeddings: Embedding space and static embeddings | Machine Learning | Google for Developers

Obtaining embeddings

Embeddings can be created using dimensionality reduction techniques such as PCA or by training them as part of a neural network.

Training an embedding within a neural network allows customisation for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space, but it may take longer than training the embedding separately.

In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space. one_hot_hot_dog_embedding.png

Word embeddings, such as word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors. However, such static word embeddings have limitations because they assign a single representation per word.

Contextual embeddings offer multiple representations based on context. For example, “orange” would have a different embedding for every unique sentence containing the word in the dataset (as it could be used as a colour or a fruit).

Contextual embeddings encode positional information, while static embeddings do not. Because contextual embeddings incorporate positional information, one token can have multiple contextual embedding vectors. Static embeddings allow only a single representation of each token.

Methods for creating contextual embeddings include ELMo, BERT, and transformer models with a self-attention layer.

Source: Embeddings: Obtaining embeddings | Machine Learning | Google for Developers

Large language models

Introduction

A language model estimates the probability of a token or sequence of tokens given surrounding text, enabling tasks such as text generation, translation, and summarisation.

Tokens, the atomic units of language modelling, represent words, subwords, or characters and are crucial for understanding and processing language.

Example: “unwatched” would be split into three tokens: un (the prefix), watch (the root), ed (the suffix).

N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence.

Short N-grams capture too little information, while very long N-grams fail to generalise due to insufficient repeated examples in training data (sparsity issues).

Recurrent neural networks improve on N-grams by processing sequences token by token and learning which past information to retain or discard, allowing them to model longer dependencies across sentences and gain more context.

  • Note that training recurrent neural networks for long contexts is constrained by the vanishing gradient problem.

Model performance depends on training data size and diversity.

While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.

Source: Large language models | Machine Learning | Google for Developers

What's a large language model?

Large language models (LLMs) predict sequences of tokens and outperform previous models because they use far more parameters and exploit much wider context.

Transformers form the dominant architecture for LLMs and typically combine an encoder that converts input text into an intermediate representation with a decoder that generates output text, for example translating between languages. TransformerBasedTranslator.png

Partial transformers

Encoder-only models focus on representation learning and embeddings (which may serve as input for another system), while decoder-only models specialise in generating long sequences such as dialogue or text continuations.

Self-attention allows the model to weigh the importance of different words in relation to each other, enhancing context understanding.

Example: “The animal didn't cross the street because it was too tired.”

The self-attention mechanism determines the relevance of each nearby word to the pronoun “it”. The bluer the line, the more important that word is to the pronoun it. As shown, “animal” is more important than “street” to the pronoun “it”. Theanimaldidntcrossthestreet.png

  • Some self-attention mechanisms are bidirectional, meaning they calculate relevance scores for tokens preceding and following the word being attended to. This is useful for generating representations of whole sequences (encoders).
  • By contrast, a unidirectional self-attention mechanism can gather context only from words on one side of the word being attended to. This suits applications that generate sequences token by token (decoders).

Multi-head multi-layer self-attention

Each self-attention layer contains multiple self-attention heads. The output of a layer is a mathematical operation (such as a weighted average or dot product) of the outputs of the different heads.

A complete transformer model stacks multiple self-attention layers. The output from one layer becomes the input for the next, allowing the model to build increasingly complex representations, from basic syntax to more nuanced concepts.

Self-attention is an O(N^2 * S * D) problem.

  • N is the number of tokens in the context.
  • S is the number of self-attention layers.
  • D is the number of heads per layer.

LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. You probably will never train an LLM from scratch.

Instruction tuning can improve an LLM's ability to follow instructions.

Why transformers are so large

This course generally recommends building models with a smaller number of parameters, but research shows that transformers with more parameters consistently achieve better performance.

Text generation

LLMs generate text by repeatedly predicting the most probable next token, effectively acting as highly powerful autocomplete systems. You can think of a user's question to an LLM as the “given” sentence followed by a masked response.

Benefits and problems

While LLMs offer benefits such as clear text generation, they also present challenges.

  • Training an LLM involves gathering enormous training sets, consuming vast computational resources and electricity, and solving parallelism challenges.
  • Using an LLM for inference raises issues such as hallucinations, high computational and electricity costs, and bias.

Source: LLMs: What's a large language model? | Machine Learning | Google for Developers

Fine-tuning, distillation, and prompt engineering

General-purpose LLMs, also known as foundation LLMs, base LLMs, or pre-trained LLMs, are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they act as platforms rather than complete solutions for tasks such as classification or regression.

Fine-tuning updates the parameters of a model to improve its performance on a specialised task, improving prediction quality.

  • Adapts a foundation LLM to a specific task by training on task-specific examples, often only hundreds or thousands, which improves performance for that task but retains the original model size (same number of parameters) and can still be computationally expensive.
  • Parameter-efficient tuning reduces fine-tuning costs by updating only a subset of model parameters during training rather than all weights and biases.

Distillation aims to reduce model size, typically at the cost of some prediction quality.

  • Distillation compresses an LLM into a smaller student model that runs faster and uses fewer resources, at the cost of some predictive accuracy.
  • It typically uses a large teacher model to label data, often with rich numerical scores rather than simple labels, and trains a smaller student model on those outputs.

Prompt engineering allows users to customise an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing pattern-recognition abilities without changing its parameters.

One-shot, few-shot, and zero-shot prompting differ by how many examples the prompt provides, with more examples usually improving reliability by giving clearer context.

Prompt engineering does not alter the model's parameters. Prompts leverage the pattern-recognition abilities of the existing LLM.

Offline inference pre-computes and caches LLM predictions for tasks where real-time response is not critical, saving resources and enabling the use of larger models.

Responsible use of LLMs requires awareness that models inherit biases from their training and distillation data.

Source: LLMs: Fine-tuning, distillation, and prompt engineering | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This second module covers fundamental techniques and best practices for working with machine learning data.

Working with numerical data

Introduction

Numerical data: Integers or floating-point values that behave like numbers. They are additive, countable, ordered, and so on. Examples include temperature, weight, or the number of deer wintering in a nature preserve.

Source: Working with numerical data | Machine Learning | Google for Developers

How a model ingests data with feature vectors

A machine learning model ingests data through floating-point arrays called feature vectors, which are derived from dataset features. Feature vectors often utilise processed or transformed values instead of raw dataset values to enhance model learning.

Example of a feature vector: [0.13, 0.47]

Feature engineering is the process of converting raw data into suitable representations for the model. Common techniques are:

  • Normalization: Converting numerical values into a standard range.
  • Binning (bucketing): Converting numerical values into buckets or ranges.

Non-numerical data like strings must be converted into numerical values for use in feature vectors.

Source: Numerical data: How a model ingests data using feature vectors | Machine Learning | Google for Developers

First steps

Before creating feature vectors, it is crucial to analyse numerical data to detect anomalies and patterns in the data, which aids in identifying potential issues early in the data analysis process.

  • Visualising it through plots and graphs (like scatter plots or histograms)
  • Calculating basic statistics like mean, median, standard deviation, or values at the quartile divisions (0th, 25th, 50th, 75th, 100th percentiles, where the 50th percentile is the median)

Outliers, values significantly distant from others, should be identified and handled appropriately.

  • The outlier is due to a mistake: For example, an experimenter incorrectly entered data, or an instrument malfunctioned. We generally delete examples containing mistake outliers.
  • If the outlier is a legitimate data point: If the model needs to infer good predictions on these outliers, keep them. If not, delete them or apply more invasive feature engineering techniques, such as clipping.

A dataset probably contains outliers when:

  • The delta between the 0th and 25th percentiles differs significantly from the delta between the 75th and 100th percentiles
  • The standard deviation is almost as high as the mean

Source: Numerical data: First steps | Machine Learning | Google for Developers

Normalization

Data normalization is crucial for enhancing machine learning model performance by scaling features to a similar range. It is also recommended to normalise a single numeric feature that covers a wide range (for example, city population).

Normalisation has the following benefits:

  • Helps a model converge more quickly.
  • Helps models infer better predictions.
  • Helps avoid the NaN trap (large numerical values exceeding the floating-point precision limit and flipping into NaN values).
  • Helps the model learn appropriate weights (so the model does not pay too much attention to features with wide ranges).
Normalization technique Formula When to use
Linear scaling $$x'=\frac{x-x_\text{min}}{x_\text{max}-x_\text{min}}$$ When the feature is mostly uniformly distributed across range; flat-shaped
Z-score scaling $$x' = (x-\mu)/\sigma$$ When the feature is normally distributed (peak close to mean); bell-shaped
Log scaling $$x'=ln(x)$$ When the feature distribution is heavy skewed on at least either side of tail; heavy Tail-shaped
Clipping If x > max, set $$x'=max$$ If x < min, set $$x' = min$$ When the feature contains extreme outliers

Source: Numerical data: Normalization | Machine Learning | Google for Developers

Binning

Binning (bucketing) is a feature engineering technique used to group numerical data into categories (bins). In many cases, this turns numerical data into categorical data.

For example, if a feature X has values ranging from 15 to 425, we can apply binning to represent X as a feature vector divided into specific intervals:

Bin number Range Feature vector
1 15-34 [1.0, 0.0, 0.0, 0.0, 0.0]
2 35-117 [0.0, 1.0, 0.0, 0.0, 0.0]
3 118-279 [0.0, 0.0, 1.0, 0.0, 0.0]
4 280-392 [0.0, 0.0, 0.0, 1.0, 0.0]
5 393-425 [0.0, 0.0, 0.0, 0.0, 1.0]

Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.

Binning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.

When to use: Binning works well when features exhibit a “clumpy” distribution, that is, the overall linear relationship between the feature and label is weak or nonexistent, or when feature values are clustered.

Example: Number of shoppers versus temperature. By binning them, the model learns separate weights for each bin. binning_temperature_vs_shoppers_divided_into_3_bins.png

While creating multiple bins is possible, it is generally recommended to avoid an excessive number, as this can lead to insufficient training examples per bin and increased feature dimensionality.

Quantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.

  • Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.
  • Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket. QuantileBucketing.png

Source: Numerical data: Binning | Machine Learning | Google for Developers

Scrubbing

Problem category Example
Omitted values A census taker fails to record a resident's age
Duplicate examples A server uploads the same logs twice
Out-of-range feature values A human accidentally types an extra digit
Bad labels A human evaluator mislabels a picture of an oak tree as a maple

You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.

Source: Numerical data: Scrubbing | Machine Learning | Google for Developers

Qualities of good numerical features

  • Good feature vectors require features that are clearly named and have obvious meanings to anyone on the project.
  • Data should be checked and tested for bad data or outliers, such as inappropriate values, before being used for training.
  • Features should be sensible, avoiding “magic values” that create discontinuities (for example, setting the value “watch_time_in_seconds” to -1 to indicate an absence of measurement); instead, use separate boolean features or new discrete values to indicate missing data.

Source: Numerical data: Qualities of good numerical features | Machine Learning | Google for Developers

Polynomial transformations

Synthetic features, such as polynomial transforms, enable linear models to represent non-linear relationships by introducing new features based on existing ones.

By incorporating synthetic features, linear regression models can effectively separate data points that are not linearly separable, using curves instead of straight lines. For example, we can separate two classes with y = x^2. ft_cross1.png

Feature crosses, a related concept for categorical data, synthesise new features by combining existing features, further enhancing model flexibility.

Source: Numerical data: Polynomial transforms | Machine Learning | Google for Developers

Working with categorical data

Introduction

Categorical data has a specific set of possible values. Examples include species of animals, names of streets, whether or not an email is spam, and binned numbers.

Categorical data can include numbers that behave like categories. An example is postal codes.

  • Numerical data can be meaningfully multiplied.
  • Data that are native integer values should be represented as categorical data.

Encoding means converting categorical or other data to numerical vectors that a model can train on.

Preprocessing includes converting non-numerical data, such as strings, to floating-point values.

Source: Working with categorical data | Machine Learning | Google for Developers

Vocabulary and one-hot encoding

Machine learning models require numerical input; therefore, categorical data such as strings must be converted to numerical representations.

The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:

Feature name # of categories Sample categories
snowed_today 2 True, False
skill_level 3 Beginner, Practitioner, Expert
season 4 Winter, Spring, Summer, Autumn
dayofweek 7 Monday, Tuesday, Wednesday
planet 8 Mercury, Venus, Earth
car_colour 8 Red, Orange, Blue, Yellow

When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. This treats each category as a separate feature, allowing the model to learn distinct weights for each during training.

One-hot encoding transforms categorical values into numerical vectors (arrays) of N elements, where N is the number of categories. Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.

Feature Red Orange Blue Yellow Green Black Purple Brown
“Red” 1 0 0 0 0 0 0 0
“Orange” 0 1 0 0 0 0 0 0
“Blue” 0 0 1 0 0 0 0 0
“Yellow” 0 0 0 1 0 0 0 0
“Green” 0 0 0 0 1 0 0 0
“Black” 0 0 0 0 0 1 0 0
“Purple” 0 0 0 0 0 0 1 0
“Brown” 0 0 0 0 0 0 0 1

It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.

The end-to-end process to map categories to feature vectors: vocabulary-index-sparse-feature.png

In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.

A feature whose values are predominantly zero (or empty) is termed a sparse feature.

Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.

  • For example, the one-hot vector for “car_colour” “Blue” is: [0, 0, 1, 0, 0, 0, 0, 0].
  • Since the 1 is in position 2 (when starting the count at 0), the sparse representation is: 2.

Notice that the sparse representation consumes far less memory. Importantly, the model must train on the one-hot vector, not the sparse representation.

The sparse representation of a multi-hot encoding stores the positions of all the non-zero elements. For example, the sparse representation of a car that is both “Blue” and “Black” is 2, 5.

Categorical features can have outliers. If “car_colour” includes rare values such as “Mauve” or “Avocado”, you can group them into one out-of-vocabulary (OOV) category. All rare colours go into this single bucket, and the model learns one weight for it.

For high-dimensional categorical features with many categories, one-hot encoding might be inefficient, and embeddings or hashing (also called the hashing trick) are recommended.

  • For example, a feature like “words_in_english” has around 500,000 categories.
  • Embeddings substantially reduce the number of dimensions, which helps the model train faster and infer predictions more quickly.

Source: Categorical data: Vocabulary and one-hot encoding | Machine Learning | Google for Developers

Common issues with categorical data

Categorical data quality hinges on how categories are defined and labelled, impacting data reliability.

Human-labelled data, known as “gold labels”, is generally preferred for training due to its higher quality, but it is essential to check for human errors and biases.

  • Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement.
  • Inter-rater agreement can be measured using kappa and intra-class correlation (Hallgren, 2012), or Krippendorff's alpha (Krippendorff, 2011).

Machine-labelled data, or “silver labels”, can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.

  • For example, if a computer-vision model mislabels a photo of a chihuahua as a muffin, or a photo of a muffin as a chihuahua.
  • Similarly, a sentiment analyser that scores neutral words as -0.25, when 0.0 is the neutral value, might be scoring all words with an additional negative bias.

High dimensionality in categorical data increases training complexity and costs, leading to techniques such as embeddings for dimensionality reduction.

Source: Categorical data: Common issues | Machine Learning | Google for Developers

Feature crosses

Feature crosses are created by combining two or more categorical or bucketed features to capture interactions and non-linearities within a dataset.

For example, consider a leaf dataset with the categorical features:

  • “edges”, containing values {smooth, toothed, lobed}
  • “arrangement”, containing values {opposite, alternate}

The feature cross, or Cartesian product, of these two features would be:

{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}

For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for “Lobed_Alternate”, and a value of 0 for all other terms:

{0, 0, 0, 0, 0, 1}

This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.

Feature crosses are somewhat analogous to polynomial transforms.

Feature crosses can be particularly effective when guided by domain expertise. It is often possible, though computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.

Overuse of feature crosses with sparse features should be avoided, as it can lead to excessive sparsity in the resulting feature set. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.

Source: Categorical data: Feature crosses | Machine Learning | Google for Developers

Datasets, generalization, and overfitting

Introduction

  • Data quality significantly impacts model performance more than algorithm choice.
  • Machine learning practitioners typically dedicate a substantial portion of their project time (around 80%) to data preparation and transformation, including tasks such as dataset construction and feature engineering.

Source: Datasets, generalization, and overfitting | Machine Learning | Google for Developers

Data characteristics

A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it is trained on, with larger, high-quality datasets generally leading to better results.

Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.

The following are common causes of unreliable data in datasets:

  • Omitted values
  • Duplicate examples
  • Bad feature values
  • Bad labels
  • Bad sections of data

Maintaining data quality involves addressing issues such as label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.

Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.

When imputing missing values, use reliable methods such as mean/median imputation and consider adding an indicator column to signal imputed values to the model. For example, alongside temperature include “temperature_is_imputed”. This lets the model learn to trust real observations more than imputed ones.

Source: Datasets: Data characteristics | Machine Learning | Google for Developers

Labels

Direct labels are generally preferred but often unavailable.

  • Direct labels exactly match the prediction target and appear explicitly in the dataset, such as a “bicycle_owner” column for predicting bicycle ownership.
  • Proxy labels approximate the target and correlate with it, such as a bicycle magazine subscription as a signal of bicycle ownership.

Use a proxy label when no direct label exists or when the direct concept resists easy numeric representation. Carefully evaluate proxy labels to ensure they are a suitable approximation.

Human-generated labels, while offering flexibility and nuanced understanding, can be expensive to produce and prone to errors, requiring careful quality control.

Models can train on a mix of automated and human-generated labels, but an extra set of human labels often adds complexity without sufficient benefit.

Source: Datasets: Labels | Machine Learning | Google for Developers

Imbalanced datasets

Imbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.

Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset.

A highly imbalanced floral dataset containing far more sunflowers (200) than roses (2): FloralDataset200Sunflowers2Roses.png

During training, a model should learn two things:

  • What each class looks like, that is, what feature values correspond to which class.
  • How common each class is, that is, what the relative distribution of the classes is.

Standard training conflates these two goals. In contrast, a two-step technique of downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both.

Step 1: Downsample the majority class by training on only a small fraction of majority class examples, which makes an imbalanced dataset more balanced during training and increases the chance that each batch contains enough minority examples.

For example, with a class-imbalanced dataset consisting of 99% majority class and 1% minority class examples, we could downsample the majority class by a factor of 25 to create a more balanced training set (80% majority class and 20% minority class).

Downsampling the majority class by a factor of 25: FloralDatasetDownsampling.png

Step 2: Upweight the downsampled majority class by the same factor used for downsampling, so each majority class error counts proportionally more during training. This corrects the artificial class distribution and bias introduced by downsampling, because the training data no longer reflects real-world frequencies.

Continuing the example from above, we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).

Upweighting the majority class by a factor of 25: FloralDatasetUpweighting.png

Experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.

Benefits of this technique include a better model (the resultant model knows what each class looks like and how common each class is) and faster convergence.

Source: Datasets: Class-imbalanced datasets | Machine Learning | Google for Developers

Dividing the original dataset

Machine learning models should be tested against unseen data.

It is recommended to split the dataset into three subsets: training, validation, and test sets. PartitionThreeSets.png

The validation set is used for initial testing during training (to determine hyperparameter tweaks, add, remove, or transform features, and so on), and the test set is used for final evaluation. workflow_with_validation_set.png

The validation and test sets can “wear out” with repeated use. For this reason, it is a good idea to collect more data to “refresh” the test and validation sets.

A good test set is:

  • Large enough to yield statistically significant results
  • Representative of the dataset as a whole
  • Representative of real-world data the model will encounter (if your model performs poorly on real-world data, determine how your dataset differs from real-life data)
  • Free of duplicates from the training set

In theory, the validation set and test set should contain the same number of examples, or nearly so.

Source: Datasets: Dividing the original dataset | Machine Learning | Google for Developers

Transforming data

Machine learning models require all data, including features such as street names, to be transformed into numerical (floating-point) representations for training.

Normalisation improves model training by converting existing floating-point features to a constrained range.

When dealing with large datasets, select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions. Safeguard privacy by omitting examples containing personally identifiable information.

Source: Datasets: Transforming data | Machine Learning | Google for Developers

Generalization

Generalisation refers to a model's ability to perform well on new, unseen data.

Source: Generalization | Machine Learning | Google for Developers

Overfitting

Overfitting means creating a model that matches the training set so closely that the model fails to make correct predictions on new data.

Generalization is the opposite of overfitting. That is, a model that generalises well makes good predictions on new data.

An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world. An underfit model is like a product that does not even do well in the lab.

Overfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve (a graph that shows two or more loss curves). A generalization curve for a well-fit model shows two loss curves that have similar shapes.

Common causes of overfitting include:

  • A training set that does not adequately represent real-life data (or the validation set or test set).
  • A model that is too complex.

Dataset conditions for good generalization include:

  • Examples must be independently and identically distributed, which is a fancy way of saying that your examples cannot influence each other.
  • The dataset is stationary, meaning it does not change significantly over time.
  • The dataset partitions have the same distribution, meaning the examples in the training set, validation set, test set, and real-world data are statistically similar.

Source: Overfitting | Machine Learning | Google for Developers

Model complexity

Simpler models often generalise better to new data than complex models, even if they perform slightly worse on training data.

Occam's Razor favours simpler explanations and models.

Model training should minimise both loss and complexity for optimal performance on new data. $$ \text{minimise}(\text{loss + complexity}) $$

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases.

Regularisation techniques help prevent overfitting by penalising model complexity during training.

  • L1 regularisation (also called LASSO) uses model weights to measure model complexity.
  • L2 regularisation (also called ridge regularisation) uses squares of model weights to measure model complexity.

Source: Overfitting: Model complexity | Machine Learning | Google for Developers

L2 regularization

L2 regularisation is a popular regularisation metric to reduce model complexity and prevent overfitting. It uses the following formula: $$ L_2 \text{ regularisation} = w^2_1 + w^2_2 + \ldots + w^2_n $$

It penalises especially large weights.

L2 regularisation encourages weights towards 0, but never pushes them all the way to zero.

A regularisation rate (lambda) controls the strength of regularisation. $$ \text{minimise}(\text{loss} + \lambda \text{ complexity}) $$

  • A high regularisation rate reduces the likelihood of overfitting and tends to produce a histogram of model weights that are normally distributed around 0.
  • A low regularisation rate lowers the influence of regularisation and tends to produce a histogram of model weights with a flat distribution.

Tuning is required to find the ideal regularisation rate.

Early stopping is an alternative regularisation method that involves ending training before the model fully converges to prevent overfitting. It usually increases training loss but decreases test loss. It is a quick but rarely optimal form of regularisation.

Learning rate and regularisation rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero, while a high regularisation rate pulls weights towards zero. The goal is to find the equilibrium.

Source: Overfitting: L2 regularization | Machine Learning | Google for Developers

Interpreting loss curves

An ideal loss curve looks like this: metric-curve-ideal.png

To improve an oscillating loss curve:

  • Reduce the learning rate.
  • Reduce the training set to a tiny number of trustworthy examples.
  • Check your data against a data schema to detect bad examples, then remove the bad examples from the training set. metric-curve-ex03.png

Possible reasons for a loss curve with a sharp jump include:

  • The input data contains a burst of outliers.
  • The input data contains one or more NaNs (for example, a value caused by a division by zero). metric-curve-ex02.png

Test loss diverges from training loss when:

  • The model overfits the training set. metric-curve-ex01.png

The loss curve gets stuck when:

  • The training set is not shuffled well. metric-curve-ex05.png

Source: Overfitting: Interpreting loss curves | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This first module covers the fundamentals of building regression and classification models.

Linear regression

Introduction

The linear regression model uses an equation $$ y' = b + w_1x_1 + w_2x_2 + \ldots $$ to represent the relationship between features and the label.

  • y' is the predicted label—the output
  • b is the bias of the model (the y-intercept in algebraic terms), sometimes referred to as w_0
  • w_1 is the weight of the feature (the slope in algebraic terms)
  • x_1 is a feature—the input

y and features x are given. b and w are calculated from training by minimizing the difference between predicted and actual values.

Source: Linear regression | Machine Learning | Google for Developers

Loss

Loss is a numerical value indicating the difference between a model's predictions and the actual values.

The goal of model training is to minimize loss, bringing it as close to zero as possible.

Loss type Definition Equation
L1 loss The sum of the absolute values of the difference between the predicted values and the actual values. $$\sum |\text{actual value}-\text{predicted value}|$$
Mean absolute error (MAE) The average of L1 losses across a set of N examples. $$\frac{1}{N}\sum |\text{actual value}-\text{predicted value}|$$
L2 loss The sum of the squared difference between the predicted values and the actual values. $$\sum (\text{actual value}-\text{predicted value})^2$$
Mean squared error (MSE) The average of L2 losses across a set of N examples. $$\frac{1}{N}\sum (\text{actual value}-\text{predicted value})^2$$

The most common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.

A model trained with MSE moves the model closer to the outliers but further away from most of the other data points. model-mse.png

A model trained with MAE is farther from the outliers but closer to most of the other data points. model-mae.png

Source: Linear regression: Loss | Machine Learning | Google for Developers

Gradient descent

Gradient descent is an iterative optimisation algorithm used to find the best weights and bias for a linear regression model by minimising the loss function.

  1. Calculate the loss with the current weight and bias.
  2. Determine the direction to move the weights and bias that reduce loss.
  3. Move the weight and bias values a small amount in the direction that reduces loss.
  4. Return to step one and repeat the process until the model can't reduce the loss any further.

A model is considered to have converged when further iterations do not significantly reduce the loss, indicating it has found the weights and bias that produce the lowest possible loss.

Loss curves visually represent the model's progress during training, showing how the loss decreases over iterations and helping to identify convergence.

Linear models have convex loss functions, ensuring that gradient descent will always find the global minimum, resulting in the best possible model for the given data.

Source: Linear regression: Gradient descent | Google for Developers

Hyperparameters

Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.

The learning rate determines the step size during gradient descent, impacting the speed and stability of convergence.

  • If the learning rate is too low, the model can take a long time to converge.
  • However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimise the loss.

Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.

  • When a dataset contains hundreds of thousands or even millions of examples, using the full batch isn't practical.
  • Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent.
    • Stochastic gradient descent uses only a single random example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy.
    • Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For N number of data points, the batch size can be any number greater than 1 and less than N. The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.

Model trained with SGD: noisy-gradient.png

Model trained with mini-batch SGD: mini-batch-sgd.png

Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.

  • For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.

Source: Linear regression: Hyperparameters | Machine Learning | Google for Developers

Logistic regression

Introduction

Logistic regression is a model used to predict the probability of an outcome, unlike linear regression which predicts continuous numerical values.

Logistic regression models output probabilities, which can be used directly or converted to binary categories.

Source: Logistic Regression | Machine Learning | Google for Developers

Calculating a probability with the sigmoid function

A logistic regression model uses a linear equation and the sigmoid function to calculate the probability of an event.

The sigmoid function ensures the output of logistic regression is always between 0 and 1, representing a probability. $$ f(x) = \frac{1}{1 + e^{-x}} $$ sigmoid_function_with_axes.png

Linear component of a logistic regression model: $$ z = b + w_1 x_1 + w_2 x_2 + \ldots + w_N x_N $$ To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1: $$ y' = \frac{1}{1+e^{-z}} $$

  • y' is the output of the logistic regression model.
  • z is the linear output (as calculated in the preceding equation).

z is referred to as the log-odds because if you solve the sigmoid function for z you get: $$ z = \log(\frac{y}{1-y}) $$ This is the log of the ratio of the probabilities of the two possible outcomes: y and 1 – y.

When the linear equation becomes input to the sigmoid function, it bends the straight line into an s-shape. linear_to_logistic.png

Source: Logistic regression: Calculating a probability with the sigmoid function | Machine Learning | Google for Developers

Loss and regularisation

Logistic regression models are trained similarly to linear regression models but use Log Loss instead of squared loss and require regularisation.

Log Loss is used in logistic regression because the rate of change isn't constant, requiring varying precision levels unlike squared loss used in linear regression.

The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows: $$ \text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') – (1 – y)\log(1 – y') $$

  • (x,y) is the dataset containing many labelled examples, which are (x, y) pairs.
  • y is the label in a labelled example. Since this is logistic regression, every value of y must either be 0 or 1.
  • y' is your model's prediction (somewhere between 0 and 1), given the set of features in x.

Regularisation, such as L2 regularisation or early stopping, is crucial in logistic regression to prevent overfitting (due to the model's asymptotic nature) and improve generalisation.

Source: Logistic regression: Loss and regularization | Machine Learning | Google for Developers

Classification

Introduction

Logistic regression models can be converted into binary classification models for predicting categories instead of probabilities.

Source: Classification | Machine Learning | Google for Developers

Thresholds and the confusion matrix

To convert the raw output from a logistic regression model into binary classification (positive and negative class), you need a classification threshold.

Confusion matrix

Actual positive Actual negative
Predicted positive True positive (TP) False positive (FP)
Predicted negative False negative (FN) True negative (TN)

Total of each row = all predicted positives (TP + FP) and all predicted negatives (FN + TN) Total of each column = all real positives (TP + FN) and all real negatives (FP + TN)

  • When positive examples and negative examples are generally well differentiated, with most positive examples having higher scores than negative examples, the dataset is separated.
  • When the total of actual positives is not close to the total of actual negatives, the dataset is imbalanced.
  • When many positive examples have lower scores than negative examples, and many negative examples have higher scores than positive examples, the dataset is unseparated.

When we increase the classification threshold, both TP and FP decrease, and both TN and FN increase.

Source: Thresholds and the confusion matrix | Machine Learning | Google for Developers

Accuracy, Recall, Precision, and related metrics are all calculated at a single classification threshold value.

Accuracy is the proportion of all classifications that were correct. $$ \text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN} $$

  • Use as a rough indicator of model training progress/convergence for balanced datasets. Typically the default.
  • For model performance, use only in combination with other metrics.
  • Avoid for imbalanced datasets. Consider using another metric.

Recall, or true positive rate, is the proportion of all actual positives that were classified correctly as positives. Also known as probability of detection. $$ \text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN} $$

  • Use when false negatives are more expensive than false positives.
  • Better than Accuracy in imbalanced datasets.
  • Improves when false negatives decrease.

False positive rate is the proportion of all actual negatives that were classified incorrectly as positives. Also known as probability of a false alarm. $$ \text{FPR} = \frac{\text{incorrectly classified actual negatives}}{\text{all actual negatives}}=\frac{FP}{FP+TN} $$

  • Use when false positives are more expensive than false negatives.
  • Less meaningful and useful in a dataset where the number of actual negtives is very, very low.

Precision is the proportion of all the model's positive classifications that are actually positive. $$ \text{Precision} = \frac{\text{correctly classified actual positives}}{\text{everything classified as positive}}=\frac{TP}{TP+FP} $$

  • Use when it's very important for positive predictions to be accurate.
  • Less meaningful and useful in a dataset where the number of actual positives is very, very low.
  • Improves as false positives decrease.

Precision and Recall often show an inverse relationship.

F1 score is the harmonic mean of Precision and Recall. $$ \text{F1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2TP}{2TP + FP + FN} $$

  • Preferable for class-imbalanced datasets.
  • When Precision and Recall are close in value, F1 will be close to their value.
  • When Precision and Recall are far apart, F1 will be similar to whichever metric is worse.

Source: Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google for Developers

ROC and AUC

ROC and AUC evaluate a model's quality across all possible thresholds.

ROC curve, or receiver operating characteristic curves, plot the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. A perfect model would pass through (0,1), while a random guesser forms a diagonal line from (0,0) to (1,1).

AUC, or area under the curve, represents the probability that the model will rank a randomly chosen positive example higher than a negative example. A perfect model has AUC = 1.0, while a random model has AUC = 0.5.

ROC and AUC of a hypothetical perfect model (AUC = 1.0) and for completely random guesses (AUC = 0.5): auc_1-0.pngauc_0-5.png

ROC and AUC are effective when class distributions are balanced. For imbalanced data, precision-recall curves (PRCs) can be more informative. prauc.png

A higher AUC generally indicates a better-performing model.

ROC and AUC of two hypothetical models; the first curve (AUC = 0.65) represents the better of the two models: auc_0-65.png auc_0-93.png

Threshold choice depends on the cost of false positives versus false negatives. The most relevant thresholds are those closest to (0,1) on the ROC curve. For costly false positives, a conservative threshold (like A in the chart below) is better. For costly false negatives, a more sensitive threshold (like C) is preferable. If costs are roughly equivalent, a threshold in the middle (like B) may be best. auc_abc.png

Source: Classification: ROC and AUC | Machine Learning | Google for Developers

Prediction bias

Prediction bias measures the difference between the average of a model's predictions and the average of the true labels in the data. For example, if 5% of emails in the dataset are spam, a model without prediction bias should also predict about 5% as spam. A large mismatch between these averages indicates potential problems.

Prediction bias can be caused by:

  • Biased and noisy data (e.g., skewed sampling)
  • Overly strong regularisation that oversimplifies the model
  • Bugs in the model training pipeline
  • Insufficient features provided to the model

Source: Classification: Prediction bias | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification extends binary classification to cases with more than two classes.

If each example belongs to only one class, the problem can be broken down into a series of binary classifications. For instance, with three classes (A, B, C), you could first separate C from A+B, then distinguish A from B within the A+B group.

Source: Classification: Multi-class classification | Machine Learning | Google for Developers

 
Read more...

from Bloc de notas

no sé si te acuerdas o vas tan rápido que a estas alturas te da igual lo que pasó / lo que se fue se fue y entonces sí que aprendiste algo de mí / un poco a vivir

 
Leer más...

from Stefan Angrick

I like to revisit Google's Machine Learning Crash Course every now and then to refresh my understanding of key machine learning concepts. It's a fantastic free resource, published under the Creative Commons Attribution 4.0 License, now updated with content on recent developments like large language models and automated ML.

I take notes as a personal reference, and I thought I would post them here to keep track of what I've learned—while hopefully offering something useful to others doing the same.

The notes are organised into four posts, one for each course module:

  1. Google ML Crash Course #1: ML Models
  2. Google ML Crash Course #2: Data
  3. Google ML Crash Course #3: Advanced ML Models
  4. Google ML Crash Course #4: Real-World ML
 
Read more...

from Context[o]s

Fotografía de una playa con una barca en la arena y un niño a su lado. Al fondo, un cielo lleno de estrellas en un fotomontaje de una foto del universo. Foto en blanco y negro. © Foto Ricard Ramon. Collage digital compuesto por una fotografía analógica (Leica R y Kodak TriX) y una foto libre de derechos de la Institución Smithsonian.

Amigas y amigos que os molestáis en leer estas líneas, se acerca un año nuevo lleno de incertidumbre y que se prevé más complicado para el futuro de la humanidad que el anterior. Todos los indicadores y la realidad evidente nos muestran un escenario de confusión, de mentira constante, de amenazas de gobiernos fascistas en oriente y occidente, de ascenso de los fascistas en este mismo trozo de tierra que habitamos, cuya memoria trae frescos y terribles recuerdos de los hechos criminales y salvajes que siempre cometen esos mismos fascistas. Supuestos demócratas, abriéndole las puertas al demonio ante nuestras narices y con desvergonzada alegría.

El escenario es desalentador, con una crisis climática desbocada e imparable de consecuencias imprevisibles, con gobiernos formados por pandas de lunáticos a los que hemos otorgado poder por la gracia de Meta, X y Google, que nos van dictando con sus algoritmos lo que tenemos que pensar, o peor aún, nos generan la suficiente confusión y sentimiento de derrota como para dejar de creer en nada. A la que se une la basura infecta de la IA generativa de OpenAI y derivados y sus megacentros de datos extractivos de agua y energía, generadores de pobreza y miseria en zonas rurales ya pobres y castigadas de inicio.

Los resultados electorales en todo el mundo parecen ir demostrando una sola cosa: la pérdida de esperanza y la fe en un mundo mejor. Algunas personas se están lanzando a votar hacia aquellos partidos que solo pueden prometer una cosa, la aceleración de nuestra autodestrucción colectiva, dado que básicamente eso es el fascismo. Un sistema de opresión de libertades y cuya novedad en su versión contemporánea es su pleitesía y obediencia a los grandes magnates de la tecnología (curiosamente extranjeros, ellos que se dicen tan nacionales en sus diferentes versiones de colorines).

Una desesperanza que contagia incluso a los esperanzados y optimistas, que permanecen prisioneros de sus redes, nunca mejor dicho, y se niegan a abandonar un mar embravecido en el que manipulan y destruyen su esperanza cada día. Un mar dominado por los algoritmos del poder que ahonda en la miseria de la humanidad. Como adictos, el primer paso es reconocer la adicción; el segundo, imprescindible, es abandonar con urgencia las redes del algoritmo del poder, eliminar para siempre X, Facebook, Instagram, TikTok, etc. y salir corriendo sin mirar atrás. Hay esperanza, más allá de los dictados del algoritmo y sus millonarios dueños, en las redes libres, sin dueño, federadas, sin algoritmo, cuya propiedad reside en las personas que las componen y donde no es posible generar adicción ni viralidad sin algoritmos.

Se necesitan impulsos políticos que no nazcan desde la rabia o la reacción que estas redes alimentan (de nuevo el algoritmo de los ricos para oprimir y manipular a los pobres). Que emerjan desde la proposición optimista de cómo se deben hacer las cosas hacia el bien. No como forma de enfrentarse al mal que nos acecha, sino porque solo se puede concebir el mundo hacia el camino del bien, de la búsqueda de la verdad, de la creación de escenarios de comunidad y pensamiento verdaderamente empático.

Hay que utilizar los medios creativos para estimular la posibilidad de imaginar nuevos mundos posibles y no para recordar de forma permanente las amenazas que se ciernen sobre nosotros. Las evidencias y el diagnóstico solo sirven para eso, para proponer y actuar, no para quedarse en la reacción y la queja perpetua hacia el otro. Esto supone una pérdida de tiempo y energía que debe conducirse hacia la restitución del bien y la esperanza, hacia la imaginación y el ejercicio activo de mundos posibles. Ejemplos de ello y personas que están en ese camino no nos faltan.

Quizá hay que proponer más utopías en el cine y en el arte, y menos distopías, que parecen ser las únicas que inundan las series y propuestas cinematográficas y narrativas. Hay que pintar, escribir, danzar, proyectándose hacia un futuro imaginado mejor, porque simplemente se vive mejor en los escenarios de ese futuro, que para que exista, hay que crearlo, inventarlo, imaginarlo y proyectarlo colectivamente. Y hay que visibilizar aquello que se está haciendo sin miedo y con alegría. Hay que poner de moda de nuevo hacer el bien, creer en los demás y en un futuro colectivo que sienta unas bases mínimas de esperanza en común. Solo existe futuro en lo común; esto es una evidencia empírica incontestable. Una verdad innegable, en tiempos donde nos quieren hacer creer que no existe la verdad.

Vivimos atenazados por el miedo y nos movemos por reacción, y es necesario empezar a moverse por acción. Los actos, el acto creativo, artístico, imaginativo, es el único que nos puede salvar de caer en la desesperanza o el nihilismo. El activismo no puede ser siempre reactivo, porque entonces jugamos prisioneros del campo contrario y sabemos que ahí no se puede ganar, como estamos viendo cada día. Tenemos la razón, la búsqueda de la verdad, la justicia, los derechos humanos, los derechos animales y ambientales, la ciencia, la ciencia del arte, incluso el sentido común al que tanto alegan algunos fascistas, reside en la razón y el optimismo.

Sí, ciertamente, parte de la cultura se ha vuelto hostil, y emerge una subcultura de la sinrazón y la mentira; eso lo sabemos. Tampoco es nada nuevo ni que nos debiera sorprender especialmente; siempre ha estado ahí en otras formas, solo hay que revisar la historia. Pero la oscuridad no se combate con más oscuridad, ni con exceso de luz. Nuestra obligación es crear el contraste entre la luz y la sombra, que es donde emerge el color, como demostró acertadamente Goethe; vamos a ello.

#sociedad #internet #fediverso #política

 
Leer más...

from

contradictions i am full of them at times.
logic dominates my perception,
fully aligned with who i truly am.
yet a sense of doubt can still set me
back, and thinking becomes dominated
by the heart. such is the human condition,
which i find difficult to accept within me,
wanting to be holy while human.
this has been my biggest challenge,
accepting that condition means coexisting
with this body and time.

asking others for reassurance,
knowing well i do not truly need it.
the body has ways of acting
i do not always expect,
learned behaviors shaped by a wounded ego. for a moment i forget i am human
and turn my anger inward,
but somewhere i remember
i am not too much: i am simply human.

i find extreme beauty in this so many layers a human has, no flat lines. and still at times feel shame in it = contradiction

 
Read more... Discuss...

from SmarterArticles

The game changed in May 2025 when Anthropic released Claude 4 Opus and Sonnet, just three months after Google had stunned the industry with Gemini 2.5's record-breaking benchmarks. Within a week, Anthropic's new models topped those same benchmarks. Two months later, OpenAI countered with GPT-5. By September, Claude Sonnet 4.5 arrived. The pace had become relentless.

This isn't just competition. It's an arms race that's fundamentally altering the economics of building on artificial intelligence. For startups betting their futures on specific model capabilities, and enterprises investing millions in AI integration, the ground keeps shifting beneath their feet. According to MIT's “The GenAI Divide: State of AI in Business 2025” report, whilst generative AI holds immense promise, about 95% of AI pilot programmes fail to achieve rapid revenue acceleration, with the vast majority stalling and delivering little to no measurable impact on profit and loss statements.

The frequency of model releases has accelerated to a degree that seemed impossible just two years ago. Where annual or semi-annual updates were once the norm, major vendors now ship significant improvements monthly, sometimes weekly. This velocity creates a peculiar paradox: the technology gets better faster than organisations can adapt to previous versions.

The New Release Cadence

The numbers tell a striking story. Anthropic alone shipped seven major model versions in 2025, starting with Claude 3.7 Sonnet in February, followed by Claude 4 Opus and Sonnet in May, Claude Opus 4.1 in August, and culminating with Claude Sonnet 4.5 in September and Claude Haiku 4.5 in October. OpenAI maintained a similarly aggressive pace, releasing GPT-4.5 and its landmark GPT-5 in August, alongside o3 pro (an enhanced reasoning model), Codex (an autonomous code agent), and the gpt-oss family of open-source models.

Google joined the fray with Gemini 3, which topped industry benchmarks and earned widespread praise from researchers and developers across social platforms. The company simultaneously released Veo 3, a video generation model capable of synchronised 4K video with natural audio integration, and Imagen 4, an advanced image synthesis system.

The competitive dynamics are extraordinary. More than 800 million people use ChatGPT each week, yet OpenAI faces increasingly stiff competition from rivals who are matching or exceeding its capabilities in specific domains. When Google released Gemini 3, it set new records on numerous benchmarks. The following week, Anthropic's Claude Opus 4.5 achieved even higher scores on some of the same evaluations.

This leapfrogging pattern has become the industry's heartbeat. Each vendor's release immediately becomes the target for competitors to surpass. The cycle accelerates because falling behind, even briefly, carries existential risks when customers can switch providers with relative ease.

The Startup Dilemma

For startups building on these foundation models, rapid releases create a sophisticated risk calculus. Every API update or model deprecation forces developers to confront rising switching costs, inconsistent documentation, and growing concerns about vendor lock-in.

The challenge is particularly acute because opportunities to innovate with AI exist everywhere, yet every niche has become intensely competitive. As one venture analysis noted, whilst innovation potential is ubiquitous, what's most notable is the fierce competition in every sector going after the same customer base. For customers, this drives down costs and increases choice. For startups, however, customer acquisition costs continue rising whilst margins erode.

The funding landscape reflects this pressure. AI companies now command 53% of all global venture capital invested in the first half of 2025. Despite unprecedented funding levels exceeding $100 billion, 81% of AI startups will fail within three years. The concentration of capital in mega-rounds means early-stage founders face increased competition for attention and investment. Geographic disparities persist sharply: US companies received 71% of global funding in Q1 2025, with Bay Area startups alone capturing 49% of worldwide venture capital.

Beyond capital, startups grapple with infrastructure constraints that large vendors navigate more easily. Training and running AI models requires computing power that the world's chip manufacturers and cloud providers struggle to supply. Startups often queue for chip access or must convince cloud providers that their projects merit precious GPU allocation. The 2024 State of AI Infrastructure Report painted a stark picture: 82% of organisations experienced AI performance issues.

Talent scarcity compounds these challenges. The demand for AI expertise has exploded whilst supply of qualified professionals hasn't kept pace. Established technology giants actively poach top talent, creating fierce competition for the best engineers and researchers. This “AI Execution Gap” between C-suite ambition and organisational capacity to execute represents a primary reason for high AI project failure rates.

Yet some encouraging trends have emerged. With training costs dramatically reduced through algorithmic and architectural innovations, smaller companies can compete with established leaders, spurring a more dynamic and diverse market. Over 50% of foundation models are now available openly, meaning startups can download state-of-the-art models and build upon them rather than investing millions in training from scratch.

Model Deprecation and Enterprise Risk

The rapid release cycle creates particularly thorny problems around model deprecation. OpenAI's approach illustrates the challenge. The company uses “sunset” and “shut down” interchangeably to indicate when models or endpoints become inaccessible, whilst “legacy” refers to versions that no longer receive updates.

In 2024, OpenAI announced that access to the v1 beta of its Assistants API would shut down by year's end when releasing v2. Access discontinued on 18 December 2024. On 29 August 2024, developers learned that fine-tuning babbage-002 and davinci-002 models would no longer support new training runs starting 28 October 2024. By June 2024, only existing users could continue accessing gpt-4-32k and gpt-4-vision-preview.

The 2025 deprecation timeline proved even more aggressive. GPT-4.5-preview was removed from the API on 14 July 2025. Access to o1-preview ended 28 July 2025, whilst o1-mini survived until 27 October 2025. In November 2025 alone, OpenAI deprecated the chatgpt-4o-latest model snapshot (removal scheduled for 17 February 2026), codex-mini-latest (removed 16 January 2026), and DALL·E model snapshots (removal set for 12 May 2026).

For enterprises, this creates genuine operational risk. Whilst OpenAI indicated that API deprecations for business customers receive significant advance notice (typically three months), the sheer frequency of changes forces constant adaptation. Interestingly, OpenAI told VentureBeat that it has no plans to deprecate older models on the API side, stating “In the API, we do not currently plan to deprecate older models.” However, ChatGPT users experienced more aggressive deprecation, with subscribers on the ChatGPT Enterprise tier retaining access to all models whilst individual users lost access to popular versions.

Azure OpenAI's policies attempt to provide more stability. Generally Available model versions remain accessible for a minimum of 12 months. After that period, existing customers can continue using older versions for an additional six months, though new customers cannot access them. Preview models have much shorter lifespans: retirement occurs 90 to 120 days from launch. Azure provides at least 60 days' notice before retiring GA models and 30 days before preview model version upgrades.

These policies reflect a fundamental tension. Vendors need to maintain older models whilst advancing rapidly, but supporting numerous versions simultaneously creates technical debt and resource strain. Enterprises, meanwhile, need stability to justify integration investments that can run into millions of pounds.

According to nearly 60% of AI leaders surveyed, their organisations' primary challenges in adopting agentic AI are integrating with legacy systems and addressing risk and compliance concerns. Agentic AI thrives in dynamic, connected environments, but many enterprises rely on rigid legacy infrastructure that makes it difficult for autonomous AI agents to integrate, adapt, and orchestrate processes. Overcoming this requires platform modernisation, API-driven integration, and process re-engineering.

Strategies for Managing Integration Risk

Successful organisations have developed sophisticated strategies for navigating this turbulent landscape. The most effective approach treats AI implementation as business transformation rather than technology deployment. Organisations achieving 20% to 30% return on investment focus on specific business outcomes, invest heavily in change management, and implement structured measurement frameworks.

A recommended phased approach introduces AI gradually, running AI models alongside traditional risk assessments to compare results, build confidence, and refine processes before full adoption. Real-time monitoring, human oversight, and ongoing model adjustments keep AI risk management sharp and reliable. The first step involves launching comprehensive assessments to identify potential vulnerabilities across each business unit. Leaders then establish robust governance structures, implement real-time monitoring and control mechanisms, and ensure continuous training and adherence to regulatory requirements.

At the organisational level, enterprises face the challenge of fine-tuning vendor-independent models that align with their own governance and risk frameworks. This often requires retraining on proprietary or domain-specific data and continuously updating models to reflect new standards and business priorities. With players like Mistral, Hugging Face, and Aleph Alpha gaining traction, enterprises can now build model strategies that are regionally attuned and risk-aligned, reducing dependence on US-based vendors.

MIT's Center for Information Systems Research identified four critical challenges enterprises must address to move from piloting to scaling AI: Strategy (aligning AI investments with strategic goals), Systems (architecting modular, interoperable platforms), Synchronisation (creating AI-ready people, roles, and teams), and Stewardship (embedding compliant, human-centred, and transparent AI practices).

How companies adopt AI proves crucial. Purchasing AI tools from specialised vendors and building partnerships succeed about 67% of the time, whilst internal builds succeed only one-third as often. This suggests that expertise and pre-built integration capabilities outweigh the control benefits of internal development for most organisations.

Agile practices enable iterative development and quick adaptation. AI models should grow with business needs, requiring regular updates, testing, and improvements. Many organisations cite worries about data confidentiality and regulatory compliance as top enterprise AI adoption challenges. By 2025, regulations like GDPR, CCPA, HIPAA, and similar data protection laws have become stricter and more globally enforced. Financial institutions face unique regulatory requirements that shape AI implementation strategies, with compliance frameworks needing to be embedded throughout the AI lifecycle rather than added as afterthoughts.

The Abstraction Layer Solution

One of the most effective risk mitigation strategies involves implementing an abstraction layer between applications and AI providers. A unified API for AI models provides a single, standardised interface allowing developers to access and interact with multiple underlying models from different providers. It acts as an abstraction layer, simplifying integration of diverse AI capabilities by providing a consistent way to make requests regardless of the specific model or vendor.

This approach abstracts away provider differences, offering a single, consistent interface that reduces development time, simplifies code maintenance, and allows easier switching or combining of models without extensive refactoring. The strategy reduces vendor lock-in and keeps applications shipping even when one provider rate-limits or changes policies.

According to Gartner's Hype Cycle for Generative AI 2025, AI gateways have emerged as critical infrastructure components, no longer optional but essential for scaling AI responsibly. By 2025, expectations from gateways have expanded beyond basic routing to include agent orchestration, Model Context Protocol compatibility, and advanced cost governance capabilities that transform gateways from routing layers into long-term platforms.

Key features of modern AI gateways include model abstraction (hiding specific API calls and data formats of individual providers), intelligent routing (automatically directing requests to the most suitable or cost-effective model based on predefined rules or real-time performance), fallback mechanisms (ensuring service continuity by automatically switching to alternative models if primary models fail), and centralised management (offering a single dashboard or control plane for managing API keys, usage, and billing across multiple services).

Several solutions have emerged to address these needs. LiteLLM is an open-source gateway supporting over 100 models, offering a unified API and broad compatibility with frameworks like LangChain. Bifrost, designed for enterprise-scale deployment, offers unified access to over 12 providers (including OpenAI, Anthropic, AWS Bedrock, and Google Vertex) via a single OpenAI-compatible API, with automatic failover, load balancing, semantic caching, and deep observability integrations.

OpenRouter provides a unified endpoint for hundreds of AI models, emphasising user-friendly setup and passthrough billing, well-suited for rapid prototyping and experimentation. Microsoft.Extensions.AI offers a set of core .NET libraries developed in collaboration across the .NET ecosystem, providing a unified layer of C# abstractions for interacting with AI services. The Vercel AI SDK provides a standardised approach to interacting with language models through a specification that abstracts differences between providers, allowing developers to switch between providers whilst using the same API.

Best practices for avoiding vendor lock-in include coding against OpenAI-compatible endpoints, keeping prompts decoupled from code, using a gateway with portable routing rules, and maintaining a model compatibility matrix for provider-specific quirks. The foundation of any multi-model system is this unified API layer. Instead of writing separate code for OpenAI, Claude, Gemini, or LLaMA, organisations build one internal method (such as generate_response()) that handles any model type behind the scenes, simplifying logic and future-proofing applications against API changes.

The Multimodal Revolution

Whilst rapid release cycles create integration challenges, they've also unlocked powerful new capabilities, particularly in multimodal AI systems that process text, images, audio, and video simultaneously. According to Global Market Insights, the multimodal AI market was valued at $1.6 billion in 2024 and is projected to grow at a remarkable 32.7% compound annual growth rate through 2034. Gartner research predicts that 40% of generative AI solutions will be multimodal by 2027, up from just 1% in 2023.

The technology represents a fundamental shift. Multimodal AI refers to artificial intelligence systems that can process, understand, and generate multiple types of data (text, images, audio, video, and more) often simultaneously. By 2025, multimodal AI reached mass adoption, transforming from experimental capability to essential infrastructure.

GPT-4o exemplifies this evolution. ChatGPT's general-purpose flagship as of mid-2025, GPT-4o is a unified multimodal model that integrates all media formats into a singular platform. It handles real conversations with 320-millisecond response times, fast enough that users don't notice delays. The model processes text, images, and audio without separate preprocessing steps, creating seamless interactions.

Google's Gemini series was designed for native multimodality from inception, processing text, images, audio, code, and video. The latest Gemini 2.5 Pro Preview, released in May 2025, excels in coding and building interactive web applications. Gemini's long context window (up to 1 million tokens) allows it to handle vast datasets, enabling entirely new use cases like analysing complete codebases or processing comprehensive medical histories.

Claude has evolved into a highly capable multimodal assistant, particularly for knowledge workers dealing with documents and images regularly. Whilst it doesn't integrate image generation, it excels when analysing visual content in context, making it valuable for professionals processing mixed-media information.

Even mobile devices now run sophisticated multimodal models. Phi-4, at 5.6 billion parameters, fits in mobile memory whilst handling text, image, and audio inputs. It's designed for multilingual and hybrid use with actual on-device processing, enabling applications that don't depend on internet connectivity or external servers.

The technical architecture behind these systems employs three main fusion techniques. Early fusion combines raw data from different modalities at the input stage. Intermediate fusion processes and preserves modality-specific features before combining them. Late fusion analyses streams separately and merges outputs from each modality. Images are converted to 576 to 3,000 tokens depending on resolution. Audio becomes spectrograms converted to audio tokens. Video becomes frames transformed into image tokens plus temporal tokens.

The breakthroughs of 2025 happened because of leaps in computation and chip design. NVIDIA Blackwell GPUs enable massive parallel multimodal training. Apple Neural Engines optimise multimodal inference on consumer devices. Qualcomm Snapdragon AI chips power real-time audio and video AI on mobile platforms. This hardware evolution made previously theoretical capabilities commercially viable.

Audio AI Creates New Revenue Streams

Real-time audio processing represents one of the most lucrative domains unlocked by recent model advances. The global AI voice generators market was worth $4.9 billion in 2024 and is estimated to reach $6.40 billion in 2025, growing to $54.54 billion by 2033 at a 30.7% CAGR. Voice AI agents alone will account for $7.63 billion in global spend by 2025, with projections reaching $139 billion by 2033.

The speech and voice recognition market was valued at $15.46 billion in 2024 and is projected to reach $19.09 billion in 2025, expanding to $81.59 billion by 2032 at a 23.1% CAGR. The audio AI recognition market was estimated at $5.23 billion in 2024 and projected to surpass $19.63 billion by 2033 at a 15.83% CAGR.

Integrating 5G and edge computing presents transformative opportunities. 5G's ultra-low latency and high-speed data transmission enable real-time sound generation and processing, whilst edge computing ensures data is processed closer to the source. This opens possibilities for live language interpretation, immersive video games, interactive virtual assistants, and real-time customer support systems.

The Banking, Financial Services, and Insurance sector represents the largest industry vertical, accounting for 32.9% of market share, followed by healthcare, retail, and telecommunications. Enterprises across these sectors rapidly deploy AI-generated voices to automate customer engagement, accelerate content production, and localise digital assets at scale.

Global content distribution creates another high-impact application. Voice AI enables real-time subtitles across more than 50 languages with sub-two-second delay, transforming how content reaches global audiences. The media and entertainment segment accounted for the largest revenue share in 2023 due to high demand for innovative content creation. AI voice technology proves crucial for generating realistic voiceovers, dubbing, and interactive experiences in films, television, and video games.

Smart devices and the Internet of Things drive significant growth. Smart speakers including Amazon Alexa, Google Home, and Apple HomePod use audio AI tools for voice recognition and natural language processing. Modern smart speakers increasingly incorporate edge AI chips. Amazon's Echo devices feature the AZ2 Neural Edge processor, a quad-core chip 22 times more powerful than its predecessor, enabling faster on-device voice recognition.

Geographic distribution of revenue shows distinct patterns. North America dominated the Voice AI market in 2024, capturing more than 40.2% of market share with revenues amounting to $900 million. The United States market alone reached $1.2 billion. Asia-Pacific is expected to witness the fastest growth, driven by rapid technological adoption in China, Japan, and India, fuelled by increasing smartphone penetration, expanding internet connectivity, and government initiatives promoting digital transformation.

Recent software developments encompass real-time language translation modules and dynamic emotion recognition engines. In 2024, 104 specialised voice biometrics offerings were documented across major platforms, and 61 global financial institutions incorporated voice authentication within their mobile banking applications. These capabilities create entirely new business models around security, personalisation, and user experience.

Video Generation Transforms Content Economics

AI video generation represents another domain where rapid model improvements have unlocked substantial commercial opportunities. The technology enables businesses to automate video production at scale, dramatically reducing costs whilst maintaining quality. Market analysis indicates that the AI content creation sector will see a 25% compound annual growth rate through 2028, as forecasted by Statista. The global AI market is expected to soar to $826 billion by 2030, with video generation being one of the biggest drivers behind this explosive growth.

Marketing and advertising applications demonstrate immediate return on investment. eToro, a global trading and investing platform, pioneered using Google's Veo to create advertising campaigns, enabling rapid generation of professional-quality, culturally specific video content across the global markets it serves. Businesses can generate multiple advertisement variants from one creative brief and test different hooks, visuals, calls-to-action, and voiceovers across Meta Ads, Google Performance Max, and programmatic platforms. For example, an e-commerce brand running A/B testing on AI-generated advertisement videos for flash sales doubled click-through rates.

Corporate training and internal communications represent substantial revenue opportunities. Synthesia's most popular use case is training videos, but it's versatile enough to handle a wide range of needs. Businesses use it for internal communications, onboarding new employees, and creating customer support or knowledge base videos. Companies of every size (including more than 90% of the Fortune 100) use it to create training, onboarding, product explainers, and internal communications in more than 140 languages.

Business applications include virtual reality experiences and training simulations, where Veo 2's ability to simulate realistic scenarios can cut costs by 40% in corporate settings. Traditional video production may take days, but AI can generate full videos in minutes, enabling brands to respond quickly to trends. AI video generators dramatically reduce production time, with some users creating post-ready videos in under 15 minutes.

Educational institutions leverage AI video tools to develop course materials that make abstract concepts tangible. Complex scientific processes, historical events, or mathematical principles transform into visual narratives that enhance student comprehension. Instructors describe scenarios in text, and the AI generates corresponding visualisations, democratising access to high-quality educational content.

Social media content creation has become a major use case. AI video generators excel at generating short-form videos (15 to 90 seconds) for social media and e-commerce, applying pre-designed templates for Instagram Reels, YouTube Shorts, or advertisements, and synchronising AI voiceovers to scripts for human-like narration. Businesses can produce dozens of platform-specific videos per campaign with hook-based storytelling, smooth transitions, and animated captions with calls-to-action. For instance, a beauty brand uses AI to adapt a single tutorial into 10 personalised short videos for different demographics.

The technology demonstrates potential for personalised marketing, synthetic media, and virtual environments, indicating a major shift in how industries approach video content generation. On the marketing side, AI video tools excel in producing personalised sales outreach videos, B2B marketing content, explainer videos, and product demonstrations.

Marketing teams deploy the technology to create product demonstrations, explainer videos, and social media advertisements at unprecedented speed. A campaign that previously required weeks of planning, shooting, and editing can now generate initial concepts within minutes. Tools like Sora and Runway lead innovation in cinematic and motion-rich content, whilst Vyond and Synthesia excel in corporate use cases.

Multi-Reference Systems and Enterprise Knowledge

Whilst audio and video capabilities create new customer-facing applications, multi-reference systems built on Retrieval-Augmented Generation have become critical for enterprise internal operations. RAG has evolved from an experimental AI technique to a board-level priority for data-intensive enterprises seeking to unlock actionable insights from their multimodal content repositories.

The RAG market reached $1.85 billion in 2024 and is growing at 49% CAGR, with organisations moving beyond proof-of-concepts to deploy production-ready systems. RAG has become the cornerstone of enterprise AI applications, enabling developers to build factually grounded systems without the cost and complexity of fine-tuning large language models. The RAG market is expanding with 44.7% CAGR through 2030.

Elastic Enterprise Search stands as one of the most widely adopted RAG platforms, offering enterprise-grade search capabilities powered by the industry's most-used vector database. Pinecone is a vector database built for production-scale AI applications with efficient retrieval capabilities, widely used for enterprise RAG implementations with a serverless architecture that scales automatically based on demand.

Ensemble RAG systems combine multiple retrieval methods, such as semantic matching and structured relationship mapping. By integrating these approaches, they deliver more context-aware and comprehensive responses than single-method systems. Various RAG techniques have emerged, including Traditional RAG, Long RAG, Self-RAG, Corrective RAG, Golden-Retriever RAG, Adaptive RAG, and GraphRAG, each tailored to different complexities and specific requirements.

The interdependence between RAG and AI agents has deepened considerably, whether as the foundation of agent memory or enabling deep research capabilities. From an agent's perspective, RAG may be just one tool among many, but by managing unstructured data and memory, it stands as one of the most fundamental and critical tools. Without robust RAG, practical enterprise deployment of agents would be unfeasible.

The most urgent pressure on RAG today comes from the rise of AI agents: autonomous or semi-autonomous systems designed to perform multistep processes. These agents don't just answer questions; they plan, execute, and iterate, interfacing with internal systems, making decisions, and escalating when necessary. But these agents only work if they're grounded in deterministic, accurate knowledge and operate within clearly defined guardrails.

Emerging trends in RAG technology for 2025 and beyond include real-time RAG for dynamic data retrieval, multimodal content integration (text, images, and audio), hybrid models combining semantic search and knowledge graphs, on-device AI for enhanced privacy, and RAG as a Service for scalable deployment. RAG is evolving from simple text retrieval into multimodal, real-time, and autonomous knowledge integration.

Key developments include multimodal retrieval. Rather than focusing primarily on text, AI will retrieve images, videos, structured data, and live sensor inputs. For example, medical AI could analyse scans alongside patient records, whilst financial AI could cross-reference market reports with real-time trading data. This creates opportunities for systems that reason across diverse information types simultaneously.

Major challenges include high computational costs, real-time latency constraints, data security risks, and the complexity of integrating multiple external data sources. Ensuring seamless access control and optimising retrieval efficiency are also key concerns. The deployment of RAG in enterprise systems addresses practical challenges related to retrieval of proprietary data, security, and scalability. Performance is benchmarked on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead remain critically assessed.

Looking Forward

The competitive landscape created by rapid model releases shows no signs of stabilising. In 2025, three names dominate the field: OpenAI, Google, and Anthropic. Each is chasing the same goal: building faster, safer, and more intelligent AI systems that will define the next decade of computing. The leapfrogging pattern, where one vendor's release immediately becomes the target for competitors to surpass, has become the industry's defining characteristic.

For startups, the challenge is navigating intense competition in every niche whilst managing the technical debt of constant model updates. The positive developments around open models and reduced training costs democratise access, but talent scarcity, infrastructure constraints, and regulatory complexity create formidable barriers. Success increasingly depends on finding specific niches where AI capabilities unlock genuine value, rather than competing directly with incumbents who can absorb switching costs more easily.

For enterprises, the key lies in treating AI as business transformation rather than technology deployment. The organisations achieving meaningful returns focus on specific business outcomes, implement robust governance frameworks, and build flexible architectures that can adapt as models evolve. Abstraction layers and unified APIs have shifted from nice-to-have to essential infrastructure, enabling organisations to benefit from model improvements without being held hostage to any single vendor's deprecation schedule.

The specialised capabilities in audio, video, and multi-reference systems represent genuine opportunities for new revenue streams and operational improvements. Voice AI's trajectory from $4.9 billion to projected $54.54 billion by 2033 reflects real demand for capabilities that weren't commercially viable 18 months ago. Video generation's ability to reduce production costs by 40% whilst accelerating campaign creation from weeks to minutes creates compelling return on investment for marketing and training applications. RAG systems' 49% CAGR growth demonstrates that enterprises will pay substantial premiums for AI that reasons reliably over their proprietary knowledge.

The treadmill won't slow down. If anything, the pace may accelerate as models approach new capability thresholds and vendors fight to maintain competitive positioning. The organisations that thrive will be those that build for change itself, creating systems flexible enough to absorb improvements whilst stable enough to deliver consistent value. In an industry where the cutting edge shifts monthly, that balance between agility and reliability may be the only sustainable competitive advantage.

References & Sources


Tim Green

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk

 
Read more... Discuss...

from Justina Revolution

I went out into the cool evening air and did my 5 phase routine. I loosened my body. I did Cosmos Palm. (This is my signature qigong for power training sequence.) I then did my Swimming Dragon Baguazhang. God that form feels so good. I did Fut Gar and White Crane earlier today. So it has been a very complete workout day.

Tomorrow I will be talking to Dillon and Bre and then I will call Dr. Abad and get her help to get myself vaccinated so I can send Aldo the cards necessary for our residency. This is a good thing. I think things will work out.

I want to maybe play with makeup later but I never do.

The question becomes what nourishes me? What drains me?

 
Read more... Discuss...

from The Catechetic Converter

Detail of window at Chartes Cathedral showing the Massacre of the Innocents, taken from Wikimedia Commons

I was first really exposed to the Christian commemorations of the Holy Innocents thanks to a church name. Holy Innocents Episcopal Church outside Atlanta, to be exact. I visited that church and liked the architecture and liturgy and it inspired me to learn more about a story I had known since childhood but seldom dwelt on—much less saw as a focus of devotion.

It’s a story that largely gets left out of our Christmas commemorations in the Episcopal Church, partly because it is such a horrible story (and likely partly due to the more modern doubt of the story’s historical accuracy, which we’ll talk about in a bit). No one wants to follow up Christmas morning with a service about the mass-murdering of children.

At the same time, especially this year, this is a highly relevant story. Tragically, all over the world, politicians are playing like Herod and systematically executing anyone they deem a threat—including children.

Holy Innocents, also known Childermas, commemorates an event that, in all likelihood, never happened. Josephus, an important Jewish historian, took great care to showcase the brutalities of the Herodians and never once mentioned a mass slaughter of children. Outside of the gospel of Matthew there are no other historical accounts of this story and it seems likely to be something meant by the evangelist as a means to make connections between Jesus and Moses, a common theme throughout that particular gospel. So what are we to make of this fact? That we not only have a day marked on our calendar but also name churches and schools for an event that probably never happened?

This is one of the tough parts of reading the Bible. It’s not always “factual” in the ways to which we are accustomed today. Nevertheless, elements that we deem “fictional” can have a huge impact on our faith and wind up speaking Truth despite their (in)accuracy.

Consider the typical Christmas pageant. Aside from Mary, Joseph, a baby, angels, and some shepherds most of the story we dramatize is completely fictional and not related to what is written in the Bible. We tend to think of the birth of Jesus as being an event that culminates after Mary and Joseph, alone on a donkey, have gone to every house or inn in Bethlehem and been told “no vacancy” and so set up shop in a nearby stable. But none of the gospels mention a donkey and we’re only told that there was no room in “the inn”—nothing at all about conversations with inn-keepers or a door-to-door journey. Further, given the nature of the census, there was probably a caravan of people traveling to Bethlehem and others taking residence among the livestock because Bethlehem was not prepared for such an influx of extra people. What we think of when we think of the Christmas story is largely fictional, but that doesn’t mean there’s not truth in those elements. We crafted those details over the centuries in order to “flesh out” the story a bit, to give it the sort of texture that it invites. And those added details speak much of the faith and mindset of the church that crafted them.

The same is true of the Massacre of the Innocents. It might not have happened, but it’s very telling that no one finds the story improbable. There might not be any records to back it up, but the story sounds like the sort of thing Herod would have done—indeed, the sort of thing that rulers all over the world and all over our history books have done.

The sort of government that gleefully cancels aid and assistance to poor countries is acting like Herod. The one that uses starvation, particularly of children, as a weapon of retaliation is acting like Herod. The political entities that travel throughout villages to murder women and children are the ones acting like Herod.

The actual Herod may not have ordered a campaign to murder the children of Bethlehem out of some fear of losing power, but Herod for sure murdered plenty of children and other innocents during his reign out of a sense that because he was in charge he could do so—without any fear of God. And in this, Herod is an archetype. Plenty of gilded so-called rulers kill innocents in the name of preserving their name on the side of buildings. If they were honest, they do so out of a desire to kill the God that they are not.

Yesterday’s saint records Jesus saying “If the world hates you, know that it hated me first.” The poet Dianne di Prima says in her poem, “Rant” that “the only war that matters is the war against the imagination, all other wars are subsumed by it.” I tend to think that it’s more the case that all hatred is subsumed in hatred for Jesus and, therefore, all wars are the Battle of Armageddon, the war against Christ Himself.

If the story behind Holy Innocents is fictional, then it is worth asking what it is we’re commemorating this day. I think the answer is simple: Holy Innocents commemorates all children sacrificed on the altar of expedience or inconvenience by those in power attempting to cast themselves as gods. Those killed by starvation from the abrupt end to programs like USAID or in Gaza by the Israeli government. Those killed by radicals in Somalia and Sudan. Those dying thanks to bombs dropped on Ukraine. And that’s only looking at what’s currently happened in the news in recent weeks. These are who we commemorate on Holy Innocents. The gospel story is subsumed in the stories we see right now, and is itself reflective of those stories. The gospel story helps us Christians see the shape of the story happening around us, helps us in remembering where our allegiance lies.

Herod is the one who oversees the death of innocents. Christ is the one who sees them as holy.

***

The Rev. Charles Browning II is the rector of Saint Mary’s Episcopal Church in Honolulu, Hawai’i. He is a husband, father, surfer, and frequent over-thinker. Follow him on Mastodon and Pixelfed.

#Christmas #HolyInnocents #History #Theology #Church #Christianity #War #Gaza #Ukraine

 
Read more... Discuss...

from Attronarch's Athenaeum

OSRIC 3.0 Player Guide PDF has just been released for free on DriveThruRPG. Offset print and print-on-demand will be available next year, as well as GM Guide, adventures, and a host of other material.

OSRIC, Old School Reference and Index Compilation, was the first retroclone of Advanced Dungeons & Dragons. Released almost 20 years ago, it led the charge during early days of OSR, providing means to legally publish content compatible with AD&D.

OSRIC 3.0 brings a host of improvements, focusing on providing more explanations and examples of play, replacing dense blocks of text with more accessible layout, discards OGL, and brings the rules even closer to AD&D, just to name a few.

Learn more about OSRIC 3.0 on BackerKit.

#News #OSRIC #OSR

 
Read more...

from Zéro Janvier

Le jeu du cormoran est le quatrième roman appartenant au cycle romanesque Le Rêve du Démiurge de Francis Berthelot.

La fresque se poursuit puisqu’on retrouve des personnages déjà croisés dans les romans précédents : Ivan Algeiba, que l’on avait aperçu jeune garçon de cirque dans Le jongleur interrompu et qui est désormais un jeune homme ; Tom-Boulon, le régisseur du théâtre du Dragon, et Katri, l’ancienne actrice qui a retrouvé sa passion pour le chant, que l’on vient tous deux de quitter dans le roman précédent, Mélusath ; et ce cormoran qui donne son titre au roman, serait-il la réincarnation de Constantin, le jongleur qu’Ivan adorait et qui croyait si fort à la légende de l’île mythique où les âmes défuntes renaissent en oiseaux ?

On découvre également d’autres personnages, comme Moa-Toa, jeune asiatique androgyne, au sexe indéterminé, et un mystérieux inconnu aux yeux d'un bleu de métal qui semble pourchasser Ivan et faire appel à ses pires passions.

Le récit se déroule en 1974. Ivan quitte le cirque où il a passé son enfance et son adolescence et fait la connaissance de Moa-Toa et de Tom-Boulon. Guidés par le cormoran, ils vont accomplir un voyage depuis les Landes jusqu’à Paris, puis en Finlande. Les étapes et la destination permettront à chacun d’affronter leur passé et, peut-être, de trouver des réponses aux questions qu’ils se posent.

Dans la lignée des trois premiers romans, Francis Berthelot propose un récit sensible, empreint de symbolisme, avec une touche de fantastique qui s’affirme à chaque roman.

 
Lire la suite... Discuss...

Join the writers on Write.as.

Start writing or create a blog