from 💚

Fabled Entry

An episode of the palace Where the sky took up rain For Cooks and a dream The cougar that waits To cross by the stream Was an early river Taking chance

We in Ontario Take timing breaks to exist And enjoy open play For Hammond At the cenotaph Laurie at the gate Solemn news, Was war And I printed last Summer For a wear of resistance Typing rain And hearing doughnuts The simplest mood But afraid of existence- For its afterwards Laying on a table Being fed As time goes up

And so by dawn I work carefully But to know an amend I like peace And am on the phone

đŸ“±

 
Read more...

from 💚

Xylophone

David sat here By the rooftop, looking South A prayer for the first responders They were German and feeling well Six pairs of lungs today A solemn bit of Earth being turned A thousand trillion Euros for keep Kids on notice- There was a war and an accident Three years for better days A stink for redemption But the peers in line- We’re not our best We invest in freedom And finding our renew The Earth’s project And just at last An attempted standing Will see the coup And bear on our Sun In perfect hiding For his law- The one of the land And only day In his life To recover- Unarmed And likely injured For poetic frost

 
Read more...

from jolek78's blog

3:00 AM. Another one of those nights where my brain decided sleep was overrated. After my usual nocturnal walk through the streets of a remote Scottish town—where even a fox observed me with that “humans are weird” look—I sat back down at my server. Just a quick scan of my RSS feeds, I told myself, then I can start work. When...

We backed up Spotify (metadata and music files). It's distributed in bulk torrents (~300TB), grouped by popularity. This release includes the largest publicly available music metadata database with 256 million tracks and 186 million unique ISRCs. It's the world's first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens.

The news came from Anna's Archive—the world's largest pirate library—which had just scraped Spotify's entire catalog. Not just metadata, but also the audio files. 86 million tracks, 300 terabytes. I stopped to reread those numbers, then thought: holy shit, how big is this thing?

And so, while the rest of the world slept, I started digging. This is one of those stories that needs to be told—a story weaving together hacker idealism, technology, billions of dollars in AI training data, and an ethical paradox few want to truly confront.

When Z-Library fell

November 3, 2022. The FBI seized Z-Library's domains, one of the world's largest pirate libraries. Two alleged operators were arrested in Argentina. The community panicked—Z-Library served millions of students, researchers, and readers. And suddenly, everything vanished.

But someone was prepared. A group called PiLiMi (Pirate Library Mirror) had created complete backups of all shadow libraries for years. LibGen, Z-Library, Sci-Hub. Everything. When Z-Library fell, these backups were ready. But there was a problem: petabytes of unusable data with no way to search them.

Enter Anna Archivist—a pseudonym, probably a collective—who understood something fundamental: preserving data is useless if it's not accessible. Days after Z-Library's seizure, Anna's Archive was online with a meta-search engine aggregating all shadow library catalogs, making them searchable and—crucially—virtually impossible to censor.

The numbers

December 2025:

  • 61.3 million books (PDF, EPUB, MOBI, DjVu)
  • 95.5 million academic papers
  • 256 million music tracks (Spotify metadata)
  • 86 million audio files (~300TB)
  • Total: ~1.1 Petabyte in unified torrents

To put this in perspective: the sum of all academic knowledge produced by humanity, plus a gigantic slice of world literary production, plus now music. All indexed, searchable, downloadable. Free. And virtually impossible to shut down.

Why it can't be killed

Remember Napster? Centralized servers, one lawsuit, shut down in a day. BitTorrent learned from that—decentralized everything. But Anna's Archive goes further, combining layers of resilience that make it practically immortal:

Distributed Frontend: Multiple domain mirrors (.li, .se, .org, .gs), Tor hidden service, Progressive Web App that works offline. Block one, others continue.

Distributed Database: Elasticsearch + PostgreSQL + public API. Anyone can download the entire database and host their own instance. No central server to attack.

Distributed Files: This is the genius part. Anna's Archive hosts almost nothing directly. Instead:

  • IPFS (InterPlanetary File System): Files identified by cryptographic hash, served by volunteer nodes worldwide
  • BitTorrent: Classic torrents with multiple trackers, self-sustaining swarms
  • HTTP Gateways: For normal users who just want to click-and-download, links redirect to public IPFS gateways

Result: user downloads via normal HTTP, but content comes from a decentralized network. Can't shut down IPFS. Can't stop BitTorrent. Can block gateways, but hundreds exist and anyone can create new ones.

OpSec: Domains registered via privacy-focused Icelandic registrar, bulletproof hosting in non-cooperative jurisdictions, Bitcoin payments, PGP-encrypted communications, zero personal information.

The only way to stop Anna's Archive would be to shut down the internet. Or convince every single seeder to stop. Good luck.

81.7 terabytes free for meta

And here's where it gets disturbing.

February 2025. Documents from Kadrey v. Meta are unsealed—a class action by authors against Meta for using their pirated books to train Llama AI models. Internal emails reveal a shocking timeline:

October 2022 – Melanie Kambadur, Senior Research Manager:

I don't think we should use pirated material. I really need to draw a line there.

Eleonora Presani, Meta employee:

Using pirated material should be beyond our ethical threshold. SciHub, ResearchGate, LibGen are basically like PirateBay... they're distributing content that is protected by copyright and they're infringing it.

January 2023 – Meeting with Mark Zuckerberg present:

[Zuckerberg] wants to move this stuff forward, and we need to find a way to unblock all this.

April 2023 – Nikolay Bashlykov, Meta engineer:

Using Meta IP addresses to load through torrents pirate content... torrenting from a corporate laptop doesn't feel right.

2023-2024: The Operation

Meta downloaded:

  • 81.7 TB via Anna's Archive torrents (35.7 TB from Z-Library alone)
  • 80.6 TB from LibGen
  • Total: ~162 TB of pirated books

Method: BitTorrent client on separate infrastructure, VPN to obscure origin, active seeding to other peers. Result: 197,000 copyrighted books integrated into Llama training data.

June 2025: the ruling

Judge Vince Chhabria (Northern District California) applied the four-factor fair use test. The decision is legally fascinating and ethically disturbing.

Factor 1 – Transformative Use: Meta wins decisively. The judge ruled AI training is “spectacularly transformative”—fundamentally different from human reading. The purpose isn't to express the content but to learn statistical relationships between words.

Factor 2 – Nature of Work: Neutral. Creative fiction gets more copyright protection than factual works, but this didn't tip the scales either way.

Factor 3 – Amount Used: Meta wins. Even though they used entire books, the judge found this necessary for training. You can't cherry-pick sentences and expect an AI to learn language patterns.

Factor 4 – Market Effect: This is where the judge's discomfort shows through:

Generative AI has the potential to flood the market with endless amounts of images, songs, articles, books... So by training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.

He sees the problem clearly. AI trained on copyrighted works will compete with and potentially destroy the market for those very works. But the plaintiffs couldn't prove specific economic harm with hard data.

The final ruling: “Given the state of the record, the Court has no choice but to grant summary judgment.” Meta wins on these specific facts. But the judge adds a critical caveat: “In most cases, training LLMs on copyrighted works without permission is likely infringing and not fair use.”

Meta didn't win because what they did was legitimate. They won because the authors' lawyers didn't build a strong enough evidentiary case. It's a technical legal victory that sidesteps the ethical question entirely.

The precedent this sets is chilling: AI companies can pirate with relative impunity if they have good lawyers and plaintiffs can't prove specific damages.

The math

Scenario A (legal):

  • Meta negotiates licenses with publishers
  • Cost: $50-100 million (conservative estimate)
  • Authors receive royalties

Scenario B (what they did):

  • Download 81.7 TB for free
  • Legal defense: ~$5 million
  • Win in court
  • Authors receive: $0

Meta's savings: $45-95 million

And now every AI company knows: download from Anna's Archive, risk a lawsuit with weak evidence, save tens of millions.

Anna's Archive also revealed they provide “SFTP bulk access to approximately 30 companies”—primarily Chinese LLM startups and data brokers—who contribute money or data. DeepSeek publicly admitted using Anna's Archive data for training. No consequences in Chinese jurisdiction.

Aaron Swartz and the question that haunts this story

There's a ghost here. His name is Aaron Swartz, and his story illuminates everything wrong with how we treat information access.

2011: Aaron, 24, brilliant programmer, Reddit co-founder, and information freedom activist, connected to MIT's network and downloaded 4.8 million academic papers from JSTOR. His intent was to make publicly-funded research freely available. He wasn't enriching himself. He was acting on principle.

The response was swift and brutal. Federal prosecutors threw the book at him: 13 felony charges, maximum penalty of 50 years in prison and $1 million in fines. For downloading academic papers. The prosecution was led by U.S. Attorney Carmen Ortiz, who called it “stealing is stealing, whether you use a computer command or a crowbar.”

The pressure was immense. Aaron faced financial ruin, decades in prison, complete destruction of his life. In January 2013, at age 26, he hanged himself. His family and partner blamed the aggressive prosecution. The internet mourned a brilliant mind and passionate advocate crushed by prosecutorial overreach.

Now consider the parallel:

Aaron Swartz: 4.8 million papers → federal persecution, suicide at 26

Meta: 162 TB (~162 million papers) → wins in court, saves $95 million

Aaron was an individual acting on idealistic principles about information freedom. Meta is a trillion-dollar corporation acting on profit motives. Aaron faced the full weight of federal prosecution. Meta faced a civil lawsuit they successfully defended with their massive legal team.

The system punishes idealism and rewards profit. The disparity isn't just unjust—it reveals something fundamental about who gets to break rules and who doesn't.

The paradox no one wants to see

Anna's Archive claims to fight publishing monopolies and inequality in access to knowledge. But the reality:

Who benefits most?

  • Meta: 81.7 TB free, $95M saved
  • ~30 AI companies: privileged access
  • Corporations with $100M+ compute budgets

Resources needed to benefit:

  • Storage/Bandwidth: trivial for Meta ($1000s)
  • Computing for training: MASSIVE ($10-100M)
  • Legal defense: MASSIVE ($millions)

Only big tech can afford this. The result:

  • Data: socialized (Anna's Archive, shared risk)
  • Profits: privatized (proprietary LLMs, paid APIs)
  • Costs: externalized (authors not compensated)

But what about students in the Global South?

This is where the story gets complicated, because the benefits are real and they matter immensely.

Consider a medical student in India. Her family earns about $400/month. A single medical textbook costs $300-500. She needs fifteen of them. The math is impossible. Her options: don't graduate, or Anna's Archive. She chose the latter and completed her degree. She's now a practicing physician.

Or take a PhD researcher in South Africa studying climate change impacts. The critical papers for his dissertation are behind Elsevier's paywall at $35 each. He needs twenty papers minimum—$700 his university can't afford. Without Sci-Hub (accessible through Anna's Archive), his dissertation would have been impossible. He completed it, published findings that inform local climate policy.

An art history teacher in Argentina wanted to enrich her curriculum with Renaissance art analysis. The books she needed weren't available in local libraries. Importing them? Prohibitive between shipping costs and customs. Anna's Archive gave her access to rare texts that transformed her teaching.

The data backs this up: literature review times for researchers in developing countries reduced 60-80%. Citation patterns show researchers in Nigeria, Bangladesh, Ecuador now cite contemporary research at parity with Harvard and Oxford. Publications from developing countries have increased. Methodological quality has improved. International collaborations have expanded.

This matters. This changes lives. This is not hypothetical.

The problem is: both things are simultaneously true.

  1. Anna's Archive saves academic careers in the Global South
  2. Anna's Archive allows Meta to save $95 million

But Meta downloaded more data in one week than all Indian students download in a year. How do we square that?

The broken system that created this monster

To understand why Anna's Archive exists and why it's grown so explosively, you need to understand how fundamentally broken academic publishing has become.

Here's the perverse cycle:

  1. Researcher writes paper (unpaid)
  2. Other researchers peer review it (unpaid)
  3. Publisher publishes it
  4. Researcher's own university must pay to read it
  5. Publisher profits: Elsevier and Wiley report 35-40% profit margins

Today, over 70% of academic papers sit behind paywalls. Access costs $35-50 per paper for individuals, or $10,000-100,000+ per year for institutional subscriptions. Universities in developing countries simply cannot afford these subscriptions. Neither can most universities in developed countries—Harvard famously called journal subscription costs “fiscally unsustainable” in 2012.

The system extracts free labor from researchers, locks up publicly-funded research behind paywalls, charges exorbitant fees to access it, and funnels enormous profits to publishers who add relatively little value. Academic institutions create the knowledge, do the quality control, and then pay again to access their own work.

Sci-Hub and Anna's Archive didn't emerge from nowhere. They're responses to a genuinely broken system. The question is whether they're the right response—and who ultimately benefits most from that response.

The architecture determines the ethics

Anna's Archive can't discriminate because:

  1. Open source philosophy: everyone or no one
  2. Technical impossibility: how do you block Meta but not students?
  3. Legal strategy: claiming “non-hosting” makes usage control impossible

IPFS and BitTorrent are magnificent tools for resisting censorship. But resistance to censorship also means resistance to ethical control. You can't have one without the other.

The system is structurally designed to be unkillable. Which also means it's structurally designed to serve whoever has the resources to benefit most.

Where does it end?

December 2025: Anna's Archive announced they'd scraped Spotify. The same preservation narrative, the same pattern. 256 million tracks, 86 million audio files, 300TB available to anyone with the infrastructure to use it.

“This Spotify scrape is our humble attempt to start such a 'preservation archive' for music,” they wrote. The justification mirrors the books argument: Spotify loses licenses, music disappears; platform risk if Spotify fails; regional blocks prevent access; long tail poorly preserved.

All true. But who downloads 300TB of music? Not the kid in Malawi who just wants to listen to his favorite artist. ByteDance, training the next AI music generator. Startups building Spotify competitors. The same companies with compute budgets in the tens of millions.

Anna's Archive is pivoting from text to multimedia, and each escalation follows a predictable pattern:

  • Books → Justified by paywalls and academic access
  • Papers → Justified by broken academic publishing
  • Music → Justified by platform risk and preservation
  • Video? → What's the justification for the next step?

With each escalation:

  • The value for big tech increases exponentially
  • The proportion of benefit for individual students decreases
  • Mass piracy becomes normalized as “preservation”
  • The ethical questions get harder to answer

And the international precedent is already being set. Japan's AI Minister (January 2025) stated explicitly: “AI companies in Japan can use whatever they want for AI training... whether it is content obtained from illegal sites or otherwise.”

The message from governments: pirate freely if it serves AI supremacy. We're in a race to the bottom where copyright becomes meaningless for AI training, and the companies with the most resources benefit most.

Conclusions: I don't know which way to turn

I started from that sleepless night, 256 million songs in an RSS feed, and ended up here with more questions than answers.

Anna's Archive is a technological marvel—IPFS, BitTorrent, distributed databases creating something genuinely uncensorable. It's also a lifeline for millions of students and researchers locked out of knowledge by an exploitative publishing system. And simultaneously, it's the largest intellectual property expropriation operation in history, saving corporations hundreds of millions while creators receive nothing.

All of these things are true at once. This isn't a simple story with heroes and villains.

The academic publishing system is genuinely broken. Researchers create knowledge for free, review it for free, then their institutions must pay exorbitant fees to access it while publishers extract 35-40% profit margins. This system deserves to be disrupted.

But Anna's Archive isn't disrupting it equitably. The architecture that makes it uncensorable also makes it impossible to distinguish between a student in Lagos accessing a textbook and Meta downloading 162TB for AI training. You can't have selective resistance to censorship—it's all or nothing.

Aaron Swartz died fighting for information freedom with idealistic principles. Meta achieves the same result with corporate profit motives and walks away victorious. The system rewards power and punishes principle.

Can this be fixed? Copyright reform moves at the speed of politics—years, decades. Compulsory licensing for AI training? Just beginning to be discussed. Open access mandates? Facing massive publisher resistance. Meanwhile, Anna's Archive operates at the speed of software, and data flows freely to those with $100M compute clusters.

The question isn't whether Anna's Archive will be stopped—it won't be, that's the point of the architecture. The question is what world we're building where the same technology that liberates a medical student in India also bankrolls Meta's AI ambitions, and we can't separate one from the other.

I don't have answers. I have a functioning IPFS node, a Tor relay, and the uncomfortable knowledge that every byte I help distribute might be saving a researcher's career or training someone's proprietary AI model. Probably both.

Free for everyone. The problem is that “everyone” has very different resources to benefit from that freedom.

Now, if you'll excuse me, I'm going to check how much bandwidth my nodes are using. And reflect on whether participation is complicity or resistance. Maybe it's both. Maybe that's the point.

#AnnaArchive #AI #Copyright #AaronSwartz #Meta #AcademicPublishing #IPFS #InformationFreedom

 
Read more... Discuss...

from Unvarnished diary of a lill Japanese mouse

JOURNAL

29 décembre 2025

Mamie et papi sont partis se coucher, nous on a l'auberge pour nous toutes seules. On s'est installĂ©es autour du foyer, on a allumĂ©s trois bouts de bois et une bougie, et on se fait chauffer du sakĂ© tranquillement. Quelle fĂȘte ! On est au temps des shogun soudain — sauf l'Ă©cran du cellphone, je vais l'Ă©teindre, ça va pas du tout dans le dĂ©cor. On est heureuses on aimerait que ça dure comme ça mille annĂ©es, on sait bien que c’est fugace alors on en profite on se baigne dedans.

 
Lire la suite...

from Olhar Convexo

AVISO: Este texto contém material que pode ser inadequado para alguns leitores.

Quem nĂłs serĂ­amos sem nossos prazeres?

Bom, vejamos os 4 prazeres básicos que todo ser humano possui. Todo ser humano possui desejo de transar; beber; comer e dormir. Afinal são os desejos que a vida necessita. “Reprodução, sede, fome e sono.”

Neste texto, a questĂŁo a ser abordada Ă© outra. É o hiperestimulo que acaba causando problemas numa parte especĂ­fica do cĂ©rebro chamada cĂłrtex prĂ©-frontal. Essa parte Ă© responsĂĄvel pelo domĂ­nio da atenção.

Como qualquer droga, o vĂ­cio, ou melhor, a dependĂȘncia, pode virar uma doença, a depender do nĂ­vel em que esteja.

Nós “nos permitimos” criar uma doença: o vício do uso de celular, chamado de nomofobia.

Como qualquer dependĂȘncia, a nomofobia estĂĄ sendo tratada pela medicina como uma doença – e Ă© o que deve acontecer de fato.

(Nota: Quando aplicamos a lei que proĂ­be o uso de celulares nas escolas, na minha visĂŁo, irĂ­amos ter adolescentes nomofĂłbicos em todos os cantos. E foi o que de fato aconteceu.)

Por que trago esse assunto, no meio dos 4 desejos mĂ­nimos humanos?

Porque é o mais pronunciado na nossa sociedade na época de hoje. E derivado dele, nasce o imediatismo. Nasce também a inquietação e o TDAH (Transtorno de Déficit de Atenção e Hiperatividade).

HĂĄ uma parcela da geração revolucionĂĄria (que hoje estĂĄ na meia idade) que acredita que estamos “fazendo diagnĂłsticos em excesso”; especialmente de TDAH, mas tambĂ©m de TEA (Transtorno do Espectro Autista).

A geração revolucionåria não vivenciou o que é vivenciado hoje pela geração mais afetada, a geração.com.

A geração.com vivenciou picos de dependĂȘncia por excesso de uso de celular; vivenciou picos de uso de tecnologia em geral, vivenciou e grande parte e, ainda vivencia o hiperestĂ­mulo que os vĂ­deos curtos, os reels (Instagram) e o TikTok fornecem.

O ato de passar para cima para ver um vĂ­deo de gatinho seguido do outro, Ă© uma doença! Especialmente porque nĂŁo sĂŁo dois ou trĂȘs vĂ­deos, sĂŁo 400 seguidos que o jovem nĂŁo se dĂĄ conta que o algoritmo jĂĄ o fez levar a assistir 398 vĂ­deos a mais do que era o desejo dele.

Uma novidade: hoje existem novelas – repito – NOVELAS – no formato reels.

Essas novelas sĂŁo projetadas para criar mais imediatismo e mais dependĂȘncia.

Elas NÃO possuem intervalo entre as falas - atĂ© porquĂȘ a geração.com nĂŁo aguentaria aguardar e passaria o vĂ­deo.

Essas novelas sĂŁo projetadas para o vĂ­cio.

(NĂŁo que as novelas comuns nĂŁo sejam).

Mas o potencial de adicção é extremo.

A saĂșde da geração.com Ă© algo que se vĂȘ como delicada, mas essa geração tem seus prĂłprios problemas que foram projetados para afetĂĄ-la.

HĂĄ um questionamento na sociedade cientĂ­fica de fato de que possamos estar fazendo muitos diagnĂłsticos sem os devidos critĂ©rios, mas ao mesmo tempo mais pessoas estĂŁo obtendo acesso Ă  mĂ©dicos especialistas, e Ă  informação, que se tornou essencial para questionar os “hiperdiagnĂłsticos”. A conclusĂŁo? Mais pessoas estĂŁo expostas a problemas causados pelo celular, e uma gigantesca quantidade de pessoas obteve acesso a cuidados mĂ©dicos, fazendo o nĂșmero de diagnĂłsticos crescer exponencialmente. Mas de fato, estamos mais doentes do que em qualquer outra Ă©poca.

Hoje em dia, nĂŁo Ă© mais vencer a dependĂȘncia a alguma droga que Ă© o ĂĄpice.

O ĂĄpice Ă© vencer a dependĂȘncia do uso do celular.

Rio de Janeiro,

29 de Dezembro de 2025.

FONTE: https://pubmed.ncbi.nlm.nih.gov/35253285/

 
Leia mais... Discuss...

from An Open Letter

I started a new workout routine, no longer doing my own but using a PPLUL from the app I use. And holy shit, that leg day beat the fuck out of me. I feel good again. I think I miss that intensity and level of pain, and overcoming that helps so much. Wanting to quit and cut it out but not really helps me a lot.

 
Read more...

from Justina Revolution

I am quite fast on my feet and I am very, very dextrous in my foot placement. My sparring partners have always marveled at how easily I can traverse distances and remain just out of reach of their strikes.

I credit this to my Fut Gar training. I practice four short stepping drills that enable me to absolutely focus on dropping my weight and delivering power from every possible stance and position.

These Butterfly Drills contain the true essence of Southern Shaolin and have enhanced my fighting capabilities by forcing me to endlessly drill real positions over and over until finally I cannot get them wrong.

It improved my kickboxing and grappling abilities by enabling me to be stable even in the most awkward positions.

 
Read more... Discuss...

from The Europe–China Monitor

Which visas are not suitable for a China Internship?

Because internships involve work and work-related activities, they are treated as employment under Chinese immigration law. According to official Chinese government sources, lawful work in China requires a Z visa, a Foreigner’s Work Permit, and a work-type residence permit. This would automatically exclude those on the following visas from legally undertaking an internship in China:

The F (Visitor) Visa

The F visa is a non-commercial visa for foreigners entering China for exchanges, visits, or study tours. As it does not include work authorisation or permit income-generating activities, it cannot be used for employment or internships, which by nature involve work and remuneration. According to the Beijing Authorities, a person in possession of an F visa cannot obtain an Employment Permit (Electronic Social Security Card) or Work-Type Residence which are mandatory for work in China.

The M (Business) Visa

The China Business Visa (M visa) is issued to foreigners for commercial and trade activities, such as visiting clients, attending trade fairs, and meeting business partners — not for employment. In addition, on most China business (M) visas, holders can stay in China for a limited period (often 30–120 days per visit), and if longer continuous time in the country is needed, travelers may need to exit and re-enter or apply for an extension. The M Visa cannot be directly converted to a residence permit in China.

The L (Tourist) Visa

The China L Visa (Tourist Visa) is for foreigners visiting China for sightseeing, tourism, or visiting friends/relatives, allowing for short stays (often 30-90 days).

The X1 and X2 (Study) Visas

The X1 visa permits study exceeding 180 days, while the X2 visa is limited to study periods of less than 180 days. In some regions of China, students may apply for permission to engage in part-time work; however, this involves additional administrative requirements.

Such permission can only be applied for after arrival in China and requires formal approval and official documentation from the university or college, the employer, immigration and, in some cases, the municipal authorities. Another potential issue is that not all educational institutions or employers are authorised or willing to support applications for part-time work or internships.

Why the Z Visa Is the Best Option for a China Internship

The Z visa is widely regarded as the gold standard for internships and work-related activities in China. The key benefits of holding a Z visa for an internship include the following:

Why the China International Leadership Programme Is the Best Internship in China

The China International Leadership Programme offers applicants Z visa sponsorship, a work-type residence permit, and an Employment Permit (Electronic Social Security Card), allowing participants to legally work and receive remuneration in China.

In addition, the programme’s HSK-aligned Mandarin language lessons and immersion are delivered as work-related activities, supporting participants in carrying out the teaching and internship components more effectively and enabling clear, professional communication within a Chinese working environment.

https://youtu.be/3NL2hWs6XT0?si=WRNlK14cwlmeaoqN

© 2025 Europe China Monitor News Team

 
Read more... Discuss...

from wystswolf

The consequences of a touched eyeball are that you can run, but you cannot hide.

Wolfinwool · Isaiah 14-16

NARRATOR:

For Jehovah will show mercy to Jacob, and he will again choose Israel. He will settle them in their land, and the foreign residents will join them and attach themselves to the house of Jacob.

And peoples will take them and bring them to their own place, and the house of Israel will possess them as male and female servants in Jehovah’s land; and they will be the captors of those who held them captive, and they will have in subjection those who were forcing them to work.

In the day when Jehovah gives you rest from your pain and from your turmoil and from the hard slavery imposed on you, you will recite this proverb against the king of Babylon:


ISRAEL (PROVERB AGAINST THE KING OF BABYLON):

How the one forcing others to work has met his end! How the oppression has ended!

Jehovah has broken the rod of the wicked, the staff of the rulers, the one furiously striking peoples with unceasing blows, the one angrily subduing nations with relentless persecution.

The whole earth now rests, free of disturbance. People cry out for joy.

Even the juniper trees rejoice over you, along with the cedars of Lebanon. They say, ‘Ever since you have fallen, no woodcutter comes up against us.’

Even the Grave underneath is stirred up to meet you when you come. Because of you, it awakens those powerless in death, all the oppressive leaders of the earth. It makes all the kings of the nations rise from their thrones.

All of them speak up and say to you: ‘Have you also become weak like us? Have you become like us?

Down to the Grave your pride has been brought, the sound of your stringed instruments. Maggots are spread beneath you as a bed, and worms are your covering.’

How you have fallen from heaven, O shining one, son of the dawn! How you have been cut down to the earth, you who vanquished nations!

You said in your heart, ‘I will ascend to the heavens. Above the stars of God I will lift up my throne, and I will sit down on the mountain of meeting, in the remotest parts of the north. I will go up above the tops of the clouds; I will make myself resemble the Most High.’

Instead, you will be brought down to the Grave, to the remotest parts of the pit.

Those seeing you will stare at you; they will closely examine you, saying: ‘Is this the man who was shaking the earth, who made kingdoms tremble, who made the inhabited earth like the wilderness and overthrew its cities, who refused to let his prisoners go home?’

All other kings of the nations, yes, all of them, lie down in glory, each one in his own tomb.

But you are discarded without a grave, like a detested sprout, clothed with the slain who were stabbed with the sword, who go down to the stones of a pit, like a carcass trampled underfoot.

You will not join them in a grave, for you destroyed your own land, you killed your own people. The offspring of evildoers will never again be named.

Prepare a slaughtering block for his sons because of the guilt of their forefathers, so that they will not rise up and take over the earth and fill the land with their cities.


JEHOVAH OF ARMIES:

I will rise up against them. And I will wipe out from Babylon name and remnant and descendants and posterity.

And I will make her a possession of porcupines and a region of marshes, and I will sweep her with the broom of annihilation.


NARRATOR:

Jehovah of armies has sworn: “Just as I have intended, so it will occur, and just as I have decided, that is what will come true.

I will crush the Assyrian in my land, and I will trample him on my mountains. His yoke will be removed from them, and his load will be removed from their shoulder.”

This is what has been decided against all the earth, and this is the hand that is stretched out against all the nations.

For Jehovah of armies has decided, and who can thwart it? His hand is stretched out, and who can turn it back?

In the year that King Ahaz died, this pronouncement was made:


JEHOVAH (PRONOUNCEMENT AGAINST PHILISTIA):

Do not rejoice, Philistia, any of you, just because the staff of the one striking you has been broken. For from the root of the serpent will come a poisonous snake, and its offspring will be a flying fiery snake.

While the firstborn of the lowly feed and the poor lie down in security, I will put your root to death with famine, and what is left of you will be killed.

Wail, O gate! Cry out, O city! All of you will lose heart, O Philistia! For a smoke is coming from the north, and there are no stragglers in his ranks.

How should they answer the messengers of the nation? That Jehovah has laid the foundation of Zion, and that the lowly ones of his people will take refuge in her.


CHAPTER 15

NARRATOR (PRONOUNCEMENT AGAINST MOAB):

Because it has been devastated in a night, Ar of Moab has been silenced. Because it has been devastated in a night, Kir of Moab has been silenced.

He has gone up to the House and to Dibon, to the high places to weep. Moab wails over Nebo and over Medeba. Every head is shaved bald, every beard is clipped.

In its streets they have put on sackcloth. On their roofs and in their public squares they all wail; they go down weeping.

Heshbon and Elealeh cry out; their voice is heard as far as Jahaz. That is why the armed men of Moab keep shouting. He is trembling.

My heart cries out over Moab. Its fugitives have fled as far as Zoar and Eglath-shelishiyah. On the ascent of Luhith they weep as they go up; on the way to Horonaim they cry out over the catastrophe.

For the waters of Nimrim are desolate; the green grass has dried up, the grass is gone and nothing green is left.

That is why they are carrying away what is left of their stores and their riches; they are crossing the valley of poplars.

For the outcry echoes throughout the territory of Moab. The wailing reaches to Eglaiim; the wailing reaches to Beer-elim.

For the waters of Dimon are full of blood, and I have more in store for Dimon: a lion for those of Moab who escape and for those remaining in the land.


CHAPTER 16

NARRATOR:

Send a ram to the ruler of the land, from Sela through the wilderness to the mountain of the daughter of Zion.

Like a bird chased away from its nest, so the daughters of Moab will be at the fords of Arnon.


COUNSEL TO MOAB:

Offer counsel, carry out the decision. Make your shadow at high noon like the night. Conceal the dispersed and do not betray those fleeing.

May my dispersed ones reside in you, O Moab. Become a place of concealment to them because of the destroyer. The oppressor will reach his end, the destruction will come to an end, and those trampling others down will perish from the earth.

Then a throne will be firmly established in loyal love. The one who sits on it in the tent of David will be faithful; he will judge fairly and will swiftly execute righteousness.


NARRATOR:

We have heard about the pride of Moab—he is very proud— his haughtiness and his pride and his fury; but his empty talk will come to nothing.

So Moab will wail for Moab; they will all wail. Those who are stricken will moan for the raisin cakes of Kir-hareseth.

For the terraces of Heshbon have withered, the vine of Sibmah. The rulers of the nations have trampled its bright-red branches; they had reached as far as Jazer; they had extended into the wilderness. Its shoots had spread out and gone as far as the sea.

That is why I will weep over the vine of Sibmah as I weep for Jazer. With my tears I will drench you, O Heshbon and Elealeh, because the shouting over your summer fruit and your harvest has ended.

Rejoicing and joyfulness have been taken away from the orchard, and there are no songs of joy or shouting in the vineyards. The treader no longer treads out wine in the presses, for I have caused the shouting to cease.

That is why deep within me I am boisterous over Moab, like the strumming of a harp, and my innermost being over Kir-hareseth.

Even when Moab wears himself out on the high place and goes to pray in his sanctuary, he will accomplish nothing.

This is the word that Jehovah previously spoke concerning Moab.

And now Jehovah says: “Within three years, like the years of a hired worker, the glory of Moab will be disgraced with much tumult of every sort, and those who remain will be very few and insignificant.”

 
Read more... Discuss...

from Justina Revolution

I did my 5 phase routine with Loosening, Cosmos Palm, Silk Reeling, and Swimming Dragon Baguazhang. This was so good as the sun rose behind me. I am increasing my power, my flexibility, my meditative abilities, and my body, mind, and spirit senses.

Weaving energy around my body, spreading my awareness from horizon to horizon. Generating stillness in both limited and unlimited forms. This is glorious. I am generating a world of benefits and my evolution, the activation of my DNA upgrades all beings in the multiverse.

There is no separation. It’s all one thing. I did the Monroe guided portal meditation last night. I know this energy of the portal. It is Akasha and I am joined with all beings in that beautiful pregnant void.

The Void is not emptiness or annihilation. It is the pregnant field from whence all things arise and to which all things return. This is my reality. As solid and true as my fist. Nothing is ever gone. Nothing is ever lost. There is no past and no future because there is no time. There is no loss because there is no space. Nothing can come to you or leave you. It is all here right now in this very moment.

 
Read more... Discuss...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This fourth module covers critical considerations when building and deploying ML models in the real world, including productionisation best practices, automation, and responsible engineering.

Production ML systems

Introduction

The model is only a small part of real-world production ML systems. It often represents only 5% or less of the total codebase in the system. MlSystem.png

Source: Production ML systems | Machine Learning | Google for Developers

Static versus dynamic training

Machine learning models can be trained statically (once) or dynamically (continuously).

Static training (offline training) Dynamic training (online training)
Advantages Simpler. You only need to develop and test the model once. More adaptable. Keeps up with changes in data patterns, providing more accurate predictions.
Disadvantages Sometimes stale. Can become outdated if data patterns change, requiring data monitoring. More work. You must build, test, and release a new product continuously.

Choosing between static and dynamic training depends on the specific dataset and how frequently it changes.

Monitoring input data is essential for both static and dynamic training to ensure reliable predictions.

Source: Production ML systems: Static versus dynamic training | Machine Learning | Google for Developers

Static versus dynamic inference

Inference involves using a trained model to make predictions on unlabelled examples, and it can be done as follows:

  • Static inference (offline inference, batch inference) generates predictions in advance and caches them, which suits scenarios where prediction speed is critical.

  • Dynamic inference (online inference, real-time inference) generates predictions on demand, offering flexibility for diverse inputs.

Static inference (offline inference, batch inference) Dynamic inference (online inference, real-time inference)
Advantages No need to worry about cost of inference; allows post-verification of predictions before pushing Can infer a prediction on any new item as it comes in
Disadvantages Limited ability to handle uncommon inputs Compute-intensive and latency-sensitive; monitoring needs are intensive

Choosing between static and dynamic inference depends on factors such as model complexity, desired prediction speed, and the nature of the input data.

Static inference is advantageous when cost and prediction verification are prioritised, while dynamic inference excels in handling diverse, real-time predictions.

Source: Production ML systems: Static versus dynamic inference | Machine Learning | Google for Developers

When to transform data?

Feature engineering can be performed before or during model training, each with its own advantages and disadvantages.

  • Transforming data before training allows for a one-time transformation of the entire dataset but requires careful recreation of transformations during prediction to avoid training-serving skew.
  • Transforming data during training ensures consistency between training and prediction but can increase model latency and complicate batch processing.
    • When transforming data during training, considerations such as Z-score normalisation across batches with varying distributions need to be addressed.

Source: Production ML systems: When to transform data? | Machine Learning | Google for Developers

Deployment testing

Deploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.

Reproducible model training involves deterministic seeding, fixed initialisation order, averaging multiple runs, and using version control.

Integration tests ensure that different components of the ML pipeline work together seamlessly and should run continuously and for new model or software versions.

Before serving a new model, validate its quality by checking for sudden and gradual degradations against previous versions and fixed thresholds.

Ensure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.

Source: Production ML systems: Deployment testing | Machine Learning | Google for Developers

Monitoring pipelines

ML pipeline monitoring involves validating data (using data schemas) and features (using unit tests), tracking real-world metrics, and addressing potential biases in data slices.

Monitoring training-serving skew, label leakage, model age, and numerical stability is crucial for maintaining pipeline health and model performance.

  • Training-serving skew means that input data during training differs from input data during serving, for example because training and serving data use different schemas (schema skew) or because engineered data differs between training and serving (feature skew).
  • Label leakage means that the ground truth labels being predicted have inadvertently entered the training features.
  • Numerical stability involves writing tests to check for NaN and Inf values in weights and layer outputs, and testing that more than half of the outputs of a layer are not zero.

Live model quality testing uses methods such as human labelling and statistical analysis to ensure ongoing model effectiveness in real-world scenarios.

Implementing proper randomisation through deterministic data generation enables reproducible experiments and consistent analysis.

Maintaining invariant hashing ensures that data splits remain consistent across experiments, contributing to reliable analysis and model evaluation.

Source: Production ML systems: Monitoring pipelines | Machine Learning | Google for Developers

Questions to ask

Continuously monitor models in production to evaluate feature importance and potentially remove unnecessary features, ensuring prediction quality and resource efficiency.

  • Regularly assess whether features are genuinely helpful and whether their value outweighs the cost of inclusion.

Data reliability is crucial. Consider data source stability, potential changes in upstream data processes, and the creation of local data copies to control versioning and mitigate risks.

Be aware of feedback loops, where a model's predictions influence future input data, potentially leading to unexpected behaviour or biased outcomes, especially in interconnected systems.

Source: Production ML systems: Questions to ask | Machine Learning | Google for Developers

Automated machine learning

Introduction

AutoML automates tasks in the machine learning workflow, such as data engineering (feature selection and engineering), training (algorithm selection and hyperparameter tuning), and analysis, making model building faster and easier. ml-workflow.png

While manual training involves writing code and iteratively adjusting it, AutoML reduces repetitive work and the need for specialised skills.

Source: Automated Machine Learning (AutoML) | Google for Developers

Benefits and limitations

Benefits:

  • To save time.
  • To improve the quality of an ML model.
  • To build an ML model without needing specialised skills.
  • To smoke test a dataset. AutoML can give quick baseline estimates of whether a dataset has enough signal relative to noise.
  • To evaluate a dataset. AutoML can help determine which features may be worth using.
  • To enforce best practices. Automation includes built-in support for applying ML best practices.

Limitations:

  • Model quality may not match that of manual training.
  • Model search and complexity can be opaque. Models generated with AutoML are difficult to reproduce manually.
  • Multiple AutoML runs may show greater variance.
  • Models cannot be customised during training.

Large amounts of data are generally required for AutoML, although specialised systems using transfer learning (taking a model trained on one task and adapting its learned representations to a different but related task) can reduce this requirement.

AutoML suits teams with limited ML experience or those seeking productivity gains without customisation needs. Custom (manual) training suits cases where model quality and customisation matter most.

Source: AutoML: Benefits and limitations | Machine Learning | Google for Developers

Getting started

AutoML tools fall into two categories:

  • Tools that require no coding.
  • API and CLI tools.

The AutoML workflow follows steps similar to traditional machine learning, including problem definition, data gathering, preparation, model development, evaluation, and potential retraining.

  • Some AutoML systems also support model deployment.

Data preparation is crucial for AutoML and involves labelling, cleaning and formatting data, and applying feature transformations.

No-code AutoML tools guide users through model development with steps such as data import, analysis, refinement, and configuration of run parameters before starting the automated training process.

  • Users still need to carry out semantic checks to select the appropriate semantic type for each feature (for example recognising that postal codes are categorical rather than numeric), and to set transformations accordingly.

Source: AutoML: Getting started | Machine Learning | Google for Developers

Fairness

Introduction

Before putting a model into production, it is critical to audit training data and evaluate predictions for bias.

Source: Fairness | Machine Learning | Google for Developers

Types of bias

Machine learning models can be susceptible to bias due to human involvement in data selection and curation.

Understanding common human biases is crucial for mitigating their impact on model predictions.

Types of bias include reporting bias, historical bias, automation bias, selection bias, coverage bias, non-response bias, sampling bias, group attribution bias (in-group bias and out-group homogeneity bias), implicit bias, confirmation bias, and experimenter's bias, among others.

Source: Fairness: Types of bias | Machine Learning | Google for Developers

Identifying bias

Missing or unexpected feature values in a dataset can indicate potential sources of bias.

Data skew, where certain groups are under- or over-represented, can introduce bias and should be addressed.

Evaluating model performance by subgroup ensures fairness and equal performance across different characteristics.

Source: Fairness: Identifying bias | Machine Learning | Google for Developers

Mitigating bias

Machine learning engineers use two primary strategies to mitigate bias in models:

  • Augmenting training data.
  • Adjusting the model's loss function.

Augmenting training data involves collecting additional data to address missing, incorrect, or skewed data, but it can be infeasible due to data availability or resource constraints.

Adjusting the model's loss function involves using fairness-aware optimisation functions rather than the common default log loss.

The TensorFlow Model Remediation Library provides optimisation functions designed to penalise errors in a fairness-aware manner:

  • MinDiff aims to balance errors between different data slices by penalising differences in prediction distributions.
  • Counterfactual Logit Pairing (CLP) penalises discrepancies in predictions for similar examples with different sensitive attribute values.

Source: Fairness: Mitigating bias | Machine Learning | Google for Developers

Evaluating for bias

Aggregate model performance metrics such as precision, recall, and accuracy can hide biases against minority groups.

Fairness in model evaluation involves ensuring equitable outcomes across different demographic groups.

Fairness metrics can help assess model predictions for bias.

  • Demographic parity
  • Equality of opportunity
  • Counterfactual fairness

Candidate pool of 100 students: 80 students belong to the majority group (blue), and 20 students belong to the minority group (orange): fairness_metrics_candidate_pool.png

Source: Fairness: Evaluating for bias | Machine Learning | Google for Developers

Demographic parity

Demographic parity aims to ensure equal acceptance rates for majority and minority groups, regardless of individual qualifications.

Both the majority (blue) and minority (orange) groups have an acceptance rate of 20%: fairness_metrics_demographic_parity.png

While demographic parity promotes equal representation, it can overlook differences in individual qualifications within each group, potentially leading to unfair outcomes.

Qualified students in both groups are shaded in green, and qualified students who were rejected are marked with an X: fairness_metrics_demographic_parity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 16/35 = 46% Minority acceptance rate = Qualified minority accepted / Qualified minority = 4/15 = 27%

When the distribution of a preferred label (“qualified”) differs substantially between groups, demographic parity may not be the most appropriate fairness metric.

There may be additional benefits/drawbacks of demographic parity not discussed here that are also worth considering.

Source: Fairness: Demographic parity | Machine Learning | Google for Developers

Equality of opportunity

Equality of opportunity focuses on ensuring that qualified individuals have an equal chance of acceptance, regardless of demographic group.

Qualified students in both groups are shaded in green: fairness_metrics_equality_of_opportunity_by_qualifications.png

Majority acceptance rate = Qualified majority accepted / Qualified majority = 14/35 = 40% Minority acceptance rate = Qualified minority accepted / Qualified minority = 6/15 = 40%

Equality of opportunity has limitations, including reliance on a clearly defined preferred label and challenges in settings that lack demographic data.

It is possible for a model to satisfy both demographic parity and equality of opportunity under specific conditions where positive prediction rates and true positive rates align across groups.

Source: Fairness: Equality of opportunity | Machine Learning | Google for Developers

Counterfactual fairness

Counterfactual fairness evaluates fairness by comparing predictions for similar individuals who differ only in a sensitive attribute such as demographic group.

This metric is particularly useful when datasets lack complete demographic information for most examples but contain it for a subset.

Candidate pool, with demographic group membership unknown for most candidates (icons shaded in grey): fairness_metrics_counterfactual_satisfied.png

Counterfactual fairness may not capture broader systemic biases across subgroups. Other fairness metrics, such as demographic parity and equality of opportunity, provide a more holistic view but may require complete demographic data.

Summary

Selecting the appropriate fairness metric depends on the specific application and desired outcome, with no single “right” metric universally applicable.

For example, if the goal is to achieve equal representation, demographic parity may be the optimal metric. If the goal is to achieve equal opportunity, equality of opportunity may be the best metric.

Some definitions of fairness are mutually incompatible.

Source: Fairness: Counterfactual fairness | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This third module covers advanced ML model architectures.

Neural networks

Introduction

Neural networks are a model architecture designed to automatically identify non-linear patterns in data, eliminating the need for manual feature cross experimentation.

Source: Neural networks | Machine Learning | Google for Developers

Nodes and hidden layers

In neural network terminology, additional layers between the input layer and the output layer are called hidden layers, and the nodes in these layers are called neurons. HiddenLayerBigPicture.png

Source: Neural networks: Nodes and hidden layers | Machine Learning | Google for Developers

Activation functions

Each neuron in a neural network performs the following two-step action:

  • Calculates the weighted sum of input values.
  • Applies an activation function to that sum.

Common activation functions include sigmoid, tanh, and ReLU.

The sigmoid function maps input x to an output value between 0 and 1: $$ F(x) = \frac{1}{1 + e^{-x}} $$ sigmoid.png

The tanh function (short for “hyperbolic tangent”) maps input x to an output value between -1 and 1: $$ F(x) = \tanh{(x)} $$ tanh.png

The rectified linear unit activation function (or ReLU, for short) applies a simple rule:

  • If the input value is less than 0, return 0.
  • If the input value is greater than or equal to 0, return the input value. $$ F(x) = \max{(0,x)} $$

ReLU often outperforms sigmoid and tanh because it reduces vanishing gradient issues and requires less computation. relu.png

A neural network consists of:

  • A set of nodes, analogous to neurons, organised in layers.
  • A set of learned weights and biases connecting layers.
  • Activation functions that transform each node's output, which may differ across layers.

Source: Neural networks: Activation functions | Machine Learning | Google for Developers

Training using backpropagation

Backpropagation is the primary training algorithm for neural networks. It calculates how much each weight and bias in the network contributed to the overall prediction error by applying the chain rule of calculus. It works backwards from the output layer to tell the gradient descent algorithm which equations to adjust to reduce loss.

In practice, this involves a forward pass, where the network makes a prediction and the loss function measures the error, followed by a backward pass that propagates that error back through the layers to compute gradients for each parameter.

Best practices for neural network training:

  • Vanishing gradients occur when gradients in earlier layers become very small, slowing or stalling training, and can be mitigated by using the ReLU activation function.
  • Exploding gradients happen when large weights cause excessively large gradients in early layers, disrupting convergence, and can be addressed with batch normalisation or by lowering the learning rate.
  • Dead ReLU units emerge when a ReLU unit's output gets stuck at 0, halting gradient flow during backpropagation, and can be avoided by lowering the learning rate or using ReLU variants like LeakyReLU.
  • Dropout regularisation is a technique to prevent overfitting by randomly dropping unit activations in a network for a single gradient step, with higher dropout rates indicating stronger regularisation (0 = no regularisation, 1 = drop out all nodes).

Source: Neural Networks: Training using backpropagation | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification models predict from multiple possibilities (binary classification models predict just two).

Multi-class classification can be achieved through two main approaches:

  • One-vs.-all
  • One-vs.-one (softmax)

One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently. one_vs_all_binary_classifiers.png

This approach is fairly reasonable when the total number of classes is small.

We can create a more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. one_vs_all_neural_net.png

Note that the probabilities do not sum to 1. With a one-vs.-all approach, the probability of each binary set of outcomes is determined independently of all the other sets (the sigmoid function is applied to each output node independently).

One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function in the output layer. It assigns decimal probabilities to each class such that all probabilities add up to 1.0. This additional constraint helps training converge more quickly.

Note that the softmax layer must have the same number of nodes as the output layer. one_vs_one_neural_net.png

The softmax formula extends logistic regression to multiple classes: $$ p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} $$

Full softmax is fairly cheap when the number of classes is small but can become computationally expensive with many classes.

Candidate sampling offers an alternative for increased efficiency. It computes probabilities for all positive labels but only a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we do not have to provide probabilities for every non-dog example.

One label versus many labels

Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For multi-label problems, use multiple independent logistic regressions instead.

Example: To classify dog breeds from images, including mixed-breed dogs, use one-vs.-all, since it predicts each breed independently and can assign high probabilities to multiple breeds, unlike softmax, which forces probabilities to sum to 1.

Source: Neural networks: Multi-class classification | Machine Learning | Google for Developers

Embeddings

Introduction

Embeddings are lower-dimensional representations of sparse data that address problems associated with one-hot encodings.

A one-hot encoded feature “meal” of 5,000 popular meal items: food_images_one_hot_encodings.png

This representation of data has several problems:

  • Large input vectors mean a huge number of weights for a neural network.
  • The more weights in your model, the more data you need to train effectively.
  • The more weights, the more computation required to train and use the model.
  • The more weights in your model, the more memory is needed on the accelerators that train and serve it.
  • Poor suitability for on-device machine learning (ODML).

Embeddings, lower-dimensional representations of sparse data, address these issues.

Source: Embeddings | Machine Learning | Google for Developers

Embedding space and static embeddings

Embeddings are low-dimensional representations of high-dimensional data, often used to capture semantic relationships between items.

Embeddings place similar items closer together in the embedding space, allowing for efficient machine learning on large datasets.

Example of a 1D embedding of a sparse feature vector representing meal items: embeddings_1D.png

2D embedding: embeddings_2D.png

3D embedding: embeddings_3D_tangyuan.png

Distances in the embedding space represent relative similarity between items.

Real-world embeddings can encode complex relationships, such as those between countries and their capitals, allowing models to detect patterns.

In practice, embedding spaces have many more than three dimensions, although far fewer than the original data, and the meaning of individual dimensions is often unclear.

Embeddings usually are task-specific, but one task with broad applicability is predicting the context of a word.

Static embeddings like word2vec represent all meanings of a word with a single point, which can be a limitation in some cases. When each word or data point has a single embedding vector, this is called a static embedding.

word2vec can refer both to an algorithm for obtaining static word embeddings and to a set of word vectors that were pre-trained with that algorithm.

Source: Embeddings: Embedding space and static embeddings | Machine Learning | Google for Developers

Obtaining embeddings

Embeddings can be created using dimensionality reduction techniques such as PCA or by training them as part of a neural network.

Training an embedding within a neural network allows customisation for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space, but it may take longer than training the embedding separately.

In general, you can create a hidden layer of size d in your neural network that is designated as the embedding layer, where d represents both the number of nodes in the hidden layer and the number of dimensions in the embedding space. one_hot_hot_dog_embedding.png

Word embeddings, such as word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors. However, such static word embeddings have limitations because they assign a single representation per word.

Contextual embeddings offer multiple representations based on context. For example, “orange” would have a different embedding for every unique sentence containing the word in the dataset (as it could be used as a colour or a fruit).

Contextual embeddings encode positional information, while static embeddings do not. Because contextual embeddings incorporate positional information, one token can have multiple contextual embedding vectors. Static embeddings allow only a single representation of each token.

Methods for creating contextual embeddings include ELMo, BERT, and transformer models with a self-attention layer.

Source: Embeddings: Obtaining embeddings | Machine Learning | Google for Developers

Large language models

Introduction

A language model estimates the probability of a token or sequence of tokens given surrounding text, enabling tasks such as text generation, translation, and summarisation.

Tokens, the atomic units of language modelling, represent words, subwords, or characters and are crucial for understanding and processing language.

Example: “unwatched” would be split into three tokens: un (the prefix), watch (the root), ed (the suffix).

N-grams are ordered sequences of words used to build language models, where N is the number of words in the sequence.

Short N-grams capture too little information, while very long N-grams fail to generalise due to insufficient repeated examples in training data (sparsity issues).

Recurrent neural networks improve on N-grams by processing sequences token by token and learning which past information to retain or discard, allowing them to model longer dependencies across sentences and gain more context.

  • Note that training recurrent neural networks for long contexts is constrained by the vanishing gradient problem.

Model performance depends on training data size and diversity.

While recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.

Source: Large language models | Machine Learning | Google for Developers

What's a large language model?

Large language models (LLMs) predict sequences of tokens and outperform previous models because they use far more parameters and exploit much wider context.

Transformers form the dominant architecture for LLMs and typically combine an encoder that converts input text into an intermediate representation with a decoder that generates output text, for example translating between languages. TransformerBasedTranslator.png

Partial transformers

Encoder-only models focus on representation learning and embeddings (which may serve as input for another system), while decoder-only models specialise in generating long sequences such as dialogue or text continuations.

Self-attention allows the model to weigh the importance of different words in relation to each other, enhancing context understanding.

Example: “The animal didn't cross the street because it was too tired.”

The self-attention mechanism determines the relevance of each nearby word to the pronoun “it”. The bluer the line, the more important that word is to the pronoun it. As shown, “animal” is more important than “street” to the pronoun “it”. Theanimaldidntcrossthestreet.png

  • Some self-attention mechanisms are bidirectional, meaning they calculate relevance scores for tokens preceding and following the word being attended to. This is useful for generating representations of whole sequences (encoders).
  • By contrast, a unidirectional self-attention mechanism can gather context only from words on one side of the word being attended to. This suits applications that generate sequences token by token (decoders).

Multi-head multi-layer self-attention

Each self-attention layer contains multiple self-attention heads. The output of a layer is a mathematical operation (such as a weighted average or dot product) of the outputs of the different heads.

A complete transformer model stacks multiple self-attention layers. The output from one layer becomes the input for the next, allowing the model to build increasingly complex representations, from basic syntax to more nuanced concepts.

Self-attention is an O(N^2 * S * D) problem.

  • N is the number of tokens in the context.
  • S is the number of self-attention layers.
  • D is the number of heads per layer.

LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. You probably will never train an LLM from scratch.

Instruction tuning can improve an LLM's ability to follow instructions.

Why transformers are so large

This course generally recommends building models with a smaller number of parameters, but research shows that transformers with more parameters consistently achieve better performance.

Text generation

LLMs generate text by repeatedly predicting the most probable next token, effectively acting as highly powerful autocomplete systems. You can think of a user's question to an LLM as the “given” sentence followed by a masked response.

Benefits and problems

While LLMs offer benefits such as clear text generation, they also present challenges.

  • Training an LLM involves gathering enormous training sets, consuming vast computational resources and electricity, and solving parallelism challenges.
  • Using an LLM for inference raises issues such as hallucinations, high computational and electricity costs, and bias.

Source: LLMs: What's a large language model? | Machine Learning | Google for Developers

Fine-tuning, distillation, and prompt engineering

General-purpose LLMs, also known as foundation LLMs, base LLMs, or pre-trained LLMs, are pre-trained on vast amounts of text, enabling them to understand language structure and generate creative content, but they act as platforms rather than complete solutions for tasks such as classification or regression.

Fine-tuning updates the parameters of a model to improve its performance on a specialised task, improving prediction quality.

  • Adapts a foundation LLM to a specific task by training on task-specific examples, often only hundreds or thousands, which improves performance for that task but retains the original model size (same number of parameters) and can still be computationally expensive.
  • Parameter-efficient tuning reduces fine-tuning costs by updating only a subset of model parameters during training rather than all weights and biases.

Distillation aims to reduce model size, typically at the cost of some prediction quality.

  • Distillation compresses an LLM into a smaller student model that runs faster and uses fewer resources, at the cost of some predictive accuracy.
  • It typically uses a large teacher model to label data, often with rich numerical scores rather than simple labels, and trains a smaller student model on those outputs.

Prompt engineering allows users to customise an LLM's output by providing examples or instructions within the prompt, leveraging the model's existing pattern-recognition abilities without changing its parameters.

One-shot, few-shot, and zero-shot prompting differ by how many examples the prompt provides, with more examples usually improving reliability by giving clearer context.

Prompt engineering does not alter the model's parameters. Prompts leverage the pattern-recognition abilities of the existing LLM.

Offline inference pre-computes and caches LLM predictions for tasks where real-time response is not critical, saving resources and enabling the use of larger models.

Responsible use of LLMs requires awareness that models inherit biases from their training and distillation data.

Source: LLMs: Fine-tuning, distillation, and prompt engineering | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This second module covers fundamental techniques and best practices for working with machine learning data.

Working with numerical data

Introduction

Numerical data: Integers or floating-point values that behave like numbers. They are additive, countable, ordered, and so on. Examples include temperature, weight, or the number of deer wintering in a nature preserve.

Source: Working with numerical data | Machine Learning | Google for Developers

How a model ingests data with feature vectors

A machine learning model ingests data through floating-point arrays called feature vectors, which are derived from dataset features. Feature vectors often utilise processed or transformed values instead of raw dataset values to enhance model learning.

Example of a feature vector: [0.13, 0.47]

Feature engineering is the process of converting raw data into suitable representations for the model. Common techniques are:

  • Normalization: Converting numerical values into a standard range.
  • Binning (bucketing): Converting numerical values into buckets or ranges.

Non-numerical data like strings must be converted into numerical values for use in feature vectors.

Source: Numerical data: How a model ingests data using feature vectors | Machine Learning | Google for Developers

First steps

Before creating feature vectors, it is crucial to analyse numerical data to detect anomalies and patterns in the data, which aids in identifying potential issues early in the data analysis process.

  • Visualising it through plots and graphs (like scatter plots or histograms)
  • Calculating basic statistics like mean, median, standard deviation, or values at the quartile divisions (0th, 25th, 50th, 75th, 100th percentiles, where the 50th percentile is the median)

Outliers, values significantly distant from others, should be identified and handled appropriately.

  • The outlier is due to a mistake: For example, an experimenter incorrectly entered data, or an instrument malfunctioned. We generally delete examples containing mistake outliers.
  • If the outlier is a legitimate data point: If the model needs to infer good predictions on these outliers, keep them. If not, delete them or apply more invasive feature engineering techniques, such as clipping.

A dataset probably contains outliers when:

  • The delta between the 0th and 25th percentiles differs significantly from the delta between the 75th and 100th percentiles
  • The standard deviation is almost as high as the mean

Source: Numerical data: First steps | Machine Learning | Google for Developers

Normalization

Data normalization is crucial for enhancing machine learning model performance by scaling features to a similar range. It is also recommended to normalise a single numeric feature that covers a wide range (for example, city population).

Normalisation has the following benefits:

  • Helps a model converge more quickly.
  • Helps models infer better predictions.
  • Helps avoid the NaN trap (large numerical values exceeding the floating-point precision limit and flipping into NaN values).
  • Helps the model learn appropriate weights (so the model does not pay too much attention to features with wide ranges).
Normalization technique Formula When to use
Linear scaling $$x'=\frac{x-x_\text{min}}{x_\text{max}-x_\text{min}}$$ When the feature is mostly uniformly distributed across range; flat-shaped
Z-score scaling $$x' = (x-\mu)/\sigma$$ When the feature is normally distributed (peak close to mean); bell-shaped
Log scaling $$x'=ln(x)$$ When the feature distribution is heavy skewed on at least either side of tail; heavy Tail-shaped
Clipping If x > max, set $$x'=max$$ If x < min, set $$x' = min$$ When the feature contains extreme outliers

Source: Numerical data: Normalization | Machine Learning | Google for Developers

Binning

Binning (bucketing) is a feature engineering technique used to group numerical data into categories (bins). In many cases, this turns numerical data into categorical data.

For example, if a feature X has values ranging from 15 to 425, we can apply binning to represent X as a feature vector divided into specific intervals:

Bin number Range Feature vector
1 15-34 [1.0, 0.0, 0.0, 0.0, 0.0]
2 35-117 [0.0, 1.0, 0.0, 0.0, 0.0]
3 118-279 [0.0, 0.0, 1.0, 0.0, 0.0]
4 280-392 [0.0, 0.0, 0.0, 1.0, 0.0]
5 393-425 [0.0, 0.0, 0.0, 0.0, 1.0]

Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.

Binning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.

When to use: Binning works well when features exhibit a “clumpy” distribution, that is, the overall linear relationship between the feature and label is weak or nonexistent, or when feature values are clustered.

Example: Number of shoppers versus temperature. By binning them, the model learns separate weights for each bin. binning_temperature_vs_shoppers_divided_into_3_bins.png

While creating multiple bins is possible, it is generally recommended to avoid an excessive number, as this can lead to insufficient training examples per bin and increased feature dimensionality.

Quantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.

  • Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.
  • Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket. QuantileBucketing.png

Source: Numerical data: Binning | Machine Learning | Google for Developers

Scrubbing

Problem category Example
Omitted values A census taker fails to record a resident's age
Duplicate examples A server uploads the same logs twice
Out-of-range feature values A human accidentally types an extra digit
Bad labels A human evaluator mislabels a picture of an oak tree as a maple

You can use programs or scripts to identify and handle data issues such as omitted values, duplicates, and out-of-range feature values by removing or correcting them.

Source: Numerical data: Scrubbing | Machine Learning | Google for Developers

Qualities of good numerical features

  • Good feature vectors require features that are clearly named and have obvious meanings to anyone on the project.
  • Data should be checked and tested for bad data or outliers, such as inappropriate values, before being used for training.
  • Features should be sensible, avoiding “magic values” that create discontinuities (for example, setting the value “watch_time_in_seconds” to -1 to indicate an absence of measurement); instead, use separate boolean features or new discrete values to indicate missing data.

Source: Numerical data: Qualities of good numerical features | Machine Learning | Google for Developers

Polynomial transformations

Synthetic features, such as polynomial transforms, enable linear models to represent non-linear relationships by introducing new features based on existing ones.

By incorporating synthetic features, linear regression models can effectively separate data points that are not linearly separable, using curves instead of straight lines. For example, we can separate two classes with y = x^2. ft_cross1.png

Feature crosses, a related concept for categorical data, synthesise new features by combining existing features, further enhancing model flexibility.

Source: Numerical data: Polynomial transforms | Machine Learning | Google for Developers

Working with categorical data

Introduction

Categorical data has a specific set of possible values. Examples include species of animals, names of streets, whether or not an email is spam, and binned numbers.

Categorical data can include numbers that behave like categories. An example is postal codes.

  • Numerical data can be meaningfully multiplied.
  • Data that are native integer values should be represented as categorical data.

Encoding means converting categorical or other data to numerical vectors that a model can train on.

Preprocessing includes converting non-numerical data, such as strings, to floating-point values.

Source: Working with categorical data | Machine Learning | Google for Developers

Vocabulary and one-hot encoding

Machine learning models require numerical input; therefore, categorical data such as strings must be converted to numerical representations.

The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:

Feature name # of categories Sample categories
snowed_today 2 True, False
skill_level 3 Beginner, Practitioner, Expert
season 4 Winter, Spring, Summer, Autumn
dayofweek 7 Monday, Tuesday, Wednesday
planet 8 Mercury, Venus, Earth
car_colour 8 Red, Orange, Blue, Yellow

When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. This treats each category as a separate feature, allowing the model to learn distinct weights for each during training.

One-hot encoding transforms categorical values into numerical vectors (arrays) of N elements, where N is the number of categories. Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.

Feature Red Orange Blue Yellow Green Black Purple Brown
“Red” 1 0 0 0 0 0 0 0
“Orange” 0 1 0 0 0 0 0 0
“Blue” 0 0 1 0 0 0 0 0
“Yellow” 0 0 0 1 0 0 0 0
“Green” 0 0 0 0 1 0 0 0
“Black” 0 0 0 0 0 1 0 0
“Purple” 0 0 0 0 0 0 1 0
“Brown” 0 0 0 0 0 0 0 1

It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.

The end-to-end process to map categories to feature vectors: vocabulary-index-sparse-feature.png

In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.

A feature whose values are predominantly zero (or empty) is termed a sparse feature.

Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage.

  • For example, the one-hot vector for “car_colour” “Blue” is: [0, 0, 1, 0, 0, 0, 0, 0].
  • Since the 1 is in position 2 (when starting the count at 0), the sparse representation is: 2.

Notice that the sparse representation consumes far less memory. Importantly, the model must train on the one-hot vector, not the sparse representation.

The sparse representation of a multi-hot encoding stores the positions of all the non-zero elements. For example, the sparse representation of a car that is both “Blue” and “Black” is 2, 5.

Categorical features can have outliers. If “car_colour” includes rare values such as “Mauve” or “Avocado”, you can group them into one out-of-vocabulary (OOV) category. All rare colours go into this single bucket, and the model learns one weight for it.

For high-dimensional categorical features with many categories, one-hot encoding might be inefficient, and embeddings or hashing (also called the hashing trick) are recommended.

  • For example, a feature like “words_in_english” has around 500,000 categories.
  • Embeddings substantially reduce the number of dimensions, which helps the model train faster and infer predictions more quickly.

Source: Categorical data: Vocabulary and one-hot encoding | Machine Learning | Google for Developers

Common issues with categorical data

Categorical data quality hinges on how categories are defined and labelled, impacting data reliability.

Human-labelled data, known as “gold labels”, is generally preferred for training due to its higher quality, but it is essential to check for human errors and biases.

  • Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement.
  • Inter-rater agreement can be measured using kappa and intra-class correlation (Hallgren, 2012), or Krippendorff's alpha (Krippendorff, 2011).

Machine-labelled data, or “silver labels”, can introduce biases or inaccuracies, necessitating careful quality checks and awareness of potential common-sense violations.

  • For example, if a computer-vision model mislabels a photo of a chihuahua as a muffin, or a photo of a muffin as a chihuahua.
  • Similarly, a sentiment analyser that scores neutral words as -0.25, when 0.0 is the neutral value, might be scoring all words with an additional negative bias.

High dimensionality in categorical data increases training complexity and costs, leading to techniques such as embeddings for dimensionality reduction.

Source: Categorical data: Common issues | Machine Learning | Google for Developers

Feature crosses

Feature crosses are created by combining two or more categorical or bucketed features to capture interactions and non-linearities within a dataset.

For example, consider a leaf dataset with the categorical features:

  • “edges”, containing values {smooth, toothed, lobed}
  • “arrangement”, containing values {opposite, alternate}

The feature cross, or Cartesian product, of these two features would be:

{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}

For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for “Lobed_Alternate”, and a value of 0 for all other terms:

{0, 0, 0, 0, 0, 1}

This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.

Feature crosses are somewhat analogous to polynomial transforms.

Feature crosses can be particularly effective when guided by domain expertise. It is often possible, though computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.

Overuse of feature crosses with sparse features should be avoided, as it can lead to excessive sparsity in the resulting feature set. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.

Source: Categorical data: Feature crosses | Machine Learning | Google for Developers

Datasets, generalization, and overfitting

Introduction

  • Data quality significantly impacts model performance more than algorithm choice.
  • Machine learning practitioners typically dedicate a substantial portion of their project time (around 80%) to data preparation and transformation, including tasks such as dataset construction and feature engineering.

Source: Datasets, generalization, and overfitting | Machine Learning | Google for Developers

Data characteristics

A machine learning model's performance is heavily reliant on the quality and quantity of the dataset it is trained on, with larger, high-quality datasets generally leading to better results.

Datasets can contain various data types, including numerical, categorical, text, multimedia, and embedding vectors, each requiring specific handling for optimal model training.

The following are common causes of unreliable data in datasets:

  • Omitted values
  • Duplicate examples
  • Bad feature values
  • Bad labels
  • Bad sections of data

Maintaining data quality involves addressing issues such as label errors, noisy features, and proper filtering to ensure the reliability of the dataset for accurate predictions.

Incomplete examples with missing feature values should be handled by either deletion or imputation to avoid negatively impacting model training.

When imputing missing values, use reliable methods such as mean/median imputation and consider adding an indicator column to signal imputed values to the model. For example, alongside temperature include “temperature_is_imputed”. This lets the model learn to trust real observations more than imputed ones.

Source: Datasets: Data characteristics | Machine Learning | Google for Developers

Labels

Direct labels are generally preferred but often unavailable.

  • Direct labels exactly match the prediction target and appear explicitly in the dataset, such as a “bicycle_owner” column for predicting bicycle ownership.
  • Proxy labels approximate the target and correlate with it, such as a bicycle magazine subscription as a signal of bicycle ownership.

Use a proxy label when no direct label exists or when the direct concept resists easy numeric representation. Carefully evaluate proxy labels to ensure they are a suitable approximation.

Human-generated labels, while offering flexibility and nuanced understanding, can be expensive to produce and prone to errors, requiring careful quality control.

Models can train on a mix of automated and human-generated labels, but an extra set of human labels often adds complexity without sufficient benefit.

Source: Datasets: Labels | Machine Learning | Google for Developers

Imbalanced datasets

Imbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.

Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset.

A highly imbalanced floral dataset containing far more sunflowers (200) than roses (2): FloralDataset200Sunflowers2Roses.png

During training, a model should learn two things:

  • What each class looks like, that is, what feature values correspond to which class.
  • How common each class is, that is, what the relative distribution of the classes is.

Standard training conflates these two goals. In contrast, a two-step technique of downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both.

Step 1: Downsample the majority class by training on only a small fraction of majority class examples, which makes an imbalanced dataset more balanced during training and increases the chance that each batch contains enough minority examples.

For example, with a class-imbalanced dataset consisting of 99% majority class and 1% minority class examples, we could downsample the majority class by a factor of 25 to create a more balanced training set (80% majority class and 20% minority class).

Downsampling the majority class by a factor of 25: FloralDatasetDownsampling.png

Step 2: Upweight the downsampled majority class by the same factor used for downsampling, so each majority class error counts proportionally more during training. This corrects the artificial class distribution and bias introduced by downsampling, because the training data no longer reflects real-world frequencies.

Continuing the example from above, we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).

Upweighting the majority class by a factor of 25: FloralDatasetUpweighting.png

Experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.

Benefits of this technique include a better model (the resultant model knows what each class looks like and how common each class is) and faster convergence.

Source: Datasets: Class-imbalanced datasets | Machine Learning | Google for Developers

Dividing the original dataset

Machine learning models should be tested against unseen data.

It is recommended to split the dataset into three subsets: training, validation, and test sets. PartitionThreeSets.png

The validation set is used for initial testing during training (to determine hyperparameter tweaks, add, remove, or transform features, and so on), and the test set is used for final evaluation. workflow_with_validation_set.png

The validation and test sets can “wear out” with repeated use. For this reason, it is a good idea to collect more data to “refresh” the test and validation sets.

A good test set is:

  • Large enough to yield statistically significant results
  • Representative of the dataset as a whole
  • Representative of real-world data the model will encounter (if your model performs poorly on real-world data, determine how your dataset differs from real-life data)
  • Free of duplicates from the training set

In theory, the validation set and test set should contain the same number of examples, or nearly so.

Source: Datasets: Dividing the original dataset | Machine Learning | Google for Developers

Transforming data

Machine learning models require all data, including features such as street names, to be transformed into numerical (floating-point) representations for training.

Normalisation improves model training by converting existing floating-point features to a constrained range.

When dealing with large datasets, select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions. Safeguard privacy by omitting examples containing personally identifiable information.

Source: Datasets: Transforming data | Machine Learning | Google for Developers

Generalization

Generalisation refers to a model's ability to perform well on new, unseen data.

Source: Generalization | Machine Learning | Google for Developers

Overfitting

Overfitting means creating a model that matches the training set so closely that the model fails to make correct predictions on new data.

Generalization is the opposite of overfitting. That is, a model that generalises well makes good predictions on new data.

An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world. An underfit model is like a product that does not even do well in the lab.

Overfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve (a graph that shows two or more loss curves). A generalization curve for a well-fit model shows two loss curves that have similar shapes.

Common causes of overfitting include:

  • A training set that does not adequately represent real-life data (or the validation set or test set).
  • A model that is too complex.

Dataset conditions for good generalization include:

  • Examples must be independently and identically distributed, which is a fancy way of saying that your examples cannot influence each other.
  • The dataset is stationary, meaning it does not change significantly over time.
  • The dataset partitions have the same distribution, meaning the examples in the training set, validation set, test set, and real-world data are statistically similar.

Source: Overfitting | Machine Learning | Google for Developers

Model complexity

Simpler models often generalise better to new data than complex models, even if they perform slightly worse on training data.

Occam's Razor favours simpler explanations and models.

Model training should minimise both loss and complexity for optimal performance on new data. $$ \text{minimise}(\text{loss + complexity}) $$

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases.

Regularisation techniques help prevent overfitting by penalising model complexity during training.

  • L1 regularisation (also called LASSO) uses model weights to measure model complexity.
  • L2 regularisation (also called ridge regularisation) uses squares of model weights to measure model complexity.

Source: Overfitting: Model complexity | Machine Learning | Google for Developers

L2 regularization

L2 regularisation is a popular regularisation metric to reduce model complexity and prevent overfitting. It uses the following formula: $$ L_2 \text{ regularisation} = w^2_1 + w^2_2 + \ldots + w^2_n $$

It penalises especially large weights.

L2 regularisation encourages weights towards 0, but never pushes them all the way to zero.

A regularisation rate (lambda) controls the strength of regularisation. $$ \text{minimise}(\text{loss} + \lambda \text{ complexity}) $$

  • A high regularisation rate reduces the likelihood of overfitting and tends to produce a histogram of model weights that are normally distributed around 0.
  • A low regularisation rate lowers the influence of regularisation and tends to produce a histogram of model weights with a flat distribution.

Tuning is required to find the ideal regularisation rate.

Early stopping is an alternative regularisation method that involves ending training before the model fully converges to prevent overfitting. It usually increases training loss but decreases test loss. It is a quick but rarely optimal form of regularisation.

Learning rate and regularisation rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero, while a high regularisation rate pulls weights towards zero. The goal is to find the equilibrium.

Source: Overfitting: L2 regularization | Machine Learning | Google for Developers

Interpreting loss curves

An ideal loss curve looks like this: metric-curve-ideal.png

To improve an oscillating loss curve:

  • Reduce the learning rate.
  • Reduce the training set to a tiny number of trustworthy examples.
  • Check your data against a data schema to detect bad examples, then remove the bad examples from the training set. metric-curve-ex03.png

Possible reasons for a loss curve with a sharp jump include:

  • The input data contains a burst of outliers.
  • The input data contains one or more NaNs (for example, a value caused by a division by zero). metric-curve-ex02.png

Test loss diverges from training loss when:

  • The model overfits the training set. metric-curve-ex01.png

The loss curve gets stuck when:

  • The training set is not shuffled well. metric-curve-ex05.png

Source: Overfitting: Interpreting loss curves | Machine Learning | Google for Developers

 
Read more...

from Stefan Angrick

This post is part of a four-part summary of Google's Machine Learning Crash Course. For context, check out this post. This first module covers the fundamentals of building regression and classification models.

Linear regression

Introduction

The linear regression model uses an equation $$ y' = b + w_1x_1 + w_2x_2 + \ldots $$ to represent the relationship between features and the label.

  • y' is the predicted label—the output
  • b is the bias of the model (the y-intercept in algebraic terms), sometimes referred to as w_0
  • w_1 is the weight of the feature (the slope in algebraic terms)
  • x_1 is a feature—the input

y and features x are given. b and w are calculated from training by minimizing the difference between predicted and actual values.

Source: Linear regression | Machine Learning | Google for Developers

Loss

Loss is a numerical value indicating the difference between a model's predictions and the actual values.

The goal of model training is to minimize loss, bringing it as close to zero as possible.

Loss type Definition Equation
L1 loss The sum of the absolute values of the difference between the predicted values and the actual values. $$\sum |\text{actual value}-\text{predicted value}|$$
Mean absolute error (MAE) The average of L1 losses across a set of N examples. $$\frac{1}{N}\sum |\text{actual value}-\text{predicted value}|$$
L2 loss The sum of the squared difference between the predicted values and the actual values. $$\sum (\text{actual value}-\text{predicted value})^2$$
Mean squared error (MSE) The average of L2 losses across a set of N examples. $$\frac{1}{N}\sum (\text{actual value}-\text{predicted value})^2$$

The most common methods for calculating loss are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which differ in their sensitivity to outliers.

A model trained with MSE moves the model closer to the outliers but further away from most of the other data points. model-mse.png

A model trained with MAE is farther from the outliers but closer to most of the other data points. model-mae.png

Source: Linear regression: Loss | Machine Learning | Google for Developers

Gradient descent

Gradient descent is an iterative optimisation algorithm used to find the best weights and bias for a linear regression model by minimising the loss function.

  1. Calculate the loss with the current weight and bias.
  2. Determine the direction to move the weights and bias that reduce loss.
  3. Move the weight and bias values a small amount in the direction that reduces loss.
  4. Return to step one and repeat the process until the model can't reduce the loss any further.

A model is considered to have converged when further iterations do not significantly reduce the loss, indicating it has found the weights and bias that produce the lowest possible loss.

Loss curves visually represent the model's progress during training, showing how the loss decreases over iterations and helping to identify convergence.

Linear models have convex loss functions, ensuring that gradient descent will always find the global minimum, resulting in the best possible model for the given data.

Source: Linear regression: Gradient descent | Google for Developers

Hyperparameters

Hyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.

The learning rate determines the step size during gradient descent, impacting the speed and stability of convergence.

  • If the learning rate is too low, the model can take a long time to converge.
  • However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimise the loss.

Batch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.

  • When a dataset contains hundreds of thousands or even millions of examples, using the full batch isn't practical.
  • Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent.
    • Stochastic gradient descent uses only a single random example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy.
    • Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For N number of data points, the batch size can be any number greater than 1 and less than N. The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.

Model trained with SGD: noisy-gradient.png

Model trained with mini-batch SGD: mini-batch-sgd.png

Epochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.

  • For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.

Source: Linear regression: Hyperparameters | Machine Learning | Google for Developers

Logistic regression

Introduction

Logistic regression is a model used to predict the probability of an outcome, unlike linear regression which predicts continuous numerical values.

Logistic regression models output probabilities, which can be used directly or converted to binary categories.

Source: Logistic Regression | Machine Learning | Google for Developers

Calculating a probability with the sigmoid function

A logistic regression model uses a linear equation and the sigmoid function to calculate the probability of an event.

The sigmoid function ensures the output of logistic regression is always between 0 and 1, representing a probability. $$ f(x) = \frac{1}{1 + e^{-x}} $$ sigmoid_function_with_axes.png

Linear component of a logistic regression model: $$ z = b + w_1 x_1 + w_2 x_2 + \ldots + w_N x_N $$ To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1: $$ y' = \frac{1}{1+e^{-z}} $$

  • y' is the output of the logistic regression model.
  • z is the linear output (as calculated in the preceding equation).

z is referred to as the log-odds because if you solve the sigmoid function for z you get: $$ z = \log(\frac{y}{1-y}) $$ This is the log of the ratio of the probabilities of the two possible outcomes: y and 1 – y.

When the linear equation becomes input to the sigmoid function, it bends the straight line into an s-shape. linear_to_logistic.png

Source: Logistic regression: Calculating a probability with the sigmoid function | Machine Learning | Google for Developers

Loss and regularisation

Logistic regression models are trained similarly to linear regression models but use Log Loss instead of squared loss and require regularisation.

Log Loss is used in logistic regression because the rate of change isn't constant, requiring varying precision levels unlike squared loss used in linear regression.

The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows: $$ \text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') – (1 – y)\log(1 – y') $$

  • (x,y) is the dataset containing many labelled examples, which are (x, y) pairs.
  • y is the label in a labelled example. Since this is logistic regression, every value of y must either be 0 or 1.
  • y' is your model's prediction (somewhere between 0 and 1), given the set of features in x.

Regularisation, such as L2 regularisation or early stopping, is crucial in logistic regression to prevent overfitting (due to the model's asymptotic nature) and improve generalisation.

Source: Logistic regression: Loss and regularization | Machine Learning | Google for Developers

Classification

Introduction

Logistic regression models can be converted into binary classification models for predicting categories instead of probabilities.

Source: Classification | Machine Learning | Google for Developers

Thresholds and the confusion matrix

To convert the raw output from a logistic regression model into binary classification (positive and negative class), you need a classification threshold.

Confusion matrix

Actual positive Actual negative
Predicted positive True positive (TP) False positive (FP)
Predicted negative False negative (FN) True negative (TN)

Total of each row = all predicted positives (TP + FP) and all predicted negatives (FN + TN) Total of each column = all real positives (TP + FN) and all real negatives (FP + TN)

  • When positive examples and negative examples are generally well differentiated, with most positive examples having higher scores than negative examples, the dataset is separated.
  • When the total of actual positives is not close to the total of actual negatives, the dataset is imbalanced.
  • When many positive examples have lower scores than negative examples, and many negative examples have higher scores than positive examples, the dataset is unseparated.

When we increase the classification threshold, both TP and FP decrease, and both TN and FN increase.

Source: Thresholds and the confusion matrix | Machine Learning | Google for Developers

Accuracy, Recall, Precision, and related metrics are all calculated at a single classification threshold value.

Accuracy is the proportion of all classifications that were correct. $$ \text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN} $$

  • Use as a rough indicator of model training progress/convergence for balanced datasets. Typically the default.
  • For model performance, use only in combination with other metrics.
  • Avoid for imbalanced datasets. Consider using another metric.

Recall, or true positive rate, is the proportion of all actual positives that were classified correctly as positives. Also known as probability of detection. $$ \text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN} $$

  • Use when false negatives are more expensive than false positives.
  • Better than Accuracy in imbalanced datasets.
  • Improves when false negatives decrease.

False positive rate is the proportion of all actual negatives that were classified incorrectly as positives. Also known as probability of a false alarm. $$ \text{FPR} = \frac{\text{incorrectly classified actual negatives}}{\text{all actual negatives}}=\frac{FP}{FP+TN} $$

  • Use when false positives are more expensive than false negatives.
  • Less meaningful and useful in a dataset where the number of actual negtives is very, very low.

Precision is the proportion of all the model's positive classifications that are actually positive. $$ \text{Precision} = \frac{\text{correctly classified actual positives}}{\text{everything classified as positive}}=\frac{TP}{TP+FP} $$

  • Use when it's very important for positive predictions to be accurate.
  • Less meaningful and useful in a dataset where the number of actual positives is very, very low.
  • Improves as false positives decrease.

Precision and Recall often show an inverse relationship.

F1 score is the harmonic mean of Precision and Recall. $$ \text{F1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} = \frac{2TP}{2TP + FP + FN} $$

  • Preferable for class-imbalanced datasets.
  • When Precision and Recall are close in value, F1 will be close to their value.
  • When Precision and Recall are far apart, F1 will be similar to whichever metric is worse.

Source: Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google for Developers

ROC and AUC

ROC and AUC evaluate a model's quality across all possible thresholds.

ROC curve, or receiver operating characteristic curves, plot the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. A perfect model would pass through (0,1), while a random guesser forms a diagonal line from (0,0) to (1,1).

AUC, or area under the curve, represents the probability that the model will rank a randomly chosen positive example higher than a negative example. A perfect model has AUC = 1.0, while a random model has AUC = 0.5.

ROC and AUC of a hypothetical perfect model (AUC = 1.0) and for completely random guesses (AUC = 0.5): auc_1-0.pngauc_0-5.png

ROC and AUC are effective when class distributions are balanced. For imbalanced data, precision-recall curves (PRCs) can be more informative. prauc.png

A higher AUC generally indicates a better-performing model.

ROC and AUC of two hypothetical models; the first curve (AUC = 0.65) represents the better of the two models: auc_0-65.png auc_0-93.png

Threshold choice depends on the cost of false positives versus false negatives. The most relevant thresholds are those closest to (0,1) on the ROC curve. For costly false positives, a conservative threshold (like A in the chart below) is better. For costly false negatives, a more sensitive threshold (like C) is preferable. If costs are roughly equivalent, a threshold in the middle (like B) may be best. auc_abc.png

Source: Classification: ROC and AUC | Machine Learning | Google for Developers

Prediction bias

Prediction bias measures the difference between the average of a model's predictions and the average of the true labels in the data. For example, if 5% of emails in the dataset are spam, a model without prediction bias should also predict about 5% as spam. A large mismatch between these averages indicates potential problems.

Prediction bias can be caused by:

  • Biased and noisy data (e.g., skewed sampling)
  • Overly strong regularisation that oversimplifies the model
  • Bugs in the model training pipeline
  • Insufficient features provided to the model

Source: Classification: Prediction bias | Machine Learning | Google for Developers

Multi-class classification

Multi-class classification extends binary classification to cases with more than two classes.

If each example belongs to only one class, the problem can be broken down into a series of binary classifications. For instance, with three classes (A, B, C), you could first separate C from A+B, then distinguish A from B within the A+B group.

Source: Classification: Multi-class classification | Machine Learning | Google for Developers

 
Read more...

from Bloc de notas

no sé si te acuerdas o vas tan råpido que a estas alturas te da igual lo que pasó / lo que se fue se fue y entonces sí que aprendiste algo de mí / un poco a vivir

 
Leer mĂĄs...

Join the writers on Write.as.

Start writing or create a blog