Computer vision

source

Depending on the kind of data we give each AI model, each has its own way of processing input, even if the underlying math and data propagation are conceptually similar. But they all operate on input, processing and output.

flowchart LR
    A[Input]
    A --> B[Processing]
    B --> c[Output]

Images

To a model, a digital image is a matrix of pixel values ranging from 0 to 255. The AI model sees in the sense that it recognizes the structures and relationships between the numbers in that matrix. e.g. Edge Detection.

Text

With natural language processing, the model breaks down sentences into words, subwords and even characters. i.e. Tokenization. These tokens are also embedded into a numerical matrix. Each token is represented as a Vector in a 3D space. Tokens with similar meanings or usage will bunch closer together, allowing the model to contextualize relationships.

Audio

Audio waveforms are converted into a spectogram, where each pixel in the spectogram corresponds to a frequency. These pass through the model to derive an output.

Gradual Notes

Recent Notes

Vibe Coding a Multi Agent System

The Lego Approach

Computer vision

Meatware

Node

Computer vision

Images

Text

Audio

Recent Notes

Vibe Coding a Multi Agent System

The Lego Approach

Computer vision

Meatware

Node

Graph View

Table of Contents