source

Depending on the kind of data we give each AI model, each has its own way of processing input, even if the underlying math and data propagation are conceptually similar. But they all operate on input, processing and output.

flowchart LR
    A[Input]
    A --> B[Processing]
    B --> c[Output]

Images

To a model, a digital image is a matrix of pixel values ranging from 0 to 255. The AI model sees in the sense that it recognizes the structures and relationships between the numbers in that matrix. e.g. Edge Detection.

processed-image

Text

With natural language processing, the model breaks down sentences into words, subwords and even characters. i.e. Tokenization. These tokens are also embedded into a numerical matrix. Each token is represented as a Vector in a 3D space. Tokens with similar meanings or usage will bunch closer together, allowing the model to contextualize relationships.

Audio

Audio waveforms are converted into a spectogram, where each pixel in the spectogram corresponds to a frequency. These pass through the model to derive an output.

spectogram