Depending on the kind of data we give each AI model, each has its own way of processing input, even if the underlying math and data propagation are conceptually similar. But they all operate on input, processing and output.
flowchart LR A[Input] A --> B[Processing] B --> c[Output]
Images
To a model, a digital image is a matrix of pixel values ranging from 0 to 255. The AI model sees in the sense that it recognizes the structures and relationships between the numbers in that matrix. e.g. Edge Detection.

Text
With natural language processing, the model breaks down sentences into words, subwords and even characters. i.e. Tokenization. These tokens are also embedded into a numerical matrix. Each token is represented as a Vector in a 3D space. Tokens with similar meanings or usage will bunch closer together, allowing the model to contextualize relationships.
Audio
Audio waveforms are converted into a spectogram, where each pixel in the spectogram corresponds to a frequency. These pass through the model to derive an output.
