Foundation Models, The AI Frontier

27 Jun 2025

Artificial Intelligence has evolved rapidly, reshaping industries and redefining how we interact with technology. Central to this evolution are Foundation Models—advanced AI systems trained on vast datasets, capable of adapting to numerous tasks without extensive retraining. In this article, I’ll dive into the heart of Foundation Models, examining their current limitations, future potentials, and the transformative concept of “Software 3.0.”

The Problem Landscape

Humans vs LLMs: The Sample Efficiency Gap

Humans learn incredibly efficiently. A four-year-old child processes roughly 1 petabyte (1×10¹⁵ bytes) of visual information-50 times more data than GPT-4 uses—yet learns effortlessly and robustly. Current AI models, by contrast, exhibit significant inefficiencies, requiring massive amounts of data and computational resources.

Data Availability Ceiling

We’re quickly approaching a ceiling of available high-quality text data, predicted to be exhausted within this decade. This limitation pushes research towards multimodal models—AI that integrates diverse data sources such as images, audio, and sensor data—promising richer learning and greater efficiency.

Autoregressive Limits: Error Cascades

Traditional autoregressive models predict sequentially, token-by-token, which magnifies errors exponentially. Yann LeCun famously warned this approach is inherently limited for achieving Artificial General Intelligence (AGI). Future models must integrate predictive world-modeling to simulate outcomes before taking action.

Predictive World Models: The New Standard

Future AI must adopt internal predictive “world models,” continuously simulating outcomes before acting. This method enables deeper, strategic planning capabilities—moving beyond reactionary prediction to proactive imagination and reasoning.

Variable Computation: Adaptive “Thinking Time”

Today’s one-size-fits-all approach is inefficient. New models will employ adaptive computation, allocating computational effort dynamically—similar to human cognition—spending more resources on complex problems while swiftly managing simpler tasks.

Memory & Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation enhances model reliability by dynamically integrating external knowledge into AI responses, reducing hallucinations and ensuring accuracy. Models now effectively use real-time data retrieval to augment their internal knowledge, significantly boosting their adaptability and utility.

The Power of Tokenization

Tokenization is fundamental, converting raw data into standardized discrete symbols that AI can understand. Effective tokenization facilitates multimodal fusion, where text, images, audio, and sensor data share a common language—transforming AI into versatile, unified foundation models.

Software 3.0: The Evolution of Development

Introduced by Andrej Karpathy, Software 3.0 redefines software development. In Software 1.0, developers explicitly code algorithms; in Software 2.0, data curation and model training become central. Software 3.0 positions large language models as general-purpose platforms programmed via natural language, significantly shifting development paradigms.

Transforming Software Development Processes

In Software 3.0, development cycles accelerate dramatically through rapid iteration driven by conversational prompts and AI-driven code refactoring. Developers evolve into orchestrators, shaping and guiding intelligent models rather than writing detailed procedural code.

LLM as Operating System

Foundation models are becoming akin to operating systems, providing memory management, scheduling, and API access. Applications become streamlined, relying on these powerful AI backbones. This paradigm shift raises new challenges, including:

Reliability: Preventing hallucinations through structured grounding and sandboxed execution.
Security & Privacy: Protecting against prompt injection and data leakage.
Governance & Compliance: Establishing clear audit trails, licensing, and attribution.
Sustainability: Reducing computational costs and environmental impacts.
Ethical Alignment: Ensuring models produce safe, unbiased, and ethically sound outputs.

The Hopeful Future

Foundation models promise immense opportunities for collaboration between humans and AI, fostering new roles, enhancing creativity, and boosting productivity. However, responsible and ethical stewardship is crucial for ensuring these powerful tools positively impact society.

Foundation models and Software 3.0 represent an exciting AI frontier, blending technological innovation with human ingenuity to shape a future where technology complements and enhances human potential.

References & Further Reading

Bommasani et al., Stanford CRFM, 2021.
OpenAI, GPT-4 Technical Report, 2023.
A. Karpathy, Software 2.0 (Medium, YouTube).
Y. LeCun, World-Modeling AI (YouTube).
3Blue1Brown, Neural Networks Explained (YouTube).

Train Resnet50 on ImageNet with PyTorch

18 Dec 2022

Without further due, here is a one pager code for training Resnet50 on ImageNet in PyTorch:

import torch
import torchvision
import torchvision.transforms as transforms

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set hyperparameters
num_epochs = 10
batch_size = 64
learning_rate = 0.001

# Initialize transformations for data augmentation
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(degrees=45),
    transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Load the ImageNet Object Localization Challenge dataset
train_dataset = torchvision.datasets.ImageFolder(
    root='/kaggle/input/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/train', 
    transform=transform
)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)

# Load the ResNet50 model
model = torchvision.models.resnet50(pretrained=True)

# Parallelize training across multiple GPUs
model = torch.nn.DataParallel(model)

# Set the model to run on the device
model = model.to(device)

# Define the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model...
for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        # Move input and label tensors to the device
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero out the optimizer
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass
        loss.backward()
        optimizer.step()

    # Print the loss for every epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}')

print(f'Finished Training, Loss: {loss.item():.4f}')

This code will train Resnet50 model on the ImageNet dataset for 10 epochs using ADAM optimizer with a learning rate of 0.001. The model is trained on GPU if available, otherwise it is trained on CPU.

Note that the code is adjusted to run with ImageNet Object Localization Challenge on Kaggle. You may check some results in the notebook Train Resnet50 on Imagenet with PyTorch.

Expanded explanation of each training step

Import the necessary PyTorch modules:

import torch
import torchvision
import torchvision.transforms as transforms

torch provides tensors and basic mathematical operations
torchvision provides utilities for loading and preprocessing image data
torchvision.transforms is a submodule of torchvision that provides functions for performing image preprocessing

Set the device to use for training:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

This line sets the device variable to “cuda” if a GPU is available, otherwise it sets it to “cpu”. The model and tensors will be moved to this device later in the code.

Prepare transformations for data augmentation

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(degrees=45),
    transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

This block of code makes up the set of transformations that will be applied during training. In particular, the transforms.Normalize takes two arguments:

[0.485, 0.456, 0.406] - the mean of the data along each channel (i.e., the red, green, and blue channels for an image).
[0.229, 0.224, 0.225] - the standard deviation of the data along each channel.

These exact values are used for normalizing data that has been pre-trained on the ImageNet dataset. They are based on the statistics of the ImageNet dataset, which consists of a large number of natural images.

Load the ImageNet dataset:

train_dataset = torchvision.datasets.ImageFolder(
    root='/kaggle/input/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/train', 
    transform=transform
)

This line loads ImageNet dataset in Kaggle’s format and applies all transformations defined above.

Create a dataloader for the dataset:

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)

The torch.utils.data.DataLoader function creates a dataloader for the dataset. The batch_size parameter specifies the number of samples per batch, the shuffle parameter specifies whether to shuffle the data at each epoch, and the num_workers parameter specifies the number of worker threads to use for loading the data

A rule of thumb for the number of workers is the number of CPU cores minus 1 for controlling processes os.cpu_count() or multiprocessing.cpu_count() can help with this.

Load the Resnet50 model:

model = torchvision.models.resnet50(pretrained=True)

This line uses the torchvision.models.resnet50 function to load the Resnet50 model, with the pretrained parameter set to True to use the pretrained weights.

Parallelize training across multiple GPUs

model = torch.nn.DataParallel(model)

torch.nn.DataParallel wraps a model and splits the input across available GPUs, then it replicates the model on each GPU. The model is then run in parallel on each GPU, with the results from each GPU being collected and concatenated together. Normally this significantly speeds up training process, especially for large models on GPUs with a high number of parallel processing cores.

Move the model to the device:

model.to(device)

This line moves the model and its parameters to the device specified earlier.

Set the loss function and optimizer:

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

The torch.nn.CrossEntropyLoss function creates a cross entropy loss criterion, which is commonly used for classification tasks. The torch.optim.Adam function creates an Adam optimizer with the specified learning rate. The model.parameters() method returns a list of the model’s trainable parameters, which the optimizer will adjust during training.

Train the model:

for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        # Move input and label tensors to the device
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero out the optimizer
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass
        loss.backward()
        optimizer.step()

    # Print the loss for every epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}')

This block of code trains the model for 10 epochs. An epoch is a complete pass through the training data.

In each epoch, the code iterates over the dataloader, which yields batches of inputs and labels. The inputs and labels are moved to the device, and the gradients are zeroed using the optimizer.zero_grad() method.

During training process, gradients of the model's parameters are computed using backpropagation, which involves propagating the loss gradient back through the model's layers to compute the gradients of the model's parameters. These gradients are used to update model's parameters using the optimizer's update rule.

However, if you don't zero gradients before each training step, gradients will accumulate and the update rule will be based on the sum of the gradients over all previous training steps. This can cause the model's parameters to oscillate or diverge, leading to poor convergence and potentially poor model performance.

By calling optimizer.zero_grad() before each training step, you reset the gradients of the model's parameters to zero, ensuring that the update rule is based only on the gradients of the current training step.

Next, the model performs a forward pass on the inputs, producing output logits. The loss is then computed using output logits and labels, and the model’s gradients are computed using the loss.backward() method. Finally, the optimizer takes a step to update the model’s parameters using gradients.

Training environment with Docker

FROM pytorch/pytorch

# Install additional dependencies
RUN apt-get update && apt-get install -y \
    wget \
    unzip

# Copy Kaggle credentials 
RUN mkdir -p /root/.kaggle
ADD kaggle.json /root/.kaggle

# Install Kaggle toolchain and download imagenet 
# Rememeber to join competition https://www.kaggle.com/c/imagenet-object-localization-challenge
RUN pip install -q kaggle
RUN kaggle competitions download -c imagenet-object-localization-challenge
RUN unzip imagenet-object-localization-challenge.zip -d imagenet-object-localization-challenge

# Download pretrained model and store in the image layer 
# Available at the path /root/.cache/torch/
RUN python -c "import torchvision; model = torchvision.models.resnet50(pretrained=True)"

# Copy the code
COPY resnet50.py /app/

# Set the working directory
WORKDIR /app

# Run the code
CMD ["python", "resnet50.py"]

This Dockerfile is based on pytorch/pytorch image, which provides all necessary dependencies for running PyTorch programs with GPU acceleration.

The Dockerfile installs wget and unzip utilities, which are needed to download the ImageNet dataset. It then downloads the dataset and extracts images to the imagenet-object-localization-challenge directory.

Next, the Dockerfile copies resnet50.py file, which should contain the code for training the ResNet50 model, to the /app directory. It then sets the working directory to /app and specifies that the python resnet50.py command should be run when the container is started.

It is important to note that you have to provide your own secret in kaggle.json file to be able to download locally the data from ImageNet Object Localization Challenge on Kaggle. Here is the local directory structure

➜  2022-12-18-one-pager-training-resnet-on-imagenet git:(master) ✗ tree
.
├── .gitignore
├── Dockerfile
├── kaggle.json
├── resnet50.py
└── README.md

To build the Docker container, you can run the following command:

docker build -t resnet50 .

This will build the Docker container and tag it with the name “resnet50”.

To run the container, you can use the following command:

docker run --gpus all -it resnet50

This will run the container with access to all available GPUs and start the container in interactive mode.

This will start the training of the ResNet50 model on the ImageNet dataset. You should see the running loss printed to the console as the training progresses.

I hope this helps! Let me know if you have any questions.

Find all tables without primary key in PostgreSQL

09 Dec 2022

Search across all databases (schemas) for tables without primary key

The following query obtains the list of tables without primary key, those who destroys the database performance

SELECT
	t.table_name
FROM information_schema.tables as t
LEFT JOIN information_schema.table_constraints as tc 
ON (
        t.table_schema = t.table_schema
    AND t.table_name = tc.table_name 
    AND tc.constraint_type = 'PRIMARY KEY'
)
WHERE 
	t.table_type = 'BASE TABLE'
AND t.table_schema not in ('pg_catalog', 'information_schema')
AND tc.constraint_name is NULL

In this example, the table tables is used to find all tables registered in PostgreSQL and LEFT JOIN with table_constraints is used to select relative constraints.

WHERE clause is used to filter out the system related databases and filter the tables that do not have a primary key.

Restrict search for tables without primary key to a specific databases (schema)

SELECT
	t.table_name
FROM information_schema.tables as t
LEFT JOIN information_schema.table_constraints as tc 
ON (
        t.table_schema = t.table_schema
    AND t.table_name = tc.table_name 
    AND tc.constraint_type = 'PRIMARY KEY'
)
WHERE 
	t.table_type = 'BASE TABLE'
AND t.table_schema not in ('pg_catalog', 'information_schema')
AND t.table_schema = '<database name>' -- put database name here
AND tc.constraint_name is NULL

A friendly advise, the result list of these queries should be an Empty set.

Happy querying!

Fixing error 'Matplotlib is currently using agg'

03 Dec 2022

Have you ever built graphs in matplotlib or any other GUI running in docker? You will have to overcome some obstacles

missing X11 server inside docker
missing X11-related libraries to able to render your GUI elements

For example, I’m writing this simple python program

import matplotlib.pyplot as plt
import numpy as np

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

First allow to connect to X11 server by

xhost local:root

And I want to run it in the docker container, by emitting the command

docker run -it --rm --env="DISPLAY" -v "/tmp/.X11-unix:/tmp/.X11-unix:rw" -v $PWD:/app python-container python /app/example.py

/app/examples/example.py:9: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

To fix the error Matplotlib is currently using agg, one dependency must be installed. Below I provide a minimalist Dockerfile which includes the most important X11 library dependency libx11-6, which works fine on Ubuntu 20-22.04 and Debian 11+.

FROM python

RUN apt-get update && \
    apt-get install -y \
        python3 \
        python3-setuptools \
        libx11-6 \
        ca-certificates

RUN pip3 install numpy matplotlib

WORKDIR /app

et voilà, we have perfectly running minimalist docker container that is able to connect to a local X11 server and create GUI elements!

Note 1. To build an image run

docker build -t python-container .

Note 2. What does --env="DISPLAY" stand for? --env propagates $DISPLAY environment variable into the container.

Note 3. What -v "/tmp/.X11-unix:/tmp/.X11-unix:rw" does? It mounts X11 socket into the /tmp folder inside the container that will be used by GUI applications to render GUI elements.

Monitor GPU utilization with nvidia-smi

27 Nov 2022

When training you’d love to know how efficiently GPU is utilized. Nvidia provides a tool nvidia-smi with a driver.

Just invoking it without any parameters it gives you a matrix with basic GPU parameters

$ nvidia-smi
Fri Dec  2 23:13:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 20%   50C    P5    25W / 250W |    744MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4804      G   /usr/lib/xorg/Xorg                292MiB |
|    0   N/A  N/A      4918      G   /usr/bin/gnome-shell              108MiB |
|    0   N/A  N/A     10549      G   ...390539104842029425,131072      340MiB |
+-----------------------------------------------------------------------------+

but how to monitor continuously the GPU usage, we have to use keys

$ nvidia-smi dmon -s pucvmet
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk pviol tviol    fb  bar1 sbecc dbecc   pci rxpci txpci
# Idx     W     C     C     %     %     %     %   MHz   MHz     %  bool    MB    MB  errs  errs  errs  MB/s  MB/s
    0    24    46     -     2     4     0     0   810  1151     0     0   791     6     -     -     0    14     0
    0    19    46     -     0     3     0     0   810  1151     0     0   791     6     -     -     0     0     2
    0    20    45     -     0     3     0     0   810  1151     0     0   791     6     -     -     0     0     2
    0    21    46     -     1     3     0     0   810  1151     0     0   791     6     -     -     0     0     2
    0    19    45     -     9    10     0     0   810  1151     0     0   821     6     -     -     0     0     0

the parameters to watch

p - Power Usage (in Watts) and Gpu/Memory Temperature (in C) if supported
u - Utilization (SM, Memory, Encoder and Decoder Utilization in %)
c - Proc and Mem Clocks (in MHz)
v - Power Violations (in %) and Thermal Violations (as a boolean flag)
m - Frame Buffer and Bar1 memory usage (in MB)
e - ECC (Number of aggregated single bit, double bit ecc errors) and PCIe Replay errors
t - PCIe Rx and Tx Throughput in MB/s (Maxwell and above)

Older Newer

Igor Moiseev Applied mathematician, AI Enthusiast