27 Jun 2025
Artificial Intelligence has evolved rapidly, reshaping industries and redefining how we interact with technology. Central to this evolution are Foundation Models—advanced AI systems trained on vast datasets, capable of adapting to numerous tasks without extensive retraining. In this article, I’ll dive into the heart of Foundation Models, examining their current limitations, future potentials, and the transformative concept of “Software 3.0.”
The Problem Landscape
Humans vs LLMs: The Sample Efficiency Gap
Humans learn incredibly efficiently. A four-year-old child processes roughly 1 petabyte (1×10¹⁵ bytes) of visual information-50 times more data than GPT-4 uses—yet learns effortlessly and robustly. Current AI models, by contrast, exhibit significant inefficiencies, requiring massive amounts of data and computational resources.
Data Availability Ceiling
We’re quickly approaching a ceiling of available high-quality text data, predicted to be exhausted within this decade. This limitation pushes research towards multimodal models—AI that integrates diverse data sources such as images, audio, and sensor data—promising richer learning and greater efficiency.
Autoregressive Limits: Error Cascades
Traditional autoregressive models predict sequentially, token-by-token, which magnifies errors exponentially. Yann LeCun famously warned this approach is inherently limited for achieving Artificial General Intelligence (AGI). Future models must integrate predictive world-modeling to simulate outcomes before taking action.
Predictive World Models: The New Standard
Future AI must adopt internal predictive “world models,” continuously simulating outcomes before acting. This method enables deeper, strategic planning capabilities—moving beyond reactionary prediction to proactive imagination and reasoning.
Variable Computation: Adaptive “Thinking Time”
Today’s one-size-fits-all approach is inefficient. New models will employ adaptive computation, allocating computational effort dynamically—similar to human cognition—spending more resources on complex problems while swiftly managing simpler tasks.
Memory & Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation enhances model reliability by dynamically integrating external knowledge into AI responses, reducing hallucinations and ensuring accuracy. Models now effectively use real-time data retrieval to augment their internal knowledge, significantly boosting their adaptability and utility.
The Power of Tokenization
Tokenization is fundamental, converting raw data into standardized discrete symbols that AI can understand. Effective tokenization facilitates multimodal fusion, where text, images, audio, and sensor data share a common language—transforming AI into versatile, unified foundation models.
Software 3.0: The Evolution of Development
Introduced by Andrej Karpathy, Software 3.0 redefines software development. In Software 1.0, developers explicitly code algorithms; in Software 2.0, data curation and model training become central. Software 3.0 positions large language models as general-purpose platforms programmed via natural language, significantly shifting development paradigms.
In Software 3.0, development cycles accelerate dramatically through rapid iteration driven by conversational prompts and AI-driven code refactoring. Developers evolve into orchestrators, shaping and guiding intelligent models rather than writing detailed procedural code.
LLM as Operating System
Foundation models are becoming akin to operating systems, providing memory management, scheduling, and API access. Applications become streamlined, relying on these powerful AI backbones. This paradigm shift raises new challenges, including:
- Reliability: Preventing hallucinations through structured grounding and sandboxed execution.
- Security & Privacy: Protecting against prompt injection and data leakage.
- Governance & Compliance: Establishing clear audit trails, licensing, and attribution.
- Sustainability: Reducing computational costs and environmental impacts.
- Ethical Alignment: Ensuring models produce safe, unbiased, and ethically sound outputs.
The Hopeful Future
Foundation models promise immense opportunities for collaboration between humans and AI, fostering new roles, enhancing creativity, and boosting productivity. However, responsible and ethical stewardship is crucial for ensuring these powerful tools positively impact society.
Foundation models and Software 3.0 represent an exciting AI frontier, blending technological innovation with human ingenuity to shape a future where technology complements and enhances human potential.
References & Further Reading
- Bommasani et al., Stanford CRFM, 2021.
- OpenAI, GPT-4 Technical Report, 2023.
- A. Karpathy, Software 2.0 (Medium, YouTube).
- Y. LeCun, World-Modeling AI (YouTube).
- 3Blue1Brown, Neural Networks Explained (YouTube).
18 Dec 2022
Without further due, here is a one pager code for training Resnet50 on ImageNet in PyTorch:
import torch
import torchvision
import torchvision.transforms as transforms
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Set hyperparameters
num_epochs = 10
batch_size = 64
learning_rate = 0.001
# Initialize transformations for data augmentation
transform = transforms.Compose([
transforms.Resize(256),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.RandomRotation(degrees=45),
transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Load the ImageNet Object Localization Challenge dataset
train_dataset = torchvision.datasets.ImageFolder(
root='/kaggle/input/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/train',
transform=transform
)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
# Load the ResNet50 model
model = torchvision.models.resnet50(pretrained=True)
# Parallelize training across multiple GPUs
model = torch.nn.DataParallel(model)
# Set the model to run on the device
model = model.to(device)
# Define the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model...
for epoch in range(num_epochs):
for inputs, labels in train_loader:
# Move input and label tensors to the device
inputs = inputs.to(device)
labels = labels.to(device)
# Zero out the optimizer
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass
loss.backward()
optimizer.step()
# Print the loss for every epoch
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}')
print(f'Finished Training, Loss: {loss.item():.4f}')
This code will train Resnet50 model on the ImageNet dataset for 10 epochs using ADAM optimizer with a learning rate of 0.001
. The model is trained on GPU if available, otherwise it is trained on CPU.
Note that the code is adjusted to run with ImageNet Object Localization Challenge on Kaggle. You may check some results in the notebook Train Resnet50 on Imagenet with PyTorch.
Expanded explanation of each training step
Import the necessary PyTorch modules:
import torch
import torchvision
import torchvision.transforms as transforms
torch
provides tensors and basic mathematical operations
torchvision
provides utilities for loading and preprocessing image data
torchvision.transforms
is a submodule of torchvision
that provides functions for performing image preprocessing
Set the device to use for training:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
This line sets the device variable to “cuda” if a GPU is available, otherwise it sets it to “cpu”. The model and tensors will be moved to this device later in the code.
transform = transforms.Compose([
transforms.Resize(256),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.RandomRotation(degrees=45),
transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
This block of code makes up the set of transformations that will be applied during training. In particular, the transforms.Normalize
takes two arguments:
[0.485, 0.456, 0.406]
- the mean of the data along each channel (i.e., the red, green, and blue channels for an image).
[0.229, 0.224, 0.225]
- the standard deviation of the data along each channel.
These exact values are used for normalizing data that has been pre-trained on the ImageNet dataset. They are based on the statistics of the ImageNet dataset, which consists of a large number of natural images.
Load the ImageNet dataset:
train_dataset = torchvision.datasets.ImageFolder(
root='/kaggle/input/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/train',
transform=transform
)
This line loads ImageNet dataset in Kaggle’s format and applies all transformations defined above.
Create a dataloader for the dataset:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
The torch.utils.data.DataLoader
function creates a dataloader for the dataset. The batch_size
parameter specifies the number of samples per batch, the shuffle
parameter specifies whether to shuffle the data at each epoch, and the num_workers
parameter specifies the number of worker threads to use for loading the data
A rule of thumb for the number of workers is the number of CPU cores minus 1 for controlling processes os.cpu_count()
or multiprocessing.cpu_count()
can help with this.
Load the Resnet50 model:
model = torchvision.models.resnet50(pretrained=True)
This line uses the torchvision.models.resnet50
function to load the Resnet50 model, with the pretrained parameter set to True
to use the pretrained weights.
Parallelize training across multiple GPUs
model = torch.nn.DataParallel(model)
torch.nn.DataParallel
wraps a model and splits the input across available GPUs, then it replicates the model on each GPU. The model is then run in parallel on each GPU, with the results from each GPU being collected and concatenated together. Normally this significantly speeds up training process, especially for large models on GPUs with a high number of parallel processing cores.
Move the model to the device:
This line moves the model and its parameters to the device specified earlier.
Set the loss function and optimizer:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
The torch.nn.CrossEntropyLoss
function creates a cross entropy loss criterion, which is commonly used for classification tasks. The torch.optim.Adam
function creates an Adam optimizer with the specified learning rate. The model.parameters()
method returns a list of the model’s trainable parameters, which the optimizer will adjust during training.
Train the model:
for epoch in range(num_epochs):
for inputs, labels in train_loader:
# Move input and label tensors to the device
inputs = inputs.to(device)
labels = labels.to(device)
# Zero out the optimizer
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass
loss.backward()
optimizer.step()
# Print the loss for every epoch
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}')
This block of code trains the model for 10 epochs. An epoch is a complete pass through the training data.
In each epoch, the code iterates over the dataloader, which yields batches of inputs and labels. The inputs and labels are moved to the device, and the gradients are zeroed using the optimizer.zero_grad()
method.
During training process, gradients of the model's parameters are computed using backpropagation, which involves propagating the loss gradient back through the model's layers to compute the gradients of the model's parameters. These gradients are used to update model's parameters using the optimizer's update rule.
However, if you don't zero gradients before each training step, gradients will accumulate and the update rule will be based on the sum of the gradients over all previous training steps. This can cause the model's parameters to oscillate or diverge, leading to poor convergence and potentially poor model performance.
By calling optimizer.zero_grad()
before each training step, you reset the gradients of the model's parameters to zero, ensuring that the update rule is based only on the gradients of the current training step.
Next, the model performs a forward pass on the inputs, producing output logits. The loss is then computed using output logits and labels, and the model’s gradients are computed using the loss.backward()
method. Finally, the optimizer takes a step to update the model’s parameters using gradients.
Training environment with Docker
FROM pytorch/pytorch
# Install additional dependencies
RUN apt-get update && apt-get install -y \
wget \
unzip
# Copy Kaggle credentials
RUN mkdir -p /root/.kaggle
ADD kaggle.json /root/.kaggle
# Install Kaggle toolchain and download imagenet
# Rememeber to join competition https://www.kaggle.com/c/imagenet-object-localization-challenge
RUN pip install -q kaggle
RUN kaggle competitions download -c imagenet-object-localization-challenge
RUN unzip imagenet-object-localization-challenge.zip -d imagenet-object-localization-challenge
# Download pretrained model and store in the image layer
# Available at the path /root/.cache/torch/
RUN python -c "import torchvision; model = torchvision.models.resnet50(pretrained=True)"
# Copy the code
COPY resnet50.py /app/
# Set the working directory
WORKDIR /app
# Run the code
CMD ["python", "resnet50.py"]
This Dockerfile
is based on pytorch/pytorch
image, which provides all necessary dependencies for running PyTorch programs with GPU acceleration.
The Dockerfile
installs wget
and unzip
utilities, which are needed to download the ImageNet dataset. It then downloads the dataset and extracts images to the imagenet-object-localization-challenge
directory.
Next, the Dockerfile copies resnet50.py
file, which should contain the code for training the ResNet50 model, to the /app
directory. It then sets the working directory to /app and specifies that the python resnet50.py
command should be run when the container is started.
It is important to note that you have to provide your own secret in kaggle.json
file to be able to download locally the data from ImageNet Object Localization Challenge on Kaggle. Here is the local directory structure
➜ 2022-12-18-one-pager-training-resnet-on-imagenet git:(master) ✗ tree
.
├── .gitignore
├── Dockerfile
├── kaggle.json
├── resnet50.py
└── README.md
To build the Docker container, you can run the following command:
docker build -t resnet50 .
This will build the Docker container and tag it with the name “resnet50”.
To run the container, you can use the following command:
docker run --gpus all -it resnet50
This will run the container with access to all available GPUs and start the container in interactive mode.
This will start the training of the ResNet50 model on the ImageNet dataset. You should see the running loss printed to the console as the training progresses.
I hope this helps! Let me know if you have any questions.
09 Dec 2022
Search across all databases (schemas) for tables without primary key
The following query obtains the list of tables without primary key, those who destroys the database performance
SELECT
t.table_name
FROM information_schema.tables as t
LEFT JOIN information_schema.table_constraints as tc
ON (
t.table_schema = t.table_schema
AND t.table_name = tc.table_name
AND tc.constraint_type = 'PRIMARY KEY'
)
WHERE
t.table_type = 'BASE TABLE'
AND t.table_schema not in ('pg_catalog', 'information_schema')
AND tc.constraint_name is NULL
In this example, the table tables
is used to find all tables registered in PostgreSQL and LEFT JOIN
with table_constraints
is used to select relative constraints.
WHERE
clause is used to filter out the system related databases and filter the tables that do not have a primary key.
Restrict search for tables without primary key to a specific databases (schema)
SELECT
t.table_name
FROM information_schema.tables as t
LEFT JOIN information_schema.table_constraints as tc
ON (
t.table_schema = t.table_schema
AND t.table_name = tc.table_name
AND tc.constraint_type = 'PRIMARY KEY'
)
WHERE
t.table_type = 'BASE TABLE'
AND t.table_schema not in ('pg_catalog', 'information_schema')
AND t.table_schema = '<database name>' -- put database name here
AND tc.constraint_name is NULL
A friendly advise, the result list of these queries should be an Empty set
.
Happy querying!
03 Dec 2022
Have you ever built graphs in matplotlib
or any other GUI running in docker? You will have to overcome some obstacles
- missing X11 server inside docker
- missing X11-related libraries to able to render your GUI elements
For example, I’m writing this simple python program
import matplotlib.pyplot as plt
import numpy as np
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)
# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()
First allow to connect to X11 server by
And I want to run it in the docker container, by emitting the command
docker run -it --rm --env="DISPLAY" -v "/tmp/.X11-unix:/tmp/.X11-unix:rw" -v $PWD:/app python-container python /app/example.py
/app/examples/example.py:9: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
To fix the error Matplotlib is currently using agg
, one dependency must be installed. Below I provide a minimalist Dockerfile
which includes the most important
X11 library dependency libx11-6
, which works fine on Ubuntu 20-22.04 and Debian 11+.
FROM python
RUN apt-get update && \
apt-get install -y \
python3 \
python3-setuptools \
libx11-6 \
ca-certificates
RUN pip3 install numpy matplotlib
WORKDIR /app
et voilà, we have perfectly running minimalist docker container that is able to connect to a local X11 server and create GUI elements!
Note 1. To build an image run
docker build -t python-container .
Note 2. What does --env="DISPLAY"
stand for? --env
propagates $DISPLAY
environment variable into the container.
Note 3. What -v "/tmp/.X11-unix:/tmp/.X11-unix:rw"
does? It mounts X11 socket into the /tmp
folder inside the container
that will be used by GUI applications to render GUI elements.
27 Nov 2022
When training you’d love to know how efficiently GPU is utilized. Nvidia provides
a tool nvidia-smi
with a driver.
Just invoking it without any parameters it gives you a matrix with basic GPU parameters
$ nvidia-smi
Fri Dec 2 23:13:41 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 20% 50C P5 25W / 250W | 744MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4804 G /usr/lib/xorg/Xorg 292MiB |
| 0 N/A N/A 4918 G /usr/bin/gnome-shell 108MiB |
| 0 N/A N/A 10549 G ...390539104842029425,131072 340MiB |
+-----------------------------------------------------------------------------+
but how to monitor continuously the GPU usage, we have to use keys
$ nvidia-smi dmon -s pucvmet
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk pviol tviol fb bar1 sbecc dbecc pci rxpci txpci
# Idx W C C % % % % MHz MHz % bool MB MB errs errs errs MB/s MB/s
0 24 46 - 2 4 0 0 810 1151 0 0 791 6 - - 0 14 0
0 19 46 - 0 3 0 0 810 1151 0 0 791 6 - - 0 0 2
0 20 45 - 0 3 0 0 810 1151 0 0 791 6 - - 0 0 2
0 21 46 - 1 3 0 0 810 1151 0 0 791 6 - - 0 0 2
0 19 45 - 9 10 0 0 810 1151 0 0 821 6 - - 0 0 0
the parameters to watch
- p - Power Usage (in Watts) and Gpu/Memory Temperature (in C) if supported
- u - Utilization (SM, Memory, Encoder and Decoder Utilization in %)
- c - Proc and Mem Clocks (in MHz)
- v - Power Violations (in %) and Thermal Violations (as a boolean flag)
- m - Frame Buffer and Bar1 memory usage (in MB)
- e - ECC (Number of aggregated single bit, double bit ecc errors) and PCIe Replay errors
- t - PCIe Rx and Tx Throughput in MB/s (Maxwell and above)