I agree that AGI is nowhere near.
An LLM codes by being able to generate the next word because it understands the context well, based on words in an over 1000-dimensional embedding space.
Humans learn to code by learning language commands, such as a for-loop.
The same difference applies to other things; that is, a language model is based on probability.
The core of a Transformer is an embedding space. In principle, any data can be brought in as tokens and converted back.
The embedding space itself doesn’t use words; rather, it describes the relationships of things in the surrounding world with vectors.
This method allows for human-like activity, even if its learning and thinking are not human-like.
Transformers or their variations are used extensively for processing all kinds of data. The entire Multimodal reasoning is based on the fact that multiple types of data can be processed in the same space.

I don’t know what this refers to, but Transformers are used in robots and autonomous cars.
Similarly, as far as I know, DeepSeek uses a Transformer-based approach, meaning they have made their own modifications to the basic model. Transformer-based refers to Self-Attention type embedding space processing.
I mentioned JEPA and Large Concept Models because Meta, led by LeCun, aims for human-like AI.
JEPA is a Self-Supervised learning model where an x-encoder model is trained to predict information according to a y-encoder model. This is done such that the predictor has an abstract representation space, which aims to form knowledge about what the concrete y-encoder represents (cat, dog, etc.).

I-JEPA trains a context encoder using a target. Because the predictor has an abstract multi-dimensional space where knowledge about the image’s subject is formed, the image is not predicted at the pixel level, but rather the object being generated is understood.

V-JEPA is similar for video.
Both are built using multiple ViT (Vision Transformer) models.
Large Concept Models (LCMs) by LeCun seek a similar concept.
Because humans don’t read individual words from a book but rather read a longer passage and internalize the content (Concept), LCM aims to function similarly.
That is, concepts are identified from a long text. Since concepts are wholes, descriptive text can ultimately be produced from them from different starting points. Or an image can be produced from a concept, etc.

LCM is also implemented using a Transformer.

So, when processing text, images, etc., LeCun aims to identify larger concepts and objects in the background, upon which the generation of text, images, etc., is based.
Regarding AGI.
Hype around AGI arises from time to time. This often stems from something functioning like a human.
Two years ago, AGI hype arose when ChatGPT responded like a human.
For some, S-Group’s food robots instill belief in the realization of AGI, or currently, DeepSeek’s excellent LLM enhancement ideas.
Even though no significant new changes enabling AGI have occurred in AI models.