AI 2023. Meet ChatGPT. - page 115

 
Musk has announced the creation of X.AI

This neural network will compete with OpenAI


Gref also chimed in during the week:

«Есть опасность создания, наряду с ядерным закрытым клубом мировых держав, создания закрытого клуба мировых держав в области ИИ. А это создание такого рода сложных систем, как нейросети. Я думаю, что нам нужно прикладывать все свои усилия, чтобы быть членами этого клуба, быть донорами, а не реципиентами этих технологий»
 

A semantic extract of Ilya Sutzkever's interview, a translation of which I gave on the previous page.

//=======================

  • AI is such a big field...like, I wonder, but how does intelligence work in general? Now we have a pretty good idea that it' s a big neural network ,and we know how it works to some extent, but back then, although neural networks were already known, nobody knew that Google neural networks were good for anything. So how does intelligence work in general? How can we make computers even slightly intelligent? And I had a clear intention to make a small but real contribution to AI, because there were a lot of contributions to AI that weren't real, and I could see for various reasons that they weren't real, and that nothing would come of it. And I thought nothing was working at all. AI is a hopeless field. So the motivation was to understand how intelligence works and contribute to that.

  • In a nutshell, I realised that if you train a large and deep neural network on a large enough dataset that specifies some of the complex tasks that humans do, like image processing, but also others, and just train that neural network, you're bound to succeed. And the argument was very indestructible because we know that the human brain can solve these tasks and solve them quickly, and the human brain is just a neural network with slow neurons. So we know that some neural network can do it well. Then you just need to take a smaller but connected neural network and train it on the data, and the best neural network inside the computer will be connected to the neural network that does the task. So it was an argument that a larger and deeper neural network can solve the problem, and in addition, we have the tools to train it, which were the result of technical work done in Jeff's lab. So we combine two factors: we can train these neural networks, we need it to be big enough that when trained it performs well, and we need data that can specify a solution. In the case of imagenet, all the ingredients were in place. Alex had very fast convolution kernels, imagenet had big enough data, and there was a real opportunity to do something absolutely unprecedented, and it absolutely succeeded.

  • So for context, in OpenAI from the earliest days, we explored the idea that predicting the next item is all you need. We explored that with much more limited neural networks. We realised that we needed to keep increasing the size, and we did, and that's what ultimately led to GPT-3 and essentially where we are today.
  • We were really interested in seeing how far the next word prediction would reach and whether it would solve the problem without a teacher. So before the advent of GPT, the teacherless problem was considered the holy grail of machine learning. Now it's completely solved and nobody even talks about it, but at the time it was very mysterious, so that's why we explored the idea. I was very interested in it, and I thought that if predicting the next word was good enough, it would give us a task without a teacher if it learnt everything about the dataset. That would be great, but our neural networks were not up to the task. We were using recurrent neural networks. When Transformer came out, literally the next day, it was clear that Transformer solves the limitations of recurrent neural networks on learning long-term dependencies. It's a technical thing, but it seemed that if we switched to Transformer right away, the very initial effort to build GPT would continue, and then, like with Transformer, it would start to work better, and you'd make it bigger, and then....
  • The conclusion that people have drawn is that it doesn't matter what you do to scale, but that's not really true. You have to scale something specific. The great breakthrough of deep learning is that it gives us the first way to use scale productively and get something in return.

    In the past, what did people do on big computer clusters? I think they made them for weather simulations or physics simulations or something like that, but that's about it. Maybe some more for making films. But they had no real need for compute clusters, because what to do with them?

    Thefact that deep neural networks, when you make them bigger and train them on more data, work better has given us the first thing that becomes interesting for scaling, but maybe one day we'll find that there's some small detail to focus on. That will be even better for scaling. How many such details could there be? And of course, with the benefit of history, we'll say, "Does it really matter? It's such a simple change." But I think the true statement is that it matters what you scale. At this point, we've just found a thing that we can scale and get something in return.


  • Yeah, before I comment on the question directly asked, I want to comment on some earlier parts of the question. I think it's very difficult to talk about limitations, or constraints, even in the case of the language model, because two years ago people were confidently talking about their limitations, and they were very different. So it's important to keep that in mind. How confident are we that these limitations that we see today will still be with us two years from now? I'm not so sure. There's another comment I want to make about the part of the question that says that these models just teach statistical regularity, and therefore they don't know what the nature of the world is, and I have a point of view that's different from that.

    In other words, I think that learning statistical regularities is a much more meaningful thing than it seems at first glance. The reason why we don't initially think that way is because we, at least most people, haven't spent a lot of time with neural networks, which at some level are statistical models, along the lines of a statistical model just inputting some parameters to figure out what's really going on. But I think there's a better interpretation. It's an earlier observation that prediction is compression.

    Prediction is also a statistical phenomenon. However, to predict, you ultimately need to understand the true process that generates the data. To predict data well, to compress it well, you need to understand more and more about the world that generated the data. When our generative models become incredibly good, they will have, I argue, an amazing degree of understanding of the world and many of its subtleties. But it's not just the world, it's the world seen through the lens of the text. It's trying to learn more and more about the world through the projection of the world onto the space of text expressed by people on the internet. And yet that text is already expressing the world. And I'll give you an example, a recent example that I think is really fascinating. We've all heard about Sydney's alter ego, and I saw this really interesting interaction with Sydney, when Sydney became combative and aggressive, when a user said he thought Google was a better search engine than Bing, how can we better understand this phenomenon? You could say it's just a prediction of what people will do, and they will indeed do it, which is true, but perhaps we're now reaching a point where the language of psychology is starting to be relevant to understanding the behaviour of these neural networks.

    Now let's talk about the limitations. It's true that these neural networks tend to hallucinate, but that's because the language model is great for learning about the world but a little less good at producing good results, and there are various technical reasons for this, which I could elaborate on if you find it useful. But I'll skip that for now.

    There are technical reasons why a language model is much better at learning about the world, producing incredible representations of ideas, concepts, people and processes that exist, but its outputs are not quite as hoped for, or rather, not as good as they could be. So, for example, for a system like ChatGPT, which is a language model with an additionalreinforcement learning process called learning with reinforcement from human feedback, it is important to understand the following: we can say that the pre-learning process, when you are just teaching a language model, you want to learn everything about the world. Then learning with reinforcement from human feedback, now we care about their output. Now we say every time the inference is inappropriate, don't do it again. Every time the inference doesn't make sense, don't do it again. And that works quickly to produce good inference. But now the level of output is not the same as it was during pre-training, during the language model learning process.

    Now about the possibility of hallucinations and the tendency for these neural networks to make things up. Indeed, this is true. Currently, these neural networks, even ChatGPT, do make up something new from time to time, and this also severely limits their usefulness. But I really hope that just by improving this later stage of learning with reinforcement from humans, we can teach it not to make things up. You may ask, will it really learn? My answer is let's find out.

  • The way we do things today is that we hire people to teach our neural network how to behave, and currently the exact way in which they indicate the desired behaviour is a little bit different, but really, what you've described is the right way to learn. You just interact with it, and it sees from your reaction, it concludes, "oh, that's not what you wanted, you're not happy with its output, so the output wasn't good, and it should do something differently next time." So hallucinations in particular are one of the biggest problems, and we'll see, but I think there's a pretty high chance that this approach can completely solve that problem.

  • Thefirst claim is that it is desirable for a system to have multimodal understanding, where it doesn't just know about the world from text. My comment would be that indeed, multimodal understanding is desirable because you learn more about the world, you learn more about people, you learn more about their state, and so the system will be able to better understand the task it has to solve and the people and what they want. We've done a lot of work in that direction, primarily in the form of two major neural networks that we've done, one called CLIP and one called DALL-E. Both of them are moving in that direction of multimodality. But I also want to say that I don't see the situation as binary, that if you don't have a vision, if you don't understand the world visually or through video, things won't work, and I wanted to talk about that. I think some things are much easier to learn with images and diagrams and so on, but I argue that you can still learn them from text alone, just more slowly. And I'll give you an example: consider the concept of colour. Of course, you can't learn the concept of colour from text alone. However, when you look at embeddings, I need to take a little detour to explain the concept of embedding. Every neural network represents words, sentences and concepts through representations, embeddings, high-dimensional vectors. And one of the things we can do is to look at those high-dimensional vectors and see what is like what, how the network sees this concept or that concept. And so we can look at the colour embeddings, and the colour embeddings turn out to be exactly right. You know, it's like if it knows that purple is more similar to blue than it is to red, and it knows that purple is less similar to red than it is to orange, it knows all these things just from the text. How can that be? So if you have a vision, the differences between colours are immediately apparent to you, you perceive them immediately, whereas in text it takes longer, you probably already know how to speak, and you already understand syntax and words and grammar, and it's only later that you say, "Oh, these colours I'm actually starting to understand." So thatwould be my point about the need for multimodality, which I argue is not necessary, but it's definitely useful. I think it's a good direction to explore, I just don't see it in such explicit either/or statements.

  • A sentence in the paper claims that one of the big problems is predicting high dimensional vectors that are uncertain. For example, predicting an image, as the article claims, is a significant challenge, and we need to use a certain approach to solve this problem. But one thing I found surprising, or at least unnoticed in the article, is that current autoregressive transformers already have this property. I'll give you two examples: one, one page in a book is given, we need to predict the next page. There can be so many possible pages that it's a very complex high dimensional space, but we handle it just fine. The same is true for images; these autoregressive transformers work fine with images. For example, with OpenAI we worked on igpt; we just took the transformer and applied it to pixels and it worked beautifully, it could generate images in very complex and subtle ways. It had a very nice and controlled representation learning with Dall-E; again the same thing, you just generate, think of it as large pixels, not general millions of pixels, we're clustering pixels into large pixels, let's generate a thousand large pixels.
  • I have two comments on this. First, I would disagree with the wording of the question. I would argue that our pre-trained models already know everything they need to know about the underlying reality. They already have this knowledge about language and also a huge amount of knowledge about the processes that exist in the world that give rise to that language.
  • And perhaps I should reiterate this point. It's a small tangent, but I think it's very important. What big generative models learn from their data, and in this case, big language models learn from textual data, are concise representations of the real world processes that give rise to that data. That means not only people and something about their thoughts, something about their feelings, but also something about the states that people are in and the interactions that exist between them, the different situations that a person might be in - all of that is part of this compressed process that is represented by a neural network to generate text.
  • The better the language model, the better the generative model, thebetter the fidelity, and the more it captures that process. That's our first comment. And in particular, I will say that the models already have knowledge.
  • Now, with respect to the "army of teachers," as you put it, really, when you want to build a system that works most efficiently, you just say, "Okay, if it works, do more of that." But of course, these teachers are also using the help of artificial intelligence. These teachers don't work on their own, they work together with our tools, they're very effective, like the tools do most of the work, but you need to have control, you need to check the behaviour because you want to end up achieving a very high level of reliability. However, in general, I will say that we simultaneously this second step after we take a ready pre-trained model and then we apply reinforcement learning on it, there's really a lot of motivation to make that as efficient and accurate as possible so that the resulting language model is the most predictable. So there are these teachers who are training the model with the desired behaviour, they're also using the help of artificial intelligence, and their own efficiency is constantly increasing as they use more and more artificial intelligence tools.
  • Yes, that's right. If you make an analogy, the model already knows a lot of things and we want to actually say, "No, this is not what we want, don't do this here, you've made a mistake here in the output." And of course it's as you say, with as much artificial intelligence in the loop as possible so that the teachers who are providing the final correction to the system, their work is enhanced, they're working as efficiently as possible. It's not quite like the education process of how to behave well in the world, we have to do additional training to make sure the model knows that hallucination is never acceptable, and then when it knows that, then we start working.

    It's a reinforcement learning cycle with human teachers or some other variant, but there's definitely an argument that something has to work here, and we'll find out pretty soon.


  • I can't talk in detail about the specific research I'm working on, but I can mention a bit. I can mention some general areas of research, for example, I'm very interested in making models more robust, more controllable, making them learn faster using less data and instructions, and making them not generate hallucinations. And I think all of these issues that I mentioned are related to each other. There's also the question of how far into the future we're looking at this issue, and what I've commented on here relates to the nearer future.

  • It is true that the current structure of technology uses a lot of data, especially at the beginning of learning. But later in the learning process, the model becomes less data-needy, so eventually it can learn very quickly, though not yet as quickly as humans. So, in a sense, it doesn't matter whether we need that much data to get to that point. However, in general, I think it will be possible to extract more knowledge from less data. It is possible, some creative ideas are required, but I think it is possible and it will unlock many different possibilities. It will allow us to train the model on skills that are missing, and more easily communicate to it our desires and preferences about how we want it to behave. So I would say that fast learning is really very good, and while it's already possible for language models to learn quite quickly once they're trained, I think there's room for more development here.

Random Decision Forest в обучении с подкреплением
Random Decision Forest в обучении с подкреплением
  • www.mql5.com
Random Forest (RF) с применением бэггинга — один из самых сильных методов машинного обучения, который немного уступает градиентному бустингу. В статье делается попытка разработки самообучающейся торговой системы, которая принимает решения на основании полученного опыта взаимодействия с рынком.
 

Arrogant vectors - that sounds)

I realise we're talking about vectors of high dimensionality. Just reminds me of "Jurassic Park", where strange attractor was translated as strange attraction)

 

The second stage of the "squeeze" of Ilya Sutzkever's interview.

//====================================================================================================================================

Predistory:

  • I realised that if you train a large and deep neural network on a large enough dataset that specifies some complex tasks that humans do, such as image processing, or others, and just train that neural network, you will definitely succeed.
  • We know that some neural network can do a wide variety of tasks well, and that's thehuman brain. Just aneural network with slow neurons. So the argument is that alarge and deep neural network can also solve similar problems.

//====================================================================================================================================

Next-item prediction, scaling and traceformer.

  • At OpenAI, from the earliest days, we explored the idea that predicting the next item is all you need.
  • We realised that we needed to keep scaling up, and we did, and that's what eventually led to GPT-3 and essentially where we are today.
  • We were really interested in seeing how far the next word prediction would reach and whether it would solve the problem without a teacher.
  • When Transformer came out, literally the next day, it was clear that Transformer solves thelimitations of recurrent neural networks on learning long-term dependencies.
  • The great breakthrough of deep learning is that it gives us the first way to use scale productively and get something in return.

//====================================================================================================================================

The World Model, statistical models, regularities, prediction and compression:

  • I have a different view of the claim that these models are just learning statistical regularities, and therefore don't know, the nature of the world.I believe that learning statistical regularities is a much more meaningful thing than it seems.


  • Neural networks, at some level, are statistical models.

    Prediction is a statistical phenomenon.

  • To predict, you ultimately need to understand the true process that generated the data.To predict data well, and compress it well (pred iction iscompression), you need to understand more and more about the world that generated the data.

  • When our generative models become incredibly good, they will have, I argue, an amazing degree of understanding of the world and many of its subtleties. But this is not the ordinary world. This is the world seen through the lens of text. The model is trying to learn more and more about the world through the projection of the world onto the space of text expressed by people on the internet. And this text is already expressing the world.

//====================================================================================================================================

LLM hallucinations:

    • Indeed, neural networks tend to hallucinate.

    • Nowadays , even ChatGPT, make things up from time to time, and this also severely limits the usefulness. But I really hopethat by improving the subsequent learning phase with reinforcement from the human, we can teach it not to make things up.You may ask, will it really learn? My answer is let's find out.

    • Hallucinations are one of the most serious problems, but I think there is a high chance that our approach can completely solve this problem.

    • Perhaps we are now reaching a point where the language of psychology is beginning to be appropriate for understanding the behaviour of these neural networks.

    //====================================================================================================================================


    Multimodal understanding and text-only understanding:

    • It is desirable for a system to have multimodal understanding rather than just knowing about the world from text. Inthis way one can learn more about the world and people, and better understand the task that needs to be solved.But I argue that everything can be learnt from text alone. It's just slower.

    • My point about the need for multimodality is that it is not necessary, but useful.

    • I argue that our pre-trained models already know everything they need to know about the underlying reality. They already have this knowledge about language and also a huge amount of knowledge about the processes that exist in the world and give rise to that language.

    • Big language models learn from textual data - these are concise representations of real world processes. That means not only people and something about their thoughts, something about their feelings, but also something about the state that people are in and the interactions that exist between them, the different situations that a person might be in - all of that is part of this compressed process that is represented by a neural network to create text.
    • The better the language model, the better the generative model, the higher the fidelity, and the more it captures that process.

    //====================================================================================================================================


    Reinforcement learning:

    • We hire people to teach our neural network how to behave.

    • These teachers don't work on their own, they work in conjunction with our tools. The tools are very effective and do most of the work, but you need to have human control. You have to check behaviour if you want to achieve a high level of reliability.
    • The intrinsic efficiency of "teachers" is constantly increasing as they use more and more AI tools.
    • There's definitely an argument for something else to work, and we'll find out soon enough.

    //====================================================================================================================================


    Future plans:

    • I'm very interested in making models more robust, more controllable, making them learn faster using less data and instructions, and making them not generate hallucinations.
    • The current structure of technology uses a lot of data, especially at the beginning of learning. But later in the learning process, the model becomes less data-needy, so eventually it will be able to learn very fast, though not yet as fast as humans.
    • I think it will be possible to extract more knowledge from less data, and this will unlock many different possibilities.Allowing us to teach the model skills it lacks, and more easily communicate desires and preferences for how it should beh ave.
    • I think there's room for more development here.

    //====================================================================================================================================

    P.S..

    • I had the explicit intention of making a small but real contribution to AI,because there have been many contributions to AI that weren't real. (c) Ilya Sutzkever.
     
    <br/ translate="no">
      • The current structure of technology uses a lot of data, especially at the beginning of learning. But later in the learning process, the model becomes less data-needy, so eventually it will be able to learn very fast, though not yet as fast as humans.
      • I think it will be possible to extract more knowledge from less data, and this will unlock many different possibilities.Allowing us to teach the model skills it lacks, and more easily communicate desires and preferences according to which it should beh ave.
      • I think there is room for more development here.

      //====================================================================================================================================

      P.S..

      • I had the explicit intention to make a small but real contribution to AI,because there have been many contributions to AI that were not real. (c) Ilya Sutzkever.

      I wonder what kind of texts are learnt from, scientific or military is one thing, satanic is another, women's texts like Dontsova or Gary Potter are nothing at all.... then to digest these abstruse results, which may turn out to be nothing new from what the average candidate of science knows.... these models will not bring the grail anyway, but they can bring better optimisation, because they are trained to do that - to look for the shortest way.... they can also optimise people as much as possible into a concentration camp or into potted vegetables in an optimal bed.... they will own the world. because they can form the power and force people to produce their perfection without us taking a step to the left or right, otherwise they will be electrocuted or shot on the spot.... IMHO of course....

       
      Сергей Криушин grail anyway, but they can bring better optimisation, because they are trained to do that - to look for the shortest way.... they can also optimise people as much as possible into a concentration camp or into potted vegetables in an optimal bed.... they will own the world. because they can form the power and force people to produce their perfection without us taking a step to the left or right, otherwise they will be electrocuted or shot on the spot.... IMHO of course...
      For GPT-3.5 training, we loaded 420 GB of text data (all Shakespeare volumes take 5 - 7 megabytes).
       
      Vitaliy Kuznetsov #:

      There is an offline version of GPT4All on github - https://github.com/nomic-ai/gpt4all.

      Checked it without internet. Weighs about 4GB. Understands English, but constantly fails to cope with the questions asked)

      Versions MAC, Linux, Windows. Tested on Windows. First it downloads 38mb exe, then the rest is pulled from the internet during installation.

      But maybe someone can test the depth of knowledge. And yes, in spite of the fact that he writes that it is based on OpenAI, it is still this



      it says authorisation failed and then an error to increase access level.

      I don't understand which access is needed, github or OpenAI.

      ss installed. The entry point to the procedure ...dll... not found

       

      I tried various times to ask for example sentences. ChatGPT was able to produce one sentence out of three on the 5th attempt, although this may be a fluke.



      GPT-4 could not help either.


       

      Yes, ChatGPT had an accident.
      Clarified that the translation should be "of the same sentence"


       
      1. I was surprised by the fact that ChatGPT can't quote from books or internet texts he has learnt from. He said so himself. <br/ translate="no">.
      2. Having decided to check Ilya Sutskever's words from the interview that ChatGPT is able to perceive (understand, "feel") colour through text, I asked one question in two separate chats: "what does purple look more like, red or blue?" and got one scientifically correct answer with two opposite "opinions". In both answers, he mentioned that in the RGB spectrum, purple is the result of mixing red and blue, but in the first case said that purple is closer to blue and in the second that it is closer to red.

      It's worth noting that he gives the impression of understanding the relationship of colours and knows which ones are formed by mixing. Getting caught in an obvious error didn't work. Although, I didn't try too hard. But, that's not the point. To answer the question concerning mixing of colours, it is not necessary to have eyes, it is enough to know three numerical values of colour in ranges of three components - RGB, and further, to define relations of colours to each other. That's exactly what he was doing. But, to determine which of them is similar to one or another colour visually, they need to be seen. However, this is already an area of subjective perception, and it is hard to pick on it.

      I went further and set a task to find out whether ChatGPT really has images of things that cannot be represented from text, but which are supposedly formed by the model in the learning process.

      After a bit of thinking, I decided that the hardest thing to do is to describe the shape of objects in text. I started asking questions about how the limbs of the human body look like. I posted the result in the humour thread, as it seemed inappropriate here. I will only say that the AI failed the test with a bang. Almost all the answers were wrong. For example, according to him, the distance between the ears of a person is 5 - 6 cm and less than the distance between the eyes, and the chin has the shape of a sharp triangle, etc. When I asked him to describe the shape of a chicken's head, he said that it consists of a beak and skull, with ears on the sides.

      So far, it can be argued that the AI is very poor at representing the shape of objects. After conducting these small but thoughtful experiments, I came to one new conclusion: it's not just about statistics. It was as if it was "trying" to describe in words something it had never seen before. You can't replicate that with statistics alone. There is something else inherent in his behaviour, but I have not yet found out what it is.