As a new faculty member, when I arrived on the University of Notre Dame campus in 1975, I noticed that no one was using the IBM 360 computer in The Computer Center during the football games. Since this one computer was the only one that most faculty could use, I was eager to spend the time during the football games creating my punch cards and running them through the computer. I told my wife how eager I was to use the computer during the football games, but she said adamantly “No, we are going to the football games.” Oh, well. At least I was able to run my programs before and after, but not during, the football games.
In 1975 that one computer in The Computer Center had less computing power than the Apple watch on your wrist. But back then it was all we had, so we had to make the best of it. Randomized trials were created and developed to deal with situations, especially in medicine, where the sample sizes were relatively small. Regression analysis could serve as a substitute or a complement to randomized trials with larger samples. Artificial intelligence requires much, much larger samples and many simulations to be effective.
The scientific method recognizes that in any given sample, there are two types of relationships. The relationships we want to determine are the general population relationships that appear consistently in sample after sample. But in any given sample there are relationships that are unique to that particular sample and cannot be expected to appear in other samples and are certainly not the population relationships we are after.
To avoid overfitting to any particular sample, the scientific method requires that we fully specify the procedure and functional forms of any statistical method we plan to use. By prespecifying the functional form we attempt to avoid the problem of overfitting to the sample and picking up the relationships that are unique to that particular sample and do not represent populations relationships. Moreover, if you adjust the functional form to try to get a better fit, you are using the sample data to change your model and that undermines our ability to track the statistical distributions. We cannot track the statistical distributional effects of adjustments made through your head. Adjusting the functional form means not getting valid t-statistics or F-statistics that you need for determining the statistical significance of your results. Consequently, if you adjust the functional form of your model in any way to get a better fit to that particular sample, you lose track of the statistical distribution and are likely to overfit to that particular sample and not discover the true population relationships you want. Many people have lost their shirt in the stock market by overfitting to the sample data and getting great, but invalid, numbers labeled t-statistics, F-statistics and R-squared values but not getting at the true population relationships they are after. (Note: So called bootstrap methods use simulations to try to at least determine a good estimate of the variance of the distribution after functional form manipulation.)
The methods used for artificial intelligence intentionally violate the scientific method. They intentionally overfit to the data. They get away with this only by using simulations and a huge volume of data. They don’t just acquire one sample, but to as great an extent as possible, they attempt to acquire an extremely large number of samples to discover the population relationships that show up in sample after sample.
I don’t know the details of the Transformer model for A.I., but I know what strategy I would pursue. Find the word that most frequently starts a sentence about the subject of interest such as Alzheimer’s disease. Find the word that most frequently follows that word. Note the correlation. Then find the word that most frequently follows those two words. Here is where is gets interesting. You need to use the two-way correlations and the three-way correlations. As a fourth word is introduced in the same matter, you will need all two-way, three-way, and the four-way correlations. Basically you use these correlations to produce your first sentence in your summary essay. This procedure can be followed with the word that most frequently starts the second most frequent sentence that starts with that word. My paper on “Composite Dummy Variables” provides the basic ideas for understanding basic interactions effects such as used in ChatGPT with all available interaction effects as the basis for generating sentences that summarize the large literature on a topic such as Alzheimer’s disease. Follow this link to my paper in ResearchGate: https://www.researchgate.net/search.Search.html?query=composite+dummy+variables&type=publication