Machine learning ‘causing science crisis’

Posted: February 17, 2019 by oldbrew in Critique, methodology, research

Science under stress?[image credit:]

Problems can arise ‘because experiments are not designed well enough to ensure that the scientists don’t fool themselves and see what they want to see in the results.’ For example, it seems ‘up to 85% of all biomedical research carried out in the world is wasted effort’.

Machine-learning techniques used by thousands of scientists to analyse data are producing results that are misleading and often completely wrong, reports BBC News.

Dr Genevera Allen from Rice University in Houston said that the increased use of such systems was contributing to a “crisis in science”.

She warned scientists that if they didn’t improve their techniques they would be wasting both time and money. Her research was presented at the American Association for the Advancement of Science in Washington.

A growing amount of scientific research involves using machine learning software to analyse data that has already been collected. This happens across many subject areas ranging from biomedical research to astronomy. The data sets are very large and expensive.

‘Reproducibility crisis’

But, according to Dr Allen, the answers they come up with are likely to be inaccurate or wrong because the software is identifying patterns that exist only in that data set and not the real world.

“Often these studies are not found out to be inaccurate until there’s another real big dataset that someone applies these techniques to and says ‘oh my goodness, the results of these two studies don’t overlap‘,” she said.

“There is general recognition of a reproducibility crisis in science right now. I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.”

Continued here.

  1. JB says:

    Sifting a dataset isn’t learning. Computer code that writes new code resembles learning, but there’s no effective way for a computer to verify its new code against reality.

    This is something that humans have a difficult time with as is. If humans can’t eliminate an hypothesis based on the philosophical error of reification, how could a machine possibly learn to do that?

    We have manufactured a world of self-delusions, and the notion of “artificial intelligence” invested in a machine is one of them.

  2. Kip Hansen says:

    I don’t think the Reproducibility ( or as some have it — the Irreproducibility) Crisis has anything to do with machine learning although there may be some “blame” to be laid at the feet of the use of computers that do complicated statistical analysis without requiring statistical understanding on the part of researchers.

    Study Design, registered and shared BEFORE any experimentation is done is the first steep — and statisticians should be involved in the study design to establish before the fact what statistical analysis should be done once data is available — what statistical analysis is appropriate for the kind of data and its method of collection.

  3. Shaun says:

    Sounds as though she is describing automated data peeking.

  4. dai davies says:

    AI has been going through hype cycles since the 1980s. We are in one now with many $billions thrown at it. What’s missing in attempts to apply it to real-world problems is an understanding of complexity.
    Exponential increases like Moore’s law are relatively easy to comprehend but we seem to lack an intuition for increasing complexity which rises to the power of the number of components. I produced a graph the other day for an article I’m writing that shows the impact of increasing complexity simply by adding coin throws.

    “In Figure 1 the complexity increases with a doubling time of ten years. The different plots show the significance of the starting complexity or initial number of coins (from 0 to 4). The “0” plot represents the exponential rise in raw computing power. Real world problems greatly exceed the simple systems represented here. “

  5. stpaulchuck says:

    Kip Hansen says:
    February 17, 2019 at 6:28 pm
    too true Kip. Even with my CS degree I managed to shoot myself in the foot with a pre-analysis design error. I’ve spent the last year reading papers on ‘machine learning’ from all over the planet. I’m interested in the classic search for an accurate prediction of time series data.

    I thought I’d got the right architecture and input data sorted out as I had a covariance of 98% with the test data and the actual data. After a bit that started to worry me. Too good a result. I went back and looked at the INPUT data relationships instead of the architecture and biases and weights, and finally the light came on. Among other data I input the open, high, low, and close data looking for a good output on any of them for ‘next day’.

    Silly me. Today’s close price is tomorrow’s open price (plus or minus any gap that might happen in a news affected data stream). Once I removed that linkage the score dropped to around 60%. duh. And that’s on some pretty well understood data. Think about more esoteric data where relationships are likely not understood at all. Too easy to fall into a logic trap and think you’ve found the Holy Grail of machine learning when in fact you’ve got a fake.

  6. Phoenix44 says:

    Machine learning is not the problem, but the well known issues of confirmation bias, publication bias, poor statistics, data trawls, misuse of P values, the requirement to publish etc etc.

    The basic problem is scientists are humans. The scientific method is designed to minimise the effects being human has on the search for “truth”. The BBC doesn’t ewant to hear that though as it would cast doubt on their beliefs in all sorts of areas.

  7. Phoenix44 says:

    And of course the fact that 85% of proper, RCT research in something like biomedicine is somewhere between useless and fake has no implications for climate change! That science is settled and far less than 5% of that work is useless or wrong. And even though there is a clear Reproducibility Crisis across all science, you are still a Science Denier if you question science.

  8. oldbrew says:

    the software is identifying patterns that exist only in that data set and not the real world

    Time to examine the software more closely?

  9. ivan says:

    All this shows is that the computer engineers of the main frames knew what they were talking about back in the 50s when they said GIGO (Garbage In Garbage Out). That statement is truer today than it was then.

    Until such time as an Artificial Intelligence actually is produced, or emerges, people are fooling themselves using that term. A computer will only do what the program dictates, the only ‘intelligence’ is that of the programmer and that seems doubtful because too much programming uses ‘borrowed’ routines and relies on ‘getting it out of the door and we’ll fix the bugs later’ rather than supplying software that is as bug free as possible.

    There is another stumbling block with this – the actual definition of what the computer has to do. This is the human interaction which is subject to all sorts of bias and pre conceived ideas – a excellent example of this are the climate change models. Until that human bias can be eliminated GIGO still rules.