Doug Proctor: The rough and smooth of data as seen from afar

Posted: May 27, 2013 by tchannon in atmosphere, climate, Measurement, Natural Variation, volcanos, weather

Doug writes he is not a computer man. Well Doug a computer is just a robot, a tool following our instructions, exactly, to err is human, jolly good idea too because Artificial Intelligence is and has been a dead end, damagingly so. The reason is simple enough, it is without what Doug uses, his eyes, and brain, he goes places does and experiences, reacts to changes. It’s actually the noise, the randomness which breaks a precise binary way, always identical to variation and difference, no discovery.

The genetics of reproduction is doubly random, part tried and tested, part variation. Double because a second stage kicks in, won’t go into this here.

Clone. Loved on battlefields, find the weakness, hey look, they are predictable, all the same.

At the same time eg. humans are all the same.

Follow the same path, sheep, guidance and that I am trying to avoid.

Doug writes:

The smoothing in data seems to be asked only after the compilation is questioned: the spaghetti graphs that arrived after Mann was questioned, remember, showed that what we thought (were shown) as a simple trend was, in fact, a mess with a huge variability that disappeared after statistical fiddling. (Or appeared to disappear for promotional purposes) I don’t know any-more about the ice core data, or the CO2 data.

“Recently I saw the raw data from Mauna Loa, the CO2 as actually measured. The article was on how CO2 from nearby eruptions was removed. Wow! The raw data has huge variations OUTSIDE of the volcanic emanations. There are daily and seasonal changes that have been removed. Considering what I saw of obvious multi-year variations in local flora (due, largely, to rainfall changes), I wondered if there had been other “adjustments”, including that of plankton, that weren’t discussed. The difference between what was collected and what was shown was staggering, and it makes me wonder how originally “clean” the ice core data is.

CO2 measurements may be cheap. Oxygen isotope measurements, I wouldn’t think are cheap. But computational infill is cheap.

I’m not a computer man. I can barely put music from my laptop on my iPod (okay, I admit it, my son does it for me. I’d still be using a Walkman if I had a tape machine.) So understanding, let alone downloading and working with digital data from the icecaps (including Lonnies alpine data just released from her Peru work) is beyond me. And most others. What I can do is see holes and I can see where assumptions exist that may not be supportable. I can see disconnects and circular reasoning, all of which exist within the climate change establishment.

The fewer the data points, the larger effect smoothing will have. Mannian algorithms have the ability to take patterns found in one or two data sets and impose them on a grouped 18 without setting off alarm bells amongst the collaterative illuminati. Why simple grouping and eyeballing is considered unable to see patterns in such certain science is outside of my ken, but so is the iPod, so what do I know?

So, to answer [prior comments on Suggestions]: I have no idea what smoothing as per Willis will do. But here is a thought.

Willis’ work concerned data taken on a daily basis. There, the smoothing resulted in a knock-back of 18 months, or 540X the input time period. If we allow this sort of knockback, and we know that the time period of glacial ice is 70 years (until sufficiently impermeable, snow-ice “breathes” with changes in atmospheric pressure, mixing atmospheric components), then it is possible that a knockback of 800 years could occur with a time-proportional data collection as per the temperature set.

Perhaps the question to be answered is this of statisticians: within any data set with a given variation, to what extent will smoothing of various types push back the original data at which the data began to diverge from its prior pattern?”

Doug Proctor, from Talkshop Suggestions where he is also responding to prior comments.

Tim responding, I do know a spot about filtering and digital data Doug. What you are seeing is a mess with few involved knowing what they are doing. Things are so dire generally in science there is no point in speaking out.

Many things are counter intuitive.

I’ve yet to see it pointed out 1D (one dimension) smoothing/filtering is done without comment and is wrong, the data is multi dimensional. Simply taking 10 points for today and averaging is actually what they call smooth. But it is spatial, at least 2D should be done in time too and yes that is taking into account yesterday and tomorrow.

Something statistics often does poorly or not at all is take into account time, sequence. Statistically the lengths in a bundle of wood has no relationship to the sequence, there  is no sequence. It does matter when time is involved and must be appropriately handled. Few would compare prices between competitors when the numbers are taken from different years.

Perhaps this is subtle, this is actually a reduction which trades one domain for another, average 1 and 1 gives 0.5 where there is finer precision in the a single number answer, traded two coarse for one finer. Similarly there is an effect in the time domain. Done right there is preservation of analogue shape, a description at a more economical time scale. Certain loud people elsewhere on the web do not understand this and have laughed at what they see as stupid people and yet it is not their field.

Just occured to me I can disturb those who object to filtering, which of course they would not do. Every time they do one of their favourites, a straight line “trend”, they are filtering all the way to exactly two data points! That is what it is.

Get a repeating wave and there goes the roller coaster of a straight line bucking up and down.

There is no single or easy answer, only compromises.

I suspect mostly the problem which is being discussed is cockups by mixing apples and oranges, items which have been treated dissimilarly.

What is being taken as raw data is actually already altered or poorly made in the first place.

One of the complaints being discussed is an assumed loss of time resolution, a hard event at this moment which seems to have been spread. If this was correctly localised back to a hard event, a different scaling it would give exactly the right time.

If accurate low pass filtering is done on an edge and then suitable reversing filtering is done, you end up back with an edge. Two edge data which have been identically smoothed in time will correlate exactly correctly.

And the problem is?

Request: please keep off Willis as a subject, it could have been one of many others and he is not popular with some people so it’s best left alone.

An excellent subject for discussion is Mauna Loa data and I write that knowing something surprising. I’m busy creating a large article which all being well will appear soon which touches on CO2 historic data in a new and I hope fruitful way.


Tim (not Tallbloke)

Comments
  1. michael hart says:

    Looking forward to it.

    I’m also looking forward to the second coming of the orbiting carbon observatory, if it doesn’t blow up on the launch pad. I have a hunch that it will put the cat amongst the pigeons.

  2. Doug Proctor says:

    Okaaaaaaay ….

    Willis, anyone: it is not who they are, it is what they do. He appeared to explain a time divergence in global temperature (anomalies) attributable to volcanic eruptions and the time of those eruptions, such that the temp change (drop) preceeded the eruptions. He did it in a way that perhaps the less well instructed could understand.

    If you say he was wrong in method or result, I’m not reading you. Beyond his technical contribution here, other aspects he might have are irrelevant.

    While I appreciate this response, I am somewhat unclear as to what you are saying other than it is standard for us to mistake the adjusted, corrected, smoothed and cleaned, published trend for the data it purports to reflect. This is, I agree, a huge leap of understanding that the “experts” like to dismiss or diminish, but in fact is why two such expects can come to diametrical points of view with the same data. (As a geologist, I am more than familiar with this situation. It is the spaces between the notes that determine music, and it is the filled spaces between the datapoints that determine the geological map: there is a lot of leeway for modelling in an world in which “holes” dominate.)

    As for data comparisons:

    Whether temperature, CO2 content or O18 values, the limits of manipulation are determined by the number of datapoints in the series and the range of values. Temperature values have been determined in recent years by 1500 stations coalescing individually from daily averages to a monthly average, and then the 1500 collapsing to one monthly global point, as I understand it. A hundred and thirty year period which is graphed is then 1560 points, of which each point has some internal error. This is what the world sees and the world interprets.

    Antarctic and Greenland and Peruvian ice goes back thousands of years, but we are most interested in the previous 1000 years, taking us back before the MWP. Ice is created annually, but not all is preserved. While multiple samplings of specific ice bands is possible, no more than 1000 datapoints can exist for the last 1000 years. And, as noted, the temperature and CO2 value of each point has an uncertainty somewhere in the +/-35 years due to ice “breathing”, leaving aside problems of time-layer interpretation. It is thus, by intrinsic reasoning, clear that algorithms which produce time disconnects in the temp data wrt the volcanic forcing events could so something similar to the temp-CO2 data in principle. The two should be shifted the same … if the data density is the same. If 1 month data (I was incorrect in my 1-day data thought) can lead to 18 months data, a 1 year dataset could have an 18 year offset, but since the datapoints are themselves 70 year smoothed, would the net result be more than 18 years?

    This is the biggest realization I have come to in the climate argument: we are not dealing with data, but with the result of calculations. (This is especially pertinent for sea-level measurements by satellite, as opposed to sea-level measurements by tidal guages: the first is a calculation, the second, a measurement. Computation vs observation, the first being the shadow on the wall, the second being the creature standing at the cave entrance.)

    And so, does the CO2 dataset have the same density and quality of the O18? The smaller the dataset, the greater the offset would be, I expect. But we should note that O18, being bound in the snow/ice, would not have the time smoothing of CO2 which continued to exist as a gas during the compaction process. CO2 would suffer most in its estimation of time as determined by its source layer position.

    It comes back to a statistician looking at the raw data and determining its “limits of knowledge”, i.e. how much it can be cleaned and smoothed before the changes are greater than its probable error and internal variability.

    Which I admit I cannot do.

  3. Long ago in a place far, far away, computers were said to do what we told them to, much faster and fewer errors than we could.

    In those days I was a programmer, or a supervisor of programmers, and I insisted that our task was to work out how to get the computer to do what we wanted it to do in contrast with what we had told it to do.

  4. Richard111 says:

    It is my belief that all this intense concentration on temperature data is purely a distraction.
    I recently learnt that 99.9% of the atmosphere is unable to cool except by radiation from the so called ‘greenhouse gases’. The atmosphere cannot cool very efficiently by conduction with the surface as ‘cool’ prefers to move down. Heat transfer up the air column by convection is only effective if the upper regions are continuously losing energy to space. The only effective method is by radiation.
    The most effective ‘heat trapping’ gases in the atmosphere are nitrogen, oxygen and argon.
    If it wasn’t for the ‘anti-greenhouse gases’ carbon dioxide and water vapour this planet’s early atmosphere would never have cooled sufficiently for liquid water to form on the surface.

  5. Doug, Tim, I am no expert in statistics but I did a few courses both in my engineering and business studies and used statistics regularly in the past to analyse measured data (including stock exchange data and the outputs of programs such as Metastock).
    BTW I do follow EM Smith at http://chiefio.wordpress.com/ and like some of his stock and commodity charts. He has analysed some temperature data in a different way to eliminate fill-ins and adjustments (see http://chiefio.wordpress.com/category/dtdt/). A thought just came to me -it should be possible to analyse climate data with the indicators in a program such as Metastock. There are indicators such as waves, rate of change, high low ratios, moneyflow (change in price times volume) and one I have used a few times for trading CBL (count back line) buy & sell which highlights a change in market perceptions of a stock. The indicators have a value or multiple values (which can be anything) related to a time. . .
    With regard to statistics, I find the Penguin book “Facts from Figures” by M J Moroney a good read and refresher so my eyes do not glaze over when I see an article such as by Willis on volcanoes.
    Now a couple of points that I see. A linear regression gives an estimated line (which is the equivalent of a line between two points) through a scatter of points between two variables.(eg measured temperature and time). Such an estimated line should always come with a regression coefficient which should be tested for significance. If the regression coefficient is not stated, the person(s) presenting the information are either incompetent or have something to hide. There are various tests to determine significance or goodness of fit. While some may like to argue, statistic books indicate a correlation coefficient less than 0.5 has no significance and the regression equation has no use. Instead of a linear equation it is possible to determine a fit to a power equation or a wave equations (sines and cosines) and one can include a time lag. A regression coefficient can be determined between actual data and estimated data from some curve and the significance can be determined with a chi squared test..
    Just a further point. Some smoothing can be obtained by discarding supposed outliers This is a doubtful practice unless one knows an exact reason for a divergence. It becomes very doubtful if there is little data. A running average in a time series always causes a time delay.
    I certainly agree that one should be always looking at raw data.

  6. oldbrew says:

    According to Jeffrey Glassmann Ph.D. ‘The origin of the CO2 at Mauna Loa is dominated by Eastern Equatorial Pacific outgassing’.

    He also notes that ‘The CO2 growth rate at Mauna Loa is unprecedented because no comparable measurements exist.’

    http://rocketscientistsjournal.com/2007/06/on_why_co2_is_known_not_to_hav.html

  7. Paul Vaughan says:

    “[…] statistic books indicate a correlation coefficient less than 0.5 has no significance and the regression equation has no use.”

    Those are some bad stats books. Don’t use them.

  8. Paul Vaughan says:

    …or use them, but as firewood!

  9. ftp://ftp.cmdl.noaa.gov/ccg/co2/in-situ
    hourly data with all the bad bits
    many errors (no valid data) much noise, but shows the normal curve

  10. Ok Paul V, I was typing without checking the texts. It is necessary to determine the significance. A low regression coefficient can be significant with a large number of data points and of course one can have a negative correlation.
    Moroney gives standard error of correlation coefficient r (if N is large min 100) = (1-r^2) /sqrt N. So with 100 data points a correlation coef of 0.5 would be significant at the 1% level.
    Looking at the last 16yrs of temperature data, no linear regression has any significance.

  11. tchannon says:

    I should have warned and thanked Doug before using his words as part of more.

    Belatedly thank you Doug.

  12. tchannon says:

    Doug,
    The world thermometer data is as you are used to in geology a few looks as best you can interpreted over a massive area and with a lot of gaps where there ought to be more samples. This is not quite the same as regular gaps in a complete grid.

    In this case there is sampling by physical location on a sphere but also sampling in time. In this case also the sampling points on the sphere move over time as well as hole in time.

    Only part of the mess. Calibration, instruments, sites, standards etc. change too, mostly not written down or lost or withheld or not bothered with.

    In contrast to this MLO CO2 is supposed to be for a single physical location where the assumption is the gas is mobile and will come to the measurement site.

    Provided sampling is done correctly the effect is exact within known laws. Unfortunately science is not very good at getting this right.

  13. Given that the so called “climate” data: CO2 level, Temperature, RH, etc. are not presented as raw data with all the calculations, corrections, deletions, infilling of holes, etc. do we have any information that is not contaminated with original observer and subsequent user bias? Since we don’t have the information to make a rational judgement as to the answer of that question we are asked to “trust us” – “us” meaning the members of the community of the self identified “climate scientists”. Clearly, we cannot trust and should not trust them. We should especially not trust them since their prescription for a cure for the alarm they say they see in their much over cooked data is always the same: global governance, heavy handed top down command and control of energy source and use, population control (read population elimination), and massive public funds given to them to study the “problem” further.

    As a scientist and engineer, I understand that a solution to a problem is determined by the problem itself and the context in which the problem exists. There is no one solution that fits all problems. There is always a very specific problem-context-solution relationship. Since the solution is the same for global warming as for global cooling and climate disruption (or the term of the day), we have clear evidence the solution is itself bogus and is derived from a prior purpose having no connection to the presumed problem to be solved. So once again, we find that it is a serious error to accept their demand that we trust them on their word alone.

    Thus, based upon lack of transparency and the insistence of a single “solution” we have multiple independently sufficient reasons not to trust ANYTHING they say. That lack of trust has nothing to do with the complexities of statistics and the higher order of difficulty of understanding what the statistical calculations mean within the context they were done. Unless and until we can get past these issues, any discussion of the proper use of statistics is nothing but a smoke screen to hide the fact the whole discussion is irrelevant and has nothing to do with anything that actually makes climate what it is let alone how to control it.

    All we are doing is pretending our largely unfounded opinions/wishes/fears have global meaning and global social impact while they are without any rational connection to anything real. We simply feel they are real. In simple language they are all nothing but BS and not very good BS at that.

  14. tchannon says:

    oldbrew, “According to Jeffrey Glassmann Ph.D. ‘The origin of the CO2 at Mauna Loa is dominated by Eastern Equatorial Pacific outgassing’.”

    That is very likely correct. Close to Hawaii is the great big ocean conveyor has a major surfacing point for cold old water. Just done a rapid scan of the item, looks about right.
    Deep ocean overturn is very poorly known or understood.

    thefordprefect,
    A tale has yet to come out. Be a surprise.

  15. oldbrew says:

    TC: Thanks for that. Mauna Loa is a very long way from any significant industrial CO2 concentrations, and CO2 is not as ‘well mixed’ as the IPCC used to claim, so what exactly is ML measuring?

    As Glassmann says ‘the concentration of CO2 depends on where it is measured.’

    In a further comment (item 14 in the link above) he points out:

    ‘Also, small changes in ocean or atmospheric currents could have an additional profound effect on the CO2 measured at Mauna Loa. The center of the CO2 plume may now be moving toward Hawaii, causing an increase in CO2 concentration there. This could also account for the seasonal effects evident in the Keeling curve.’

  16. Doug Proctor says:

    Thanks! Blogs are attempts at dialogues using sequential monologues; the process of progressing further down the path of understanding is awkward and episodic, but it is doable.

    A final thought before I look up the raw data of the ice core:

    If glacial ice really has a 70 year “breathing” issue, then even the original raw data can be considered to have had a running 70-year smoothing algorithm applied to it. Thinking more still a 12,000 year ice core column will not have 12,000 datapoints to it, so the lessons of the temp series of the instrumental age are appicable. Smoothing will push back the start period. IF the same adjustments are made to both temp and CO2 values, the pushback will be the same – IF the data densities are the same. We will have to determine if these two “if” situations are true.

    It is worrisome for Representational Truth that simple analysis of excellent data can have cooling precede the events (volcanic emanations) that cause the cooling. Probably more worrisome that the researchers didn’t either notice it or address it.

  17. tchannon says:

    No competent researcher will be misled, we hope.

    A core point to keep in mind is comparing apples to apples: both data must be treated identically in all respects. If both are filtered, must be the same.

    Violation of this is a key part of the Hockey Stick fraud. I spotted an attempt by one of the well known names to try and intellectualise this away but all he did is alert me to his knowing it was wrong. Do that after legal advice, risking things.
    A snippet in relation to that: I wondered what tail would be on instrumental data given what they could have known at that date. My best guess gave a hockey stick blade half the height.
    This is why data and code is needed or at least a clear explanation.

  18. michael hart says:

    Regarding the problem of interpolating between data points in time series, D.Roy Spencer, until recently, posted his UAH temperatures with a cubic spline fit drawn in “for entertainment purposes only”.

    Although he didn’t say so, I made the working assumption that this was a none-too-private joke about the perils, aimed at unmentioned scientists whose name begins with K- and ends in -evinTrenberth.