Part 1: Scientific Method

Hello. I hope everyone can hear us now. WebEx says that we have audio, so hopefully it does. My name is Jason Church, I am with the National Center for Preservation Technology and Training. We are a research office of the National Park Service, and I am here today to introduce our first in what will be a series of conservation science webinars. The first is Scientific Method and Experiment Design for Preservation and Conservation Research. For this I would like to introduce Dr. Chandra Reedy, conservation scientist and professor at the University of Delaware, and she’ll be doing this webinar today and a lecture today. We’ll be in 3 parts. At the end of each part we’ll take time to have about 10 minutes worth of questions. We do not have audio for the participants, but you can chat in your questions and Dr. Reedy will pull up those questions. Also we have handouts for notes if you want to take notes on the PowerPoints and also experimental exercises that Dr. Reedy will be doing later. Those can be found on the NCPTT website at https://www.ncptt.nps.gov/blog/scientific-method-and-experimental-design/ or you can go to our home page and there’s a link right there on it. Then you can download those PDFs. That is also the link where we will place the recorded lecture for today. Without further adieu, I’d like to turn it over to Dr. Reedy.

Thank you Jason, and hello everybody. Welcome to this webinar. First I should introduce myself a bit and explain why I’m so interested in scientific methodology and experimental design issues. I’m Chandra Reedy, and I originally did my PhD work at UCLA in a program that was interdepartmental and it required working at least 3 different departments. I ended up being one-third earth and space sciences, one-third anthropology, and one-third art history. My dissertation research was actually mostly carried out in the conservation center of a Los Angeles County Museum of Art where my primary advisor was located and where I ended up working for the first few years after I did my PhD. I’ve been working in a very multi-disciplinary way that incorporates natural science, social science, and humanities research along with preservation and conservation science. This really requires that you think carefully and be clear about your experimental design.

I’ve been at the University of Delaware since 1989 where I’ve taught in the graduate program in conservation, the PhD program and art conservation research, the museum studies program, and in the graduate program and historic preservation.

I’m currently in the Center for Historic Architecture and Design where I also serve as director for its Laboratory for Analysis of Cultural Materials. I still incorporate geology, anthropology, technical art history, heritage conservation science, and historic preservation research into my research program which means I’m constantly having to think carefully about my research designs and experimental design procedures.

Early in my career I co-authored books on experimental design and statistical analysis in conservation research. This required first reviewing 320 research papers to examine and try to take apart the experimental design structures of each of those. I found that very helpful and I’ve tried to use that information over the years to help students on their experimental designs and to try to polish up my own research designs.

I’ve continued to pay careful attention to research design while I served for 7 years as editor-in-chief of the Journal of the American Institute for Conservation, and for the last 6 years as editor-in-chief of studies in conservation for the International Institute for Conservation of Historic and Artistic Work. That requires reading through a lot of papers, many of them outside my area of expertise, but trying to look carefully at experimental design issues.

I’ve taught a workshop on this topic many times at many venues, and now this webinar can be made permanently available so anyone can access it at any time and from any location thanks to NCPTT.

I’d like to first go over some of the main goals of this webinar. First we’ll be reviewing what I think are the most crucial aspects to consider in experimental design. We won’t be talking about everything having to do with this subject; it’s a very, very large subject. We’ll be discussing experimental design and not statistical analysis. Although, as we’ll see, a valid statistical analysis does require a proper experimental design, so they certainly are related. Third, many of you perhaps, if not all, will have heard some or even all of these ideas before, maybe multiple times. Certainly, I hope, in science course work in school. I myself find it helpful to review and rethink about these issues from time to time to renew and refresh my ability to organize research. I think review of these concepts is always helpful. At least it is for me.

Fourth, today we want to think about how these experimental design concepts can be used in daily work for improvements in clarity of our thought process, improvement of design of testing procedures, and perhaps lead to outcomes in which you have more confidence. In preservation and conservation, we constantly have to make decisions that can have a major impact on culturally significant objects and important irreplaceable collections. We have to decide what preservation measures and conditions to choose, what treatments to choose, how to carry it out, is one choice really better than other possible choices. The purpose of good research and experimental design is simply to help us make the best possible choices and to have confidence in that choice. Finally, a better understanding of what makes for good experimental and research design can improve our ability to evaluate the professional literature as well as improve our ability to contribute to that literature.

All of the PowerPoints that you see today are available to download on the NCPTT website either to jog your memory later or to print out and take notes as we go along. References for all the papers that I mention today are also given there as well as some practice exercises. The webinar is structured into 4 parts.

Part 1 will review the scientific method in general. We’ll discuss first a seminal paper by John Platt about what he thinks accounts for progress in scientific research. Then we’ll look at what Enrico Fermi and Louis Pasteur have had to say about the process in thinking in science and how that’s related to research design. We’ll review steps in the research cycle and then focus down on the step of formulating hypotheses, looking especially at the advantages of trying to have multiple hypotheses.

Part 2 after beginning with a discussion of the advantages of constructing formal experimental designs will focus on the basic concepts of experimental design including object protocols, measurement protocols, and treatment protocols. After part 1 I’ll take a few minutes of questions if there are any and also again after part 2.

Parts 3 and 4 will be done together with questions at the end. Part 3 will look at the more common type of experiment that we see in preservation and conservation research; the multiple object design. We’ll talk about using a design check sheet and research design flow charts or sketches or diagrams, and then we’ll look at some case studies of what I think are really good multiple group designs. Then part 4 will move into single object studies and a few other design variations. We’ll look at some case studies of how statistically valid studies can be conducted when we only have a single object available. Then we’ll conclude with a discussion of screening experiments and treatment trials and how these might be useful in conservation.

We’ll get right into the meat of things now with a brief overview of the scientific method as an approach to research. There was a classic paper that was published in 1964 in the journal ‘Science’, and it received a lot of attention at the time it came out and for years afterwards. It’s still a classic and it’s definitely worth reading, although it’s not dates. Especially, you’ll notice, in sort of genderized language that’s used.

The main point is that he noted that some fields of science seemed to advance at a much faster pace than others and he wanted to try to understand why. He proposed that the main difference is really an intellectual one; that rapidly progress in fields systematically use an approach that he called strong inference. He says that includes steps used by all field of science, but in some fields these are systematically and regularly applied as the standard accepted approach. This method includes first of all devising alternative hypotheses, then devising a crucial experiment or a series of them with alternative possible outcomes; each of which will as much as possible exclude one or more of the hypotheses. Then carrying out the experiment so as to get a clean result, and recycling the procedure to develop sub-hypotheses or sequential ones to try to refine the possibilities that remain, and doing all of this as part of long-term research programs.

He knows that while this is understood to be the way science work, actually in practice we often forget to clearly follow all of these steps. Instead what happens is when the paper’s written at the end of a project, it’s written as if this process had been followed after the fact or he says we find ourselves doing what’s more like busywork; doing analysis without appropriately setting up good hypotheses. Maybe being too method oriented rather than problem oriented. In contrast, in some fields he felt that regular and explicit use of alternative hypotheses with clear exclusions is the norm and that that leads to fast and clear progress. He discusses in detail why just affirming a hypothesis is insufficient, because there might be a lot of factors or explanations involved that we haven’t really thought about to test for; that science really advances by disproving until all of the exclusions have been exhausted. Then the possibilities that remain are really more strongly supported.

2 examples he sites are high-energy physics and molecular biology. He said that any times you walked through Francis Crick’s laboratory, for example, you would see blackboards where the hot new results from his or some other lab were written, and then below that would be 2 or 3 alternative explanations that people had written down. Then others would fill in with a series of suggested experiments that could reduce the number of possibilities.

Platt talked about how Enrico Fermi, the physicist of the early to mid twentieth century who developed the first reactor, contributed to quantum theory, nuclear and particle physics, statistical mechanics, and was a Nobel Prize winning scientist renowned for both theory and experiment, so clearly a successful scientist … That he explicitly and regularly practiced this method of spending time every day thinking about and writing down his thoughts including writing down possible hypotheses and possible tests and using this as an expansion of a more typical notebook that you would use for laboratory work.

Platt called strong inference method “the logical tree”. If you have multiple hypotheses or explanations for what you’ve observed and you design an experiment that should exclude one of those possibilities, then it’s like a tree; at the first fork of the road, the experiment sends you to one branch or it closes off another. Then eventually this leads to forward progress rather than going around in circles or not going anywhere at all. We’re familiar with these sorts of logical trees because they’re used as identification keys for things like plants, trees, pollen, or here this example is for identification of parasites that can infect animals. And we use these in conservation, for example for pigment identification keys using the results of polarized light microscopy or simple mechanical tests. As you go through this tree, certain test will allow you to exclude a possibility and eventually it takes you down to where you can pretty confidently identify what pigment you have.

Another example of really thinking carefully through your experiments that Platt talks about is Louis Pasteur. This was a French chemist and microbiologist of the late 19th century and he made remarkable breakthroughs on the causes and prevention of diseases. He developed the first vaccines for rabies and anthrax, did experiments that supported the germ theory of disease, developed the pasteurization method of milk and wine to prevent disease, and he also made breakthroughs in the understanding of crystal structures in chemistry.

Platt noted that every few years Pasteur moved into studying a new problem where there we already a lot of experts who had devoted their lives or at least many years and then he would solve problems that no one had been able to do before. Which I’m sure was very annoying to some of those people. Clearly it wasn’t encyclopedic knowledge that was a secret, nor was it amount of time spent doing the research. Platt said it was Pasteur’s systematic application of strong influence. He contrasts this approach with general surveys and observational studies that often don’t have clear alternative hypotheses or tests that could exclude a possibility, but which are designed really just to gather data. He says, “We call this science and say it’s important because it’s collecting information that might be useful later as another brick in the wall of science.” Platt said that most of those bricks just basically end up lying around in the brickyard if we’re just collecting them in the hopes that some day someone else can use them to answer a question of interest or address a hypothesis.

A good suggestion that comes out of following this method is the idea of keeping a notebook … Whether it’s physical or electronic or on the smart phone notes … That goes beyond the idea of just keeping a laboratory notebook for recording our actual experiments and data, but is used for constantly reflecting and writing down new ideas, new observations, hypotheses that might come from those, implications of those hypotheses, and therefore how you might design experiments to test them, and any other brainstorming ideas that come from sitting and spending at least a few minutes every day thinking about the problems that you’re interested in working.

These are the basic parts of the scientific method that we’ll now discuss individually and a bit more in-depth. The observations, specifying a research question, constructing hypotheses, developing the implications or inferences of those hypotheses which lead do the design of experiments, explaining the rationale and the logic behind our hypotheses and our implications, conducting experiments, doing the analysis, publishing the results, and then building on them to create a long-term research program.

First is when you observe something intriguing, puzzling, or problematic, write it down. If a literature search doesn’t show that the answer’s already known, then this observation could be the basis for a full research project either now or later.

Next, based on that observation, specify or formulate the exact question of your study. You’re going to want to choose a research problem or question that’s clear and answerable within the period of time that you have available to you. That’s really important. If you have 1 year to carry out a research project, make sure that the research question, to the best of your ability, is something that could be answered in that period of time. You’ll have a very different type of research question if you’re going to conduct a 3-year project, and if you have only 3 months it will be different altogether.

There are different types and levels of research questions. Clinical questions might include something like: Which coatings in this group will pass or fail various tests for use on outdoor bronze sculptures? We’ll look at what phenomena are present, what processes are occurring, what results do various treatments have? These questions may give you information that you need to conduct further research, and they’re always needed as a first stage in a new scientific research program. Other examples of clinical questions could be: Does a particular colorant fade over time? If so, how fast? What factors might affect the fading of this colorant? Was it light, exposure, temperature, humidity, air pollution? Maybe, which of 3 proposed controlled environmental parameters most inhibit the fading of this colorant?

Scientific questions might be something more like: Why do some coatings hold up better in this particular environment over time? What do they have in common? With these types of questions, we can then predict future coatings performance better. We can gain knowledge about the underlying reasons for the observed behavior, not just observe the behavior. For example, if we try to answer the question: Why are some silver coatings permeable to hydrogen and sulfide so the silver turns black, while others aren’t? We can then use that information to select appropriate coatings for future testing. Here we might be looking at things such as: Why are the observed phenomena present? Why do certain processes occur? Why do various treatments produce the results observed? It builds on the knowledge that we’ve gained from answering those clinical questions, but it allows you to go further; it allows you to predict the outcomes for conditions that you haven’t yet tested. We would look at: Why does this particular colorant fade over time? If ozone, for example, accelerates the fading of the colorant, what is a mechanism of that effect? Knowing that mechanism, can we then predict what other factors might also cause this colorant to fade? Can we predict what will happen in the environment of a particularly building?

The difference really is that of product testing versus scientific research. If we can better predict which coating should work well, we can do more efficient selection of coatings for testing or even design an ideal sort of coating. With a clinical question, we might rank several coatings or colorants by the degree of change in a similar environment over time. The scientific questions would try to look at explaining the ranking and identifying the underlying relationships so we can predict what change might occur on untested products. We need both of these sorts of questions.

There are other kinds of questions we might ask also. For example, how much variability do we see in results? Is the change such as fading that we’re seeing significant enough to be worth worrying about?

Once a research question or problem is developed, then the next steps in research can continue with developing your hypotheses. These are going to be the possible answers to your research question or possible explanations of your original observation. Platt referred to a paper that had been published in the journal ‘Science’ way back in 1890 by TC Chamberlin, and then after Platt’s argument came out it was reprinted by “Science’ again in 1965. Again, of course it’s dated in language, even moreso than Platt’s. This paper does make a few interesting points about the usefulness of trying to have multiple hypotheses rather than a single one and how that can encourage creativity.

He talked about an argument that was raging during his time regarding the origin of the Great Lakes basin. There were 3 competing theories that were being sort of violently argued. One was that they were river valleys whose outlets were blocked by glacial debris. The second was that they were excavated by ice, and the third was that they were made by crustal deformation. It turned out that it wasn’t actually a case of which one is supported because none of them could be refuted. The real question ended up being: How much does each contribute to the explanation? So often coming up with a single hypothesis may be too limiting because a problem often has multiple facets that can be better identified if you’re considering multiple hypotheses.

Chamberlin also talked about how if you have one hypothesis, it can become somewhat like what he called an intellectual child. If it’s our only one, we may simply not see anything that doesn’t support it. We become invested in trying to find things that support it and we just don’t see things that might refute it. Of course there are many journals that won’t accept papers that only report on lack of hypothesis support. It’s also sort of a scientific survival instinct to try and support a single hypothesis.

Chamberlin put this a lot more poetically, so I’m going to read the sentence from his paper on this. Quote: “The mind lingers with pleasure upon the facts that fall happily into the embrace of the theory and feels a natural coldness towards those that seem refractory. Instinctively there’s a special searching out a phenomena that support it, for the mind is led by its desires. There springs up also an unconscious pressing of the theory to make it fit the facts. An oppressing of the facts to make them fit the theory.” End quote.

While one hypothesis might immediately come to mind when you’re looking for an explanation for a particular observation, if we can develop a habit of then trying to be more creative and come up with additional ones, that can feed creativity. It might ultimately provide for a more robust explanation, but it also means that we’ll have less invested in there being a specific outcome because if the original hypothesis is shown to be refuted, then one or more others still might emerge as a good possibility. We’re still going to have a good result. It’s good practice to try to bring into view then, every rational explanation for our observation and try to develop every possible reasonable hypothesis we can think of; not just the 1 or 2 most obvious ones.

We’ll look first at a clinical example. Here the observation is that some adhesives used in conservation of a particular type of paper discolor. The research question is: Which of 3 adhesives will discolor the least on particular paper type?

In trying to develop alternative hypotheses, you can always negate the first hypothesis to come up with a second if you’re truly willing to take both outcomes seriously. This means that no matter what the experimental outcome, you’re still going to have a result; something is going to be supported. You won’t be able to clearly … You should at least hopefully be able to clearly eliminate one of them. Here we have as our 2 hypotheses: First, that there is no significant difference in discoloration over time for these 3 adhesives on this particular paper. The second hypothesis is: The 3 adhesives are significantly different enough to be ranked from better to worse. Clearly 1 of these will get refuted or eliminated.

Looking at a conservation example, one of my favorite ones is from a paper that Helen Alten, a conservator, wrote and presented back at the very first materials issues in art and archaeological symposium. It was an evening session running late at night; I think 7-10. Most of the other presenters, if not all were, scientists. Glass scientists. Helen was the only conservator that I recall, and it being late at night and the other papers were very convoluted and hard to follow. People were nodding off, falling asleep. With Helen’s paper, everybody woke up, paid attention. It was clear and easy for anyone to follow whether you’re an expert in that subject matter or not. Shew as looking at glass from a 13th-14th century archaeological site. Her observations were at this wet archaeological site there were a lot of variation in the amount of visible deterioration that could be seen with different wet class fragments. The second observation that air-drying the glass often seemed to cause loss of translucency, a visual emphasis of defects, or even cracking and fragmentation. Her research questions were: For both observations, what is the cause?

She developed 5 multiple hypotheses, and all of these had rationales that explained why she was proposing these hypotheses. The first hypothesis was that waterlogged glass is damaged by air-drying. Her rationale is that she found published sources that indicated air-drying might damage glass by causing loss of material. The second hypothesis was damage by air-drying is more sever in the more corroded samples, and her rationale is that the more corroded samples might be damaged more severely by air-drying through loss of structural strength. Third, the physical removal of water causes damage to the glass. Here her rationale was that water might act as a bulking agent, consolidating part of the glass structure or surface tension of the water might cause physical damage to the glass during removal. The fourth hypothesis is that the refractive index of the water merely masks damage already present. The rationale is that the removal of the water itself is not perhaps damaging the glass, it’s just that the damage that was already there is no longer being masked. Fifth she hypothesized that maybe initially-different glass compositions were causing differing glass corrosion, and the rationale is that the different compositions may behave differently in water removal and so we’ll have different initial corrosion experiences.

She then went on to present tests for each of these hypotheses and she again gave good rationale for why those tests were good to address these specific hypotheses. I invite you to go and look at that paper. It’s in the list of references up on the NCPTT website.

The next step is to explain all your relationships and rationales clearly as Helen Alten explained why her multiple hypotheses each might have validity, and therefore we understand that they were worth examining. This is an important step because it’s not always clear to others what your reasoning is, and it might not even be clear to you later on if you step away from the project for a while. This type of lack of clarity or the rationale or reasoning isn’t crystal clear is also often a reason for papers reporting on the work to be rejected by journals or for grant proposals to be unsuccessful; because their readers or reviewers will look at that and say, “I don’t understand why you think these hypotheses are answering your question or why these tests were being done. How are they addressing your hypotheses?”

The next step is to infer. This is crucial because you’re going to be basing your experiments on these inferences. What are the implications of your hypothesis? If it’s true or if it’s going to be refuted, what do you expect to find? You need to be clear here in order to set up tests that are going to clearly and logically connect back to your hypotheses so it’s clear to everyone that your hypotheses are, in fact, clearly and cleanly eliminating where testing your hypotheses.

If these combinations of hypotheses and their implications, they need to be explicit and written down in order to perform clean experiments and get a clear result. If you don’t have any hypotheses and you’re just collecting data that you’re then going to use later to try to come up with some interpretations or explanations, you’re never going to be wrong, but that type of work also tends to make really slow progress or even no progress. Funding and publication is also likely to be more difficult because reviewers want clear hypotheses and test implications.

One example that I know is I had a colleague who had a fellow researcher die unfortunately, and he left a huge amount of data that wasn’t accompanied by any written hypotheses or test implications, so the lab team needed to do something with this to get back to the funding agency. They worked for a long time just trying to dredge through the data and do various analyses to look for significant associations to try to come up with some useful result. They did finally product a tiny, not very important, result. It certainly wasn’t worth the years of data collection and the person hours that were put into that.

A hypothesis also should not be so weak that no reasonable testing program could ever disprove it such as: This treatment might be valuable under some circumstances. This isn’t really the way to advance the field or to make efficient use of research time and funds. You also don’t want to try to say everything in one hypothesis and over-complicate it. But, you do want to try to say something worthwhile. Hypotheses should not be value judgments outside of the realm of testing … This is something that’s especially problematic in social sciences … Clear hypotheses with rationales for why those are possible explanations for your observations and clear test implications means that others can follow your thinking patterns and you yourself can follow them even years later.

We’ll look at a couple of hypothesis examples for scientific research questions. Here we have an observation that some adhesives used with this particular type of paper discolor. The question is: What factors determine how much an adhesive will discolor on this paper. The first hypothesis might be adhesives with certain chemical bonds susceptible to hydration will react with water, causing discoloration. The implication here is that color measurement of some adhesives subjected to varying humidity will show greater discoloration after exposure to high humidity. The rationale is that excess moisture allows hydration reactions to occur.

A second hypothesis might be that adhesives containing a particular impurity will discolor over time. A rationale might be that this particular impurity is highly reactive and it’s known to be something that can be introduced during adhesive synthesis or processing. An implication might be that color measurements made before and after artificial aging on adhesives with and without the impurity will show greater discoloration on the ones with the impurity. Here we can see that both hypotheses could be supported rather than negated by experimental work and then we might be looking at what degree do each of these contribute, and get maybe a better explanation for what to expect.

The next step is to design a testing program using those test implications we’ve just come up with. Think back to that idea of a logical tree, and use that as sort of a road map for the design process. This type of design plan where you’re going to at least make an effort to eliminate at least one of your hypotheses will allow for some real conclusions and that will then strongly support whatever remains after your attempts at elimination.

Simplify: This is actually quite hard to do because we all have a tendency to try to design one experiment that can test for all possible variables. In conservation and preservation, you’ll be constantly pressured by colleagues to add this or that variable into the test, but don’t do it. Focus on trying to exclude at least one of your possibilities in a simple test that will give a clear result for that, and then move on to the next thing to test for.

For example, if you’re mainly interested in how well a coating for silver can protect against hydrogen sulfide, then the first logical test would be to test your potential coatings on coupons of one silver fineness under exposure to just hydrogen sulfide. If a colleague notes that maybe application method also has an effect so you should test 3 common application methods; and maybe there’s a difference between fine and sterling silver so add in that variable; and then maybe we should also check on how the coatings also perform on copper and lead too; and maybe metals of different geometries will vary, and so maybe we should try some curved coupons and some flat coupons; and what about trying to vary the coating thickness too? Pretty soon you’re going to have an overly complicated experiment that will take much too long to do and is likely to give a result that’s going to be difficult to interpret. Try not to do this. You can instead do a series of experiments in progressive, logical steps. You don’t need to do it all in one highly complex long experiment that may not have results you can clearly interpret or explain.

So, what is the simplest experiment that can exclude one of your alternatives? This is called a crucial experiment. Here we want to use the thinking again to come up with ways to try to exclude one or more hypotheses in the simplest, shortest, most clear experiment that you can versus the longest and most complicated and most expensive one you can. We carry out the experiment.

Analyze the results. This may require statistical analysis, it may not. However you’re going to interpret your results, what you’re going to be doing is looking at those actual results and comparing them back to those test implications you created for each hypothesis.

The next step is to publish your results so we don’t all have to keep reinventing the wheel. Then make sure to do a thorough literature search at the outset when developing your research questions and hypotheses so that you’re not reinventing the wheel. This is actually another reason for papers often getting turned down for publication because we might sometimes see scientists in other fields who didn’t check the conservation or conservation science literature and are not aware of the work that’s already been done or conservators or conservation scientists who didn’t check the allied literature. That will get often caught in peer review.

The next step is to build on your results if possible. Unfortunately, this is most often dependent on long-term or followup funding, and that’s often difficult to find in our field. Publishing then is even more important because some day someone else may get funding to build on your results or you might have to wait a couple of years before you can continue with that line of research, so at least publishing the initial results is going to get that information out into the field in the meanwhile.

This is a drawing of the research cycle which you can see is not linear. You often need to backtrack and rethink things at the end when something doesn’t work out the way you planned. My experience is that always happens. Something somewhere always goes wrong or is unexpected and it means you’re then going to have to start over and tweak your design. You may start up here with your observations, you develop some hypotheses, you create a design, and you start to do your experiment, and then you find out that it’s not going to work. You have to go back and either redo your design or reformulate your hypotheses. You finally run your experiment; as you’re analyzing things, maybe you didn’t get a clear answer and you decide you had something fatally wrong in your design or even back up with your hypotheses. Finally you get through here, you publish it, and then you’re going to create sub-hypotheses or sequential hypotheses to hopefully continue on the research.

I’ll just show one quick example of when things go wrong and research is not linear. I was working on a silver coating project with a rather large team and things got a bit over-complicated. After weeks of planning as a team for which coatings to test and which shapes of the coupons, which application methods, how to set up the accelerated aging chamber, which pollutants to introduce, and when and how to monitor them, et cetera … We finally got started on the actual experiments and we got the coupons in the chamber and began what was supposed to be a long-term test. But it turned out that one of our primary coatings that we were testing … We were testing it because it was recommended and used in the preservation of silver … Turned out to be completely permeable to hydrogen sulfide; at least at the thickness that we had applied it. That’s a common off-gassing product, so within the first week of testing all of the coupons with that turned completely black and it would then swamp out the results of any further analyses we wanted to do. So we realized that we had to stop and rethink the entire experiment.

Later we actually found out that the same thing had actually happened to some silver collections where this coating had been used, but no one had ever reported that in the literature. Even when you try to think about every possible aspect of your design and you’ve done a literature search, things can still go wrong. But think about how many more things could go wrong if you didn’t spend a lot of time thinking, planning, writing out rationales, and checking the literature.

I want to end this section by quickly reviewing some of the advantages of trying to come up with multiple hypotheses prior to undertaking research rather than just grabbing onto and focusing on the one that seems most obvious to you. It can encourage creativity. You might come up with some potential answers that had never occurred to you before. It can also encourage consideration of more complex causes of a problem. You may have multiple solutions that are all contributing. You may end up with a more sophisticated answer. You try to think of as many potential answers to your research question as possible, and not just stop at one or a few.

I invite you to practice this at your leisure with a hypothesis, brainstorming a practice vignette that’s been posted on the NCPTT website. There’s a couple of brainstorming exercises there where you can try to think of as many hypotheses as possible that could be possible answers as to why deterioration is occurring in 2 particular case studies. You should be able to come up with 8 or 10 hypotheses. Some of them are things that you might be able to quickly eliminate through a literature search. Others could be eliminated through some simple tests, and others may require more detailed research.

Now we’re going to pause for a couple of questions if there are any.

PART II — BASIC CONCEPTS OF EXPERIMENTAL DESIGN

I can’t hear them, so we need to have them show up in the chat. After this section, I’ll check the chat box again and answer anything that’s there. Before launching into the next topic, I want to first highlight some of the advantages of constructing a formal experimental design, and this is what we’re going to talk about now. This is when you actually write out your hypotheses and their implications and then follow a series of defined protocols that we’ll be talking about in your experimental work. First of all, this prevents unintentional bias, and that will allow you to be more confident that the results of small study of a few objects or simulated objects can be generalized beyond that to actual treatment decisions. You’re also more likely to have accurate results, that’s of course important, and the risk of self-deception is avoided or reduced by attention to design choices.

We’ll look at a few examples of when that would happen. In addition, some elements of a formal experimental design are required for meaningful statistical analysis, and in medical and agricultural research, this approach has brought about many great developments, and I think the same is likely to be true in conservation and preservation research. Here are 3 divisions where decisions have to be made within the design of experiments, and therefore these mark the major aspects or steps of an experimental design, your object protocols, measurement protocols, and treatment protocols, and we’ll be discussing each of these in turn.

For object protocols the major decisions to be made include whether to use real objects or facsimiles and how many replicas to test. As an example, for tests to identify the materials or technology used in making a ceramic pot, we need to decide how to conduct testing or sampling protocols for the pot itself. Then we test only a few objects out of a larger group available, if these are randomly selected, at least, we can generalize with some confidence to the collection itself. We know that our results will be something that is relatable to the entire collection. If we’ve randomly selected test objects from several sites or collections, we can then generalize even further to this particular type of ceramic. Or we may have a ceramic tradition with many similar sherds, and these are still real objects verses typical experiments using facsimiles, things such as metal coupons that were constructed for a test. If we can’t sample real objects, or if we prefer to begin a test in program first using facsimiles before sampling real sherds, then we can make ceramic test tiles that simulate the composition and structure of the sherds.

For studies characterizing a material or technology, of course, we need the real objects, but what about tests of treatments? For example, I once observed the pot had been damaged by multiple black spots along the surface. Someone had spot tested a potential surface cleaning method on a real object which happened to be very porous and filled with carbonized material within the microstructures. The applied solvent in the test went through the carbon up through the network of pores very quickly and deposited on the surface. There were a lot of sherds stored along with the pot, so clearly those would have been much better for the initial testing, and thin sections made from one of the sherds actually would have revealed this very porous, carbonaceous interior, and you would have seen this pore network that would have brought the carbon up.

Not starting testing on simulated objects or similar sherds can have a very disastrous effect when we then go and put something on the actual whole object. There are a lot of examples in the literature of testing something like a new cleaning method on real objects, not always with mentions of whether or not prior tests were done on the simulated objects. Even if the new method doesn’t cause damage immediately, it’s often known what the long-term effects might be.

One cautionary tale was reported in a letter to the editor on studies in conservation back in 1991 concerning a treatment for rock art sites that was published in the journal more than a decade earlier. It was an artificial desert varnish that was chemically induced over the rock art of many ancient petroglyph sites in the American southwest desert. It was a measure that was intended to protect the engravings from damage, but over time, it turned out that this artificial varnish darkened, and it had actually begun to obscure some of the petroglyphs, and it also began to show some sharp lines of demarcation with areas of natural desert varnish in the surrounding rock, and so it was becoming difficult to view the petroglyphs in the form that they had originally been created. Removal of the now discolored artificial desert varnish was not possible because it would have been too damaging to the stone and the petroglyphs. In this case, before any treatment was applied to real rock art sites, time would have been useful to apply the treatment to simulated rock art, and then do artificial aging to see how well the treatment held up over time.

Typical experimentation and preservation involves experiments using things like paper strips, metal coupons or the like, rather than on real objects. An example would be testing how well metal coatings hold up under long-term exposure, or how specific metals react to pollutants in the environment. Subjecting real objects to the test conditions would make no sense because it’s too risky, requiring cleaning and polishing and damage if we’re wrong about the potential materials or conditions, so we use simulated coupons for example of the appropriate metal type. This might be found with collection artifacts. We need to clearly define those simulated objects to be as close as possible to real objects such as deciding which metals to test, whether to use fine or sterling silver, etc.

One mistake that scientists from other fields sometimes make in testing treatments or coatings for conservation use is to use sub-straits not typically found in ancient or historic collections. How much replication to use is also a crucial issue here. The issues with replicates include: How many do we need? How do we choose them? The main purpose of replicates is to identify random variation in objects, measurements, and treatments so we can, first of all, estimate variability, and second, estimate actual treatment effects more accurately. We can’t always find a way to have replicates with real objects, but there’s really generally no excuse for simulated objects not to have replication. We need some replication to do statistical analysis and to avoid random odd results or experimental blunder. Remember, back when I mentioned if something can go wrong, it probably will. We need replicas.

If we have too many replicas, then we can needlessly complicate things or add on too much time and effort without gaining appreciably more in terms of results. One example is with Oddy tests. A lot of people don’t do any replication at all, but sometimes we do see variation in replicates being tested with the same materials or even variation in the appearance of the controls, making interpretations difficult or even potentially inaccurate. Sometimes something goes drastically wrong, like you you forgot to put water in one vessel, or put too little in, or the vessel wasn’t well sealed. Having at least 2 replicas can help you identify where something may need further followup testing.

A paper on your reference list by a man named Appleman back in 1990 noted that too much testing in the coatings industry used no replicas, and so, it was not actually reliable, and he pointed out a number of cases where everything was actually quite wrong. I’m not sure this problem has really improved since then, and you can see it’s not just a conservation and preservation problem. How many replicates are needed depends on the variability that you expect. If you have no idea what that might be, then starting with 2 or 3 replicates is pretty reasonable. If the variability is expected to be quite large, then this should be increased. However, experiments involving human beings or other highly complicated and variable biological organisms often consider that 8 to 25 replicates is generally considered sufficient. It’s wasting time and effort to go too wild with replication beyond what would actually give you new information.

I’m looking at what constitutes a replicate, or how we define or choose them. We need to think carefully about the purpose of replication. In an artificial aging study of an adhesive applied to paper, for example, we could just apply the adhesive to one sheet of paper and then cut it into 3 pieces and say we have 3 replicas. If we think there may be variation in the paper, and that could be especially true if we’re using a historic paper, or if we think their might be variation and our application process, maybe it might be a bit thinner in some areas than others, for example, then using 3 separate sheets of paper with 3 separate applications of adhesive will better catch any potential variation. If we had to prepare the adhesive mix in house or have some reason to think that there might be variations in purchase packages, then we may even want to use 3 separate batches of adhesive.

For testing a treatment to be applied to natural history skins, for example, we know to expect some variation because we’re working with a complex biological specimen that’s also been treated in ways that might involve highly variable preparation methods. On the other hand, simulated objects, skins that we acquired and prepared ourselves, are difficult to acquire and time consuming to prepare, so we do want to be judicious in replication. We could use only one skin, and then apply 3 different treatments, each with replication to different patches. Later when we talk about single object studies, we’ll look at how that can be done in a statistically valid way. If we expect a lot of variation in the sub-strait, we can get 2 separate prepared skins from 2 animals, and then we can apply the same number of replicates of each of the 3 treatments, but have each applied to a different skin. It really depends on where we expect variability, and what’s reasonable and doable, but still able to catch that expected variation.

Things to consider with measurement protocols include deciding on variable types, is time going to be quantitative data, presence/absence data, numbers that represent categories, numbers that represent real quantities or amounts, etc. Other issues to consider are how many repeated readings are needed, and we’ll look at each of these in a moment, individually. How many repeated measurements, as well as various solutions to help us try to avoid bias.

The purpose of repeated readings is to increase accuracy by averaging out measurement error such as you might get with an instrument noise or reading error. We’re going to here want to average the data. Protocols often exist for the particular instrument that you are using. With repeated measurements, here we’re taking measurements at different spots or at different times, and the purpose is to detect any spacial or temporal variation or change occurring in the sample. A spacial variations might be edge effects, temporal variation, the most common thing we’re looking at is before and after treatment or before and after accelerated aging. These are not averaged. How many do we do? It depends on the importance and the expected significance of the time and space difference and the likelihood of there being a measurable difference or at least one that matters. These should be limited to what is really likely to be important, so you’re going to want to have clear rationale for any repeated measurements that you are taking. You want to have a number that’s both reasonable and productive.

For example, if you have no reason to think that there’s any spacial difference, then you still might want to measure a spot on the interior and a spot on the edge of each specimen, you’re not going to want to measure 6 or 10 different locations, that just wouldn’t make any kind of sense. It’s a waste of your time, and if your main question is whether there’s a significant difference in a treated area before and after accelerated aging, then it doesn’t really make sense to also add in measurements that involve removing the specimens from the aging chambers at a lot of different intervals along the way. You might want to do that if your research question is at what point does change occur? If you only want to know, did change occur after aging, then you’re wasting a lot of time to take a lot of interim measurements.

Avoidance of biased experiments is a crucial consideration because otherwise you can expect that you may be getting a wrong answer because of poor research design. One classic example is that of N-rays. In 1903, a very distinguished French physicists announced a discovery that he had made of a new form of radiation. Objects placed in an x-ray beam, directed through a prism with them, would then emit light that could be seen in a phosphorescent detector. He named these N-rays. While many researches claimed to have replicated the experiment, and equal number just could never replicate it. They could never see the N-rays. Finally someone visited the original physics’s lab, and without the French physicist’s knowledge, removed the prism during the experiment and replaced it with inert wood, so no N-rays should have been produced, but the physicist still claimed to be able to see N-rays. It turned out that in fact they don’t exist. Once this whole chain of events was reported, no one ever saw them again. That was the end of N-rays, and we’ll be looking at what the problems were here and lack of blinding and randomization led to this result.

Sometimes a person may feel their career, for example, depends on a specific technique being the best, or institutions may have spend millions of dollars on an object, and now it’s been questioned, and so they want you to authenticate it. There’s a lot of pressure there, so these things can bias us, and without us trying to, can lead us to see things that we want to see and to not see things that we’d prefer not to be there. It’s best just to accept that this can happen, and then choose techniques to avoid bias so we don’t have to worry about it.

Standard techniques for doing this include randomization of measurement order or measurement place and blind analysis, not knowing which objects received which treatments while you’re taking your measurements. Of course, we’re doing this, we need to make sure we have some way of adding numbers on here or labels so we have a unique identification code that we could track each particular object through the blinded or randomized measurement process. E. Bright Wilson wrote an excellent book on scientific method that’s on your list of references, and he gives an example of need to randomize measurement order.

This was a case in an industrial lab where experiments were done to check the effective length of time of pressing of plastic parts in the mold on the strength of the finished plastic parts. Hot plastic was first put for 10 seconds and removed. The next batch was pressed for 20 seconds, etc, and the results showed strong dependence of length of time pressed with strength of the part, but the supervisor rejected the results because pressing times had not been randomized. The lab then re-did the experiments with randomization. It was clear now that the dependence didn’t actually exist. It appeared that because the mold got hotter and hotter during the experiments, it was actually heat of the mold, not length of the being pressed that was related to strength.

Another example Wilson gives us that in one lab carrying out a set of chemical analyses, standard practice was to run pairs of duplicate analyses together, an agreement was good. Another lab found a lot of discrepancy in their results. They couldn’t duplicate the results. It turns out that a zinc redactor through which all the samples ran gradually lost effectiveness over the day because of the presence of certain other elements in the sample. The effect from one sample to the next was small. If they had randomized the order so all duplicates were not done side by side, they would have seen major errors.

Another common example following a similar idea that we see in conservation is that the chronometer can often drift gradually, so even if we calibrated it at the outset, if all the replicas of one treatment or coating, etc, were measured in order, the results might be different than if all of the treatments and coatings, etc, were randomized for the measurement order.

Blinding refers to taking measurements without knowing which sample received which treatment. This can sometimes be difficult of impossible to do with cultural heritage objects, but it’s especially important to try to do it for subjective measurements or assessments. An example is the use of the placebo in medicine. The physician who’s assessing how well the patient’s responding doesn’t know whether the patient received the proposed new treatment or standard treatment or even an inert placebo.

An example in conservation is the grading of Oddy Test coupons. These are assessed as similar enough to the controls to be considered past for uses of storage or exhibit material, a little worse than the controls, but good enough for temporary short term use, or so much worse than the controls that they should be considered failed for any use near metal artifacts. I’ve found that if I know which 2 coupons are replicas or housed with the same test material, then my mind expects that they should perform the same, and if one of them is borderline between a pass and a temporary or between a temporary and a fail, and I know which way I graded the other replicate, I tend to see it the same way. If instead they’re stamped with numbers on the back or some other way to code the coupons for grading without knowing which test materials each have been housed with and which ones were replicas of each other, I often graded things differently. This same principle applies to grading the performance of coatings or any other treatments. This issue is less important maybe for automated measurement recording, but it’s crucial for subjective assessments.

Important things to consider regarding treatment protocols include controls and randomization. Controls receive either no treatment or a standard imperative treatment. Their main purpose it to be certain that the effects that we see are actually due to the treatment and not to chance or to some uncontrolled factor. Controls are not always needed. We need to think about the purpose of the control. For example, if the experiment is intended to compare one coating to another, then we might not need a third control.

Randomization of object treatments is important for a number of reasons. First, statistical tests assume treatments were signed by random process. It’s an underlying assumption of may statistical tests. Yes, of course you can still run a statistical package using your data if you didn’t randomize specimens for treatments. Alarms aren’t going to come on and the program won’t freeze up. There won’t be red flashing lights showing up on your computer, but your results will not necessarily have validity or be reproducible, and you’ll find that if you try to work with a professional statistician, he or she will likely refuse to do the analysis.

Second, randomization prevents bias in selection of which objects get which treatment. We’ll look in a moment at pieces where that has been a problem. Third, it distributes uncontrolled factors among all the treatment groups preventing bias that will accidentally make one treatment effect appear to be different from another when it’s actually not. One cautionary tale of that, E. Bright Wilson discussed in his book was a study of whether a program of offering milk to children during the school day would have an effect on their height and weight over a 3-month period. This was a large scale experiment. He had 10,000 children selected as test subjects who would get milk, and 10,000 as controls who would not get milk. Obviously there’s not problem with the amount of replication here.

There was initial randomization of children to treatment, to milk or not, however, fatal flaw in the experiment was that teachers were allowed to change that selection if they felt their class appeared to unbalanced in the number selected for each group. It turned out that the natural human sympathy of teachers created a very biased sample. Children that were already taller and heavier than their peers were assigned by teachers to the control because after all, they didn’t really need more milk, did they, and those children who appeared to be under-nourished, the teachers felt sorry for them, and they assigned them to receive milk. In the end, the biases prevented any clear result, so this huge and costly experiment had no reliable results to show.

Unlike popular conception, randomization is not haphazard picking out from a group. Instead, the term randomization refers specifically to use of a variety or methods that have known statistical odds that can be calculated. Haphazard selection is not random because there can still be an underlying bias in the order in which things are selected or assigned for the treatment group. Selecting laboratory mice to be placed in one treatment group or another, for example, is always done by a proper randomization method because it was long ago noticed that various types of biases otherwise emerge. For example, if we say that the first 10 mice we happen to pull out from the cage are going to go into treatment group A and then the second we pull out will go into treatment group B, our experiment will be worthless. We can’t trust that any significant difference in outcome seen between the two experiments or between the two treatments has anything to do with the treatment itself. Why? Well, because the healthiest, strongest mice may run away from our hand, leaving the sickliest, tired, old ones to be grabbed first. If first treatment appears to result in more sickly or dead mice, well, it’s just that more of them may have started out with health problems to begin with.

Randomization of objects to treatments requires a bit more time and effort, but even in conservation and preservation where our subjects don’t run away or try to bite us, not randomizing can affect the outcome. For myself, if I’m not randomizing, I found myself subtly choosing the order of specimens the group based on something like what’s easiest and fastest to set up. There are several methods for accomplishing randomization by methods of known statistical odds, and the easiest one often depends on how many treatment groups you have.

Flipping a coin is one method, and rolling a die, or you can use multiple dice and combining the numbers if you have more than 6 groups. A random number table can also be used. Here, you decide ahead of time where on the sheet you’re going to start and what direction you’re going to move in. Depending on the number of groups, you can use single digits or double or even triple digits. If you land on a number you’ve already used, then you can move on until you land on an new number. In some way, you can use a computerized random number generator. Those are actually called pseudo-random number generators because they satisfy statistical tests for randomness, but are still produced using a definite mathematical procedure. Be careful about the program you use.

Do some research to see if it’s accepted by statistician as a pseudo-random number generator. A colleague of mine once used a website that was casually found which claimed to generate random numbers, but once she carried out the assignment of specimens to treatment using it, she noticed there actually was a clear pattern, so it wasn’t random at all. Don’t just pick up any old website. Major statistical packages like SPSS developed by teams of statisticians, for example, have reliable pseudo-random number generators. You can also draw numbers or treatment groups out of a jar. This can be an easy way to randomize assignments for a relatively small set of specimens and treatment groups. However, this will only randomize if the pieces of paper used are very well mixed, and if you can’t read any of them while you’re drawing them. This is the same basic principle as a lottery.

There are different ways in which to construct a randomization procedure. The first is with object fixed. You’ve decided which object to assign, and then select the treatment group number using a randomization procedure. In this case, for example, you might be drawing A’s and B’s out of a jar. First to assign a treatment group to sample 1, then to sample 2, etc. The second is with treatment fixed. You start with a particular treatment group, and then fill it by assigning objects through the randomization procedure. Here, for example, we would start with treatment A, and then randomly draw numbers until we have 3 samples that got assigned to that group.

You can develop a stratified random sample strategy restricting in some way the population that you’re going to be selecting from. For example, if you’re studying pigment use of a 19th century artist, you can’t sample the painting in random locations, but you can at least report what was done and how it might affect interpretation. If you never find a particular pigment, it might be because certain color areas could never be sampled, and you need to report that. Stratified random sampling might be possible along the edge, along the frame, under the frame from each color area, or maybe just where there are existing cracks in each color. If you can’t sample some colors, you need to take account of that in reporting and interpretation. I’ve seen papers reporting on the pigment use of a particular artist that with tables organized by color as if this were the full palate of artist, but then looking the images in the publication, you can see that there are some pigment colors that are completely absent in the reported chart which is sort of puzzling.

A final example of unforeseen problems that can occur and effect results if randomization isn’t done is from a project discussed in one of my favorite books, The Cartoon Guide to Statistics. I highly recommend it. Here they’re talking about an experiment that was testing gas mileage for 2 different types of gas formulations using 2 taxi fleets. One fleet was assigned to drive on Tuesday using 1 gas formulation, and the other on Wednesday using the other gas formulation, and a significant difference in mileage was found. It turned out that the difference had absolutely nothing to do with the gas formulation. What happened is that on Tuesday it rained very heavily, so that taxi fleet had to drive a lot more cautiously, and it turned out that accounted for all the difference in mileage because studies show that accelerating more slowly away from green lights and then stopping more gradually for red lights can cut fuel consumption down by as much as 35%. This is also a driving behavior that drivers are more likely to follow in a heavy rain with slick roads. It would have been better to randomize the gas formulations to individual cars or cars to test days so that there would be fewer uncontrolled variables.

I invite you to go through the randomization practice exercises that are posted on the NCPTT website. These are intended to help you go through and try out various methods of accomplishing randomization, and it includes a random number table as well. We’ll try taking a few questions, again, here if you have any …

PART III — MULTIPLE OBJECT DESIGNS and PART IV — SINGLE OBJECT STUDIES; DESIGN VARIATIONS

We’ll move on to the last section, which is going to include parts three and four. Part three involves multiple object designs. We’re going to be looking at how to use a design check sheet to ensure that you thought of the most important aspects of your design and you’ve gotten them written down. This is definitely going to help you a lot when you go back to writing the final paper. We’re also going to talk about how sketching out the design or creating some kind of flowchart or diagram can be very useful as a summary tool for looking at your design and identifying whether or not you have any holes in that design.

Starting with the design check sheet. These are the aspects I like to have on my check sheets. The very first thing is to record the observation that led you to undertake the research in the first place. This section could also include a brief summary of the background and basic facts. Then comes the research problem or question. What is it that you really want to know? It’s important that you don’t try to test too many things in one experiment. Here’s the place where you can prioritize what it is you most want to know or need to resolve. Then your hypotheses, or the possible answers or solutions that you’ve come up with through your brainstorming process, however you carried that out.

The rationale for each hypothesis would explain why this is a potential answer to the research question. The implications of those hypotheses are going to state: If this hypothesis is true, what should you expect to have happen in an experiment? What would you expect to see? If the hypothesis is not supported, what would you expect to find instead? Again, a rationale would clearly explain why you expect those results. Then come the details of the object protocol. First, a definition of what constitutes an experimental unit is needed; are they real objects or facsimiles? Are your experimental units entire sheets of paper covered with adhesive? Or you cut up an entire sheet of adhesive covered paper into ten squares, with each square then being counted as a separate experimental unit. Your final interpretations will depend on how you defined an experimental unit, and so will the design of your statistical analysis; or the decision as to whether or not you can even do any statistical analysis.

The number of replicates is another issue you need to be clear with from the start. How are you going to define replicates? Where do you expect variation might occur? Have you defined your replicates in such a way as to be able to capture that expected variation? The measurement protocol will include the types and methods of outcome measures, such as visual evaluation, measurement of color, gloss change, etc;. These should all relate back to the hypotheses and their implication. Clarifying this at the outside is helpful to ensure that you’re testing for the right things so that you can clearly tell if a hypothesis is eliminated or supported; that you haven’t missed something crucial for making that decision. At the same time, you may find that some of your tests don’t address any of your hypotheses and would be a waste of your time. Consider whether you can randomize a measurement order so that all replicates of one group are not measured one after the other.

For the treatment protocol, it’s important to clearly define your groups and your goals. For example, if you’re comparing the performance of two coatings for protection of silver in museum collections, say, [Agatine 00:04:20] and B 72, and you have two different application methods, spraying and brushing, you’d likely then define four groups: Agatine brushed, Agatine sprayed, B 72 brushed, B 72 sprayed. If you weren’t clear about that and were thinking that you only need two replicates of Agatine – one brushed, one sprayed – and two of B 72 – one brushed, one sprayed – you wouldn’t be able to look for differences between spraying and brushing because you wouldn’t have those application methods being replicated within a coating type. To analyze for that factor, you need to have at least two of each.

Having un-coated controls here probably wouldn’t be very useful if you can predict that silver will of course tarnish if it’s exposed to pollutants with no protection, and if your main research question understudy is which of these two coatings works better? Defining your test factors means clarifying and listing the type of things your testing. For example, coating type, application method, metal type, etc;. These are the general categories that you’re going to need to be able to be crystal clear about when setting up any statistical analyses and making interpretations. Finally, your randomization method should be planned out. In your final paper, it’s helpful to describe what randomization method you use, because then it’s going to be clear to readers you actually did a randomization method versus the more common haphazard selection or picking of things that’s sometimes called randomizing.

Then, what methods or interpretation or analysis do you have plan? How will you determine that the data produced by your measurements have supported or refuted your hypotheses? You may very well need statistical analysis to get a solid interpretation. If so, you have a statistician to collaborate with. If you do, I can guarantee you he or she will be quite happy to see this experimental design check sheet, because it will make collecting the statistical test and setting up that test very fast and very easy. If you don’t think you need statistical analysis, what other methods are you going to use to interpret the results of your study?

Finally, sketching out the experimental design or creating some sort of research flowchart can help you clarify the logic in your plan and identify holes in your design. If you find it difficult to sketch out the design, it may be because your design is unclear or it’s too convoluted. Drawing helps you find that out before you get started on any actual experimentation. Sketching out the design of an experiment you’re reading about in the published literature, I find also is helpful because it will help you better understand exactly what they did and how they carried out the work.

For example, here’s a sketch for a simple experiment that was comparing the aging performance of four adhesives. Each experimental unit here is defined as a sheet of paper with adhesive applied. Four replicates were required for each adhesive. Color measurements are to be taken before and after accelerated aging, with color change and yellowing quantified. In addition, peel strength is to be measured. Because this test is destructive, an additional four replicates are going to be prepared for the initial before aging tests. The final after aging tests can be done on the original four replicates after the non-destructive coloring change and yellowing measurements are completed.

We could describe this particular experiment as having one test factor, which is: Type of adhesive with four treatment groups. None of them being control, because paper alone without any adhesive isn’t likely to be informative. Instead we probably, likely, are including at least one adhesive that’s already routinely used in conservation, and comparing it with a performance of at least one newer adhesive being proposed for similar uses. Even a very complicated multi-year project benefits from a research design sketch. You don’t have to try to read all the detail here. This is a sketch to illustrate a large project I planned out and submitted for funding. Many of the proposal reviewers mentioned that there was a lack of clarity in the research design. Of course, naturally, my first instinct was defensiveness: “Of course the research design was well thought out and clear,” but obviously it wasn’t clear to the readers.

I realized that I had violated my rule: I hadn’t sketched out the research plan. I figured, “I’ll sketch it out, and that will help readers see more clearly my brilliant research plan.” My attempt to sketch it out, I realized the readers were actually right: There were a lot of holes in the research plan. After a lot of revision, rethinking, re-planning, I was able to sketch out a research design that clearly shows a central research question, a central hypothesis, nine sub-hypotheses, and all of the tests we’d use to see if these hypotheses are negated or supported. Once I had written out this sketch, it was much easier to describe the research plan in a way that was much more clear to reviewers.

Many experimental designs that we use originated in agriculture, where a lot of research has been done testing variables such as: seed varieties, pesticides, watering practices. A typical design used in agricultural research is called the split plot block design, where multiple agricultural fields are used for testing, with each field divide into sections so that each factor is tested both within a single field and between fields. This makes for very robust and reliable comparison. This is a very appropriate design for conservation research, and it’s sometimes seen in the literature.

A very good description of the experimental design elements for this type of design is given in this paper by Green and Lease that appeared in Studies in Conservation. We’ll go through exactly how this project was set up. They started with the initial observation that paper often becomes acidic over time, and that this is caused by multiple factors, hence, the acidification is carried out to stabilize the paper. Often, an aqueous solution can used, but if you water fugitive media present, a non-aqueous solution may be needed. At that time, the British museum where they worked was using a solution of barium hydroxide and methanol. This wasn’t considered satisfactory because it was toxic, and the treatment produced an initially high pH within the paper, which they were afraid could cause hydraulisis and damage the cellulose change. Although, the pH did gradually decrease in exposure to air as the hydroxide was converted to carbonate.

The research problem here was that additional non-aqueous solutions were needed. The main hypothesis to be supported or refuted was that methyl magnesium carbonate could be a successful alternative non-aqueous de-acidification agent. The main rationale for that is that it’s non-toxic. Also, the decomposition of magnesium carbonate occurs rapidly on exposure to air, so they thought maybe that the paper would not have this initial high pH and that this should hold up over time. An implication would be that the treatment would not show adverse effects on tinsel strength, either immediately after treatment or after accelerated aging.

Experiments when done on two types of paper – Whatman Filter paper and 19th century printed book pages. Here, the experimental units were defined as including both facsimiles and real objects. The larger papers were then divided into sections for treatments, so these are the plots which makes it analogous to agriculture. There were three replicates for each combination of factors, paper type and treatment type. There were seven treatment groups: the commercial methyl magnesium carbonate and methanol; methyl magnesiate prepared in the lab in methanol; barium hydroxide in methanol; calcium hydroxide in distilled water. This was included because it was another comparative standard treatment that would be used as an aqueous treatment. A control that was treated with only distilled water, another control treated with methanol alone, and an untreated control of just paper.

There are two test factors: Paper type and de-acidification method. Here, the randomization method was clearly and explicitly described. Measurements included pH, measurement by surface electrode, and tinsel tests to check for paper damage. Both of these measurements were done before and after a period of accelerated aging. Interpretation of the data was by statistical analysis using analysis of variance. They were able to conclude that effects of methyl magnesium carbonate on loss of tinsel strength and pH were similar to that produced by barium hydroxide. However, the new treatment has the advantage of much less toxicity. It turns out it could be a very good alternative treatment.

This last design included two factors: Paper type and adhesive type. We can increase the number of factors to some extent, provided there’s good reason, which means we’re giving clear rationales for every factor we’re adding in, and as long as we don’t let it get too complicated. This one example: I had a recent project with a goal of developing protocols using software for image analysis for ceramic thin sections that could reliably and reproduce-ably measure the area of percentage of sand-sized particles in the fired clay. Rather than start with unknowns, which would’ve been real, archaeological ceramics, we prepared test tiles in the lab using known proportions of clay and sand, and with purchased sand that had a known particle size.

Since image analysis tends to vary for different basic appearances, we use two types of clay, typical found with one that fired to a red a color, and one that fired to a white color. Since we thought protocols might show some variation, depending on particle size, we used three sand sizes: fine, which was 70 mesh; medium, 30 mesh; and coarse, 16 mesh. The sands were primarily quartz. Each of these were added to the clay in three different target proportions: 10%, 25%, 40% by volume of loose sand mixed into the wet clay, in the same manner that a traditional potter might add in temper material.

We prepared five replicates of each combination of clay type, sand size and sand amount. This is because we felt that it was likely to be a lot of variation. There’s also three replicates for each clay type with no sand added as our controls to check for any natural sand that might be present as a background. After drying and firing the tiles, small slices were cut from each, and a thin section made for each of the 96 tiles. These were mounted on glass slides with the blue dyed epoxy so that quartz grains could be easily separated from pores for the image analysis because in plain, polarized light, both would appear clear if we used a clear epoxy. The thin sections were scanned in a high resolution film scanner, seven microns per pixel, and then the images used for image analysis. We used a software package called Image Pro Premiere by Media Cybernetics. The protocol that proved to be fastest and easiest to accomplish while giving accurate and reproducible results was then applied to real archaeological surroundings.

Now, since the amount of sand added to the tiles needed to be known for our protocol development, one way to incorporate blinding and randomization into the process was to check on the results once the protocol was finalized. Then samples could be randomly re-analyzed without knowing the added sand amounts, which ones were replicates of each other, or what any of the previous analyses were. Randomize and using blinding. These results were compared back to the original results. When excellent comparability was found, that increased our confidence that the protocol was actually ready to apply to real archaeological samples of unknown sand contents. This is one way we can sketch out the experimental design with this whole combination of factors. We have the white clay, the red clay. We have 10%, 25%, 40% added sand using different symbols. We have the fine, medium and coarse sand shown by different size of each of the symbols.

If we try to get too complicated with too many factors, sketching, even planning for and keeping track of the many combinations would get extremely difficult. For example, if we try to run experiment testing coatings on metal, and we want to test it on several metal types – sterling silver, fine silver, copper, lead, bronze and brass, perhaps of various surface shapes, flat and curved – then when we want to test for different methods of surface preparation, several coating application methods, perhaps several different thickness as well, all for nine coatings or so, things are quickly going to get very complicated to try to interpret. It’s likely we’re going to find at the end when we’re trying to do our analysis that we actually didn’t keep all of these variables in mind, so we didn’t get good replication of each. Even then, trying to discern which variables are important for the final results is going to be difficult. We’d be better off doing one experiment with the most crucial variable, such as metal types and coating types. Then for those coatings that appear promising for specific metals, we could do another experiment to define things regarding the effects of surface preparation, application method, coating thickness.

One major problem with maintaining silver objects in historic houses or in museum storage and exhibit spaces is that many construction or decorative materials that might be used somewhere in the vicinity of silver off gas hydrogen sulfide. Even very small concentrations of the pollutant is going to being the tarnishing process, and can quickly turn silver black. Regular cleaning of your silver artifacts is not a good solution, because that’s very staff intensive and time. Each cleaning is going to remove some silver, and will eventually create damage to the surface. An example of an elegant, well-designed, well-described experiment in conservation that was intended to address this problem can be found in a paper by Grissom, et al, published in the journal of the American Institute of Conservation in 2013. It focuses on the evaluation of coatings for the protection of silver exposed to hydrogen sulfide.

One thing that makes this an elegant experiment is that they kept the focus on the most important factor for preventive conservation of historic silver collections: the ability of coatings to protect silver from hydrogen sulfide exposure, which is by far the most problematic pollutant for silver. They didn’t try to also test the effects of other pollutants. They kept things simple so that the most important factor could be clearly tested for. They used only sterling silver coupons of one size and shape. They didn’t also try to test fine silver, adding copper or lead, any kind of other geometries. They tested 12 coatings. All of them were given a stated rationale as to why that coating was chosen for testing. They selected one standardized application method. Each coating had four silver coupon replicates. The thickness of the coatings was measured with repeated measurements around the coupon. Coupons were exposed to hydrogen sulfide for 125 days with periodic removal for continuous evaluation because they wanted to capture when change initiated for each coating.

Randomization was performed wherever it was appropriate, such as in the selection of spots for testing and in the order of coupon placement within the hydrogen sulfide chamber. Coatings were evaluated by visual observation, image analysis of digital photos, gloss measurements and color imagery. Each step of the experimental design is very clearly described with all rationales, implications, measurement and treatment steps described with enough detail that there’s no need to guess as to what was done and why. As a result, some clear and useful results were obtained that are going to help move conservation practice forward. It would also be easy for anyone else to replicate the study or to carry it further.

While there are many possible variations in experimental designs, what’s most important to keep in mind are the basic ideas of aiming for a simple, elegant experiment that will provide clear results, rather than trying to do too much in one experiment and having very confusing results. Keeping focused on the very observation that led to the research, and having a clearly defined research question or problem with clear hypotheses, implications and rationales stated; defining the experimental units and having appropriate replication; keeping clear what the test factors; then restricting these to the most important, carrying out randomization so that uncontrolled factors can be spread around and not affect interpretation and results; and then using statistical analysis where needed to aid interpretations as needed. I recommend this paper as a good model for an example where all of these issues are clearly stated.

We’ll move on to the last part here where we’re looking at first, briefly, single object studies, then at some variations in experimental designs. Replication is usually important to identify variability, and ensure that results of the experiment or study are generalizable to some sort of larger populations. Sometimes you may have only a single object available for study. Generally this is going to be the case when you have a real cultural heritage object you’re working with, since experience with facsimiles can always be replicated. With a single object study, you may primarily be interested in making conclusions that are best for the treatment of that one object. Here, the generalizability to other objects is of secondary interest.

The most important tool for begin able to design a statistically valid study using only one object is to find a way of randomizing treatment in your control areas with one object. In essence, you’re making an object its own control by treating it more than once using a formal experimental design. You set up a design where by there are several times or places at which one might apply the treatment, then random assignment of treatment to one of those times or places gives an experimental design that can be used for statistical analysis. It has probability. Especially if you can include blinding, then you have a really good study.

This is simply an extension of the idea of test patches, but instead of haphazardly testing a few patches on an object, you randomize which patches receive one or more test treatments and which ones remain as controls. This allows you to reliably compare the results. For qualitative assessments, it’s even better if the assessment can be done by someone who’s knowledgeable enough to be able to make the judgement, but who doesn’t know which patches received which treatments and which ones are the controls. For qualitative results, you now have the statistical probability assumption met, so you can reliably do statistical analysis for comparison; an analogy as seen in medicine and the idea of test patches for testing a person for allergic reactions.

While there are also larger scale experiments about general causes of allergies, a number of people in the population tend to have an allergic reactions to specific food items or other specific exposures. In the range of allergic reactions, when an individual person is having a problem, they see a physician to receive tests about their own very specific allergic reactions. Test patches might be defined on their arm or back with patches numbered, then the numbered used on a coding sheet gives information about what possible allergens were applied to which patches. Placement is often randomized with replication, and then the results are assessed by a patient or physician blindly with respect to knowledge about which allergens were used with which patches.

A case study of a single object in conservation that followed this formal design is a paper by Eric Hanson and Stan Derillian. It focused on an 18th century tapestry woven in wool and silk that was installed in the Huntington Library between 1911 and 1920. Their observations included that it was very dirty and needed to be cleaned, but the silk portion was dry and brittle to the touch. While the effects of aqueous solutions or dry cleaning solvents on the physical and chemical stability of silk hadn’t yet been well investigated, there had been some reports of negative effects, especially of repeated aqueous treatments. Another tapestry in the group that been recently cleaned using a standard wet-cleaning technique with a non-ionic surfactant, that one they found that after washing and drying, the silk appeared very differently. It seemed actually to be more supple with less breakage, less damage on handling.

Their research question then was: Does cleaning cause a significant difference in tinsel properties of silk? The main hypothesis was that silk threads could be more resistant to fracture after washing. The implication or rationale for the hypothesis were: First of all, that tinsel properties of silk change with the pH. Second, reintroduction of water to dehydrated silk fibroin might affect tinsel properties. Third, silk may undergo physical aging, but those changes might be reversible if it’s plasticized by water to the extent that the glass transition temperature is depressed near room temperature.

Here, the experimental units were defined as threads from one real 18th century woven tapestry that had never been treated before, using multiple threads taken from within this single tapestry. A grid was superimposed over the tapestry, then a number was assigned to each coordinate on the grid. Using a random number table, threads were randomly selected for sampling. Replicate sampling was done before washing, where part of the thread was sampled. After the tapestry was washed, they returned to take more sample from the same thread or from an adjacent thread. In sampling of single threads of two-ply weft were unwound at the random locations and about five millimeters was removed. They ended up with 20 paired samples of before and after treatment.

There were two treatment groups defined: Treatments and controls. The controls were the samples taken before any washing occurred. The treated samples were those that were taken after the entire tapestry had been washed. The test factors were: washed or not. Measurement for effects was done by tinsel testing, and then interpretation by statistical analysis using analysis of variance. The results show that the threads were actually significantly stronger following the washing procedure. The results pointed toward the need for more basic research and the effects of washing it. They thought this could be especially relevant to handling of archaeological textiles in situ. They did recommend that there should be a further study on long-term strength, or other potential negative effects of washing other than the ones they tested for, and the effects of washing on other factors such as dyes. However, because they kept their experiment simple, without trying to pull in other of these potential factors into one single experiment, they were able to get clear and useful results.

In preservation, single object studies can also include a survey or assessment of the potential degree and type of problems found in a library or a library system. In this case, the library itself is conceived of as the single object. The results are intended to primarily provide conclusions needed for making treatment plans within that one library. We’re not trying to generalize to other libraries in general. Here, the observation might be that deterioration is a crucial problem in large libraries where there’s a mix of age and type of collection items. The research question might be: What are the preservation and conservation needs of this particularly library? The experimental units would be real books from that collection. The number of replicates you would need is likely to be pretty large, because you can expect a lot of variation. If you think about all the different potential types of items in the collection: bound journals, older deteriorated books, new books, lots of different sorts of things.

In order to capture the full range of variation, you’re going to do a lot of replication. You could use a random number generator that could use your already existing call numbers on file. If there are different types of collections, you could do stratified random sampling. Some libraries, such as different departments, might be under different environmental conditions. Again, you could do stratified random sampling amongst those. This design could be used for observational types of surveys where you’re qualitatively assessing. For each sample book, certain things such as: Is the primary protector intact or not? Are the printed pages intact or not? Does there appear to be environmental damage such as fading or water damage, or not? Is there immediate treatment needed? If so, what type? To get an overall profile of the deterioration present and the preservation needs of this library.

It could include some basic quantitative data such as pH, but probably basic descriptive statistics such as calculating percentages of each category of deterioration is likely to be sufficient for your analysis. The main thing that would be needed for reliable results in this case is to be sure that all of the assessors receive joint training so that you’re going to have consistency in the qualitative judgments. They use a randomized selection and high replication for each potential type of library materials also means that you’re going to have results that are likely to give you an accurate preservation and treatment needs of that library. A haphazard selection of books is never likely to give you accurate results, because there are a lot of things that can go wrong. For example, if you were to assign the staff to select, haphazardly, 1,000 books, someone might walk along the aisle and pick things that are easily accessible. Others might be drawn to things that have more attractive covers. Somebody else might think they should be selecting the older, worn out looking books. Others might be attracted to selecting the newer books. Either way, you cannot assume this selection is going to be truly representative of the amount and type of problems found in the collection.

Now, conservator is not likely to want to design a true single object study for every object that’s treated, but sometimes, you’re going to be working on an important object of which there are at least two treatment options, and you’re going to want to be able to reliably choose which of those and have confidence in your results. You may want to feel that you can prove to the owner or curator that, in fact, the best option was indeed chosen. In this case, it could be worth spending the time to define multiple patches, do randomization, and have a more formal test. Strictly speaking, the results of a single object study are only applicable to that object. However, the more reliable conclusions can give you a more solid foundation for planning a treatment for that object. Another advantage is that over time, a series of completed single object studies can lead to a more general conclusion.

Screening experiments are often used in medical research, for example, to do initial screening of a lot of different compounds that might be useful in fighting cancer. Those that appear promising are pulled aside for more intensive, in-depth research. Those will be very quick testing. You don’t do replication generally in screening test. They’re useful for forming hypotheses, identifying possible treatments, but it always requires followup testing with replication before going forward to anything to do with treatments. I’ve tried out a rapid corrosion test that was originally developed in industry as a quick screening test for potential corrosion inhibitors. Here, a small amount of water with an inhibitor is held against the surface of the metal for 24 hours. The ability of the inhibitor to prevent corrosion is assessed visually or quantified by visual analysis, then anything that looks promising in this screening could then be used for more in-depth testing and evaluation by a variety of other analytical tests that would take a lot more time and a lot more expense.

A somewhat different goal for a screening test is described in a recent Studies in Conservation article written by two conservators. They developed a rapid, simple test for detecting chlorides that can be used during conservation surveys of collections that have a lot of porous material such as stone, ceramics, or un-fired clay objects such as Cuniform forming tablet like we see here. These are situations where a large number of objects have to be evaluated very quickly and at low cost. They developed a quick test with a soft brush. A small amount of surface dust is transferred from the object to weighing paper, and then into a vile containing a silver nitrate solution. A white precipitant that’s best seen using your smartphone light will form if chlorides are present.

This test is intended to quickly confirm the suspicions of an experience conservator, and detect surface chlorides that sometimes might otherwise be overlooked. More precise quantitative tests that take a lot more time and money could be added in for a selected smaller number of objects for a backup confirmation test. Having the simple, low-cost rapid screening test available can improve the conservator’s confidence in their observation or results, and improve decision making in setting treatment priorities.

A treatment trial is another type of study, and the last one we’ll talk about. This one is developed in medicine. Medical and conservation research share the problem that usefulness of laboratory experimentation is limited. Eventually treatments have to be carefully tested in the context of a real practice; on human patients for doctors and cultural materials for conservators. This will almost always mean an introduction of additional, uncontrollable variables, but the results for careful experimentation with real subjects are usually a lot more applicable, the treatment practice in general than the results of less realistic laboratory tests alone.

In medicine, the physician is the PI, or at least a co-PI, rather than or in addition to a scientist, since only the physician can determine the best treatment for a person, and only the physician can actually perform the treatment on that person. In our case, the person who can best make the treatment decision for an object and carry out the treatment is a conservator; the scientists can’t do that. The medical profession has put a lot of effort into perfecting and refining the optimum treatment trial procedures, and I think we ought to be able to benefit from that experience.

Treatment trials are really a response to ethical concerns of applying treatments. The author of a book on treatment trials in medicine, Clifford Meinert – his book is on your reading list – notes that the history of medicine is filled with drugs, devices and other treatments that were originally announced as great advances, but were later shown through treatment trials to be useless or even harmful. These mistakes include thousands of prescription drugs reviewed by the FDA with about 1/3 of them found through treatment trials to actually be ineffective even though they looked very promising in laboratory and animal studies. For this reason, the medical profession has devoted a lot of effort to developing and refining the concept of the treatment trial. I think that’s really the basis of modern medicine. Today, no drug can be approved by the FDA until it has been documented to be effective and safe in at least two separate clinical trials.

Clinical trials just don’t mean haphazard application of treatments to see what happens. There are certain protocols that must be present for a valid clinical trial. It’s a planned experiment that takes place in a clinical setting rather than a laboratory setting, using multiple patients exhibiting the same medical problem. An important aspect of the treatment include that both groups – your treatment and your control – are going to be treated and followed over the same time period. That takes away a lot of uncontrolled factors that might show up if you try to use historic studies. The control can either be no treatment or a standard treatment that’s already known to be effective and safe. A trial has to include enough patients so you can adequately evaluate the results.

Most clinical trials compare two treatments; an average of about 25 patients in each group. This number is both generally practical and doable in terms of time and resources. It’s also very important to clearly define the class of patients, or in our case it would be objects, eligible for the study so that other researchers can assess whether or not the results apply to their patients or objects. If you restrict eligibility too much, for example, instead of saying, “We’re going to test photographs with mold,” if you restrict it and say, “We’re going to test Albumin prints produced in a given 20-year period with a particular species of mold,” it’s going to get more difficult to come up with enough objects in a reasonable period of time, and that might end up requiring a multi-institution trial.

The important components of a treatment trial are things that should all look familiar to you now, because these are the same things that are important components of any other sort of experimental design. You have: randomization, blinding, controls. You’ve described your treatment protocols in detail. You have good measurement protocols that are going to be relevant to testing your treatments. You have adequate sample size and replication. You’re going to be describing your implications and rationales for these at every stage. How the treatment trial is formulated and evaluated is really interesting. It’s constantly reassessed and reevaluated through specialized journals on the methodology of treatment trials and medicine. They’re also thick books focused only on clinical trials, and they provide discussion about even the smallest detail of a clinical trial design.

There’s also the Society for Clinical Trials that helps to provide constant evaluation and improvement of methodologies. For example, some issues they consider include: additional design thought that has to go into the process if you have multi-center trials versus single-center trials; if these are needed to increase sample size. With multiple centers and a lot of different people involved, it becomes extremely important to have detailed forms for patient evaluation with training sessions and pilot studies to make sure everyone is following the same consistent protocol before beginning a larger study and more expensive clinical trial. If you think back to that study of milk with schoolchildren that I talked about, if they had done a pilot study before they launched into their study of 10,000 treated children and 10,000 control children, they would’ve very quickly found that there was a major problem with allowing teachers to undo the randomization and make their own assignments. They could’ve found that out before doing the entire 20,000 children study.

Books such as these also note that it’s crucial to avoid having to change your forms and procedures midway through the trial. Again, a pilot study will help you refine what your procedures are. If you didn’t do a pilot study and you’re partway through, it’s better to continue the way you started because you can always do a followup trial in a different way if you changed your mind partway through about what should be done. They also talk about the need to avoid undisciplined data collection, the idea that, “Oh, while we’re at it, why don’t we record everything we can think of to say about each patient or object? Because someday we might want to ask additional questions about the data.” You shouldn’t have too many secondary projects going on. What often happens with clinical trials, the main question is you’re trying to do a trial for cancer cures, for example, sometimes one of the investigators is also interested in using that same set of patients to study the relationship between smoking and baldness. Another wants to look at coffee drinking and high blood pressure. This very quickly becomes too complex and you lose track of the original questions and data that you were trying to collect.

You also don’t need to worry too much about whether you’re going to have sophisticated research analyses. If the research questions can be answered through simple observations and simple measuring techniques, that’s fine. What’s important is good design, appropriate treatment protocols and outcome measures and clear analysis and results. Basically, we’re looking at the same sorts of issues that are needed in any experiment. A simple elegant one that gives a clear result is always best, and over complicated with trying to do too much in one experiment tends to produce un-clarity. This just becomes more important when you have multi-center treatment trials where many researchers may be involved.

Now, there are obstacles in applying this to conservation. In fact, I’m not personally aware of any true treatment trials having been carried out in conservation and preservation. I’ve seen a few studies label treatment trials, but these have not contained the research design elements that I just mentioned. One problem may be funding, because true treatment trials usually are going to be multi-year projects. If they require participation and coordination of multiple centers so that you can enroll enough objects into the project, that requires even more funding. This level of funding is harder to come by in conservation than it is in medical research. The NIH and drug companies will fund large scale multi-year treatment trials, but that’s difficult to find funding for in conservation. I think they could be a very useful tool to add to the experimental design arsenal of cultural heritage conservation and preservation.

This concludes our survey of scientific method and experimental design in preservation and conservation research. Thank you for your attention. I hope to be reading about many more wonderful research designs in the future, coming from you. I also want to think NCPTT again for hosting this webinar and making it available. If there are any additional questions-

SCIENTIFIC METHOD AND EXPERIMENTAL DESIGN IN PRESERVATION AND CONSERVATION RESEARCH

Power Points
ExpDesignWebinarPart1
ExpDesignWebinarPart2
ExpDesignWebinarPart3_and4

Practice Handouts
HypothesisBrainstormingPractice
RandomizationPractice1
RandomizationPractice2
RandomNumberTable

Webinar References
Webinar_References

Instructor: Chandra L. Reedy
Recorded from a webinar broadcast on July 14, 2016 from the laboratories of the National Center for Preservation Technology and Training (NCPTT)

Introduction

PART I — SCIENTIFIC METHOD
Platt’s analysis of progress in scientific research
Thinking: Fermi and Pasteur
Steps in the research cycle
Multiple hypotheses and creativity

PART II — BASIC CONCEPTS OF EXPERIMENTAL DESIGN
Advantages of formal experimental designs
Object protocols
Selection methods
Real or facsimile
Replicates
Measurement protocols
Variable types
Repeated readings
Repeated measurements
Avoidance of bias through randomization and blinding
Treatment protocols
Selection of test factors
Controls
Randomization
PART III — MULTIPLE OBJECT DESIGNS
Using a design check sheet and research design flow chart
Case studies of multiple group designs in conservation

PART IV — SINGLE OBJECT STUDIES; DESIGN VARIATIONS
Introduction to concept of single object studies
Case studies of conservation applications of single object studies
Screening experiments
Treatment trials

National Center for Preservation Technology and Training
645 University Parkway
Natchitoches, LA 71457

Email: ncptt[at]nps.gov
Phone: (318) 356-7444
Fax: (318) 356-9119