Spencer Bliven

Thoughts and Research

TV Oscilloscope Part 2

January 16, 2013 | Posted in Engineering,Technology

To further test out my TV oscilloscope, I hooked it up to the output from a small stereo. Changes in dynamics and instrumentation were very apparent in the songs I listened to.

Oscilloscope Music 1 Oscilloscope Music 2
Oscilloscope Music 3Oscilloscope Music 4


I find the blue start and red tail of the trace to be interesting (clearest in the upper right image). Perhaps the red phosphors energize slightly faster, while the blue phosphors persist slightly longer? It could also have something to due with slightly different path lengths from each electron beam to the peripheral pixels due to the way color TVs work.

Another cool trick is that you can calculate what aperture speed I shot these photos at. They all have about 7.5 scans visible in the photo, giving a shutter speed of 7.5 scans/60 Hz = 1/8 second. You can check that yourself against the EXIF data in the photos.

Finally, if you look close you can see that the traces are composed of alternating white and black dots. I think that the black parts represent the horizontal blanking interval, when the electron beam gets shut off between lines. If that is the case there should be 241 dots per trace, which doesn’t seem too far off from my quick estimate.

TV Oscilloscope

January 9, 2013 | Posted in Engineering

Someone left an old CRT TV on the laundry room swap table, so I decided to try and hack it. Following this instructable, I converted it into an extremely basic oscilloscope in about 30 minutes. Basically, you just disconnect the horizontal scan (which is too fast for most inputs), connect the vertical scan up to the horizontal coil (since oscilloscopes are normally have time across the x axis), and snake some wires from the vertical coil out of the case as an input. Setting the channel to static noise gave a nice bright line.

Oscilloscope with 0V DC

I next tried hooking up some DC. Four AA batteries gave a good signal, but I haven’t measured how much current they’re supplying. Tapping the leads together sporadically gives some cool traces as the current starts and stops:

Trace while connecting a 6V battery

Trace while connecting a 6V battery

Closeup of traces from 6VDC spikes. Note oscillations towards the right of the screen.

Closeup of traces from 6VDC spikes


The horizontal oscillations near the right of the screen were unexpected. It would seem to indicate that the vertical oscillator (now driving the horizontal motion) has some feedback near the end of the trace. The TV didn’t show any distortion before I took it apart, so I’m not sure where this jiggle is coming from.

Other thing that one should be able to deduce from the above photo is the direction of the x-axis. All the spikes have very sharp left sides, with more gradual declines on the right. Although I’m not entirely sure about this, my intuition is that briefly tapping the leads together would result in a rapid surge as the leads contact, followed by a slower drop-off as the circuit breaks due to the energy stored in the magnetic field of the coil. If that’s the case, the oscilloscope is correctly wired with time increasing to the right.

The last thing I tried was attaching a small 12V AC transformer I salvaged. This results in a single wavelength of sinusoid. The AC should oscillate at 60Hz, so a single trace must take 1/60 sec. I could also have found this out by looking up the NTSC spec of 59.94 Hz per frame. Had I left the horizontal oscillator attached, I would have had a 15750 Hz oscillation (30 fps*525 lines/frame), or 64 µs per trace. The phase of the wave drifts slightly due to the slight difference between NTSC frame rate and mains power.

With 12V 60Hz AC

As an oscilloscope, this setup leaves much to be desired. It needs circuitry for changing the scan frequency, triggering scans, amplifying signals, etc. However, it should be cool for visualizing music. I have a couple old iPod speakers lying around, so hopefully I can cannibalize one for an amplifier. That should give some cool videos for a future post!

ISMB 2012

ISMB Badge

My badge for ISMB, showing admission to 3DSIG and BOSC, as well as ISMB membership and Open Access pride.

I just got back from a fantastic ISMB 2012. The talks were very good this year, and it was fun to see many people I met last year in Vienna again.

I presented three documents at the conference:

I went to talks at BOSC, 3DSIG, and ISMB. Here are a my favorites:

  • Jonathan Eisen Science Wants to Be Open: If Only We Could Get Out of Its Way (BOSC)
    Great talk about the history of PLoS and the barriers to Open Access. Eisen calls for hiring and tenure committees to actually read papers, rather than making snap judgements based on journal names.
  • Titus Brown Doing Next-gen Sequencing Analysis in the Cloud (BOSC)
    Titus manages to speed up, reduce memory, and improve accuracy of genome assembly by throwing away unneeded reads.
  • Chris Sander Decoding genetic variation to compute 3D structures of proteins (3DSIG)
    Sander talks about his new EVFold method for predicting structure from multiple alignments. It’s a very exciting method with the potential to really leverage nextgen sequencing for structure prediction.
  • John-Marc Chandonian The next generation of SCOP and ASTRAL (3DSIG)
    SCOP has now given up on fully manual classification and started doing automated classification of close homologues, called SCOP 1.75A. They’re also working on SCOP 2.0, which will change from a structure hierarchy to a DAG representing both evolution and structural similarity, add hyperfamilies, show evolutionary clades, and allow tagging with other info like Uniprot IDs.
  • Ada Yonath What was first? The genetic code or its products (3DSIG)
    Ada talked a bit about her Nobel Prize work on the ribosome. I was particularly interested to hear how symmetry is essential at the ribosomal catalytic site. However, the most thought-provoking bit of her talk was speculation about the RNA-only protoribosome, which she thinks was a small, symmetric RNA dimer.
  • Sheng Wang Protein structure alignment beyond spatial proximity (3DSIG)
    DeepAlign is a good-looking protein alignment algorithm that uses both sequence and structural similarity in its scoring term. It also includes orientation information, so it’s better at avoiding register errors in sheets and helices. Oh, and it’s as fast as TM-Align! Jianzhu Ma from the same lab gave good a talk at ISMB about using DeepAlign for protein threading, so it’s practically useful.
  • Saliha Ece Acuner Ozbabacan Enriching the human apoptosis pathway by predicting the structures of protein-protein complexes (ISMB)
    Haiyuan Yu Understanding human disease through 3D protein interactome network (ISMB)
    Yu Xia A three-dimensional map of protein networks within and between species (ISMB)
    Several ISMB talks focused on using structural information about protein-protein interactions. Saliha Ece Acuner Ozbabacan introduced a pretty flexible method for modelling PPI structuctures based on structurally similar interfaces, although I wonder about the false positive rate. Haiyuan Yu showed that by localizing mutations to a particular protein interface we can much more accurately predict downstream effects and pathology, as opposed to just binning them by gene. Yu Xia compared viral-host interfaces with host-host interfaces and found significant overlap. Furthermore, the overlapping regions are rapidly evolving, suggesting a race between viruses creating inhibitors and hosts trying to avoid them. The increasing availability of protein comple structures and high-quality models seems to have stimulated some very interesting biology in the last few years.
  • Karen Lasker Molecular architecture of the 26S proteasome holocomplex determined by an integrative approach (ISMB)
    Andrej Šali Integrative Structural Biology (ISMB)
    Crystal structures are known for some subunits of the proteasome, but the whole complex seems to be too flexible to crystallize. So instead they took a low-resolution EM structure of the whole complex and combined it with high-resolution monomers, PPI information, and cross-linking data to get the complete structure. Karen talked about the proteasome in depth, while Andrej sold us on the integrative method as the next great thing for solving big biological structures. Phil has a planning grant relating to visualizing big hybrid structures, so we’re hoping to see more big complexes solved like this in the future.
  • Carl Kingsford Uncovering Ancient Networks from Present-Day Interactions (ISMB)
    Carl has an interesting method for creating plausible ancestral networks (eg the PPI network of LUCA/LUCE). Unfortunately it uses a pretty naive model of evolution, so the results can’t be taken too literally. Maybe this could be applied to structural similarity networks, given a smarter evolutionary model?
  • Arun Konagurthu Minimum Message Length Inference of Secondary Structure from Protein Coordinate Data (ISMB)
    The problem with secondary structure assignment is that no one can agree on exactly the same definition. Although it may be a little too complex for most biologists, Arun proposes an interesting definition based on the concept of compression, of all things. It’s a compelling mathematical definition, although I’ll have to see more examples to know if it’s really any good.
  • Alex Bateman Assessing the contribution of scientists to Wikipedia for Pfam and Rfam annotation (ISMB)
    Long before Topic Pages started, Alex started utilizing Wikipedia to crowd-source RNA and protein family annotation. He’s now compiled some fascinating data on how Rfam/Pfam articles grow, finding that they tend to be expanded gradually by wikipedians, punctuated by large, short increases when a scientist or annotator takes an interest. He also notes that professional curators are still essential for maintaining the quality and organization of the annotations, even if the brunt of the workload is distributed across the community.
  • Michal Linial Viral-host coevolution: Playing ‘seek and hide’ (ISMB)
    Viruses often steal genes from their hosts. Michal looks at such cases and finds a couple ways viruses modify the proteins. Because of the pressure viruses face for small genome sizes, they tend to shed unnecessary parts of the genes, such as domain linkers or even whole domains. This brings up some very interesting open questions about the function of those domains to the virus. Maybe they can compensate for domain loss by using symmetry and homomers? It would be interesting to look!
  • Fantastic (I thought) talks by my colleagues and collaborators:
    Andreas Prlić How to Use BioJava to Calculate One Billion Alignments at the RCSB PDB website (BOSC)
    Andreas Prlić Internal Pseudo-symmetry in Proteins (ISMB)
    Peter Rose Efficient Searching and Mining of the RCSB Protein Data Bank (ISMB)
    Philip Bourne Hiring and Supervising (ISMB)
    Lei Xie A structural systems biology approach to polypharmacological drug discovery (ISMB)
    Andreas talked about our research on symmetry and protein alignments. He has some really intriguing examples of why protein symmetry is important in his slides. Peter highlighted changes to the PDB over the last year, and Phil drew from his “Ten simple rules” series for a professional development talk. It was really nice to see Lei again, and hear about his latest research with his new lab at Hunter college.

The conference was fun too. It was my first time in Long Beach, and I grew to like the downtown area. The colored lights that project everywhere from fountains to bus stops at night are a little kitsch, but the restaurants and nightlife are nice. The Monday-night reception at the Aquarium of the Pacific was phenomenal—I want to go back and spend some more time with the bat rays! They have a couple cool webcams set up in lieu of visiting.


Topic Page update

April 30, 2012 | Posted in Science

Two weeks ago our Wikipedia article on Circular permutation in proteins was featured in on Wikipedia’s mainpage as a “Did you know…” article:

Did you know... that the protein Concanavalin A (pictured) cuts itself in two and then reassembles in a circularly permuted order?

We were quickly replaced by the Rice stink bug, but a snapshot of the page Wednesday morning is available from WebCitation.

Submitting an article for DYK is a bit of a hassle: think up a good hook, make sure your article is up to snuff, find someone else to review your hook favorably… However, the results of getting on the main page are stunning. Here’s the number of page views for Circular permutation in proteins this month:

daily page views

Daily page views for Circular permutation in proteins in April 2012. The page was featured on DYK on the labelled day.

Prior to our update, the page was seeing under 20 views per day. Since then, we’ve mostly been getting 20-60 views a day, with a few spikes probably due to attention from blogs. But being featured on the main page for just 9 hours resulted in 1440 hits.

Topic Pages were also featured in Daniel Mietchen’s new post to the PLoS blog, Bridging the Journal-Wikipedia Gap.

Update: In the interest of full disclosure, the hook was badly phrased. Concanavalin does not really cut “itself” in two, as the full procedure requires a restriction enzyme to make the cuts. This is a flaw with the DYK hook rather than the article itself, which merely describes the procedure as an “unusual protein ligation.”

The First PLoS Comp Bio Topic Page

March 29, 2012 | Posted in General

Last summer, my boss Phil Bourne brought up an interesting problem for science. Wikipedia is the first place most people look for information, and yet most scientific research can only be found in scientific journals. Experts rarely have time or desire to move this information onto Wikipedia, where it will be accessible to the general public. Phil credits this to a lack of incentives for academics—the pressure to publish in peer-reviewed journals leaves no time for other kinds of writing.

His solution as Editor-in-Chief of PLoS Computational Biology is to start publishing Wikipedia pages. Or rather, to publish a new kind of article called a Topic Page that is suitable for broad audiences and will be published simultaneously on PLoS Comp Biol and Wikipedia. I thought this was a great idea, and Andreas Prlić and I volunteered to be guinea pigs.

Today, I am happy to announce the publication of the first Topic Page. “Circular Permutation in Proteins”. It is peer reviewed and indexed by PubMed. It also appears as the Wikipedia page Circular permutation in proteins. In the past few hours there have already been edits by 5 people. I am excited to see how the Wikipedia community will improve my humble article. The PLoS PDF looks nice, but it is already outdated. In contrast, the Wikipedia version feels very much alive.

It was fun to be a part of the process of starting a new method of publishing. One of the problems we had is that while PLoS Comp Biol is an open-access journal, it has a slightly different license from Wikipedia. So I had to set up a wiki of our own on which to draft Topic Pages. I now know far more about administering a website than I did from my comparatively simple other websites (such as this blog). I also liked drafting a paper on a wiki. It’s easy to collaborate with someone else (although not real-time, like Google Docs), and the markup syntax is powerful but intuitive. I would consider drafting other scientific papers on wikis in the future.

Finally, here’s a few links if you found this interesting:

The Statistics of Monopoly

February 22, 2012 | Posted in Math, Tagged , , , ,

American Edition Monopoly board. (Source: wikipedia)

When we were kids, my siblings and I each had our own favorite properties when playing monopoly. My brother always went for the hotel on Boardwalk. My favorite was St. Charles Place. It didn’t cost as much to develop, and I was convinced that people tended to land on St. Charles place a disproportionate amount of the time. I’d be delighted when I managed to purchase St. Charles, and every time someone landed on that square it would reinforce my conviction that St. Charles was the most profitable property in Monopoly.

Armed with a college-level understanding of statistics, I thought I would test my childhood hypothesis that the probability of landing on St. Charles is higher than for other properties. The movement aspect of Monopoly can be modeled as a Markov process, which just means that the probability of landing on a square depends only on what square you were on last term. Let \mathbf{x}_t be a row vector of length 40 giving the probability that a player will land on each square after t rolls. All players start on GO!, so

\mathbf{x}_t = \left[ 1, 0, 0, \dots, 0\right].

Dice Movement

Now define transition probabilities for moving from one square to another. Ignoring things like chance cards, Monopoly uses the sum of two dice to move, so the probability of moving from one square to another is a triangle distribution with a peak 7 squares forward. For instance, the probability of moving to each square if you were previously on GO! is

 \left[ 0, 0, \frac{1}{36}, \frac{2}{36}, \frac{3}{36}, \frac{4}{36}, \frac{5}{36}, \frac{6}{36}, \frac{5}{36}, \frac{4}{36}, \frac{3}{36}, \frac{2}{36}, \frac{1}{36}, 0, 0, \dots \right].

Here’s some matlab code to generate the full transition matrix. D[i][j] gives the probability of moving from square i to j after rolling the dice.

% Initialize transition matrix for dice movement
D = zeros(40);
D(1,3:13) = [1:6, 5:-1:1]/36;
for i = 2:40,
    D(i,:) = circshift(D(i-1,:),[0 1]);
end

We can calculate the probability of being on a particular square after t turns by repeatedly multiplying current state vector by the transition matrix.

 \mathbf{x}_t = \mathbf{x}_{t-1}D = \mathbf{x}_0 D^t

The probability of landing on each square during the first 50 turns. The board is oriented as above, with GO! in the lower right corner. Pure white indicates a probability of 1, while 50% gray corresponds to a probability of 0.025, which is average for a 40-square board.

Applying this for a few steps, it appears that all squares rapidly become equally probable. But would they eventually all have probability 1/40=.025? This can be checked by examining the eigenvectors of the transition matrix. The eigenvector corresponding to \lambda=1 gives the probabilities of landing on each square after an infinite number of rolls.

%% Find steady state
[V, l] = eigs(A',1,'lm');

xss = V'/sum(V); %x at steady state, eg t->infinity

Running this shows that all the squares do indeed have probability 1/40 at t=\infty. So if all movement were determined by dice roll, all spaces would be equally likely to be landed upon.

Jail and Chance cards

Now lets see what happens when all the detail of Monopoly are included in the model. There are a couple other ways players get moved around:

  • The dreaded Go to Jail space
  • Community Chest. Most of the 16 community chest cards deal with money, but one sends you to jail.
  • Chance. 10 of the 16 chance cards move the player around the board.

This type of movement can be modeled as a second transition matrix, which get applied after each dice roll. Most squares leave the player where they ended up. Landing on ‘Go to Jail’ has a 100% probability of sending you to the jail spot. Community chest has a 1/16 chance of sending you to jail, and a 15/16 chance of remaining on the same spot. The Chance squares are complicated, but we can work out the transition probabilities from Chance squares too.

% Probabilities for non-dice movement.
ND = eye(40);

% Go directly to jail
ND(31,:) = 0; % 'Go to Jail' is square 31
ND(31,11) = 1; $ 'Jail' is square 11

% Community Chest cards: 16 total
% Cards are assumed to be drawn uniformly at random with replacement.
% 15 Unchanged
chestSquares = [3, 34];
ND(chestSquares,:) = ND(chestSquares,:)*15/16;
% 1 Go directly to Jail
ND(chestSquares,11) = ND(chestSquares,11)+1/16;

% Chance card calculations-download code for details

% A turn consists of a dice roll and then some non-dice movement
A = D*ND;

By including non-dice movement, some squares become much more likely than others. For instance, it is impossible to end a turn on ‘Go to Jail’. Squares such as Railroads are more likely, since players who land on Chance could be sent there.

Probability of ending a turn on each space, including non-dice movement.

Steady state probabilities for each square using the full model. The red line shows the mean probability of 0.025, and a few notable squares are labelled.

RankSquareProbability
1Jail0.0570
2Illinois0.0317
3B&O Railroad0.0304
4New York0.0301
5Reading Railroad0.0301
6Water Works0.0294
7Communtity Chest0.0289
8Tennessee0.0286
9Free Parking0.0282
10Kentucky0.0278
11St. Charles0.0275

Conclusion

It turns out that St. Charles Place was not a particularly good space, since the probability of landing there is only 0.0275, slightly above average. I would have been much smarter to try to buy Illinois, which gets landed on significantly more often than average.

So what’s the optimum strategy for playing monopoly? Looking at steady state probabilities of landing on each square gives a hint to this, but doesn’t capture the full complexity of the game. It doesn’t factor in the costs and revenues for each property, nor can it provide advice on trading, selling, or improving properties. Finally, looking at steady state probabilities can be deceiving since the game starts far from steady state. For instance, Vermont Ave has a lower than average chance of being landed on. However, it is almost 50% more likely to be landed on 5 turns after starting from GO! than it would have been if the players began already spread out. Sometimes extra revenue early in the game can translate to a big advantage later on.

Now, who wants to play monopoly?

Code

Calculating the full transition matrix has a lot of cases, so I didn’t include the code here. If you’re curious, check out the full code from bitbucket. It was tested on both Matlab and the free alternative, Octave. The main script is called ‘MCMonopoly.m’.

DNA font

February 10, 2012 | Posted in Science,Technology, Tagged , , , , , ,
ATGCXYR

A short DNA 'word' showing the four bases A, T, C, and G, the 'unknown' base X, and the 'pyrimidine' and 'purine' characters Y and R.

I recently downloaded the Deja-Vu font family and discovered the open-source typography community. I thought it would be fun to try to make a font myself. Since much of my time is spent looking at biological sequences, I thought a way to visualize DNA molecules in a text editor would be cool. The result: DNA Type. DNA only contains a few letters, so I’ve only made 7 characters so far. However, you can use it to write secret messages to your fellow bio-nerds, as long as the messages contain only A, T, G, C, Y, R, and X! The message is read off from the top strand.

Spencer, what’s your favorite instrumental electronic band?
RATATAT
What was the name of that movie which predicted the infringement of human rights based on genetic predisposition, a possibility which seems very plausible in todays high-throughput sequencing world?
GATACA

DNA is cool, but I’m really more of a protein guy. With a protein font you can make meaningful amino acid glyphs for the whole alphabet. However, polypeptides are usually displayed alternating up and down. I think this is possible with glyph variants, but I need to learn more about the TrueType format and the FontForge program before attempting something so complex.

Download

Version 0.
Download TrueType font
Download FontForge source file

Known bugs

  • Pretty much unreadable at smaller than 48pts.
  • Only 7 characters. Not even the lowercase works.
  • No hinting for small sizes beyond FontForge’s autohints.
  • Doesn’t work for RNA. No one cares about the difference between Uracil and Thymine, anyway.
  • Ugly glyphs. What are you, an artist? Go back to research.

Arduino IDE keywords

January 18, 2012 | Posted in Arduino,Technology, Tagged , ,

The other day I made my first library (a 7-segment display controller) for my new Arduino Uno, following two nice tutorials. They both mention that it’s a good idea to make a keywords.txt file for new libraries, which gives hints to the Arduino IDE’s syntax highlighter. However, neither gives a thorough explanation of format of that file. I thought I would document my findings.

The built-in keywords are defined an a simple text file. On my computer, this lives at /Applications/Arduino.app/Contents/Resources/Java/lib/keywords.txt. Here’s how it starts:

# LITERAL1 specifies constants

HIGH	LITERAL1	Constants
LOW 	LITERAL1	Constants

The interesting thing here is that there are three fields which get parsed. Only the first two are useful.

  1. The keyword to highlight
  2. The type of keyword it is. This really just determines the color, but most people seem to use the following convention:
    • KEYWORD1 Classes, datatypes, and C++ keywords
    • KEYWORD2 Methods and functions
    • KEYWORD3 setup and loop functions, as well as the Serial keywords
    • LITERAL1 Constants
    • LITERAL2 Built-in variables (unused by default)
  3. Documentation page. This is used by the ‘Help<Find in Reference’ menu item. For example, the reference for HIGH in the example above would be file:///Applications/Arduino.app/Contents/Resources/Java/reference/Constants.html.

By default, Arduino 1.0 colors all the KEYWORD types orange, and all the LITERAL types blue. These defaults are set in the /Applications/Arduino.app/Contents/Resources/Java/lib/theme/theme.txt. Here’s the relevant snippet (the comments seem to be inaccurate or outdated):

# TEXT - KEYWORDS

# e.g abstract, final, private
editor.keyword1.style = #cc6600,plain

# e.g. beginShape, point, line
editor.keyword2.style = #cc6600,plain

# e.g. byte, char, short, color
editor.keyword3.style = #cc6600,bold


# TEXT - LITERALS

# constants: e.g. null, true, this, RGB, TWO_PI
editor.literal1.style = #006699,plain

# p5 built in variables: e.g. mouseX, width, pixels
editor.literal2.style = #006699,plain

Just change any of the hexadecimal colors. I like the following:

editor.keyword1.style = #cc6600,plain
editor.keyword2.style = #993300,plain
editor.keyword3.style = #993300,bold
editor.literal1.style = #006699,plain
editor.literal2.style = #0099CC,plain

If loop and setup aren’t showing up bold, you may be using Monaco, which doesn’t have a bold style. I recommend using another fixed-width font which does have a bold style, such as DejaVu Sans Mono. This can be set in the Arduino preferences file, ~/Library/Arduino/preferences.txt:

editor.font=DejaVu Sans Mono,plain,10
editor.antialias=true

Make sure the Arduino IDE is not running, as it overwrites the preferences file upon exit.

EVfold

For our weekly journal club I talked about a new method for de novo protein folding called EVfold. [Slides] Details can be read in the paper (plus 15 page supporting text)

Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., & Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation. PloS ONE, 6(12), e28766. doi:10.1371/journal.pone.0028766.t001

The authors are motivated by two observations:

“In spite of significant progress in the field of structural genomics over the last decade [20], only about half of all well-characterized protein families (PFAM-A, 12,000 families), have a 3D structure for any of their members [1].”
“As we are about to reach a truly explosive phase of massively parallel sequencing, we anticipate increased coverage of sequence space for protein families by several orders of magnitude, well above the level of 1000–10000 non-redundant sequences for protein family and with rich evolutionary information about protein structure directly from sequence.”

Basically, DNA sequencing is dirt cheap and will only get cheaper, but up until now this hasn’t been helping to solve protein structures.

Marks et al. try to remedy this situation by looking at co-evolving residue pairs. Basically, they hypothesize that residues which are located close together in 3D space will tend to evolve together. If one mutates to a smaller residue, the other will tend to mutate to something bigger to compensate. If one changes from positively charged to negative, the other will change from negative to positive to balance it out. The idea behind EVfold is to identify co-evolving residues from the thousands of sequences we have for some protein families, then use that information to provide distance constraints in order to predict the protein’s structure.

Of course, just because two residues co-vary doesn’t necessarily imply they are spatially close. They could indirectly influence each other, such as if both bind to a ligand or both bind some intermediate residue. So the authors use a technique called direct coupling analysis (DCA) to predict which residues are close together. This has been around for a few years (Weigt et al (2009). PNAS, 106(1), 67–72), although that’s not immediately clear form the paper. DCA assigns a quantity called direct information (DI) to each pair of residues, which correlates really well with whether the pair is close together.

Marks et al. figure S2c. Grey regions indicate residues of Ras protein which are close together in the crystal structure, while red dots indicate pairs which were predicted to be close based on DI.

EVfold takes the top-ranked residue pairs and assumes they are close together. It then uses those pairs as distance constraints to solve the structure. This is identical to using distance constraints from NMR to solve a structure, and uses well-know simulated annealing/molecular dynamics algorithms. At the end, you get lovely protein strucutures with 3-5Å RMSD from the crystal structure.

Marks et al. figure 2. Predicted (left) and observed (right) structures for three proteins. A few minor differences are visible, such as missing beta-strands, but all three predictions are correct overall.

Perhaps the most impressive fact about this is that EVfold is able to predict a structure in less than an hour from only sequence information. That is incredible compared with the days of supercomputer time needed for other ab initio methods like ROSETTA.

So has EVfold solved the structure prediction problem? Hardly. There are many proteins where finding 1000+ homologous sequences will be hard, even with advances in sequencing technology (vertebrate-only proteins, for instance). Also, the authors suggest that even with perfect distance constraints the simulated annealing methods will not be able to predict structures at less that 2Å. So major advances at refining structures are needed before the crystallographers will be out of a job.

Still, there are lots of applications for which 3-5Å models of widespread folds would be useful. For instance, one of the major difficulties I’ve run into in my work on fold space is that we know there are thousands of proteins which are dissimilar to all known structures. Do these represent new folds, or are they just more variants of existing known folds? The speed of EVfold means that it should be fairly easy to predict structures for all of these domains which have enough sequence information out there. That’s not as good as having experimentally determined structures for everything, but it could give us some intriguing insights into the completeness of protein fold space.

PDB40

November 22, 2011 | Posted in Science, Tagged , ,

I had a great time over Halloween at PDB40 at Cold Springs Harbor Laboratory, celebrating 40 years of protein structure. I especially like the talks from older structural biologist about the history of structural biology. It’s amazing to see the difficulties the early structural biologists overcame to solve and analyze proteins, as well as how far we’ve come since Kendrew solved the first protein structure in 1958. Still, I feel a little sad that I missed the days when proteins were solved like this:

John Kendrew with model of myoglobin in progress. © MRC Laboratory of Molecular Biology

A Richards Box, which used mirrors to visually overlay a Kendrew model on top of hand-drawn electron density maps. This public domain image comes from Protopedia and depicts Fred Richards' original box.

and diagrams of proteins were works of art:

Ribbon schematic (hand drawn & colored, in 1981) of the 3D structure of the protein triose phosphate isomerase. The barrel of 8 beta-strands is shown by green arrows and the 8 alpha-helices as brown spirals. By Jane Richardson.

I presented an updated poster with my work on Fold-space [PDF, 3.3MB].