[medsbio-l] draft expanded goals for MEDSBIO

Herbert J. Bernstein yaya at bernstein-plus-sons.com
Sat May 20 16:38:36 BST 2006


The following is a draft of the goals for MEDSBIO expanded to
help clarify the technical focus of the the group.  Specific
citations to other groups (PDB, BioSync, NOBUGS, etc.) will be
added in the next draft, but first it is important to clarify
the focus of this group.   Comments,  criticism and suggestions
appreciated. -- H. J. Bernstein

Consortium on Management of Experimental Data in Structural Biology
(MEDSBIO)

There is a complex relationship among raw experimental data,
derived data and experimental models used in structural biology.
There are strong collaborative efforts that help to achieve coherence
and consistency in the nomenclature and representation of derived
data and of experimental models in structural biology.  There are
many existing efforts in the management of raw data, wherein lies a
problem.  Each vendor of data collection equipment defines their own
data acquisition protocols and data formats.  Each synchrotron
beamline development group layers their own data acquisition
protocols and formats on top of and sometimes in place of a variety
of vendor formats.  Multiple collaborations have developed to reduce
the complexity of raw data management data protocols in structural
biology.  For image data in synchrotron-based protein crystallography
we have both imgCIF/CBF and NeXuS from collaborations as well a
multiple vendor image formats, with not only different formats for
different detectors, but even with different formats for the same
type of detector.  If we do not bring the imgCIF and NeXuS
collaborations together with some significant number of vendors
to establish clean, well-documented relationships among the
formats, instead of standards resulting in coherence, they may add to
the chaos as poorly documented variants of "standards" emerge.  We
are not certain that this risk can be, or even that it should be,
avoided completely.  Perfect standardization could suppress
creativity and scientific development.  We are creating a new
consortium on the Management of Experimental Data in Structural
Biology (MEDSBIO) not to enforce standardization on a single data
management protocol, but to document clearly the interfaces among
protocols, so that individual experimental efforts working in the
intersection of multiple protocols can function as efficiently as
possible and so that the competition among standards can be resolved
as an open competition of ideas to the betterment of the science
involved, rather than as a political exercise.

The goals of the MEDSBIO consortium are to collaboratively resolve
the interface issues among multiple structural biology data
management protocols, including imgCIF, NeXuS, vendor data formats,
instrument control and signaling protocols, local and remote
experiment control protocols, etc. with the objective of making the
collection, transfer and archiving of data for experiments in
structural biology as efficient as practicable; maintain an archive
of documentation on standards and proposals for ontologies, software,
hardware specifications, web templates and other documentation
related to such protocols; maintain an archive of open source
software and links to closed source software related to such
protocols; maintain a archive of samples and test cases related to
such protocols; run annual workshops on issues relating to such
protocols; contribute open source software to fill gaps in the
infrastructure related to such protocols; gather and where necessary
create curricular material to assist in training experimenters in
issues related to such protocols.

These efforts are primarily focused on the fine details of data
acquisition, of managing raw data in hardware and software in ways
that conserve resources.  These are issues that users of this data
often gloss over or do not consider at all.  For the users, data
derived from the raw data - e.g. structure factors derived from
pixel-by-pixel photon counts are the primary data, to be provided
by "black-box" systems.  MEDSBIO is concerned with issues in the
innards of those black boxes.  There is a strong relationship between
these internal issues and the issues that users must confront.
They are connected by the data and the representations of the
derived data required by the users.  Thus if a particular user
community were to standardize on, say, imgCIF for their "raw"
data in a synchrotron environment using, say, NeXuS, for its
overall data  management, working with detectors using an
idiosyncratic detector element coordinate system, the users well
might wish to be isolated from NeXuS and the oddities of the detector
coordinate system, but the beam line designers need to have a
detailed, well-documented understanding of how to interface among all
the messy innards that the users never wish to deal with.  If this is
not done well and done in a consistent manner at multiple beam lines,
then, instead of imgCIF providing a standard, it will exist in
multiple difficult-to-translate dialects.

Because end users and developers have a lot in common and are tied
together by the data itself as it is transformed from raw images,
photon counts, axis settings, etc., it is important that there be
collegial collaboration between people working on problems on both
ends of the data stream, but it is equally important to allow the
technical issues on the raw data side to be fully discussed and
explored without being swamped by the equally demanding discussions
needed on the derived data side.  Therefore it is  important to have
a collaborative consortium in the developer community that is neither
focused on a single data management protocol, nor dominated by
discussions of  derived data user-level issues.

The MEDSBIO consortium formalizes several existing collaborations and
introduces a new level of coordination and cooperation in working
with raw experimental data of importance in structural biology,
complementing well-established efforts in working with the data
derived from this raw data, hopefully producing a better
understanding of the data upon which the much experimental work in
structural biology is based and an understanding of the issues which
affect the quality and reliability of that data.  By clarifying and
codifying the parameters of the information streams that interact to
produce the raw data, we hope to bring a new level of consistency and
coherence to the presentation of scientific results of the
experiments that depend upon this data, thereby facilitating reliable
intercomparisons among experiments and facilitating analysis based
upon the results of multiple experiments.

-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

               Office:  +1-631-244-3035
            Lab (KSC 020): +1-631-244-3451
                  yaya at dowling.edu
=====================================================


More information about the medsbio-l mailing list