[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GAM - Data Issues, was Re: [seul-sci] GAM

Danilo Gonzalez Hashimoto wrote:
> Ok, let's see that in a deeper way...
>         Managing Data: what kind of data do you use? Here, the most usual
> is ".DAT" files. These are points of a graph. Some of these files have
> column titles, analisis parameters and sample id. The other most usual
> files are pictures (microscopy). I am not vey mature in coding, but i'd
> like to try some, and think writing such a data manager would be nice to
> it.

     Image data presents an interesting set of challenges, and I think if we
plan properly, we can accomodate many different image types (or rather,
image uses) so someone can eventually build the necessary tools. In the
interim, I'd be apt to see how we can store information about the image
(filename, location, image size, colour depth, description, etc) in a file
that GAM could make use of. Some of the different image uses I can think of
off the top of my head include:

     - Microscopy images: used for presentation, measurement. I know of one
       lab measuring cricket foreleg and hindleg lengths for a genetic
       research project using image capture gear

     - Scanned graphs: used to obtain an approximation of other, existing
       data from graphs in other research papers - usually when someone
       is trying to place their data / science in the context of previous
       work; I've heard this approach called 'meta-analysis' (I have the
       transforms for this already, and I'm interested in building it)

     - Field Specimen images: a digital camera imaging an animal (say a 
       fish or seahorse) against a measured background (ie. the grid on
       the plate of a 0.5 x 0.5 m straight cutter) would make for more
       rapid and likely more accurate estimates of size, length. Since the
       background behind the target changes little, it should be possible
       to automate some sort of subtraction algorhythm to remove the
       background image and keep only the target; also keeping the
       mesurements intact (form, height, length, etc)

     - Astronomical images: positions? scale? spectra?

     - Presentation images (GIMP?) - since GIMP is already the de facto
       standard in image manipulation, it's probably also the best bet for
       image prep for presentations.

     I'm sure there are other image uses...

>         The questions are: do you have any different kind of data? How do
> you think it should be organized to be well indexed(XML? directory
> listings? using current databases -- postgres, mysql -- is too much?)?
> Which features should it has (date, id, equipment, etc...)

     I think your idea to use XML is probably the best direction we can take
- it should enable us to use a flexible enough data structure without making
things too complicated or too limiting. I don't know a lot about XML tho...
     Putting info like Equipment, ID and Date is appropriate as well; there
is an issue here as to HOW the data is going to be organized in the XML; I
don't have any clear ideas right now.

>         Specificity / Generality: I think that modularity is the answer
> here. If you have all those nice common features and allow people to
> conect different modules, they could make it customizable enough. I'm not
> any expert on it, but it's just where I think environment interaction and
> use of its technology enters the room:
>         * Many equipments produces data in many data formats. Defining a
> 'Standard' file format would make it easier to exchange data betwen apps.
> So, people would be able to use the same program to analyse, say,
> poresizer data from different equipment. They would just have to make a
> module(script?)  wich would take their equipment data to the standard
> data format.

     Good point. In our own lab, we have an old light spec that produces
data according to a simple proprietary protocol, an ancient spec that uses a
different protocol, and a third having a well-documented one. I'm often
leery of comparing results from these three devices, so some sort of
filtering program should not only be able to read the file, but also attach
some sort of tag describing the instrument used. I'm writing another email
about this, will send it shortly.

     As long as the common data format is open and can deal with different
data types (so that we don't unnecessarily lose accuracy because of data
type coersion), we should be fine. Converting data to and fro should be
easily done using scripts

     If we are thinking of using XML in the first place, there are ALL kinds
of things that we could take into account to make data management more
effective or at least less error-prone. For instance, the underlying data
file could contain a common set of descriptors, like:

     DataName:         (what is the name of this column)
     DataDescription:  (what does data describe)
     DataType:         (text, integer, long integer, single, double, etc)
     DataUnits:        (meter-kilogram-second units?)

     I'm suggesting the description and units fields because on more than
one occasion, I've had to come back to someone's poorly documented dataset
(or worse yet, my own) to lose a lot of time looking up the numbers and
retesting some of the models to ensure that I understood what everything
was. Also, I've seen far to many research projects make unit errors in their
     The type field can be used to ensure proper treatment when the data are
exported to an output format (XML would be storing the data as strings,

     DataSD:           (Significant Digits)
     DataStatus:       (raw data or calculated?)

>         * Most analysis require the same data and returns the same kind
> of data.
>         I really don't know exactly how that works, but all those
> GNOME(Bonobo)/KDE(?) stuff would be nice here(I gues... :-). Which should
> be the general interfaces(logging should be here?)?

     I'm really, really tempted to try to avoid those issues at the start
and aim towards a more general graphical interface that can be modified
(customized) to make it easy to do things like call other programs, scripts.
My biggest concern is to get the basic functionality worked out, because if
we don't have that, we can get sidetracked very easily once we start coding.
It could otherwise also limit the ultimate functionality of the program.

>         Logging: nice. With interfaces wich are really wrappers to CLIs
> it would not be that difficult.

     Agreed. Logging in pretty much any program that follows the unix
philosophy (ie. CLI programs with a separate front end like R, Grace, etc.)
should be relatively straightforward.
     In the long term, the logs produced this way should have direct utility
by the user, so that s/he can build scripts to build common graph types or
carry out common analyses. Some direct applications for 'scripted graphs'
that I can already think of are in limnology and in diatom stratigraphy; the
former often requires profile graphs of temperature or oxygen concentration
with depth, and are often a pain to build because X (depth) is positive DOWN
and y (DO/temp/etc) is positive RIGHT; the latter these graphs are just
scary because of their complexity.

>         I'd just like to spot a problem. One of the problems many
> projects like that have is that the user has to be sure he has every
> needed package before he installs it: qt, gnome, gtk, R, motiff,
> sci-libs, etc. What should be the better way to simplify this task?
> Packaging everything together? It would perhaps give much more than
> needed, or even trying to install old versions. Giving the links would be
> also an enourmous task(for webmasters and users, both).

      Very good point. Perhaps an initial goal could be to produce a general
CLI program to parse the XML, which could be called by a separate front end.
We could build the front end in whatever we want, so long as it can be
extended at run time, that way it can remain flexible enough for people in
different disciplines to use it effectively, but more importantly, flexible
enough to allow advanced uses to script and customize as many tasks as

     As for the different packages, this could be a bit tricky. I'd be more
inclined to let the user worry about it a bit more, since GAM would
ostentiably sit overtop existing software - chances are that the user will
already have some handle of how these other programs work already before
s/he tries to add the GAM plugin for it. 

     Whew. Enough for now. :)


Pete St. Onge