Changes between Version 39 and Version 40 of Gec7InstMeasWGAgenda

03/24/10 17:51:02 (12 years ago)



  • Gec7InstMeasWGAgenda

    v39 v40  
    249249[  slides] [[BR]]
     251Goal:   [[BR]]
     252+  DatCat was designed to improve data sharing by providing a unified metadata database for Internet data. [[BR]]
     253+  Make easy for users: [[BR]]
     254-  finding data sets of interest [[BR]]
     255-  adding new data sets to the catalog [[BR]]
     256-  annotating data sets in the catalog [[BR]]
     257+  DOES NOT store data [[BR]]
     259Database scheme: [[BR]]
     260+  Collection [[BR]]
     261-  logical group of files(paper, project,...) [[BR]]
     262+  Data files [[BR]]
     263-  raw data files (traceroutes, logs dumps, ...) [[BR]]
     264+  Packages [[BR]]
     265-  downloadable files (single file, tarball, ...) [[BR]]
     266+  Locations [[BR]]
     267-  how to get the packages (URL, contact address, ...) [[BR]]
     269Annotations: [[BR]]
     270+  Provide an extensible naming space for assigning domain specific values to files. [[BR]]
     271+  each user has their own hierarchical name space [[BR]]
     272-  passive.IPv4.packet_count [[BR]]
     273-  active.RTT_95th_percentile [[BR]]
     274+  both data contributors and general DatCat users may attach annotations [[BR]]
     275+  any user may assign “note” annotations to any object [[BR]]
     277Metadata fields: [[BR]]
     278+  collection [[BR]]
     279-  fields: name, contents, summary, motivation, creators/primary contact/contributor, start/end time, keywords, short description/description/description URL [[BR]]
     280-  annotations: note [[BR]]
     281+  data [[BR]]
     282-  fields: name, creators/primary contact/contributor, keywords,format, file size, start/end time, duration, geographic/network location, time zone, MD5, description, creation process [[BR]]
     283-  annotations: passive.IPv4.packet_count, passive.IPv4.TCP.dst.port_count, cfg.passive.capture_len, AS_count, active.trace_count,  active.RTT_10th_percentile, .....   [[BR]]
     284+  location [[BR]]
     285-  fields: package, creators, primary contact, status, download procedure, download URL, geographic/logistic location, availability [[BR]]
     287Submission tools: [[BR]]
     288+  Perl API [[BR]]
     289-  useful for integrating into existing data management systems [[BR]]
     290-  flexible, but need to write code: [[BR]]
     291+  subcat [[BR]]
     292-  different approach (declarative) [[BR]]
     293-  describe metadata in human-friendly text files (YAML) [[BR]]
     294-  CAIDA provides tools to extract additional metadata (data-to-yaml) [[BR]]
     295-  subcat intuitively joins information together [[BR]]
     297DatCat web portal: [[BR]]
     298+  Browse collections [[BR]]
     299+  Search collections [[BR]]
     300+  Search data [[BR]]
     302Lessons learned: [[BR]]
     303+  file-level metadata hard [[BR]]
     304-  hard to fix errors across thousands of files [[BR]]
     305-  hard to display thousands of files [[BR]]
     306-  hard to generate [[BR]]
     307+  submission process too cumbersome for most users [[BR]]
     308-  majority of metadata is shared between files, creator, creation process, location, etc [[BR]]
     309-  many researchers are not programmers [[BR]]
     310-  researchers have limited time and motivation [[BR]]
     311+  Lots of redundant information: [[BR]]
     312-  For a single contribution, a majority of data objects have identical metadata shared across a large number of data objects. [[BR]]
     313-  could be solved by pushing subcat-type categories into the database [[BR]]
     314+  Move to stand-alone collections [[BR]]
     315-  contributors will only need to fill in the collection information [[BR]]
     316-  shorten search path from collection to locations [[BR]]
     317+  better to have lots of collections, than lots of files [[BR]]
    251320==  GENI I&M Architecture ==
    2523214:55pm  Harry Mussman  (GPO)