| 251 | Goal: [[BR]] |
| 252 | + DatCat was designed to improve data sharing by providing a unified metadata database for Internet data. [[BR]] |
| 253 | + Make easy for users: [[BR]] |
| 254 | - finding data sets of interest [[BR]] |
| 255 | - adding new data sets to the catalog [[BR]] |
| 256 | - annotating data sets in the catalog [[BR]] |
| 257 | + DOES NOT store data [[BR]] |
| 258 | |
| 259 | Database scheme: [[BR]] |
| 260 | + Collection [[BR]] |
| 261 | - logical group of files(paper, project,...) [[BR]] |
| 262 | + Data files [[BR]] |
| 263 | - raw data files (traceroutes, logs dumps, ...) [[BR]] |
| 264 | + Packages [[BR]] |
| 265 | - downloadable files (single file, tarball, ...) [[BR]] |
| 266 | + Locations [[BR]] |
| 267 | - how to get the packages (URL, contact address, ...) [[BR]] |
| 268 | |
| 269 | Annotations: [[BR]] |
| 270 | + Provide an extensible naming space for assigning domain specific values to files. [[BR]] |
| 271 | + each user has their own hierarchical name space [[BR]] |
| 272 | - passive.IPv4.packet_count [[BR]] |
| 273 | - active.RTT_95th_percentile [[BR]] |
| 274 | + both data contributors and general DatCat users may attach annotations [[BR]] |
| 275 | + any user may assign “note” annotations to any object [[BR]] |
| 276 | |
| 277 | Metadata fields: [[BR]] |
| 278 | + collection [[BR]] |
| 279 | - fields: name, contents, summary, motivation, creators/primary contact/contributor, start/end time, keywords, short description/description/description URL [[BR]] |
| 280 | - annotations: note [[BR]] |
| 281 | + data [[BR]] |
| 282 | - fields: name, creators/primary contact/contributor, keywords,format, file size, start/end time, duration, geographic/network location, time zone, MD5, description, creation process [[BR]] |
| 283 | - annotations: passive.IPv4.packet_count, passive.IPv4.TCP.dst.port_count, cfg.passive.capture_len, AS_count, active.trace_count, active.RTT_10th_percentile, ..... [[BR]] |
| 284 | + location [[BR]] |
| 285 | - fields: package, creators, primary contact, status, download procedure, download URL, geographic/logistic location, availability [[BR]] |
| 286 | |
| 287 | Submission tools: [[BR]] |
| 288 | + Perl API [[BR]] |
| 289 | - useful for integrating into existing data management systems [[BR]] |
| 290 | - flexible, but need to write code: [[BR]] |
| 291 | + subcat [[BR]] |
| 292 | - different approach (declarative) [[BR]] |
| 293 | - describe metadata in human-friendly text files (YAML) [[BR]] |
| 294 | - CAIDA provides tools to extract additional metadata (data-to-yaml) [[BR]] |
| 295 | - subcat intuitively joins information together [[BR]] |
| 296 | |
| 297 | DatCat web portal: [[BR]] |
| 298 | + Browse collections [[BR]] |
| 299 | + Search collections [[BR]] |
| 300 | + Search data [[BR]] |
| 301 | |
| 302 | Lessons learned: [[BR]] |
| 303 | + file-level metadata hard [[BR]] |
| 304 | - hard to fix errors across thousands of files [[BR]] |
| 305 | - hard to display thousands of files [[BR]] |
| 306 | - hard to generate [[BR]] |
| 307 | + submission process too cumbersome for most users [[BR]] |
| 308 | - majority of metadata is shared between files, creator, creation process, location, etc [[BR]] |
| 309 | - many researchers are not programmers [[BR]] |
| 310 | - researchers have limited time and motivation [[BR]] |
| 311 | + Lots of redundant information: [[BR]] |
| 312 | - For a single contribution, a majority of data objects have identical metadata shared across a large number of data objects. [[BR]] |
| 313 | - could be solved by pushing subcat-type categories into the database [[BR]] |
| 314 | + Move to stand-alone collections [[BR]] |
| 315 | - contributors will only need to fill in the collection information [[BR]] |
| 316 | - shorten search path from collection to locations [[BR]] |
| 317 | + better to have lots of collections, than lots of files [[BR]] |
| 318 | |
| 319 | |