A Compress Filesystem To Improve The Files Contents Search

International Journal of Digital Information and Wireless Communications (IJDIWC) 1(2): 554-562 The Society of Digital Information and
Wireless Communications, 2011(ISSN 2225-658X)
A compress filesystem to improve the files contents search Nicola Corriero, Emanuele Covino, Giovanni Pani University of Bari Department of Computer Science Via Orabona 4, Bari, Italy nicolacorriero@gmail.com, fcovino,panig@di.uniba.it
ABSTRACT SquashFS is a Linux compress filesystem. Hixosfs is a filesystem to improve file content search by using metadata informations. In this paper we propose to use Hixosfs idea in SquashFS context by creating a new filesystem HSFS. HSFS is a compress Linux filesystem to store metadata within inodes. We test our idea with DICOM file used to store medical images. Keywords-ilesystem, Compression Data, Linux , DICOM, Metadatailesystem, Compression Data, Linux, DICOM, MetadataF I. INTRODUCTION HixosFS is an Ext2 filesystem that allows user to extend and modify the inode associated with the file. The main idea of HixosFS is to add a tag within ext2 inode with metadata that are normally encapsulated in the same file. Squashfs is a Linux compress filesystem. Squashfs works with a virtual filesystem stored within a single file. Squashfs is very popular in embedded system because allows user to compress data where user does not have memory. Based on the experience of HixosFS, we will proceed with the change of SquashFS by making HSFS (Hixos Squash
File System) to merge advantages of Hixosfs and Squashfs. The idea was born in a real scenario. DICOM is the standard format adopted for the treatment of medical images (Digital Imaging and Communications in Medicine). The standards provide for the association for medical imaging and the incorporation of accurate and unambiguous information about the patient to whom the data relate to the image data, personal details, medical reports related to some clinical analysis, etc. Hospitals and major health organizations are turning today to computers to make secure, fast, reliable and easily accessible in real time and distance learning enormous amounts of information as regards data processing. Forensic images must be retained for a period of not less than 10 years, but we are in the presence of a large number of mages of considerable size, you need to find a method to reduce the commitment of resources, in terms of occupying space in storage devices. GNU / Linux offers an interesting solution, the SquashFS file system. Squashfs is a compressed read-only file system that compress files, inodes and directory, usually used in systems for use minimal storage read-only file. The interest lies on this file system to the mode of operation, mounted the image of the
554
International Journal of Digital Information and Wireless Communications (IJDIWC) 1(2): 554-562 The Society of Digital Information and Wireless Communications, 2011(ISSN 2225-658X)
directory containing the compressed files, allows you to manage them as a normal file through the shell or the GUI. So the compression completely transparent to the user. In the file in DICOM format, associated with the image, we have the identification data used by applications to manage them. Having to extract these data directly from the compressed files may require an excessive commitment of resources in terms of time. II. PROBLEM The DICOM Standard is structured as a multi-part document in accordance with the guidelines set forth in the document: ISO/IEC Directives, 1989. This was done in order to update the single part without having to republish the entire standard. The current version (2007) consists of 18 parts and is found at the link and http://medical.nema.org, http://www.dclunie.com/. The general structure of a DICOM file includes a header containing data describing the image, the patient, time and date of acquisition, and more, and the image itself, all in one file. The standard since its first appearance has not definitively established the indivisibility between the image pixel data and data describing the process that led to the formation of the image. We are in the presence of a large number of images of considerable size. Examples of these size are:
Exam
Nuclear Medicine Magnetic Resonance Ultrasound Echocardiography Angiography Coronary TCMS 64 CR DR Digital Mammograms
MB/Exam
1-2 6-12 15-150 150-800 600-700 700-800 700-800 13 28 120
Dicom images must be retained for a period of not less than 10 years. So we need a lot of space to store these informations and a lot of time to looking for a single exam of a single patient. III. OTHER SOLUTIONS Normally these situations are handled using PCs with large primary and secondary memories that make possible the use of every operating system and every mean for saving and managing information. Other approaches of the system require the use of complex databases over servers and/or embedded databases over embedded machines. A. Sqlite In the embedded systems we have evident problems of memory that during the time have solicited light-weight and high-performance ad-hoc solutions. Sqlite is an application that implements all the functionalities of a database using simple text files. This enlighten the execution load of the system and facilitates the integration of the system inside, for example, an embedded system. However, the system installation produces a certain load to the mass memory. B. Extended attributes Currently Linux fs as ext2 [4], ext3, reiserfs allows to 555
manage with metainformation related to a file with xattr feature. Patching the kernel with xattr you have a way to extend inode attributes that doesnt physically modify the inode struct. This is possible since in xattr the attributes are stored as a couple attribute-value out of the inode as a variable length string. Generally the basic command used to deal with extended attributes in Xattr is attr that allows to specifies different options to set and get attribute values, to remove attributes to list all of them and then to read or writes these values to standard output. The programs we implemented in our testing scenario are based on this user space tool. C. Oracle/Mysql The tools used to handle large quantities of data are very efficient although they require large resources in hardware and software. Installing and executing of applications like Oracle or Mysql, in fact, require large quantities of memory and hard disk space. IV. FILESYSTEM SOLUTION: HSFS Our proposal is to use a compressed filesystem changed ad-hoc to improve the indexing of files and facilitate the discovery of information. HSFS borns from the benefits in terms of performance for the detection of Hixosfs and compression squashed. The idea is to compress the files via DICOM Squashfs and index the contents of the files in the inode of each file compressed. In this way the image will occupy less space and you can find information without browsing the GB of data, but only in the inode file. A. System architecture In the preparation of minimal Linux operating systems
and integrated each byte of storage device (floppy, flash disk, etc..) is very important, so compression is used wherever possible. In addition, compressed file systems are usually necessary for archival purposes. That is essential for large public records or even personal archives and it is the latter application that has led us to study this file system. The SquashFS file system brings everything to a new level. This is a read-only file system that allows compression of individual directories or entire file systems, their writing on other devices / partitions or regular files and mount directly (if it is a partition) or through the use the loopback device (if it is a file). The modular and compact SquashFS is brilliant. For archival purposes, squashfs is much more flexible and faster than a tarball. The squashfs distributed software includes a patch for Linux kernel sources (which enables the kernel versions prior to 2.6.34 - support for reading the squashed file system), the instrument mksquashfs, you need to create file systems and unsquashfs tool that allows you to extract multiple files from an existing squashed file system. The latest released version of the squashed is 4.x, this version was made for the kernel 2.6.34 and later. Support for reading the squashed file system in Linux kernel versions from 2.6.34 is integrated. The previous version, 3.x, made for versions of the Linux kernel prior to 2.6.34, includes a patch to apply. To manage the new information that we intend to include in the structure of an inode is necessary: _ to create a new definition of the VFS inode with new fields; _ to create new syscall to perform operations on the new inode3; 556
_ to create a new definition of inodes in the filesystem SquashFS; _ to change squashfs instruments to allow corroboration of the new fields of the inode; _ to create user space programs to interact with the file system changes. B. Hixosfs Hixosfs is as an ext2 Linux filesystem (fs) extension able to classify and to manage metadata files collections in a fast and high performant way. This is allowed since high priority is given to the storing and retrieving of metadata respect tags because they are managed at fs level in kernel mode. The hixosfs core idea is that information regarding the content of a metadata of a file belong to the fs structure itself. Linux fs in general stores common information for a type of file, such as permissions or records of creation and modification times, then the struct to represent a file inside the VFS called inode keeps information about such data. To represent tags physically inside the inode, hixosfs reserves a greater amount of memory for the inode to store extra information needed to label a file respect its content such as album, author, title, year for music file. In this paper we explain how the fundamental concepts implemented in hixosfs can help to solve problems in embedded systems. We had to implement two system calls to write and read the new information stored inside the inode: chtag and retag, the final user recall the syscall by the tools chtag and stattag with analogous functionality of the well known chown and stats but obviously considering generics tags.
Hixosfs has been used to tag gps files and mobile devices. In this way all the load has been transfered to the kernel that handles and organizes the hixosfs files as occurrs. The servers and the clients contain partitions that can read and set the hixosfs tags so to manage the database. The kernel struct for all the file type management is the inode. The struct tag has four fields for a total of about 100 byte of stored information, theoretically an inode can be extended until 4 kb then its possible to customize it with many tags for your purpose. Its convenient to choose tags that are most of the time used in the file search to discriminate the files depending their content. For such reasons we decided to use a generic version of hixosfs with a generic structure in which is possible to insert file representative tags case by case. struct tag { 557
#ifdef CONFIG_HIXOSFS_TAG char tag1[30]; char tag2[30]; char tag3[30]; char tag4[30]; unsigned int tag_valid; #endif } C. HSFS: Hixos Squash File System DICOM header information are arranged in groups according to the following: Patient, Study, Series and Image. Each attribute has a unique identifier, consisting of two words of bytes, the first on the parent group and second on the same single attribute. To manage the new information that we intend to include in the structure of an inode is needed: create a new definition of the VFS inode with new fields; create new syscall to perform operations on the new inode 1, to create a new definition of inode in filesystem squashfs; modify the squashfs tools to allow the corroboration of the new fields of inode; create user space programs to interact with the file system changes. ... ... ... struct squashfs_dicom_inode { char tag1[30]; char tag2[30]; char tag3[30]; char tag4[30]; unsigned int hixos_tag_valid; }; ... ... ... struct squashfs_reg_inode { __le16 inode_type; __le16 mode; __le16 uid; __le16 guid; __le32 mtime;
__le32 inode_number; __le32 start_block; __le32 fragment; __le32 offset; __le32 file_size; /* HSFS Field */ struct squashfs_dicom_inode dicom; __le16 block_list[0]; }; 1In our case, simply read the values because we work on a read-only filesystem ... ... ... The location field dicom within the structure squashfs reg inode is not negligible, must precede the block list [0]. ... ... ... int squashfs_read_inode(struct inode *inode, long long ino) { ... ... ... switch (type) { case SQUASHFS_REG_TYPE: { ... ... ... /* hsfs changhes /* strcpy((inode->i_tag).tag1, (sqsh_ino->dicom.tag1)); strcpy((inode->i_tag).tag2, (sqsh_ino->dicom.tag2)); strcpy((inode->i_tag).tag3, (sqsh_ino->dicom.tag3)); strcpy((inode->i_tag).tag4, (sqsh_ino->dicom.tag4)); /*hsfs changhes /* ... ... ... User space tools. To use hixosfs features we choose to store inside squashfs inode informations about: tag1 ---- > id; tag2 ---- > name; tag3 ---- > patients birthday; tag4 ---- > study date. For that reason we need to create some tools to automatic 558
populate tag inside inode by reading informations from header. The operation of reading the header of DICOM files in order to validate the new struct inode filesystem squashfs inserted, was achieved through a feature included in the program mksquasfs.c. The program mksquashfs.c is the tool used to create compressed file and append new files to existing ones. The definition of the new tag that will contain the extracted values of the header file that will be used to validate the dicom tag added in the inode of the filesystem has been included in the file squashfs fs.h, while in the file squashfs swap.h macro definition is complete SWAP. Changes made in kernel space allow us to handle the added tag, with the squashfs tools we can add tags to indicate the data taken with the header of DICOM files, now we need tools that allow us to extract these data. In this regard we have developed programs: statdicom and finddicom. These programs was applied to files in DICOM format present in a compressed filesystem, using the syscall RETAG, will allow us to view data stored in the inode of all files and select certain files that match specific search criteria. $ statdicom help Usage: statdicom [OPTION...] FILE Shows the values of the new tags added to squashfs inode contains the header data drawn wire type dicom. -?, --help Give this help list --usage Give a short usage message -V, --version Print program version $ The command finddicom allows us to search and display
the name of the file in DICOM format, with the relative values of the added inode, which meet certain search criteria passed as arguments to the command itself. $ finddicom --help Usage: finddicom [OPTION...] Search for files in DICOM format that meet the attribute values passed as parameters in the command line. -b, --birthdate=BIRTHDATE Patients birth date (yyyymmdd) -i, --id=ID Patient ID -n, --name=NAME Patients name -s, --studydate=STUDYDATE Study date (yyyymmdd) -?, --help Give this help list --usage Give a short usage message -V, --version Print program version $ V. A REAL SCENARIO The test scenario we were offered by a local medical center analysis that helped us with a set of images in DICOM format. We have created a folder with images in all the images on DICOM format (with extension. dcm). We have created an hsfs filesystem with the command mksquashfs. $ mksquashfs immagini/ immagini.sq -always-use-fragments Parallel mksquashfs: Using 2 processors Creating 4.0 filesystem on immagini.sq, block size 131072. .... Then to test the goodness of what we have mounted the squashfs filesystem in /mnt/tmp. In this way we have the information available for research.
559
$ mount immagini.sq /mnt/tmp -t squashfs -o loop $ cd /mnt/tmp /mnt/tmp $ statdicom immagine19.dcm immagine7.dcm File: immagine19.dcm Data study: 20100428 Name Surname: XXXXXTTIXXXXXLA Id: 2804101650 Birthday: 19651128 File: immagine7.dcm Data study: 20100604 NameSurname: XXXXXCCIAXXXXXA Id: 0405101520 Birthday: 19660530 /mnt/tmp $ /mnt/tmp $ finddicom -s 20100504 File: immagine37.dcm Data study: 20100504 Id: 0405101605 Name Surname: XXXXXAXXXXX Birthday: 19601112 File: immagine38.dcm Data study: 20100504 Id: 0405101605 Name Surname: XXXXXAXXXXX Birthday: 19601112 /mnt/tmp $ The scenery is running even if there are problems because not very usable. In fact, all operations are currently done using the Linux command line. Our project is to create a more usable interface of the system through a simple web application. VI. TEST COMPARAISONS To test our filesystem we choose to create an ad-hoc minimal Linux distribution with Bash command line and without graphic environment. The metric chosen for evaluation in testing is the execution
time, to have a significant significativity statistical measurements were repeated for each instance created, 10 times, not to distort the data for the presence of demons running. We choose to compare a database of 20000 empty files managed by: hixosfs, ext2 with xattr, hsfs, a simple sqlite database and a mysql database. In cases of databases we populate tables by random data. Each databases (sqlite and mysql) is composed by a single table with 5 fields (1 for id e 1 for each of 4 hixos tag). We test time to create file, to change a tag value, to read all tags values, to find a tag value. We try to create the same conditions between the filesystem usage and database usage. Here are a set of data (the more interesting) collected during the test. To find a value among tag: 20000 empty files
Hixosfs
60,301 60,527 60,590 60,781 60,681 60,507 60,505 60,574 60,869 60,594
Ext2 xattr
234,516 234,867 235,581 236,751 236,469 235,106 236,507 234,757 235,916 236,326
HSFS
60,286 70,285 70,704 60,339 60,296 60,293 60,333 61,271 60,299 60,404
SQLite
68,467 68,708 68,999 69,245 69,309 69,327 68,377 69,690 70,297 96,133
MySQL
188,543 188,931 188,300 188,812 188,874 189,345 188,647 188,651 189,075 188,396
Average time to find a value: 20000 empty files Figure 2.
560
hixosfs
real user sys
60,603 3,364 19,066
Ext2 xattr 235,645 11,393 65,532
HSFS 62,451 4,089 18,347
SQLite 68,746 2,924 7,352
MySQL 189,012 112,247 61,000
Time to read all values: 20000 empty files Hixosfs

23,844 23,922 23,868 23,987 23,780 23,900 24,070 23,966 24,003 23,927
Ext2 xattr
69,463 69,867 69,851 97,214 97,350 69,755 97,035 69,968 97,103 97,501
HSFS
32,152 32,106 32,237 32,266 32,252 32,203 32,188 32,164 32,318 32,272
SQLite
69,973 69,748 69,451 69,790 69,489 70,573 68,122 69,234 69,235 96,567
MySQL
188,952 188,606 188,754 188,938 188,995 189,010 189,383 189,270 189,354 189,172
Average time to read all values: 20000 empty files

hixosfs Ext2 xattr 97,006 8,978 18,598 HSFS 32,216 2,578 7,145 SQLite 69,563 3,016 7,196 MySQL 188,739 105,175 60,744
real user sys
23,927 2,092 5,654
Figure 3. Read all values The applied test is the test of Student working on the average values of the metric under consideration. As you can see from the averages of observed values and how the Student test have confirmed HSFS represents a viable alternative for use with the ext2 xattr and common database. The only filesystem appears to be more powerful hixosfs which does not, however, does not present the characteristics of data compression as hsfs. VII. FUTURE WORKS Today, most medical centers have completed or are completing the transition to digital technologies. At its heart is the PACS (Picture Archiving and communcations System), a system capable of receiving digital images from the workstation via the DICOM archive, DICOM send them to other entities, to respond to questions on the studies by other nodes network, to produce transportable media containing digital images. An interesting future development of this work could lead to a test HSFS experiencing any PACS system efficiency, and possibly assess the possible advantages in services telerefertazione. 561
Furthermore, since Squashfs a filesystem read-only, a possible future evolution might affect the identification of techniques that allow, with operations transparent to the user, change the values of this additional tag in the inode file. The assessment of the experimentation carried out, the solution is obviously advantageous to include in the inode file dicom attribute values of interest, this is to allow for a more immediate than reading the same values of the header files directly. While the decision to include the construct of selecting files based on the values passed as parameters to the command finddicom RETAG in the system call, or alternatively to specific command, does not find substantial differences. Even if, where the application on a system not minimal, as the one chosen in the trial (Mvux) could come into play noise conditions that would favor a solution that is acting entirely in kernel space, so with the construct for selection of RETAG file in the system call. REFERENCES [1] DICOM Resources http://star.pst.qub.ac.uk/idl/DICOM Resources.html - 2005 [2] The SquashFS archives http://sourceforge.net/projects/squashfs/f iles/squashfs/ - 2011 [3] Official SquashFS LZMA http://www.squashfs-lzma.org/ 2006 [4] Card, Tso, and Tweedie. Design and Implementation of the Second Extended Filesystem, http:// e2fsprogs.sourceforge.net/ext2intro.html . [5] Sqlite. http://www.sqlite.org/. Home Page. 03.2011
[6] Kernel. Torvalds. www.kernel.org. Home Page. 03.2011 [7] Hixosfs. Hixos. www.di.uniba.it/_hixos/hixosfs. Home Page. 03.2009 [8] Corriero and Zhupa, An embedded filesystem for mobile and ubiquitous multimedia, MMEDIA 2010, 978-1-4244-7277-2 [9] Corriero and Zhupa, Hixosfs for ubiquitous commerce through bluetooth, FUTURETECH 2010. 978-14244-6948-2 [10] Corriero and Cozza. The hixosfs music approach vs common musical file management solutions, 978989-674-007-8, pag. 189-193. SIGMAP 2009. [11] Corriero and Cozza,Hixosfs Music: A Filesystem in Linux Kernel Space for Musical Files, 978-07695-3693-4, MMEDIA 2009 [12] Rubini. The virtual filesystem in Linux . Kernel Korner. http://www.linux.it/_rubini/docs/vfs/vfs. html. [13] Bovet & Cesati. Understandig the linux kernel. [14] Corriero & Covino & DAmore & Pani, HSFS: a compress filesystem for metadata files, International Conference on Digital Information Processing and Communications, ICDIPC2011, http://www.sdiwc.net/public/cz/public/in dex.php, CCIS 189, ISBN 978-3-642-22409-6 , pag 289-300, Springer
562

A Compress Filesystem To Improve The Files Contents Search

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Compress Filesystem To Improve The Files Contents Search

Загружено:

Авторское право:

Доступные форматы

International Journal of Digital Information and Wireless Communications (IJDIWC) 1(2): 554-562 The Society of Digital Information and

Wireless Communications, 2011(ISSN 2225-658X)

Average time to find a value: 20000 empty files Figure 2.

real user sys

60,603 3,364 19,066

Ext2 xattr 235,645 11,393 65,532

HSFS 62,451 4,089 18,347

SQLite 68,746 2,924 7,352

MySQL 189,012 112,247 61,000

Time to read all values: 20000 empty files Hixosfs

Average time to read all values: 20000 empty files

real user sys

23,927 2,092 5,654

Вам также может понравиться