|
|||||||||||||||||||
| В наши расчеты не входило преимущество долгой жизни. - М. Робеспьер | |||||||||||||||||||
Large, Multilingual Web-site Maintenance: The US-Russian Friends and Partners Database Approach.Natasha Bulashova
(
natasha@www.friends-partners.ru) Greg Cole (gcole@friends-partners.org) Table of ContentsIntroductionThe US Russian Friends and Partners project, based at the University of Tennessee, Knoxville Center for International Networking Initiatives and at the Friends and Partners Foundation in Moscow, has been active for nearly five years, promoting a grassroots effort to further US-Russian collaboration and exchange using communications and information technologies. Beyond the many information and communications services spawned and supported by "Friends and Partners" (F&P), the project has created a number of successful spin-off activities such as the "Russian Civic Networking Program" (developing community networks in several Russian cities with the support of the Ford and Eurasia Foundations) and the more recent NaukaNet initiative -- a $6.5 million, five year program funded by the US National Science Foundation and the Russian Ministry of Science and Technology, to develop a high-performance, next generation Internet infrastructure between US and Russian scientific communities. A major responsibility involves maintaining a large information site with several thousand pages and Internet "objects" on which it maintains information -- introducing the well-documented problems associated with large web-site maintenance. During the past two years the US and Russian F&P team members have developed a rather unique approach to web-site maintenance and presentation, driven entirely by an underlying relational database and involving dynamic construction and presentation of all web pages. While such use of databases and dynamic resource construction is not unique among corporate enterprise, it is rather rare in the nonprofit world. The system described in this paper features extensions to support special requirements of large, multicultural projects. These include support for multilingual material, creation of pages tailored for specific classes of browsers (with special attention to simultaneous support of high-end and low-end users), automated support of multiple character set encodings (important for Cyrillic text) and accommodating some user control of overall graphic design and appearance of entire web-sites through the use a facility we term "Personalities". The integration of overall page layout and design through use of a special "Styles" database enables site administrators to easily modify overall site. The object-oriented method of assembling web pages from discreet page elements also permits the introduction of random features such as random graphics, quotations, news updates, etc., into page construction. Further, this system integrates the "Dublin Core" metadata model, enabling the system to maintain uniform information about local resources and the several thousand external resources we find relevant to project activities. Use of the system has greatly simplified our work in maintaining a quality information resource. We feel that knowledge of this approach may be useful to other nonprofit organizations with similar challenges in maintaining information services. We are particularly interested in increasing awareness of these types of capabilities to nonprofit organizations who will find it increasingly difficult to compete with commercial organizations for providing quality information services to their constituencies. The system is quite inexpensive to maintain and is built exclusively from noncommercial software components that are readily available to anyone running a UNIX server. The remainder of this paper will discuss the capabilities of this system, the structure of the underlying database and a technical description of the developed software base. System CapabilitiesThe new database allows for much more dynamic information content (important for civic networking projects, etc.). The following describes the key capabilities of the database system.
Technical OverviewThe "Friends and Partners" system utilizes a surprisingly small number of software components. All of our project activities run on Sun SPARCstations running SOLARIS 2.5.1, 2.6 and 2.7. The entire system has also been built and operated on an Intel/LINUX platform as well. The four software packages required include.
With these basic software components, we have developed the following software to support our applications. First, the entire system is based on the MySQL database software (although the system could be easily modified to use almost any other relational/SQL database package). The current system is comprised of 24 individual database tables; all maintained in a single database named "Friends". This database maintains information on all Internet resources and page "objects" utilized by our site. The maintenance of all databases is handled by surprisingly simple code, written in PHP, which includes good support for MySQL (and other relational packages). Access to these applications is restricted by facilities of the PHPLIB library; access to the different database tables is controlled using standard MySQL authorization tables. Thus, there are two levels of authorization and authentication -- one at the database level; the other at the database maintenance application level. While complex, this yields a great degree of control over user access and user operations. The motivation is to permit as large a number of contributors to our activities as possible -- with careful restrictions on what individual users are allowed to do. Given all the information residing in these various database tables, the challenge is to provide for the dynamic and fast assembling of information from the database file and appropriate presentation to the user's browser. A former variant of this system required a Perl-based CGI script to be invoked every time a "Friends and Partners" page was requested from the web server. As the system evolved and became more complex, the cost of invoking Perl, making the necessary database connection, and then assembling the page from what often requires more than twenty database queries was much too high. The current approach is to maintain a special "server" program (which is written in Perl) and which listens for connections on a specific TCP/IP port, to which it responds with the assembled page. To connect to the server, we use a very small, simple and fast cgi client (a compiled object written in C). When invoked by the calling of certain URLs (using the ScriptAlias configuration directive within Apache), it connects to the server, passes its environment variables (which includes the full URL being requested by the user as well as the other environmental factors such as the users location, browser, etc.). After sending its environment variables, it waits on the server to return the HTML, which it, in turn, returns to the user. This "friends.server" program is roughly 2,600 lines of Perl code. As mentioned previously, Perl is used because of its excellent regular expression parser and its speed and flexibility with basic text manipulation. The use of the DBI module within Perl makes very simple the SQL queries and also enables us to maintain independence from a single database manager. Upon accepting a connection from the CGI client, the server first determines the record within the central "Friends" database based upon the requested URL. If it is a valid URL for which a record exists, it then begins the assembly process. This process is described in a more detailed section below. Obviously, the system is built using simple and widely used software components. The two critical elements of the system are the friends.server program and the structure of the underlying databases. The following section will describe the databases, and a later section describes the operation of the central page generation server. Database StructureThe "Friends and Partners" database system is comprised currently of 24 individual database tables. This number grows as new capabilities are added to the system. As a preface to describing the individual tables, the following section describes the components of a page generated from the databases which may help the reader better understand the underlying structure. All of the following comments refer to Figures 1a and 1b. The reader will note, the labeling of specific areas of a "web page" generated from the "Friends and Partners" database; each labeled element is a discreet element drawn from a specific database table. The friends.server program assembles the page based upon the user's request with information gathered from the individual databases. The logos at the top are drawn from definitions using the "Friends" database (label A). The pictures (label B) are drawn randomly from sets of pictures defined in the Friends database. The news items are drawn from another database (label C). The primary content for the web page (label D) is drawn from the central friends database. The Navigation list (label E) is generated from the "Navigation" table. Options for changing to a different display (reflecting different browser classes) (label F) are generated from the Friends server. The options for switching language and character set encoding (label G) are also set by the Friends server. The buttons (label H) are another form of navigation but appear at the bottom of most web pages and are drawn from the "Buttons" database. The trailer (label I) appearing at the bottom of every page is drawn from "Trailers" database. Random quotations and proverbs are not displayed in Figure 1a/b but are generated from corresponding databases for most F&P pages. ![]() Figure 1a. Friends and Partners Page ![]() Figure 1b. Friends and Partners Page With this as a general preface, a description is offered about the structure of the overall database system. While comprised of 24 individual tables, the system is driven by two primary tables called "Friends" and "Partners" (obviously named after the parent initiative). The "Partners" database is a relatively simple one and is used to maintain information on the "owners" of all material cited on any of our projects -- i.e., all resources to which we link, as well as our own local material. This database is primarily a bio/demographic profile of all authors and publishers of web content relevant to our activities. The "Partners" database is currently comprised of approximately 3,500 individuals and organizations. While used directly in the assembly of some web material, this database has more value as a means of tracking the individuals and organizations with whom we work. Indeed, the use of this database has become a critically important resource for all of our initiatives. A screen shot of the data maintenance application for "Partners" is included in Figures 2a and 2b to give the reader a better idea of the information maintained in this table. The "Friends" table is the central database of the entire F&P system. It is used to maintain information on three general types of resources.
The following describe the additional tables that are part of the Friends system. Closely related to the Central Friends database are the Friends Comments table and the FP text table.
Server OperationAs mentioned earlier in this paper, the core software which generates pages from the database is called the "Friends" server. While this program began as a CGI script that was called for every "Friends" page record, it has evolved into a full server to minimize resource requirements for page requests. The server is always running and listening on a certain TCP/IP port for connections. Upon receiving a connection, it generates HTML for the desired page and then releases the connection and continues listening for the next request. The following describes the operation of this server. Upon launching the roughly 2,600 line Perl script, it completes a set of initialization procedures, establishing a series of Perl data structures, setting the value of certain key variables and pre-loading a series of Perl hashes with values to accommodate the random selection of news items, quotations and pictures. Upon completion of the initializations, the script enters the central processing loop where it simply waits for connections on the specified TCP/IP port and completes requested transactions. Upon accepting a connection, the program accepts an incoming stream of environment variables from the connecting client. It fills an array with the environment variables which include the requested URL, the user's location, the user agent information (specifying such information as user platform and browser (if specified)), etc. Once it has received the environment variables, it has all of the information needed to generate the HTML. The next step is to determine if the requested URL is a valid "PAGE" record in the "Friends" database. The "Friends" database includes a URL index so the look-up for this information is handled quickly. If the URL, or a reasonable variation of the URL, is not found in the database (there is a simple algorithm for determining "reasonable alternatives") it next checks to see if the requested URL is a page from one or two old versions of the "Friends and Partners" site. If it finds that the URL points to an old page, it returns a redirect command to the requesting client and closes the connection. If the URL is not found in the database, nor in the old F&P sites, it sets the Friends "PAGE" record to a "We're Sorry" page and proceeds with HTML compilation. The next step is to determine the basic characteristics of the page requested -- such as language, browser class, character set encoding (if language is Russian) and "Personality" (governing certain stylistic attributes of the page). A word about how state information is carried forward in the system is important here. Since there is no good way to maintain state information that works with all browsers, we encode state information in the URL. The following state object is based at the end of the requested URL on Friends pages. (opt, mozilla, mac, russian, koi8, regal) This string indicates that the user requests a page for a Netscape Enhanced browser (mozilla); is using a Macintosh as the platform (a former version of this program formatted pages differently for different platforms); the user has requested a Russian language version of the page; has requested the KOI8 encoding and is using the "Regal" personality (which specifies certain colors, fonts, page layout, etc.) When a page is requested, any "local" URLs in the content, the navigation lists, etc. have the current state information placed at the end of those URLs. In this way, state information is carried forward. If the user visits a page for which there is no state information in the URL, reasonable values must be computed. The user's platform and browser are computed based on information in the user agent environment variable. The default language is chosen from a default for the "Friends Server" program (English for the server in the US, Russian for the server in Moscow). The encoding is similarly adopted from a preset default and the "personality" is likewise set in the browser program. The user can change these values by selecting different URLs on the given page. The next step is to obtain values for the specified Friends "PAGE" record by issuing a SQL query. Of special interest are the attributes which give the "Style" to be used for page layout and the navigation list, buttons, trailers, headers, random quotations, random news items, etc., which are associated with that page record. Note that selection of a style is determined, not only by the style attribute in the Friends record but also by the browser class carried forward from a previous request or determined by default. The "Style" record contains all of the information required for the general layout of the page including placeholders for all objects included on the page. Routines are called which, produce the text for each of these items and they are then substituted into the HTML included in the "Style" HTML. In this process, calls are also made to routines to produce text for randomized news items, quotations, and pictures. Finally, a call is made to the FPtext database with the friendsid and the requested language. If the language is not found, the program defaults to English. The HTML associated with the page can itself, have placeholders for which substitutions must be made. For example, the page may have a "LIST" request for which a list must computed and then substituted. The process of parsing the HTML associated with the page is described next. First, the HTML is split into a series of text records (splitting on carriage return, which is the expected delimiter). If the placeholder marker (##) is found, then further instructions are executed on that line. Otherwise, it is simply added to the text being assembled. Assuming that the markers precede the word "LIST" followed optionally by spaces, followed by an equal sign, followed optionally by spaces, followed by a number (indicating a "LIST" record), the program calls a list assembly routine with the provided number as the parameter. This routine takes the provided number, locates the requested "LIST" records, determines the list attributes regarding the format and sort order of the list and the reads the "Elements" database to find all records in the "Friends" database that have been attached to the list. Every attached record is read. If the requested language for that record is not found, it defaults to the English version of that record. These records are sorted and then the HTML is generated based on the attribute defining whether this is a menu list, a numbered list, an ordered list and whether the list contains title and URL only or whether it also contains a text description of each item. This routine returns the HTML for the requested list, which is then added to the assembled text. This process continues for all records that were included in the page text, thus assembling the text that is to be substituted for the "content" placeholder of the style record text. Currently, the only constructs expected within the HTML are the LIST indicators and an indicator for random quotations but we are currently adding SQL calls (and format/sort attributes) so that direct SQL queries can be imbedded within page records. Once the content has been substituted within the "Style" template, there is one remaining step. A call is made to the "Personalities" database to determine all attributes governing color, page elements, fonts, text size, link colors, etc. A single statement in Perl makes all the necessary substitutions in the assembled HTML so that the page is appropriately formatted with regards to the selected personality. There is one last step remaining for Russian language page requests. All of the Russian language content on "Friends and Partners" is stored using the KOI8 encoding. There are many users, however, who utilize different Cyrillic encodings or who do not have Cyrillic fonts installed on their system and desire to read the Russian text in a transliterated format (in which Cyrillic characters are transliterated into one or more Latin characters). If a non-KOI8 encoding is requested, the entire assembled text is passed through a conversion routine that converts from the KOI8 to the requested encoding or transliterated text. Obviously, assembling a page of HTML with the server is a resource intensive task. We maintain a timing log for all requests which shows that almost all requests which do not have to pass through the Russian encoding conversion, are satisfied in .12 - .18 seconds of a 40 MHz Sun SPARCstation 10 server. The Cyrillic conversion, however, requires 1.0 -- 1.5 seconds to complete.
Systems EvolutionWhile we in no way consider the design and implementation of the described system to be finished, we feel that it's evolution over the last five years and our own accompanying intensive experience with the associated information management has given us a fairly stable and well-proven design. The following provides a brief over-view of this evolution and its relevance to today's design. The Friends & Partners project was launched in early 1994 as a web site and accompanying e-mail list servers designed to attract those interested in U.S./Russian cooperation and exchange and to provide services to facilitate their communications and information sharing. This initial web system was merely a collection of static HTML files. The early interest in the project and the many contributions of information by volunteers around the world quickly led to an increasing problem managing the text files with the embedded links. It was not only difficult in maintaining the integrity of those links, but there were also challenges in adding new material to the site and determining where new resources should be placed. We also had the problem as began to make the site bilingual. At the beginning of the second year of our activities, we converted to a new model. Instead of using static HTML files we utilized our own SGML encoding which allowed us to include multiple language documents within single files and allowed us to add additional structural information to help us maintain a consistent design throughout our site. Thus, some of the SGML attributes included pointers to key graphic elements, pointers to files to be included on each page, and pointers to navigation lists associated with a given page. Associated with this new storage and organization model were CGI scripts to generate HTML from the underlying SGML base. This model resolved many of our concerns in ensuring site consistency and allowing us to change site design easily in separating discrete components of web pages into more easily managed objects and in dealing with multilingual content. However, we still had the problem with the numerous lists cataloguing resources of interest to the F&P community and embedded on the F&P pages. It was still difficult to determine how and where new material should be added to a site. During the third year of our effort, we began converting to a new model based on an underlying database. In addition to wanting to resolve the problems remaining from our current system regarding information organization and content maintenance, we wished to begin moving to a model providing metadata about all objects on which we maintained information. Because we did not have a very stable system definition at this time, we established the system on the FileMaker system (a very simple and easy to use relational database package that runs on the Macintosh and Windows platforms) on a Macintosh server. The FileMaker system allowed us to begin with what we understood about our system specifications and easily change them as we learned more. It was an excellent prototyping environment. The structure of the file maker system was in many ways similar to the system that has replaced it - the SQL database described in this paper. Of course, the problem with the Filemaker system is that we did not wish to serve web information from a Macintosh platform (although technically it would have been possible). Thus, every day we had to convert the Filemaker Pro data to a CSV format, move it to our UNIX host, and then refresh the database on the web server. Because we did not at that time have knowledge of or experience with the various SQL database platforms and because we were interested in very fast response to user requests, we implemented this database using the Berkley DB database, which accompanies most UNIX operating systems. Using the balanced binary tree structure provided by Berkley DB, we were able to provide very fast access to the database material. This system was unveiled during our fourth year of activity and has served us until the new system was finished. During the year that we used the system, we made many changes to the underlying database model and we also changed the method by which pages were generated - moving from static CGI Perl scripts to the Perl-based server which has proven much faster in satisfying user requests. Work began in November 1998 on moving from this system to the SQL-based system. The primary motivation for transitioning to the new system is that we wanted changes in any of the database tables to be immediately reflected on the web site. As our database grew to include over 6,000 resources, it became increasingly time intensive to perform the daily updates and it was both difficult and time consuming to test out changes to the system. The new system was completed in January of 1999 and everything transitioned in time for the five year anniversary of the first launching of Friends and Partners on January 19th. Note about inclusion of Dublin Core Metadata Components: As mentioned in the previous section, we decided to adopt the Dublin Core as our own metadata model as it provides a very reasonable and less time consuming model for assembling metadata. We looked at more extensive metadata models and found them simply too "expensive" in terms of personnel resources, but also required a level of expertise that we did not feel capable of committing. Data Maintenance SystemWithout question the most difficult component of the new system was the development of the data maintenance applications for the 24 tables comprising the database. To aid with the rather tedious task of designing the HTML forms and other myriad problems associated with data input validation and user authentication, we were very fortunate to find the PHP library which has good facilities for handling user authentication, data input forms, and data validation. We recommend that others involved in similar projects consider the PHP software package (for dynamic HTML page generation) and the PHP library for handling authentication and forms control. Supplementing our use of PHP LIB (and included oohforms), we created a special table called the "data dictionary" in which we maintain information on all field elements and all tables associated with this database. In addition to providing descriptive information about each field (including its representation in the database) we provide information about how that element should be represented on an HTML form (whether a text box, a text area, a select list, a check box, radio buttons, etc.), a regular expression for determining appropriate input for this field, a minimum and maximum length for the field, a help message associated with the field, and error messages associated with invalid length and a failure to meet the regular expression requirements. The use of this database in conjunction with the oohforms class of the PHP LIB library has yielded a good solution to the tedious and time consuming parts of data input, form creation, and data input validation. Each of the primary tables in the Friends system has one or more PHP "scripts" associated with its maintenance. All of these maintenance packages are included within one overall system which users find on our web site. In addition to user authorization using my SQL authorization tables, we additionally authenticate users to use certain data maintenance applications using PHP LIB facilities for that purpose. When a user attempts to "visit" an application URL, the authentication system checks to see if a "cookie" is associated with the user's browser and if that cookie meets a certain time limitation. If so, the user is admitted; if not, the user is presented with a user name and password entry screen. Assuming the user provides a valid user name and password, the user's authentication level (user, author, editor, administrator) is checked against the authentication level required for that application. If everything passes, the user is then presented with instructions on how to proceed. Once the authentication is complete, a cookie is passed to the user's browser which is used on subsequent authorization checks. Every transaction results in the cookie being time stamped. Currently a one hour limit between transactions is enforced (i.e. after an hour since the last transaction the user is required to provide the user name and password again). More information on the PHP LIB, oohforms class is available at http://www.php.net/. System Management FeaturesThe system provides a number of tools to help us better manage the database. Almost every data maintenance application provides an on-screen facility for performing common searches to help the user find requested records. The system also provides a full SQL search capability so that users can provide their own SQL based queries which yield a dynamically generated table listing the results of that query. To help us in the data entry process in maintaining valid data, the system features a number of date fields in which we can indicate when a record expires, when it was last reviewed, when it should next be reviewed, and there are additional attributes by which we can assign individuals who are responsible for reviewing each record (i.e. different individuals can be assigned responsibility for any number of database resources). Another very important feature of the overall management system is a status field which we use to "pipeline" entry processes. We can permit anyone on the web to add records to the database, but these records are marked as disabled (so that they do not appear on the site) and are also given a status indicating the need for review by an authorized user. Once a user reviews a record, they can assign any one of a number of different status levels indicating a problem with a record, question about its suitability for the site, or the need for a language translation to be provided. The use of the status information helps various members of our staff and a broader community of volunteers to share entry and maintenance responsibilities. Of course, when a record is complete it is assigned a status indicating such. Finally, the system carries a number of additional attributes which allow staff to indicate that the record should be reviewed for deletion or, alternatively, marked as "disabled" which prevents that resource from being included anywhere on the site. This latter mechanism allows us to quickly "turn off" records which need to be reviewed for some reason. Planned EnhancementsOne of the chief advantages of maintaining all web information in a database is that we can automate much of the maintenance of that information. The following represent two primary enhancements which are being implemented in the near future.
©1996 "Друзья и Партнеры"
Please
write to us
with your comments and suggestions.
|
|
||||||||||||||||||