hack.txt for Citadel/UX
 written by Art Cancro (ajc@uncnsrd.mt-kisco.ny.us)
   
   Much of this document is borrowed from the original hack.doc from
Citadel-CP/M and Citadel-86, because many of the concepts are the same.  Hats
off to whoever wrote the original, for a fine document that inspired the
implementation of Citadel for Unix. 
 
   Note that this document is really out of date.  It doesn't cover anything
about the threaded server architecture or any of the network stuff.  What is
covered here is the basic architecture of the databases.
 
   But enough of the preamble.  Here's how Citadel/UX works :)
  
   Here are the major databases to be discussed:
  
  msgmain         The big circular file that contains message text
  quickroom       Contains room info such as room names, stats, etc.
  fullroom        One fullrm file per room: message numbers and pointers.
  usersupp        Contains info for each user on the system.
 
   The fundamental structure of the system differs greatly from the way
Citadels used to work.  Citadel now depends on a record manager or database
manager of some sort.  Thanks to the API which is in place for connecting to
a data store, any record manager may be used as long as it supports the
storage and retrieval of large binary objects (blobs) indexed by unique keys.
Please see database.c for more information on data store primitives.
 
   The message base (MSGMAIN) is a big file of messages indexed by the message
number.  Messages are numbered consecutively and start with an FF (hex)
byte.  Except for this FF start-of-message byte, all bytes in the message
file have the high bit set to 0.  This means that in principle it is
trivial to scan through the message file and locate message N if it
exists, or return error.  (Complexities, as usual, crop up when we
try for efficiency...)
 
    Each room is basically just a list of message numbers.  Each time
we enter a new message in a room, its message number is appended to the end
of the list.  If an old message is to be expired, we must delete it from the
message base.  Reading a room is just a matter of looking up the messages
one by one and sending them to the client for display, printing, or whatever.
 
    Implementing the "new message" function is also trivial in principle:
we just keep track, for each caller in the userlog, of the highest-numbered
message which existed on the *last* call.  (Remember, message numbers are
simply assigned sequentially each time a message is created.  This
sequence is global to the entire system, not local within a room.)  If
we ignore all message-numbers in the room less than this, only new messages
will be printed.  Voila! 
 
 
		Message format on disk	(MSGMAIN)

   As discussed above, each message begins with an FF byte.
 
   The next byte denotes whether this is an anonymous message.  The codes
available are MES_NORMAL, MES_ANON, or MES_AN2 (defined in citadel.h).
 
   The third byte is a "message type" code.  The following codes are defined:
 0 - "Traditional" Citadel format.  Message is to be displayed "formatted."
 1 - Plain pre-formatted ASCII text (otherwise known as text/plain)
 4 - MIME formatted message.  The text of the message which follows is
     expected to begin with a "Content-type:" header.
 
   After these three opening bytes, the remainder of
the message consists of a sequence of character strings.  Each string
begins with a type byte indicating the meaning of the string and is
ended with a null.  All strings are printable ASCII: in particular,
all numbers are in ASCII rather than binary.  This is for simplicity,
both in implementing the system and in implementing other code to
work with the system.  For instance, a database driven off Citadel archives
can do wildcard matching without worrying about unpacking binary data such
as message ID's first.  To provide later downward compatability
all software should be written to IGNORE fields not currently defined.

		  The type bytes currently defined are: 	

BYTE	Mnemonic	Comments

A	Author		Name of originator of message.
B	Phone number	The dialup number of the system this message
			originated on.  This is optional, and is only
			defined for helping implement C86Net gateways.
D	Destination	Contains name of the system this message should
			be sent to, for mail routing (private mail only).
E	Extended ID	A persistent alphanumeric Message ID used for
			network replication.  When a message arrives that
			contains an Extended ID, any existing messages which
			contain the same Extended ID and are *older* than this
			message should be deleted.  If there exist any messages
			with the same Extended ID that are *newer*, then this
			message should be dropped.
G	Gateway domain	This field is provided solely for the implementation
                        of C86Net gateways, and holds the C86Net domain of
                        the system this message originated on.  Unless you're
                        implementing such a gateway, there's no need to even
                        bother with this field.
H	HumanNodeName	Human-readable name of system message originated on.
I	Original ID	A 32-bit integer containing the message ID on the
			system the message *originated* on.
M	Message Text	Normal ASCII, newlines seperated by CR's or LF's,
                        null terminated as always.
N	Nodename	Contains node name of system message originated on.
O	Room		Room of origin.
P	Path		Complete path of message, as in the UseNet news
			standard.  A user should be able to send Internet mail
			to this path. (Note that your system name will not be
			tacked onto this until you're sending the message to
			someone else)
R	Recipient	Only present in Mail messages.
S       Special field   Only meaningful for messages being spooled over a
                        network.  Usually means that the message isn't really
                        a message, but rather some other network function:
                        -> "S" followed by "FILE" (followed by a null, of
                        course) means that the message text is actually an
                        IGnet/Open file transfer.
T	Date/Time	A 32-bit integer containing the date and time of
                        the message in standard UNIX format (the number
                        of seconds since January 1, 1970 GMT).
U       Subject         Optional.  Developers may choose whether they wish to
                        generate or display subject fields.  Citadel/UX does
                        not generate them, but it does print them when found.
  
			EXAMPLE

Let <FF> be a 0xFF byte, and <0> be a null (0x00) byte.  Then a message
which prints as...

Apr 12, 1988 23:16 From Test User In Network Test> @lifesys (Life BBS)
Have a nice day!

 might be stored as...
<FF><40><0>I12345<0>Pneighbor!lifesys!test_user<0>T576918988<0>    (continued)
-----------|Mesg ID#|--Message Path---------------|--Date------

AThe Test User<0>ONetwork Test<0>Nlifesys<0>HLife BBS<0>MHave a nice day!<0>
|-----Author-----|-Room name-----|-nodename-|Human Name-|--Message text-----

 Weird things can happen if fields are missing, especially if you use the
networker.  But basically, the date, author, room, and nodename may be in any
order.  But the leading fields and the message text must remain in the same
place.  The H field looks better when it is placed immediately after the N
field.

			    Networking

Citadel nodes network by sharing one or more rooms. Any Citadel node
can choose to share messages with any other Citadel node, through the sending
of spool files.  The sending system takes all messages it hasn't sent yet, and
spools them to the recieving system, which posts them in the rooms.

Complexities arise primarily from the possibility of densely connected
networks: one does not wish to accumulate multiple copies of a given
message, which can easily happen.  Nor does one want to see old messages
percolating indefinitely through the system.

This problem is handled by keeping track of the path a message has taken over
the network, like the UseNet news system does.  When a system sends out a
message, it adds its own name to the bang-path in the <P> field of the
message.  If no path field is present, it generates one.  
   
With the path present, all the networker has to do to assure that it doesn't
send another system a message it's already received is check the <P>ath field
for that system's name somewhere in the bang path.  If it's present, the system
has already seen the message, so we don't send it.  (Note that the current
implementation does not allow for "loops" in the network -- if you build your
net this way you will see lots of duplicate messages.)

The above discussion should make the function of the fields reasonably clear:

 o  Travelling messages need to carry original message-id, system of origin,
    date of origin, author, and path with them, to keep reproduction and
    cycling under control.

(Uncoincidentally) the format used to transmit messages for networking
purposes is precisely that used on disk, except that there may be any amount
of garbage between the null ending a message and the <FF> starting the next
one.  This allows greater compatibility if slight problems crop up. The current
distribution includes netproc.c, which is basically a database replicator;
please see network.txt on its operation and functionality (if any).

			Portability issues
 
 At this point, all hardware-dependent stuff has been removed from the 
system.  On the server side, most of the OS-dependent stuff has been isolated
into the sysdep.c source module.  The server should compile on any POSIX
compliant system with a full pthreads implementation and TCP/IP support.  In
the future, we may try to port it to non-POSIX systems as well.
 
 On the client side, it's also POSIX compliant.  The client even seems to
build ok on non-POSIX systems with porting libraries (such as the Cygnus
Win32 stuff).
  

                   "Room" records (quickroom)
 
The rooms are basically indices into msgmain, the message database.
As noted in the overview, each is essentially an array of pointers into
the message file.  The pointers consist of a 32-bit message ID number
(we will wrap around at 32 bits for these purposes).

Since messages are numbered sequentially, the
set of messages existing in msgmain will always form a continuous
sequence at any given time.

That should be enough background to tackle a full-scale room.  From citadel.h:

struct quickroom {
	char QRname[20];		/* Max. len is 19, plus null term   */
	char QRpasswd[10];		/* Only valid if it's a private rm  */
	long QRroomaide;		/* User number of room aide         */
	long QRhighest;			/* Highest message NUMBER in room   */
	long QRgen;			/* Generation number of room        */
	unsigned QRflags;		/* See flag values below            */
	char QRdirname[15];		/* Directory name, if applicable    */
	char QRfloor;			/* (not yet implemented)            */
		};

#define QR_BUSY		1		/* Room is being updated, WAIT      */
#define QR_INUSE	2		/* Set if in use, clear if avail    */
#define QR_PRIVATE	4		/* Set for any type of private room */
#define QR_PASSWORDED	8		/* Set if there's a password too    */
#define QR_GUESSNAME	16		/* Set if it's a guessname room     */
#define QR_DIRECTORY	32		/* Directory room                   */
#define QR_UPLOAD	64		/* Allowed to upload                */
#define QR_DOWNLOAD	128		/* Allowed to download              */
#define QR_VISDIR	256		/* Visible directory                */
#define QR_ANONONLY	512		/* Anonymous-Only room              */
#define QR_ANON2	1024		/* Anonymous-Option room            */
#define QR_NETWORK	2048		/* Shared network room              */
#define QR_PREFONLY	4096		/* Preferred users only             */

[Note that all components start with "QR" for quickroom, to make sure we
 don't accidentally use an offset in the wrong structure. Be very careful
 also to get a meaningful sequence of components --
 some C compilers don't check this sort of stuff either.]

QRgen handles the problem of rooms which have died and been reborn
under another name.  This will be clearer when we get to the userlog.
For now, just note that each room has a generation number which is
bumped by one each time it is recycled.

QRflags is just a bag of bits recording the status of the room.  The
defined bits are:

QR_BUSY		This is to insure that two processes don't update the same
		record at the same time, even though this hasn't been
		implemented yet.
QR_INUSE	1 if the room is valid, 0 if it is free for re-assignment.
QR_PRIVATE	1 if the room is not visible by default, 0 for public.
QR_PASSWORDED	1 if entry to the room requires a password.
QR_GUESSNAME	1 if the room can be reached by guessing the name.
QR_DIRECTORY	1 if the room is a window onto some disk/userspace, else 0.
QR_UPLOAD	1 if users can upload into this room, else 0.
QR_DOWNLOAD	1 if users can download from this room, else 0.
QR_VISDIR	1 if users are allowed to read the directory, else 0.
QR_ANONONLY	1 if all messages are to recieve the "****" anon header.
QR_ANON2	1 if the user will be asked if he/she wants an anon message.
QR_NETWORK	1 if this room is shared on a network, else 0.
QR_PREFONLY	1 if the room is only accessible to preferred users, else 0.

QRname is just an ASCII string (null-terminated, like all strings)
giving the name of the room.

QRdirname is meaningful only in QR_DIRECTORY rooms, in which case
it gives the directory name to window.

QRpasswd is the room's password, if it's a QR_PASSWORDED room. Note that
if QR_PASSWORDED or QR_GUESSNAME are set, you MUST also set QR_PRIVATE.
QR_PRIVATE by itself designates invitation-only. Do not EVER set all three
flags at the same time.

QRroomaide is the user number of the room's room-aide (or zero if the room
doesn't have a room aide). Note that if a user is deleted, his/her user number
is never used again, so you don't have to worry about a new user getting the
same user number and accidentally becoming a room-aide of one or more rooms.

The only field new to us in quickroom is QRhighest, recording the
most recent message in the room.  When we are searching for rooms with
messages a given caller hasn't seen, we can check this number
and avoid a whole lot of extra disk accesses.
 
   There used to also be a structure called "fullroom" which resided in one
file for each room on the system.  This has been abandoned in favour of
"message lists" which are variable sized and simply contain zero or more
message numbers.  The message numbers, in turn, point to messages on disk.

			User records (usersupp)

This is the fun one.  Get some fresh air and plug in your thinking cap
first.	(Time, space and complexity are the eternal software rivals.
We've got lots of log entries times lots of messages spread over up to nnn
rooms to worry about, and with multitasking, disk access time is important...
so perforce, we opt for complexity to keep time and space in bounds.)

To understand what is happening in the log code takes a little persistence.
You also have to disentangle the different activities going on and
tackle them one by one.

 o	We want to remember some random things such as terminal screen
	size, and automatically set them up for each caller at login.

 o	We want to be able to locate all new messages, and only new
	messages, efficiently.	Messages should stay new even if it
	takes a caller a couple of calls to get around to them.

 o	We want to remember which private rooms a given caller knows
	about, and treat them as normal rooms.	This means mostly
	automatically seeking out those with new messages.  (Obviously,
	we >don't< want to do this for unknown private rooms!)	This
	has to be secure against the periodic recycling of rooms
	between calls.

 o	We want to support private mail to a caller.

 o	We want to provide some protection of this information (via
	passwords at login) and some assurance that messages are from
	who they purport to be from (within the system -- one shouldn't
	be able to forge messages from established users).

Lifting another page from citadel.h gives us:

struct usersupp {			/* User record                      */
	int USuid;			/* uid account is logged in under   */
	char password[20];		/* password                         */
	long lastseen[MAXROOMS];	/* Last message seen in each room   */
	char generation[MAXROOMS];	/* Generation # (for private rooms) */
	char forget[MAXROOMS];		/* Forgotten generation number      */
	unsigned flags;			/* See US_ flags below              */
	int screenwidth;		/* For formatting messages          */
	int timescalled;		/* Total number of logins           */
	int posted;			/* Number of messages posted (ever) */
	char fullname[26];		/* Bulletin Board name for messages */
	char axlevel;			/* Access level                     */
	long usernum;			/* Eternal user number              */
	long lastcall;			/* Last time the user called        */
				};

#define US_PERM		1		/* Permanent user; don't scroll off */
#define US_LASTOLD	16		/* Print last old message with new  */
#define US_EXPERT	32		/* Experienced user		    */
#define US_UNLISTED	64		/* Unlisted userlog entry           */
#define US_NOPROMPT	128		/* Don't prompt after each message  */
#define US_PREF		1024		/* Preferred user                   */
 
Looks simple enough, doesn't it?  One topic at a time:

 Random configuration parameters:
-screenwidth is the caller's screen width.  We format all messages to this
width, as best we can. flags is another bit-bag, recording whether we want
prompts, people who want to suppress the little automatic hints all through
the system, etc.
 
  Attachments, names & numbers:
-USuid is the uid the account was established under. For most users it will
be the same as BBSUID, but it won't be for users that logged in from the shell.
-fullname is the user's full login name.
-usernum is the user's ID number.  It is unique to the entire system:
once someone has a user number, it is never used again after the user is
deleted. This allows an easy way to numerically represent people.
-password is the user's password.
-axlevel is the user's access level, so we know who's an Aide, who's a problem
user, etc.  These are defined and listed in the system.

  Feeping Creatures:
-timescalled is the number of times the user has called.
-posted is the number of messages the user has posted, public or private.

  Misc stuff:
-lastcall holds the date and time (standard Unix format) the user called, so
we can purge people who haven't called in a given amount of time.

  Finding new messages:
This is the most important.  Thus, it winds up being the most
elaborate.  Conceptually, what we would like to do is mark each
message with a bit after our caller has read it, so we can avoid
printing it out again next call.  Unfortunately, with lots of user
entries this would require adding lots of bits to each message... and
we'd wind up reading off disk lots of messages which would never
get printed.  So we resort to approximation and a small table.

The approximation comes in doing things at the granularity of
rooms rather than messages.  Messages in a given room are "new"
until we visit it, and "old" after we leave the room... whether
we read any of them or not.  This can actually be defended: anyone
who passes through a room without reading the contents probably just
isn't interested in the topic, and would just as soon not be dragged
back every visit and forced to read them.  Given that messages are
numbered sequentially, we can simply record the most recent message ID#
of each room as of the last time we visited it. Very simple.

Putting it all together, we can now compute whether a given room
has new messages for our current caller without going to the message base
index (fullroom) at all:

 > We get the usersupp.lastseen[] for the room in question
 > We compare this with the room's quickroom.QRhighest, which tells us
   what the most recent message in the room is currently.


	     REMEMBERING WHICH PRIVATE ROOMS TO VISIT

This looks trivial at first glance -- just record one bit per room per
caller in the log records.  The problem is that rooms get recycled
periodically, and we'd rather not run through all the log entries each
time we do it.	So we adopt a kludge which should work 99% of the time.

As previously noted, each room has a generation number, which is bumped
by one each time it is recycled.  As not noted, this generation number
runs from 0 -> 127 (and then wraps around and starts over). 
  When someone visits a room, we set usersupp.generation for the room
equal to that of the room.  This flags the room as being available.
If the room gets recycled, on our next visit the two generation numbers
will no longer match, and the room will no longer be available -- just
the result we're looking for.  (Naturally, if a room is public,
all this stuff is irrelevant.)

This leaves only the problem of an accidental matchup between the two
numbers giving someone access to a Forbidden Room.  We can't eliminate
this danger completely, but it can be reduced to insignificance for
most purposes.	(Just don't bet megabucks on the security of this system!)
Each time someone logs in, we set all "wrong" generation numbers to -1.
So the room must be recycled 127 times before an accidental matchup
can be achieved.  (We do this for all rooms, INUSE or dead, public
or private, since any of them may be reincarnated as a Forbidden Room.)

Thus, for someone to accidentally be led to a Forbidden Room, they
must establish an account on the system, then not call until some room
has been recycled 127 to 128 times, which room must be
reincarnated as a Forbidden Room, which someone must now call back
(having not scrolled off the userlog in the mean time) and read new
messages.  The last clause is about the only probable one in the sequence.
The danger of this is much less than the danger that someone will
simply guess the name of the room outright (if it's a guess-name room)
or some other human loophole.

                     FORGOTTEN ROOMS

  This is exactly the opposite of private rooms. When a user chooses to
forget a room, we put the room's generation number in usersupp.forget for
that room. When doing a <K>nown rooms list or a <G>oto, any matchups cause
the room to be skipped. Very simple.

		     SUPPORTING PRIVATE MAIL

   Can one have an elegant kludge?  This must come pretty close.
 
   Private mail is sent and recieved in the Mail> room, which otherwise
behaves pretty much as any other room.	To make this work, we have a
separate Mail> room for each user behind the scenes.  The actual room name
in the database looks like "0000001234.Mail" (where '1234' is the user
number) and it's flagged with the QR_MAILBOX flag.  The user number is
stripped off by the server before the name is presented to the client.

   This requires a little fiddling to get things just right.  For example,
make_message() has to be kludged to ask for the name of the recipient
of the message whenever a message is entered in Mail>.	But basically
it works pretty well, keeping the code and user interface simple and
regular.


		   PASSWORDS AND NAME VALIDATION
 
  This has changed a couple of times over the course of Citadel's history.  At
this point it's very simple, again due to the fact that record managers are
used for everything.    The user file (usersupp) is indexed using the user's
name, converted to all lower-case.  Searching for a user, then, is easy.  We
just lowercase the name we're looking for and query the database.  If no
match is found, it is assumed that the user does not exist.
   
  This makes it difficult to forge messages from an existing user.  (Fine
point: nonprinting characters are converted to printing characters, and
leading, trailing, and double blanks are deleted.)