citadel/techdoc/hack.txt

   1             ------------------------------------------------------
   2              The totally incomplete guide to Citadel internals
   3             ------------------------------------------------------
   4
   5  Citadel has evolved quite a bit since its early days, and the data structures
   6 have evolved with it.  This document provides a rough overview of how the
   7 system works internally.  For details you're going to have to dig through the
   8 code, but this'll get you started.
   9
  10
  11  DATABASE TABLES
  12  ---------------
  13
  14  As you probably already know by now, Citadel uses a group of tables stored
  15 with a record manager (usually Berkeley DB).  Since we're using a record
  16 manager rather than a relational database, all record structures are managed
  17 by Citadel.  Here are some of the tables we keep on disk:
  18
  19
  20  USER RECORDS
  21  ------------
  22
  23  This table contains all user records.  It's indexed by
  24 user name (translated to lower case for indexing purposes).  The records in
  25 this file look something like this:
  26
  27 struct ctdluser {                       /* User record                      */
  28         int version;                    /* Cit vers. which created this rec */
  29         uid_t uid;                      /* Associate with a unix account?   */
  30         char password[32];              /* password (for Citadel-only users)*/
  31         unsigned flags;                 /* See US_ flags below              */
  32         long timescalled;               /* Total number of logins           */
  33         long posted;                    /* Number of messages posted (ever) */
  34         CIT_UBYTE axlevel;              /* Access level                     */
  35         long usernum;                   /* User number (never recycled)     */
  36         time_t lastcall;                /* Last time the user called        */
  37         int USuserpurge;                /* Purge time (in days) for user    */
  38         char fullname[64];              /* Name for Citadel messages & mail */
  39 };
  40
  41  Most fields here should be fairly self-explanatory.  The ones that might
  42 deserve some attention are:
  43
  44  uid -- if uid is not the same as the uid Citadel is running as, then the
  45 account is assumed to belong to the user on the underlying Unix system with
  46 that uid.  This allows us to require the user's OS password instead of having
  47 a separate Citadel password.
  48
  49  usernum -- these are assigned sequentially, and NEVER REUSED.  This is
  50 important because it allows us to use this number in other data structures
  51  without having to worry about users being added/removed later on, as you'll
  52 see later in this document.
  53
  54
  55  ROOM RECORDS
  56  ------------
  57
  58  These are room records.  There is a room record for every room on the
  59 system, public or private or mailbox.  It's indexed by room name (also in
  60 lower case for easy indexing) and it contains records which look like this:
  61
  62 struct ctdlroom {
  63         char QRname[ROOMNAMELEN];       /* Name of room                     */
  64         char QRpasswd[10];              /* Only valid if it's a private rm  */
  65         long QRroomaide;                /* User number of room aide         */
  66         long QRhighest;                 /* Highest message NUMBER in room   */
  67         time_t QRgen;                   /* Generation number of room        */
  68         unsigned QRflags;               /* See flag values below            */
  69         char QRdirname[15];             /* Directory name, if applicable    */
  70         long QRinfo;                    /* Info file update relative to msgs*/
  71         char QRfloor;                   /* Which floor this room is on      */
  72         time_t QRmtime;                 /* Date/time of last post           */
  73         struct ExpirePolicy QRep;       /* Message expiration policy        */
  74         long QRnumber;                  /* Globally unique room number      */
  75         char QRorder;                   /* Sort key for room listing order  */
  76         unsigned QRflags2;              /* Additional flags                 */
  77         int QRdefaultview;              /* How to display the contents      */
  78 };
  79
  80  Again, mostly self-explanatory.  Here are the interesting ones:
  81
  82  QRnumber is a globally unique room ID, while QRgen is the "generation number"
  83 of the room (it's actually a timestamp).  The two combined produce a unique
  84 value which identifies the room.  The reason for two separate fields will be
  85 explained below when we discuss the visit table.  For now just remember that
  86 QRnumber remains the same for the duration of the room's existence, and QRgen
  87 is timestamped once during room creation but may be restamped later on when
  88 certain circumstances exist.
  89
  90
  91
  92  FLOORTAB
  93  --------
  94
  95  Floors.  This is so simplistic it's not worth going into detail about, except
  96 to note that we keep a reference count of the number of rooms on each floor.
  97
  98
  99
 100  MSGLISTS
 101  --------
 102  Each record in this table consists of a bunch of message  numbers
 103 which represent the contents of a room.  A message can exist in more than one
 104 room (for example, a mail message with multiple recipients -- 'single instance
 105 store').  This table is never, ever traversed in its entirety.  When you do
 106 any type of read operation, it fetches the msglist for the room you're in
 107 (using the room's ID as the index key) and then you can go ahead and read
 108 those messages one by one.
 109
 110  Each room is basically just a list of message numbers.  Each time
 111 we enter a new message in a room, its message number is appended to the end
 112 of the list.  If an old message is to be expired, we must delete it from the
 113 message base.  Reading a room is just a matter of looking up the messages
 114 one by one and sending them to the client for display, printing, or whatever.
 115
 116
 117  VISIT
 118  -----
 119
 120  This is the tough one.  Put on your thinking cap and grab a fresh cup of
 121 coffee before attempting to grok the visit table.
 122
 123  This table contains records which establish the relationship between users
 124 and rooms.  Its index is a hash of the user and room combination in question.
 125 When looking for such a relationship, the record in this table can tell the
 126 server things like "this user has zapped this room," "this user has access to
 127 this private room," etc.  It's also where we keep track of which messages
 128 the user has marked as "old" and which are "new" (which are not necessarily
 129 contiguous; contrast with older Citadel implementations which simply kept a
 130 "last read" pointer).
 131
 132  Here's what the records look like:
 133
 134 struct visit {
 135         long v_roomnum;
 136         long v_roomgen;
 137         long v_usernum;
 138         long v_lastseen;
 139         unsigned int v_flags;
 140         char v_seen[SIZ];
 141         int v_view;
 142 };
 143
 144 #define V_FORGET        1       /* User has zapped this room        */
 145 #define V_LOCKOUT       2       /* User is locked out of this room  */
 146 #define V_ACCESS        4       /* Access is granted to this room   */
 147
 148  This table is indexed by a concatenation of the first three fields.  Whenever
 149 we want to learn the relationship between a user and a room, we feed that
 150 data to a function which looks up the corresponding record.  The record is
 151 designed in such a way that an "all zeroes" record (which is what you get if
 152 the record isn't found) represents the default relationship.
 153
 154  With this data, we now know which private rooms we're allowed to visit: if
 155 the V_ACCESS bit is set, the room is one which the user knows, and it may
 156 appear in his/her known rooms list.  Conversely, we also know which rooms the
 157 user has zapped: if the V_FORGET flag is set, we relegate the room to the
 158 zapped list and don't bring it up during new message searches.  It's also
 159 worth noting that the V_LOCKOUT flag works in a similar way to administratively
 160 lock users out of rooms.
 161
 162  Implementing the "cause all users to forget room" command, then, becomes very
 163 simple: we simply change the generation number of the room by putting a new
 164 timestamp in the QRgen field.  This causes all relevant visit records to
 165 become irrelevant, because they appear to point to a different room.  At the
 166 same time, we don't lose the messages in the room, because the msglists table
 167 is indexed by the room number (QRnumber), which never changes.
 168
 169  v_seen contains a string which represents the set of messages in this room
 170 which the user has read (marked as 'seen' or 'old').  It follows the same
 171 syntax used by IMAP and NNTP.  When we search for new messages, we simply
 172 return any messages that are in the room that are *not* represented by this
 173 set.  Naturally, when we do want to mark more messages as seen (or unmark
 174 them), we change this string.  Citadel BBS client implementations are naive
 175 and think linearly in terms of "everything is old up to this point," but IMAP
 176 clients want to have more granularity.
 177
 178
 179  DIRECTORY
 180  ---------
 181
 182  This table simply maps Internet e-mail addresses to Citadel network addresses
 183 for quick lookup.  It is generated from data in the Global Address Book room.
 184
 185
 186  USETABLE
 187  --------
 188  This table keeps track of message ID's of messages arriving over a network,
 189 to prevent duplicates from being posted if someone misconfigures the network
 190 and a loop is created.  This table goes unused on a non-networked Citadel.
 191
 192  THE MESSAGE STORE
 193  -----------------
 194
 195  This is where all message text is stored.  It's indexed by message number:
 196 give it a number, get back a message.  Messages are numbered sequentially, and
 197 the message numbers are never reused.
 198
 199  We also keep a "metadata" record for each message.  This record is also stored
 200 in the msgmain table, using the index (0 - msgnum).  We keep in the metadata
 201 record, among other things, a reference count for each message.  Since a
 202 message may exist in more than one room, it's important to keep this reference
 203 count up to date, and to delete the message from disk when the reference count
 204 reaches zero.
 205
 206  Here's the format for the message itself:
 207
 208    Each message begins with an 0xFF 'start of message' byte.
 209
 210    The next byte denotes whether this is an anonymous message.  The codes
 211 available are MES_NORMAL, MES_ANON, or MES_AN2 (defined in citadel.h).
 212
 213    The third byte is a "message type" code.  The following codes are defined:
 214  0 - "Traditional" Citadel format.  Message is to be displayed "formatted."
 215  1 - Plain pre-formatted ASCII text (otherwise known as text/plain)
 216  4 - MIME formatted message.  The text of the message which follows is
 217      expected to begin with a "Content-type:" header.
 218
 219    After these three opening bytes, the remainder of
 220 the message consists of a sequence of character strings.  Each string
 221 begins with a type byte indicating the meaning of the string and is
 222 ended with a null.  All strings are printable ASCII: in particular,
 223 all numbers are in ASCII rather than binary.  This is for simplicity,
 224 both in implementing the system and in implementing other code to
 225 work with the system.  For instance, a database driven off Citadel archives
 226 can do wildcard matching without worrying about unpacking binary data such
 227 as message ID's first.  To provide later downward compatability
 228 all software should be written to IGNORE fields not currently defined.
 229
 230                   The type bytes currently defined are:
 231
 232 BYTE    Mnemonic        Enum / Comments
 233
 234 A       Author          eAuthor
 235                         Name of originator of message.
 236 B       Big message     eBig_message
 237                         This is a flag which indicates that the message is
 238                         big, and Citadel is storing the body in a separate
 239                         record.  You will never see this field because the
 240                         internal API handles it.
 241 C       RemoteRoom      eRemoteRoom
 242                         when sent via Citadel Networking, this is the room
 243                         its going to be put on the remote site.
 244 D       Destination     eDestination
 245                         Contains name of the system this message should
 246                         be sent to, for mail routing (private mail only).
 247 E       Exclusive ID    eExclusiveID
 248                         A persistent alphanumeric Message ID used for
 249                         network replication.  When a message arrives that
 250                         contains an Exclusive ID, any existing messages which
 251                         contain the same Exclusive ID and are *older* than this
 252                         message should be deleted.  If there exist any messages
 253                         with the same Exclusive ID that are *newer*, then this
 254                         message should be dropped.
 255 F       rFc822 address  erFc822Addr
 256                         For Internet mail, this is the delivery address of the
 257                         message author.
 258 H       Human node name eHumanNode
 259                         Human-readable name of system message originated on.
 260 I       Message ID      emessageId
 261                         An RFC822-compatible message ID for this message.
 262 J       Journal         eJournal
 263                         The presence of this field indicates that the message
 264                         is disqualified from being journaled, perhaps because
 265                         it is itself a journalized message and we wish to
 266                         avoid double journaling.
 267 K       Reply-To        eReplyTo
 268                         the Reply-To header for mailinglist outbound messages
 269 L       List-ID         eListID
 270                         Mailing list identification, as per RFC 2919
 271 M       Message Text    eMesageText
 272                         Normal ASCII, newlines seperated by CR's or LF's,
 273                         null terminated as always.
 274 N       Nodename        eNodeName
 275                         Contains node name of system message originated on.
 276 O       Room            eOriginalRoom - Room of origin.
 277 P       Path            eMessagePath
 278                         Complete path of message, as in the UseNet news
 279                         standard.  A user should be able to send Internet mail
 280                         to this path. (Note that your system name will not be
 281                         tacked onto this until you're sending the message to
 282                         someone else)
 283 R       Recipient       eRecipient - Only present in Mail messages.
 284 S       Special field   eSpecialField
 285                         Only meaningful for messages being spooled over a
 286                         network.  Usually means that the message isn't really
 287                         a message, but rather some other network function:
 288                         -> "S" followed by "FILE" (followed by a null, of
 289                            course) means that the message text is actually an
 290                            IGnet/Open file transfer.  (OBSOLETE)
 291                         -> "S" followed by "CANCEL" means that this message
 292                            should be deleted from the local message base once
 293                            it has been replicated to all network systems.
 294 T       date/Time       eTimestamp
 295                         Unix timestamp containing the creation date/time of
 296                         the message.
 297 U       sUbject         eMsgSubject - Optional.
 298                         Developers may choose whether they wish to
 299                         generate or display subject fields.
 300 V       enVelope-to     eenVelopeTo
 301                         The recipient specified in incoming SMTP messages.
 302 W       Wefewences      eWeferences
 303                         Previous message ID's for conversation threading.  When
 304                         converting from RFC822 we use References: if present, or
 305                         In-Reply-To: otherwise.
 306                         (Who in extnotify spool messages which don't need to know
 307                         other message ids)
 308 Y       carbon copY     eCarbonCopY
 309                         Optional, and only in Mail messages.
 310 0       Error           eErrorMsg
 311                         This field is typically never found in a message on
 312                         disk or in transit.  Message scanning modules are
 313                         expected to fill in this field when rejecting a message
 314                         with an explanation as to what happened (virus found,
 315                         message looks like spam, etc.)
 316 1       suppress index  eSuppressIdx
 317                         The presence of this field indicates that the message is
 318                         disqualified from being added to the full text index.
 319 2       extnotify       eExtnotify - Used internally by the serv_extnotify module.
 320 3       msgnum          eVltMsgNum
 321                         Used internally to pass the local message number in the
 322                         database to after-save hooks.  Discarded afterwards.
 323
 324                         EXAMPLE
 325
 326 Let <FF> be a 0xFF byte, and <0> be a null (0x00) byte.  Then a message
 327 which prints as...
 328
 329 Apr 12, 1988 23:16 From Test User In Network Test> @lifesys (Life Central)
 330 Have a nice day!
 331
 332  might be stored as...
 333 <FF><40><0>I12345<0>Pneighbor!lifesys!test_user<0>T576918988<0>    (continued)
 334 -----------|Mesg ID#|--Message Path---------------|--Date------
 335
 336 AThe Test User<0>ONetwork Test<0>Nlifesys<0>HLife Central<0>MHave a nice day!<0>
 337 |-----Author-----|-Room name-----|-nodename-|Human Name-|--Message text-----
 338
 339  Weird things can happen if fields are missing, especially if you use the
 340 networker.  But basically, the date, author, room, and nodename may be in any
 341 order.  But the leading fields and the message text must remain in the same
 342 place.  The H field looks better when it is placed immediately after the N
 343 field.
 344
 345
 346  EUID (EXCLUSIVE MESSAGE ID'S)
 347  -----------------------------
 348
 349  This is where the groupware magic happens.  Any message in any room may have
 350 a field called the Exclusive message ID, or EUID.  We keep an index in the
 351 table CDB_EUIDINDEX which knows the message number of any item that has an
 352 EUID.  This allows us to do two things:
 353
 354  1. If a subsequent message arrives with the same EUID, it automatically
 355 *deletes* the existing one, because the new one is considered a replacement
 356 for the existing one.
 357  2. If we know the EUID of the item we're looking for, we can fetch it by EUID
 358 and get the most up-to-date version, even if it's been updated several times.
 359
 360  This functionality is made more useful by server-side hooks.  For example,
 361 when we save a vCard to an address book room, or an iCalendar item to a
 362 calendar room, our server modules detect this condition, and automatically set
 363 the EUID of the message to the UUID of the vCard or iCalendar item.  Therefore
 364 when you save an updated version of an address book entry or a calendar item,
 365 the old one is automatically deleted.
 366
 367
 368
 369  NETWORKING (REPLICATION)
 370  ------------------------
 371
 372 Citadel nodes network by sharing one or more rooms. Any Citadel node
 373 can choose to share messages with any other Citadel node, through the sending
 374 of spool files.  The sending system takes all messages it hasn't sent yet, and
 375 spools them to the recieving system, which posts them in the rooms.
 376
 377 The EUID discussion above is extremely relevant, because EUID is carried over
 378 the network as well, and the replacement rules are followed over the network
 379 as well.  Therefore, when a message containing an EUID is saved in a networked
 380 room, it replaces any existing message with the same EUID *on every node in
 381 the network*.
 382
 383 Complexities arise primarily from the possibility of densely connected
 384 networks: one does not wish to accumulate multiple copies of a given
 385 message, which can easily happen.  Nor does one want to see old messages
 386 percolating indefinitely through the system.
 387
 388 This problem is handled by keeping track of the path a message has taken over
 389 the network, like the UseNet news system does.  When a system sends out a
 390 message, it adds its own name to the bang-path in the <P> field of the
 391 message.  If no path field is present, it generates one.
 392
 393 With the path present, all the networker has to do to assure that it doesn't
 394 send another system a message it's already received is check the <P>ath field
 395 for that system's name somewhere in the bang path.  If it's present, the system
 396 has already seen the message, so we don't send it.
 397
 398 We also keep a small database, called the "use table," containing the ID's of
 399 all messages we've seen recently.  If the same message arrives a second or
 400 subsequent time, we will find its ID in the use table, indicating that we
 401 already have a copy of that message.  It will therefore be discarded.
 402
 403 The above discussion should make the function of the fields reasonably clear:
 404
 405  o  Travelling messages need to carry original message-id, system of origin,
 406     date of origin, author, and path with them, to keep reproduction and
 407     cycling under control.
 408
 409 (Uncoincidentally) the format used to transmit messages for networking
 410 purposes is precisely that used on disk, serialized.  The current
 411 distribution includes serv_network.c, which is basically a database replicator;
 412 please see network.txt on its operation and functionality (if any).
 413
 414
 415  PORTABILITY ISSUES
 416  ------------------
 417
 418  Citadel is 64-bit clean, architecture-independent, and Year 2000
 419 compliant.  The software should compile on any POSIX compliant system with
 420 a full pthreads implementation and TCP/IP support.  In the future we may
 421 try to port it to non-POSIX systems as well.
 422
 423  On the client side, it's also POSIX compliant.  The client even seems to
 424 build ok on non-POSIX systems with porting libraries (such as Cygwin).
 425
 426
 427
 428  SUPPORTING PRIVATE MAIL
 429  -----------------------
 430
 431    Can one have an elegant kludge?  This must come pretty close.
 432
 433    Private mail is sent and recieved in the Mail> room, which otherwise
 434 behaves pretty much as any other room.  To make this work, we have a
 435 separate Mail> room for each user behind the scenes.  The actual room name
 436 in the database looks like "0000001234.Mail" (where '1234' is the user
 437 number) and it's flagged with the QR_MAILBOX flag.  The user number is
 438 stripped off by the server before the name is presented to the client.  This
 439 provides the ability to give each user a separate namespace for mailboxes
 440 and personal rooms.
 441
 442    This requires a little fiddling to get things just right.  For example,
 443 make_message() has to be kludged to ask for the name of the recipient
 444 of the message whenever a message is entered in Mail>.  But basically
 445 it works pretty well, keeping the code and user interface simple and
 446 regular.
 447
 448
 449
 450  PASSWORDS AND NAME VALIDATION
 451  -----------------------------
 452
 453   This has changed a couple of times over the course of Citadel's history.  At
 454 this point it's very simple, again due to the fact that record managers are
 455 used for everything.    The user file (user) is indexed using the user's
 456 name, converted to all lower-case.  Searching for a user, then, is easy.  We
 457 just lowercase the name we're looking for and query the database.  If no
 458 match is found, it is assumed that the user does not exist.
 459
 460   This makes it difficult to forge messages from an existing user.  (Fine
 461 point: nonprinting characters are converted to printing characters, and
 462 leading, trailing, and double blanks are deleted.)