our logo (4.8k)

chapter 19

Good HTML

The purpose of a web page is the message. The message of the page is sometimes in the words, sometimes in the images, and sometimes in other elements of the page, but the message is rarely in the code itself. So why bother writing good HTML?

The way the message is delivered can have impact on how it is received. HTML is one link in the chain of media that carries your web-based message. Ask a painter why they choose one type of canvas over another, or ask a musician why they choose a type of string or reed or bow for their instrument. You may not be able to discern what type of strings are on a guitar by listening to a recording (although Bill says he can), but it does affect the overall quality of the experience.

Why Write Good HTML?

There are both subjective and objective reasons for writing good HTML. Subjectively, it may or may not be important to you that you do as good a job as possible on every level of every project that you take on. We feel that doing something well is its own reward, but we recognize that it's not always practicable.

On the other hand, there are some very pragmatic reasons to at least make sure that your HTML is correct, in spite of the fact that it may already work. As a practical illustration, here's a page that works fine in browsers that are based on the original NCSA Mosaic (including Microsoft Internet Explorer and older Netscape browsers), but does not work in the current Netscape:

<HTML>
<HEAD>
<TITLE> Bad Table </TITLE>
</HEAD>
<BODY BGCOLOR=WHITE>
<TABLE>
<TR><TD>

<H1>This entire page is in a table. </H1>

</BODY>
</HTML>

Notice that there is no end tag for the TABLE element (</TABLE>). The end tag is required for the TABLE element--according to both the table specification and the HTML 4 specification. It works just fine in Microsoft Internet Explorer, but Netscape Navigator (beginning with version 3) won't display a table without an end tag.

In the case of the missing table end tags, there were a number of web sites that virtually "disappeared" when Netscape 3 was released. A similar problem happened with body backgrounds with the release of Netscape 4 (see the example later in this chapter).

HTML Terminology

Probably the single most important thing you can learn about HTML is the distinction between tags, attributes, containers, and elements. Once you understand these terms, it will be much easier for you to tell when your code is correct. Here's what they mean:

Tag
A tag is an HTML instruction enclosed in angle-brackets (e.g., <P>). Some tags may also have end tags that begin with a slash (e.g., </P>). The tag without the slash is sometimes called a begin tag or a start tag.

Attribute
An attribute is a property that works with a tag. Attributes go after the name of the tag, and before the right angle-bracket. For example, if you want a horizontal rule without the shading effect, you can use the NOSHADE attribute (e.g., <HR NOSHADE>). Some attributes have values like the ALIGN attribute (e.g., <P ALIGN=CENTER>), or the HREF attribute for the destination of a link (e.g., <A HREF="http://www.htmlbook.com/">). The part to the right of the equal sign is called the value of the attribute.

Container
A container is a tag that has both a beginning and an end, and generally has content that is placed in between. The beginning of a container is marked by a begin tag, and the end is marked by an end tag. For example, TITLE is a container because it has a distinct beginning and end. The content of a TITLE is in between the tags, (e.g., <TITLE> content </TITLE>). In contrast, BR is not a container because it has no end tag; everything it needs is between the brackets of the BR tag. Some containers, like P for instance, do not require end tags if the end can be accurately determined by context. But they are still containers because they have content and a limited scope of operation. In the absence of an end tag, the effects of a P tag end when the next P, or some other tag that is not valid content for P, is encountered. This is true of many containers with optional end tags.

Element
Element is a general term for a chunk of HTML that can be treated as a distinct unit in some context. A container, along with all its content, can be considered an element (e.g., <STRONG> This is a STRONG element </STRONG>). A stand-alone tag, like IMG, can also be considered an element (e.g., <IMG SRC="element.gif">). This term is used as a convenience of nomenclature whenever we need to discuss some distinct part of a document or code.

What You See AIN'T What You Get

WYSIWYG editors are a wonderful invention, and we encourage you to use them for prototyping your web sites. The use of a WYSIWYG editor can greatly reduce the amount of time it takes you to layout, view, and re-layout your site while you are in the process of designing it.

But for production work, we implore you to be careful. An excellent example of the problem is the "disappearing background" problem that happened with the release of Netscape 4.

The HTML specification allows for one BODY element per page. Both the begin and end tags are optional (that is, the body of the document can be implied if the default properties are acceptable), but you are not allowed to have more than one BODY element in a single document.

However, there are evidently some WYSIWYG editors that don't follow this rule. We have seen a number of sites with two or more BODY tags, and this has created problems with some browsers. The early release versions of Netscape Navigator 4 would ignore the additional BODY tags and only use the attributes of the first one. For example, consider this HTML:

<HTML>
<HEAD>
<TITLE> Bad Body </TITLE>
</HEAD>
<body>
<BODY background=white.gif>

<H1>This document has two BODY tags. </H1>

</BODY>
</HTML>

Later releases of Navigator 4 (beginning with 4.03) accumulate attributes from BODY tags. But you really can't count on a browser guessing what your HTML means when it's not correct. For instance, Mosaic 3.0 (the last version) also shows a gray background for this error.

The best defense is good HTML.

Cleaning Up After a WYSIWYG Editor

As an example of the sorts of things you need to watch out for with your WYSIWYG editors, I have created a little page using Alaire's Home Site.

Here's a screenshot of the page in the editor

Now here's what it looks like in Netscape Navigator: Notice anything different?

Let's look at the code and see if we can fix it up.

<!-- This document was created with HomeSite 2.5 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<HTML>
<HEAD>
<TITLE>Test Page</TITLE>
</HEAD>

<BODY BACKGROUND="/usr/BILL/htmlbook/working/ch19/
  lgreentile.gif" TEXT="Navy" LINK="Olive" 
  VLINK="#999933" ALINK="Silver">

<TABLE BORDER=0 CELLSPACING=8 CELLPADDING=5 
  VALIGN="TOP" BGCOLOR="#CCFF99"
WIDTH=350>
<TR>
<TD>Something here</TD>
<TD>Something Else</TD>
</TR>
<TR>
<TD>Something New</TD>
<TD>Something Blue</TD>
</TR>
<TR>
<TD>Other things</TD>
<TD>Things X, Y, and Z</TD>
</TR>
<TR>
<TD>The Cat in the Hat</TD>
<TD>Dr Seuss' Toothbrush</TD>
</TR>
</TABLE>

<P> Here's a paragraph created in Home Site. 
It has <B>bold and <I>italic text in it.</I></B></P>

</BODY>
</HTML>

The most glaring problem in the HTML on the previous page is that the background image didn't show up in the browser (even though it was fine in the editors preview screen). Notice that the URL for the BACKGROUND attribute is not a proper relative URL. This is easy to fix, but it shows a flaw in the editor.

The point here is for you to expect flaws in the code that the editor puts out. Always expect to have to fix the code that an automated tool generates. Some people say that the tools will get better, and that's probably true. But the fact remains that after 20 years of trying, there are still no automated tools for any programming language that do as good a job as a careful human. The promise of artificial intelligence that can better a human's creative efforts is yet to be realized. We don't expect that overall situation to change any time soon.

We also noticed that the tool doesn't break its lines to fit an 80-column screen (this is important for those of us who use multiple platforms to work on the same files), and the use of tabs for indenting is also not portable. Again, these are easy problems to fix, but they require effort. Always prepare for more complicated pages to have more complicated problems.

As a rule, we feel that the WYSIWYG editors are excellent tools for prototyping (indeed, we use them as such), but not for production use. If you must create and maintain a large and complex web site with constantly up-dated information (like a large news or periodical site), we recommend that you either create custom tools for that particular site (as most of the large major sites do) or retain the services of a programmer to do that for you. For large one-time sites that won't change much over time, you can prototype with your WYSIWYG editor and then modify or rewrite the code by hand to make it correct.

Common HTML Gotchas

There are many common HTML "gotchas" that we see a lot on the web. Of course, each of us has our own peculiar predilection for error, and as such, our problems will not always fit nicely into a preordained list. But we've compiled a short list that you may want to watch out for anyway. These are some of the most frequent HTML problems we see on public web pages.

What's in a Quote?

Quotation marks (either double " or single ') are used in HTML to contain the values of some attributes. When do you need to use quotes? If all the characters in the value are either letters and A-Z), numbers (0-9), periods (.), or hyphens (-), you don't need to use quotes. If you have any characters besides those mentioned, you need to use quotes. When in doubt, use the quotes. They can't hurt.

The most common type of value that requires quotes, and often doesn't have them, is the URL (for example, <A HREF=http:// www.htmlbook.com/>Creative HTML Book Site</A> is not legal HTML because it is missing the quotes around "http:// www.htmlbook.com/"). Most URLs have slashes, colons, and other characters that must be quoted to be correct. We are not looking forward to the day Netscape starts requiring quotes around attributes that really need them. A lot of the web will need to be fixed!

Hanging Quotes

On the other hand, you have to use your quotes in matching pairs! For example, this doesn't work well:

<HTML>
<HEAD>
<TITLE> Bad Quotes </TITLE>
</HEAD>
<BODY BGCOLOR=white>

<P>This is a <a href="link.html>link</a> with a 
missing quote.


<P>You won't see any of this text until 
<a href="link.html">after</a> 
this other link. 

</BODY>
</HTML>

Notice the missing quote in the first link. You don't see it? Look here then. The folks at Netscape gave us this handy-dandy missing quote finder in their View:Source menu, starting with version 3. When you view the source of a document with a missing quote, all the text that's affected will be highlighted and blinking. Try this for yourself: find the bad-quote.html file in the chap19 folder of the <chd> CD-ROM and look at it in Netscape Navigator. Be sure to select View:Source. See it blink? Tell a friend.

Straddling Containers

Considering the fact that a container--along with all of its content--is a single distinct element, it is reasonable that one container can have other containers as part of its content. That's why you can write something like this:

<P> This paragraph has <EM> emphasized and <STRONG> strong </STRONG> text </EM> inside it. </P>

In this perfectly legal example, the P element contains the EM element, which in turn contains the STRONG element.

Now consider this example:

<P> This paragraph has <EM> emphasized and <STRONG> strong </EM> text </STRONG> inside it. </P>

Here we decided to end the EM element before the end of the STRONG element. What's wrong with this picture? Notice that EM no longer contains STRONG (nor does STRONG contain EM). The elements are straddling each other.

It is perfectly legal to have one element contain another element, as long as the inner element is valid content for the outer element. But it is not legal to have two element straddle each other. As with many common HTML errors, this may work in some browsers today, and it may not work in later versions of those same browsers.

Line Endings

Unless you are actually trying to make your HTML unreadable (some people actually want to make it a little tougher to "steal" their code), you should keep your lines to under 80 characters wide (75 is a good rule of thumb). That makes it easier to view your source code in the browser and to work on it on the widest possible variety of platforms.

You should also set your editor to use UNIX line-endings, especially if your server runs under UNIX.

There are three different types of line-endings:

Carriage Return used by Macs
Carriage Return + Line Feed used by PC's
Line Feed Only Used by UNIX

The line-endings are invisible to you, but visible to your web server and many HTML editors. You will probably find the setting for Unix Line-endings in the Preferences menu of your HTML editor or word processor.

Entities vs. Numbers vs. Embedded Characters

HTML uses something called "entities" for characters outside of the normal English alpha-numeric character set (there's a nice list of them here, as well as a complete list in the HTML 4.0 Reference Chapter). Named entities (e.g., &copy; for the © symbol) are preferable to the numbered entities (e.g., &#169; also for the © symbol), because the names will work on multiple platforms. The numbered entities will not work on all platforms, nor will characters embedded from your word processor. (Some WYSIWYG editors use numbered entities by default.)

Color Names not Browser-Safe

Remember that the named colors (e.g., teal) are not all browser safe. Most of them will dither in 256-color systems. Use the hexadecimal colors instead (e.g., "#669999"). (Some WYSIWYG editors use color names by default.) In-depth information about browser-safe colors is in Chapter 4, "Web Color."

Empty ALT Attributes

The ALT attribute for the IMG tag is an important tool for making your pages work on non-graphical systems, but an empty ALT attribute (e.g., ALT=" ") can be annoying. In non-graphical systems, it will take up space without saying anything; and in many graphical systems, it will show an empty little tool-help (usually a little yellow square) when the mouse is passed over the graphic. If you don't have content for your ALT attributes, don't include them at all. (Some WYSIWYG editors insert these by default.)

Case-Sensitive File Names

Most web servers run under UNIX, which use case-sensitive file names. Most web authors use Mac or PC platforms, which do not use case-sensitive file names. That means that if you have a file named Image.gif and you refer to it as IMAGE.GIF, it may work on your system at home, but not on the web. We recommend that you use all lowercase file names, just to avoid problems. They're easier to type anyway.

Relative vs. Absolute Links

Always use relative links when possible. (See Chapter 12, "Organization".) Absolute links will become a major headache for you when you eventually have to move your site to another machine, or even just another folder on the same machine. (Some WYSIWYG editors use absolute links by default.)

Chapter 19 Summary

Writing good HTML is not required. No one is going to force you to do it, and most people won't even notice if you don't. But it's a discipline that will serve you well in the long run. It will make life easier on you when new tools and browsers are released and whenever you need to make substantial changes to your site (which will likely be more often than you plan for).

In this chapter, you have seen some of the common problems with incorrect HTML, and how to correct them when they are encountered.

We encourage you to use the HTML reference that accompanies this book (it's on the CD-ROM, and available as a printed booklet) for an authoritative source of correct HTML syntax.