By Dr. Jack A. Shulman, rev .02
In the modern era, supercomputing has been successfully
collapsed onto a single chip: the superscalar RISC chip. Several examples of the
modern “supercomputer chip:” the Intel Itanium 2, the IBM RS6000 Power, the
MIPS and the Sun Sparc, have made very successful debuts as standalone server
platforms. In IBM’s case, such have
evolved into actual supercomputers, albeit, the pen is mightier than the sword
at IBM. The actual implementation in
IBM’s case produces varying degrees of success, mainly lack thereof to harness
all the horsepower. Until now, no one
had considered how to automate the “supercomputer speedup processes” for
software to facilitate use of all the performance, that is, adopting the right
architecture that suited itself to actual needs software places on the average
supercomputer (is that a malapropism? What supercomputer is average?).
American
Computer diverged from the mainstream in creating it’s “Advanced HyperSystems”
project in 1993. After I left IBM’s
contracting base, it’s owners asked me to come in and progressively blackboard
new technology that postdates anything today offered by the mainstream,
including IBM, until an acceptable design for a new kind of supercomputer was
accomplished.
And so,
for 9 years, we’ve designed, tested and built for proof purposes, a number of
supercomputers whose capabilities we did little to publicize, in fact we were
an audience of one, for to use the marketing process that IBM uses, would
require an army of Patent Attorneys, a luxury only IBM has, ensuring it’s size,
but also insuring it’s internally competitive nature: patenting when you are
NOT the original author of many technologies you patent creates an enforcement
nightmare. This may be deemed a “narcissistic”
view, that is, no sharing of technology, but we wanted an answer to the
problem, rather than to permeate the world with never ending expansions of
technology moving forward in baby steps like IBM has with it’s Deep Thought,
Deep Blue, Pacific Blue and Ascii Blue exercises.
By late 1994, we had determined that two courses would be
pursued:
a)
Type A
Massively Parallel Supercomputer.
CONCENTRATION OF PROCESSOR COUNT DENSITY and BUS SPEED.
A
massively parallel supercomputer is just as it’s name suggests: a massive
aggregation of many smaller CPUs, harnessed in a manner intended to provide as
fast communication and arbitration between them as possible. The key: arbitration, since you can build
fast communications, but arbitration of many users to a single resource can
produce the proverbial tortoise in a race to produce a “hare”. We’d learned by several mistakes made by IBM
in attempting to copy some of my earliest work back in the 1980’s on Proteus
and Aerosphere. They simply do not
understand granularity and their idea of wide band arbitration is mass brute
force overkill.
And so we decided to focus on what would be a good granule size. Our solution probably vexes IBM, because we
decided that a single Granule was most efficient if it had three processors.
However, a Tri-processor design lacks redundancy, leading to chaos when heavily
loaded. So, we altered our thinking to
fix on a Four Process, Fixed Memory Size Granule, with Virtual Memory (in a
traditional sense) and a unique method of resolving arbitration (which being
proprietary, I won’t share here) both in terms of resource sharing, resource
excluding and resource synchronization.
This course required us to identify the processor mask for the “granule” we
intended to use to build the ultimate “massively parallel” supercomputer, which
in our case, resulted in selection of the INTEL IA-64/128 PA-RISC superscalar
processor with 2M of Cache or larger, along with 1 Gigabyte of dedicated
333-400 MHz double data rate ECC Memory using dual channels and 64 Bit wide
data bus for transfers and addressing.
A single granule today consists of a large board implementation of:
x- Four ITANIUM-2 bare chips, 4M Cache each
y- 4 Gigabyte of shared dedicated DDRam
z- One HyperSystems Chipset including
an East Bridge with four dedicated OC1000
Optical Busses
a West Bridge with Dedicated 133MHz 256Bit
Wide Advanced Graphical Bus
a North Bridge with SMP 1.9b Quad APIC
and instruction trace sharing and
hyperthreading support
a South Bridge with PCI/x Bridge and
PCI Bridge for locally attached controllers
And a manager/supermanager for four granules consists of:
a- 16 OC1000 Optical Busses
b- One Granule
c- Optical Twisted Double Helix Net
Connectors (2x) for connection to up to three other Managers and a
supermanager.
Replicating
this pattern into a geodesic like array of 4x16x64x256 Granules processors,
makes the basic 256 processor Supercomputer a reality (with 1108 CPU chips
inside). The 84 manager and
supermanager granules more than compensate the system to allow all 256
processors to run at 86% of full speed or higher. And by adding a Superframe Manager, of 4x16 processors (64 more)
we can support up to 16 such geodesic like arrays in a single system, without
additional management, making 4096 CPU Granule Supercomputers a reality.
Naturally, since we use an internal limited overhead,
minimum distance (meaning: minimum conflicts and minimum arbitration
multiplexing overhead), we can continue to expand the number of
processors. Estimated on the basis of the
net performance of a single CPU Granule being somewhere in the vicinity of 2
Billion instructions per second average (with peak speed of 3.6 Billion
instructions per second in a burst), the coaxial speed of the 4096 Granule unit
is thought to be 8000+ Billion Instructions per second average and 14400+
Billion Instructions per second average during peak bursts. The interesting advance we made in the
design, was to re-equip each granule with a Pre-Fetch / Post Op cache of an
additional 256Mbytes of bipolar memory, that serves to cache elements most
frequently needed to reload the cache of the processor, and use the highest
speed transfer mechanism of each processor to reload it’s cache in the event of
cache penalty states, resulting in the CPUs running at 99% of their rated SPEED
at all times. That, coupled with the
highly efficient TDH MDNet architecture, allows us to retain 99% of the
computation capacity of each CPU. This
compares very favorably to the 44% (and declining) rate of CPU execution
retained by IBM’s current top of the line supercomputer, allowing us to achieve
more than a 4:1.6 higher performance ratio, per unit of Central Processor
speed, leading to a more powerful system at the price, of about 2.5 times
scalar, than IBM is capable of producing.
However, our research also produced a startling discovery: we could achieve ALL
THAT PERFORMANCE with a single CPU Chip, if only we built one using something
other than Silicon dioxide and other conventional Transistors. Such led to the consideration of a second
architecture, the ‘Type B’.
B) TYPE B Linear Hyper-Accelerated
Supercomputer. REPLACEMENT OF MAS
PAR DESIGN WITH HYPER ACCELERATION SEMICONDUCTORS (NAST)
After a
while it became apparent that the complexity of larger and larger arrays of
CPUs in a Supercomputer (in the case of the design above, 1360 Itanium 2 CPUs
are used…) causes an exponential expansion in the complexity of software to
service the system, we decided to investigate an early experiment in physics,
the Near Insulator Near Conductor Bimetallic Electron Trap I’d demonstrated to
the physics community in the mid 1980’s.
The
advantage of using a device tottering on the brink of conductivity and
insulation, is that such devices can be made to settle in one state or the
other using very little quantum energy, for example, the mere presence of a
trapped electron, can tilt the semiconductor scales in one direction or
another, in such experimental devices.
So, I elected to start experimentation on building a 16 boolean function chip
suitable for ALU uses on single, nybble, byte and word streams of data, in
1994. As luck would have it, the first
successful example, the nAST Oscillator, was ready by the spring of 1996, by
mapping it’s transient states to “and”, “or”, “nand”, “nor” and other Boolean
transfers, using a small array of such devices, using dual photo-electronic
inputs and an LC diode as a semaphore output, we were able to build a
successful 32 bit wide 16-Bool chip that could process two data streams in
comparative mode, and produce a result, at the fastest possible rate that
optics is capable of transmitting data, somewhere in the 500 Terahertz range. That meant that such an ALU would be capable
of register-register operations at about 500,000 Giga-instruction-cycles per
second if full streaming rate were sustainable.
Because such a supercomputer would be processing so much data in so short a
period of time, I invented a new architecture, the “instruction execution
streaming processor”.
The
idea behind the IESP, is that programs consist of compilations of streams of
execution code, and code loops that are packeted with tags associating them
with their associated Tasks, and then “streamed” in long data transfer queues
into the Supercomputer, their results combined based on any interdependencies,
gated by dependency state switches, allowing an array to interoperate without
reordering for the purposes of state dependency.
The results, an array of 16 of the Boolean function chips that could accept a
page of up to 1,000,000 instructions, process them, and move to the next page,
with all memory and cross instruction references resolved, in a single
“multi-clock” cycle, allowing up to 500 million such pages to be streamed at
this array per second.
The astonishing results, a single small processor, of only 16 mid scale
complexity chips and a large input array buffer, all fabricated using nAlkane
Silver Thiozole, that could execute 500,000 Billion-instructions in a single
second so long as we could sustain the fetch cycle for the programming.
In tests, we have since added a 43 chip Boolean array that processes 128 and
256 bit wide complex arithmetic (the original 16 chip array was more than
capable of simulating integer arithmetic, without major modification). This “floating point arithmetic” simulation
processor also achieves a nearly sustained 500,000 Billion instruction per
second rate during burst program transmission operations.
The
two, comprise the most complex elements of the Type B Linear Supercomputer.
We have as of yet to design the input program fetch, page faulting, input
output management and communications processors to support this CPU.
However, one thing of note is that power input to the devices is only .25 volts
and less than 1 microwatt of power consumption at all times. The device produces little heat, only on the
photo-optic side, and the resulting conclusion of the nAST Supercomputer
experiment, suggests that we are but a few million dollars away from a
breakthrough, a Supercomputer of a few hundred mid-complexity nAST gate arrays,
that could easily be reduced to a single chip, capable of ½ Million Billion
instructions per second sustained execution rate, so long as programming was
capable of delivering same to it.
For a Single CPU Linear Supercomputer, this breakthrough is very dramatic,
because it eliminates the need to use three quarters of a million 1 GHz Itanium
2 processors, or more, just to achieve the same performance.
The cost reduction that the Type B Linear Supercomputer represents is equally
dramatic, along with the power reduction.
At the time of completing the initial NAST project, we had
come to conclude that the best direction for research dollars, was to further
commercialize nAST Semiconductors. With
the prototype test of the Type B Linear Supercomputer, we proved that not only
was such thousands of times less expensive, but could yield a SINGLE CPU
capable of outgunning the fastest CMOS Silicon Transistorized CPUS arranged in
a Massively Parallel Array.
And, clearly, once miniaturized and commercialized the Type B Linear
Supercomputer CPU would be able to replace today’s CPU Chip, as a single CPU
system, or even be incorporated into some futuristic Massively or Mid-scale
Parallel computer that used multiple NAST SUPERCOMPUTER CPUs.
We as of this time, had not standardized on a logical instruction set for the
Type B, but were considering incorporating simulations of a wide variety of
existing CPU instruction sets, from available types, including x86, Risc86,
IA64, Sparc and Itanium 2, so as to provide the best overall housing for future
software developers, wishing to avail themselves of these codes, without the
limitations forced on them by the design of their CPU chips. Such, of course, would require expansion of
the architecture to serve multiple arrangements of registers, but due to the
extraordinary speed of the Type B, we do not feel that would be a problem.
C) TYPE
C STORAGE SUPERCOMPUTER.
We also evolved the notions of a “Storage Supercomputer”, which using basic,
nAST dual-triodes, implemented terabytes of storage in single board design, so
that Disk Drives could be used only as hard storage backup devices.
The
exceptional speed (1 Million Gigabytes/second) such a Type C SC would provide,
would allow greater overall streaming rates to the central processor of the
Type B Linear Supercomputer, eliminating one of the great foibles of modern
computation, the great overhead of the mechanical disk drive as an extension to
Random Access Memory.
One of
the primary concerns of the Type C, was to equip the unit with File Service
Acceleration, eliminating the need for repetitive storage retrieval and update,
by attaching the File System to other Supercomputers, while containing it
locally and it’s service units locally in the Type C. This “channelized Storage egress” greatly reduces overhead in the
main supercomputer connected to the SC, by retrieval of the actual record
needed and by providing all the file system maintenance and overhead needed by
the main supercomputer. That more than
quintupled the performance of our test vehicles, so long as we did not under-equip
the design of either the API between the Type C and the Type A, nor saddle the
unit with poorly conceived drive topologies – for which we invented a UNIVERSAL
HARD DRIVE STORAGE TOPOLOGY (UHD/ST) that could accommodate any and all other
file systems as mere “conveniences” of applicability within the Type C, for application
compatibility, e.g..
In our
opinion, semi-exotic approaches like nAST and the Type B Linear Hyper
Accelerated Supercomputer, are not only feasible, but are better spent moneys,
than continuing on with current, slow, CMOS processors in a Massively Parallel
Array.
If emulating human intelligence ever becomes a necessity, say, in simulacrum
robotics, a massive array of Type B Linear HA Supercomputers can be assembled
into a single “brain box” and applied to that end application.
In the meanwhile, the Type B satisfies
every requirement there is.
Used for Workstation, Server, Supercomputer, even a scaled down version for
handheld, would bring an entirely new dimension of computation to business and
industry, accelerating it nearly 1 million fold.
_____
© 2002/2003 American
Computer Science Association Inc. All
rights reserved.