|
CQBIC "Machine Check Bug
New Update (02/08/2006): this bug is solved. I met this bug again and
was very frustrated. I tried hard with my logic analyzer. I shot
pictures of the chronogram and search for clues. After days of hard
work, I concluded that my hardware and software was not causing the
problem. Something happened to the VAX3900. I took out the boards, and
did two things,
- I exchanged the order of the two memory boards
- After finding the leaking Ni-Cd battery caused decay of the
front panel and the small PCB attached on the panel, I took out the
battery, disambled the board and panel, washed them with brush, soap
and baking soda, and put them back.
Without doing anything else, the machine check bug was gone.
old update: Maybe it is
my fault that I did not monitor the DMR line when performing DMA. I
added the circuit and I haven't met the problem ever since.
System settings
- KA655, 2x MS650-BA, MSCP SCSI board with CDROM and HD
- NetBSD 1.5.2
Symptoms (Update 2006/02/08, the following diagnoses must be wrong)
- If the MSCP SCSI board performs two burst mode DMA transactions in one DMA grant, the machine has machine check 10 errors (10us time out when CPU accesses memory that is mapped to the QBUS) from time to time.
- If
the board runs one transaction a grant (single shot DMA), the
probability that the problem occurs is much rarer, but it still
happens.
- Some of the DIN cycles are long. It is the CQBIC
controller that prolongs the DMA (slow RPLY as the response to DIN). I
recorded 6us single short DMA and 9us double shot burst mode DMA. See
the "long DMA" diagram below.
Solutions (Update 2006/02/08, the following diagnoses must be wrong)
- The OS should re-try the memory after machine check and keep going.
Detailed description and
first hand materials (Update 02/08/2006, I suspected the strange DMR
(no DMG, SACK, only a lonely DMR followed to my DMR) later. I have
photos)
- The NetBSD 1.5.2 installer reported "machine check 10"
and "CDAL parity error", while the DMASER (DMA system error register)
reported dmaser=0x88 which means 10us time out occured when the CPU was
accessing QBUS memory (page 3-97, KA655 CPU technical manual. google
"ka655tm1" for an online version). A trap happened after the installer
tried to recover. Note that in the first two screen images, the machine check happened when the CPU was executing the same instruction (same PC). The installer was busy unzipping the NetBSD .tgz files.
- screen image One incident. dmaser bit 7: time out when CPU was access QBUS memory; bit 3: another machine check happed before this one so Qbus memory and DMA address was not captured.
- screen image Another incident
- screen image
Yet another incident (when compiling kernel). This one have a different
pc. It may be the same pc mapped to different virtual address?
- A long DMA
diagram. It is negtive logic. It is burst mode DMA with two
transactions. it is about 9us (see the diagram). The Q22 DMA protocal
allows up to 4 transactions in one bus grant. The reason why this DMA
is long (slow) is that The KA655 wasted a lot of time to respond to DIN
with RPLY. Maybe the CPU was competing for Qbus memory access and CQBIC
was busy handing that.
- Some google results. Nobody provided explanation. I googled
that some NetBSD versions had this problem with some CMD MSCP SCSI
controllers but I failed to find the post the second time.
|