Opened 12 years ago

Last modified 12 years ago

#88 new

bbn-iscsi has an orange light on one of its disks

Reported by: chaos@bbn.com Owned by: somebody
Priority: critical Milestone:
Component: Administration Version: SPIRAL4
Keywords: Cc:
Dependencies:

Description

When IBM came to install the extra nodes today, they noticed an orange light on one of the iSCSI disks (second from the top on the left), which they say is a failed disk. The top disk has a solid green light, while this second disk has blinking green and solid orange, so that definitely seems suspicious.

  • If that is indeed what that means (i.e. the disk isn't disconnected intentionally or anything), someone should phone it into IBM and get it replaced. What's the procedure for that?
  • We should verify that the iSCSI is configured to e-mail or notify someone about a failed or failing disk, since site admins are unlikely to happen to notice the blinking light in a timely fashion

Change History (4)

comment:1 Changed 12 years ago by jonmills@renci.org

We need to establish the best process for handling drive failures, with respect to contacting the correct IBM representatives...

As for emails, I was just talking to Victor about this last Friday. It's somewhat complicated, actually. What we need to do is tweak the postfix install on every head node, to make it use control.exogeni.net as its smart relay host. Then worker nodes and devices like the iSCSI unit either need to directly relay to control.exogeni.net, or relay through their local head node. Finally, on RENCI's Exchange Server, Casey Averill will have to whitelist control.exogeni.net as an allowed relay host. All this, to get the messages into exogeni-ops@renci.org. Control.exogeni.net will also need to have ACLs for the hosts it will relay for, otherwise it becomes an open relay.

Please let me know if anyone knows of another/better way to set up the email stuff...

comment:2 Changed 12 years ago by chaos@bbn.com

FWIW, i'm not 100% convinced that blinking-green-plus-solid-orange actually means "drive failure". As Nick pointed out, it might just mean "hot spare". I made the ticket because the IBM guys noticed it and thought it was a problem, but that was just an off-the-cuff diagnosis, and i didn't do anything to try to verify it.

comment:3 Changed 12 years ago by chaos@bbn.com

Hmm. So the part i'm confused about, is: why does control.exogeni.net need to be whitelisted by Casey Averill so that it can send mail to exogeni-ops@renci.org?

Because other than that, i would say either:

  1. configure every head node as a nullclient which relays all its mail via control.exogeni.net
  2. configure every head node as an MTA which relays mail for the private subnets in its own rack and directly delivers it to its destination

would work fine. We use both in our lab under different circumstances.

B has the advantage that it's a less complicated configuration if you want to be able to get mail from those things with private addresses (like the iSCSI), and puppet makes "push out identical aliases files to each host" pretty easy.

It sounds from what you're saying, though, that B would cause some kind of problem with sending mail to exogeni-ops@renci.org, and i'm not clear on why. If your aliases file on bbn-hn says:

root: exogeni-ops@renci.org

that's not relaying, that's just delivery, and renci.org's normal mailservers should handle it the same way they would any non-local sender wanting to send mail to exogeni-ops. No?

comment:4 Changed 12 years ago by chaos@bbn.com

As a quick note, i also think that rack mail configuration is a bit out of scope for this. If there's a disk problem in the iSCSI, that's worth dropping things and fixing. Our druthers would be for there to be a procedure in place for detecting disk issues in the future, but, if it has a lot of dependencies that can't be filled, let's just figure out whether this particular disk is having an issue and move on for now.

So: is this particular disk having an issue? How would we tell?

Note: See TracTickets for help on using tickets.